This article (Paine 2016)
caught my attention recently. It talks about the case of Charles Reep, a former Royal Air Force Wing Commander
who was tracking play-by-play data for matches and serving as a quantitative
consultant for Football League teams as early as the 1950s.
https://fivethirtyeight.com/features/how-one-mans-bad-math-helped-ruin-decades-of-english-soccer/amp/?__twitter_impression=true&fbclid=IwAR0MNCiSu4nJIcGYvW5dRoTif1mzNc6MJzo8c-AFLU-mDWqZgWOCnT75tIw
The article recalls how Reep’s analytics caused him to conclude that the number of passes
made in soccer is directly correlated to scoring. His admonition is that
shooting after three passes or less have a higher probability for scoring a
goal.
But Reep was making a huge mistake. Put simply, Reep started
with each goal scored and looked at how many passes were made prior to scoring.
His starting point was goals scored. The problem is that most goals scored in
soccer do come after three passes or less, because that is the nature of the
game, it is sporadic, and the passing game get disrupted frequently by the
defense. What he did not count were the goals missed after just three passes,
that block of data is missing because of his focus on just scoring the goal.
In a previous article, Neil Paine of the website Five Thirty-Eight
refuted that bit of wisdom gleaned from Reep’s agglomeration of soccer data.
https://fivethirtyeight.com/features/what-analytics-can-teach-us-about-the-beautiful-game/
But subsequent analysis has discredited
this way of thinking. Reep’s mistake was to fixate on the percentage of
goals generated by passing sequences of various lengths. Instead, he should
have flipped things around, focusing on the probability that a given sequence
would produce a goal. Yes, a large proportion of goals are generated on
short possessions, but soccer is also fundamentally a game of short possessions
and frequent turnovers. If you account for how often each sequence-length
occurs during the flow of play, of course more
goals are going to come off of smaller sequences — after all, they’re easily
the most common type of sequence. But that doesn’t mean a small sequence has a
higher probability of leading to a goal.
To the contrary, a team’s probability of scoring goes up
as it strings together more successful passes. The implication of this
statistical about-face is that maintaining
possession is important in
soccer. There’s a good relationship between a team’s time spent in
control of the ball and its ability to generate shots on target, which in turn
is hugely predictive of a team’s scoring rate and, consequently, its placement
in the league table. While there’s less rhyme or reason to the rate at which
teams convert those scoring chances into goals, modern analysis has ascertained
that possession plays a big role in creating offensive opportunities, and that
effective short passing — fueled
largely by having pass targets move to
soft spots in the defense before ever receiving the ball — is strongly
associated with building and maintaining possession. (Paine 2014)
To reiterated, he should have focused tracking the number of
possessions and whether those possession turned into goals. Given the complexity of the game, it was
perhaps understandable that Reep made this mistake, and given that the state of
the art of statistical analysis in sports was still rudimentary, it was perhaps
predictable. The unfortunate thing is that Reep was able to convince an entire
nation’s soccer establishment, not just any nation, but the nation where the
game was born, the nation who’s excellence in the game was globally recognized
to go off on a wild goose chase. People should have known better. Maybe.
This brings us to an oft repeated but rarely observed tenet
of using statistics in applications: Correlation does not
equal causation. The saying may sound glib, but
it is remarkably dead on. If you find
some kind of correlation between two events, then our habit and inclination is
to jump to the conclusion that the two events have a causal relationship; that
is, one event caused the other to occur, or that we can deterministically and
reasonably predict the latter event will result from the occurrence of the
first event. Unfortunately for us that is rarely the case. Establishing
causality takes a bit of mathematical formal checking, just because the statistics
show some kind of correlation exists between the two events, however minimal,
doesn’t necessarily mean that they have a causal relationship.
In order to establish causality, a lot
of number crunching needs to happen, and a lot of statistical metrics need to
meet certain established thresholds before we can declare causality. That is a
completely different arm of statistical sciences call inferential statistics.
Far too involved for me to try to explain here and now, even assuming I can
explain it. A rather large and dodgy assumption.
Another thing that Reep’s error
illustrates is the Survivorship bias. The story of Abraham Wald and the US
warplanes is a favorite on social media and business writers because it
perfectly demonstrates the linear and direct thinking most people employ when
they see data, or results without taking into account the underlying situation.
Abraham Wald was born in 1902 in the then
Austria-Hungarian empire. After graduating in Mathematics he lectured in
Economics in Vienna. As a Jew following the Anschluss between Nazi Germany and
Austria in 1938 Wald and his family faced persecution and so they emigrated to
the USA after he was offered a university position at Yale. During World War
Two Wald was a member of the Statistical Research Group (SRG) as the US tried
to approach military problems with research methodology.
One
problem the US military faced was how to reduce aircraft casualties. They
researched the damage received to their planes returning from conflict. By
mapping out damage they found their planes were receiving most bullet holes to
the wings and tail. The engine was spared.
The US military’s conclusion was simple:
the wings and tail are obviously vulnerable to receiving bullets. We need to
increase armour to these areas. Wald stepped in. His conclusion was surprising:
don’t armour the wings and tail. Armour the engine.
Wald’s
insight and reasoning were based on understanding what we now call
survivorship bias. Bias is any factor in the research process which
skews the results. Survivorship bias describes the error of looking only at
subjects who’ve reached a certain point without considering the (often
invisible) subjects who haven’t. In the case of the US military they were only
studying the planes which had returned to base following conflict i.e. the
survivors. In other words what their diagram of bullet holes actually showed
was the areas their planes could sustain damage and still be able to fly and
bring their pilots home. (Thomas 2019)
What Reep saw was goals, he was fixated
on them rather than the big picture, he fell into the trap of reaching the
first and most obvious conclusion rather than try to explore the structure of
the game. Sometimes prior experience is very useful and not everything new is
golden.
Works Cited
Paine, Neil. 2016. "How One Man’s Bad Math
Helped Ruin Decades Of English Soccer." http://www.fivethirtyeight.com.
October 27. Accessed December 24, 2019.
https://fivethirtyeight.com/features/how-one-mans-bad-math-helped-ruin-decades-of-english-soccer/amp/?__twitter_impression=true&fbclid=IwAR0MNCiSu4nJIcGYvW5dRoTif1mzNc6MJzo8c-AFLU-mDWqZgWOCnT75tIw.
—. 2014. "What Analytics Can Teach Us About the
Beautiful Game." http://www.fivethirtyeight.com. June 12. Accessed
December 24, 2019.
https://fivethirtyeight.com/features/what-analytics-can-teach-us-about-the-beautiful-game/.
Thomas, James. 2019. "Survivorship BIas." McDreeamie
Musings. April 1. Accessed December 28, 2019.
https://mcdreeamiemusings.com/blog/2019/4/1/survivorship-bias-how-lessons-from-world-war-two-affect-clinical-research-today.