Followers

Search This Blog

Saturday, January 4, 2020

Stats For Spikes-Correlation and Causation


This article (Paine 2016) caught my attention recently. It talks about the case of Charles Reep, a former Royal Air Force Wing Commander who was tracking play-by-play data for matches and serving as a quantitative consultant for Football League teams as early as the 1950s.


https://fivethirtyeight.com/features/how-one-mans-bad-math-helped-ruin-decades-of-english-soccer/amp/?__twitter_impression=true&fbclid=IwAR0MNCiSu4nJIcGYvW5dRoTif1mzNc6MJzo8c-AFLU-mDWqZgWOCnT75tIw


The article recalls how Reep’s analytics caused him to conclude that the number of passes made in soccer is directly correlated to scoring. His admonition is that shooting after three passes or less have a higher probability for scoring a goal.

But Reep was making a huge mistake. Put simply, Reep started with each goal scored and looked at how many passes were made prior to scoring. His starting point was goals scored. The problem is that most goals scored in soccer do come after three passes or less, because that is the nature of the game, it is sporadic, and the passing game get disrupted frequently by the defense. What he did not count were the goals missed after just three passes, that block of data is missing because of his focus on just scoring the goal.
In a previous article, Neil Paine of the website Five Thirty-Eight refuted that bit of wisdom gleaned from Reep’s agglomeration of soccer data.

https://fivethirtyeight.com/features/what-analytics-can-teach-us-about-the-beautiful-game/

But subsequent analysis has discredited this way of thinking. Reep’s mistake was to fixate on the percentage of goals generated by passing sequences of various lengths. Instead, he should have flipped things around, focusing on the probability that a given sequence would produce a goal. Yes, a large proportion of goals are generated on short possessions, but soccer is also fundamentally a game of short possessions and frequent turnovers. If you account for how often each sequence-length occurs during the flow of play, of course more goals are going to come off of smaller sequences — after all, they’re easily the most common type of sequence. But that doesn’t mean a small sequence has a higher probability of leading to a goal.

To the contrary, a team’s probability of scoring goes up as it strings together more successful passes. The implication of this statistical about-face is that maintaining possession is important in soccer. There’s a good relationship between a team’s time spent in control of the ball and its ability to generate shots on target, which in turn is hugely predictive of a team’s scoring rate and, consequently, its placement in the league table. While there’s less rhyme or reason to the rate at which teams convert those scoring chances into goals, modern analysis has ascertained that possession plays a big role in creating offensive opportunities, and that effective short passing — fueled largely by having pass targets move to soft spots in the defense before ever receiving the ball — is strongly associated with building and maintaining possession. (Paine 2014)

To reiterated, he should have focused tracking the number of possessions and whether those possession turned into goals.  Given the complexity of the game, it was perhaps understandable that Reep made this mistake, and given that the state of the art of statistical analysis in sports was still rudimentary, it was perhaps predictable. The unfortunate thing is that Reep was able to convince an entire nation’s soccer establishment, not just any nation, but the nation where the game was born, the nation who’s excellence in the game was globally recognized to go off on a wild goose chase. People should have known better. Maybe.
This brings us to an oft repeated but rarely observed tenet of using statistics in applications: Correlation does not equal causation. The saying may sound glib, but it is remarkably dead on.  If you find some kind of correlation between two events, then our habit and inclination is to jump to the conclusion that the two events have a causal relationship; that is, one event caused the other to occur, or that we can deterministically and reasonably predict the latter event will result from the occurrence of the first event. Unfortunately for us that is rarely the case. Establishing causality takes a bit of mathematical formal checking, just because the statistics show some kind of correlation exists between the two events, however minimal, doesn’t necessarily mean that they have a causal relationship.

In order to establish causality, a lot of number crunching needs to happen, and a lot of statistical metrics need to meet certain established thresholds before we can declare causality. That is a completely different arm of statistical sciences call inferential statistics. Far too involved for me to try to explain here and now, even assuming I can explain it. A rather large and dodgy assumption.
Another thing that Reep’s error illustrates is the Survivorship bias. The story of Abraham Wald and the US warplanes is a favorite on social media and business writers because it perfectly demonstrates the linear and direct thinking most people employ when they see data, or results without taking into account the underlying situation.

Abraham Wald was born in 1902 in the then Austria-Hungarian empire. After graduating in Mathematics he lectured in Economics in Vienna. As a Jew following the Anschluss between Nazi Germany and Austria in 1938 Wald and his family faced persecution and so they emigrated to the USA after he was offered a university position at Yale. During World War Two Wald was a member of the Statistical Research Group (SRG) as the US tried to approach military problems with research methodology.
One problem the US military faced was how to reduce aircraft casualties. They researched the damage received to their planes returning from conflict. By mapping out damage they found their planes were receiving most bullet holes to the wings and tail. The engine was spared.


The US military’s conclusion was simple: the wings and tail are obviously vulnerable to receiving bullets. We need to increase armour to these areas. Wald stepped in. His conclusion was surprising: don’t armour the wings and tail. Armour the engine.

Wald’s insight and reasoning were based on understanding what we now call survivorship bias. Bias is any factor in the research process which skews the results. Survivorship bias describes the error of looking only at subjects who’ve reached a certain point without considering the (often invisible) subjects who haven’t. In the case of the US military they were only studying the planes which had returned to base following conflict i.e. the survivors. In other words what their diagram of bullet holes actually showed was the areas their planes could sustain damage and still be able to fly and bring their pilots home. (Thomas 2019)

What Reep saw was goals, he was fixated on them rather than the big picture, he fell into the trap of reaching the first and most obvious conclusion rather than try to explore the structure of the game. Sometimes prior experience is very useful and not everything new is golden.

Works Cited

Paine, Neil. 2016. "How One Man’s Bad Math Helped Ruin Decades Of English Soccer." http://www.fivethirtyeight.com. October 27. Accessed December 24, 2019. https://fivethirtyeight.com/features/how-one-mans-bad-math-helped-ruin-decades-of-english-soccer/amp/?__twitter_impression=true&fbclid=IwAR0MNCiSu4nJIcGYvW5dRoTif1mzNc6MJzo8c-AFLU-mDWqZgWOCnT75tIw.
—. 2014. "What Analytics Can Teach Us About the Beautiful Game." http://www.fivethirtyeight.com. June 12. Accessed December 24, 2019. https://fivethirtyeight.com/features/what-analytics-can-teach-us-about-the-beautiful-game/.

Thomas, James. 2019. "Survivorship BIas." McDreeamie Musings. April 1. Accessed December 28, 2019. https://mcdreeamiemusings.com/blog/2019/4/1/survivorship-bias-how-lessons-from-world-war-two-affect-clinical-research-today.

No comments: