Popular Posts

Wednesday, December 28, 2011

Problems with an Adjusted Plus Minus Metric in Football


Michael Essien

What would be the perfect, all-encompassing football statistic? Something that takes in to account both offensive and defensive skill. Something that measures what value a player adds to his club. All in all, a statistic that quantifies the individual impact a player has on improving (or worsening) his club's ability to score goals and limit (or not) goals against.

Some people have made attempts at this in the past. One example are OptaJoe's tweets (@OptaJoe) about club's winning percentages with and without a player. Here is one example: "10 - Since January 2005, Everton have averaged 61 points per season with Arteta playing, compared to 51 points without him. Lynchpin." These statements are simple, easy to understand, and at first glance seem to be informative. On his blog 5 Added Minutes, Omar Chaudhuri has correctly pointed out that these statements tend to be entirely misleading. As Omar shows, the problem is that these statements are not controlling for the strength of the opponent, the venue of the game, or really anything else, in these games.

My idea was to create a metric that would control for all of these factors to truly understand every player's worth to their club. Being a big ice hockey fan (specifically the Boston Bruins, if you are wondering) I thought that the plus minus statistic might be able to be applied to football. For those of you not familiar with this statistic, plus minus basically measures a club's net goals when that player is on the ice/field. When the team scores a goal when the player is playing, the player's plus minus increases by one. Conversely, when their team concedes a goal when they are playing, their plus minus decreases by one. The idea is that, over the season, the best players will have the highest plus minus.
I faced the same problem as before though, as this does not control for the strength of the opponent, the strength of the team the player is playing with, and where the game is being played. For example, a poor player on a top club would naturally have a higher plus minus than a good player on a poor club.
To fix this, I applied an analysis used in basketball to create an adjusted plus minus statistic. This was created by Dan Rosenbaum, and if you are interested the explanation can be found here 

Without going in to too many technical details, the adjusted plus minus metric is created using a massive regression. The right hand side variables are variables for every player, while the left hand side are goals for. Each observation is a unit of time during a game where no substitutions are made. Each player variable is a 1 if the player is playing at home during that unit of time, a -1 if they are playing away, and a 0 if they are not playing. The significance behind this methodology is that it controls for each player's team, venue, and opponents. If you want to know more about the methodology, read the link above. The data is from the 2010/2011 season and is provided by Infostrada Sports (@InfostradaLive on Twitter).

The main problem with this as some, including Albert Larcada (@adlarcada_ESPN), pointed out on Twitter, is that there is multicollinearity in the regression. This arises because, unlike in basketball, there are not many scoring events. What happens is that many players are highly correlated in the model. This throws off the adjusted plus minus values for each player, so we should not take anything from the results.

With that in mind, here are the results that I came up with. Again, these results are likely not correct, but I thought people might be curious to see them anyways:


Because I (a) spent a lot of time on this and (b) think it is important post work even if it doesn't necessarily work out, I went ahead and wrote this post. Keep in mind that the results above don't really mean much. The values are also not statistically different from 0. In other words, the standard errors on all the values are large enough so that we cannot say that they are statistically different from 0. This is another reason why the results are not very reliable. However, I think that the adjusted plus minus statistic could be the first step to creating metrics that truly capture the actual value of a player. Most statistics used (assists, goals, etc.) can be thrown off because they are highly dependent on the team the player plays for.

One way to fix the problem of mulitcollinearity is to use a different statistic that occurs more often, and is highly correlated with goals. I think the best option for this would be shots on goal. This way, you could create a statistic that controls for the player's team, opponents, and venue, and measure how many net shots on goal occur when they are on the field. Just a thought on a possibility of something to look at in the future.

The data used for this is provided by Infostrada Sports (@InfostradaLive on Twitter). Special thanks to Simon Gleave (@SimonGleave on Twitter) for helping me with the data.

Wednesday, November 9, 2011

How to Succeed in the EPL: Chances Created and Chance Conversion

A common statistic that many people have begun to value and notice a lot recently is the chances created statistic. Chances created, according to Opta's website, is defined as "assists plus Key passes" where a Key Pass is "the final pass or pass-cum-shot leading to the recipient of the ball having an attempt at goal without scoring" (Opta is a company that tracks and generates a ton of data in soccer). So basically, any pass that leads to a shot is considered a chance created.
Swansea's Mark Gower is a perfect example
of a player highlighted by the chances
created statistic.

Chances Created
The appeal of this measure is that it can value players that play on weaker teams better than assists do. For a player on a weaker team, it is harder to record assists since they are playing with teammates that are less likely to score. Chances created is a fairer statistic because it does not value the strength of your teammates as much. Overall, it can highlight creative players that are often overlooked because they are on weaker teams and do not have as many assists.

Do Chances Created Actually Matter?
With all this in mind, I was curious to find the actual worth of the chances created statistic. One way to measure this is to look at how chances created and wins are correlated. To make it a little easier, I looked at the relationship between goals scored and chances created for EPL teams. In other words, do teams that have more chances created score more? Do teams with less chances created score less? The answer, in short, is yes, they are correlated. Below is a scatterplot of the relationship. There is a clear positive relationship between chances created and goals in the EPL last season. The coefficient is statistically different than 0 (p=.000), which tells us that there is extremely strong evidence that there is a positive relationship.



Chance Conversion Percentages
This is only half the story though. Some teams get a lot of shots off, but either because they are not good at shooting or are taking shots that have a smaller chance of going in, some of these teams have a low number of goals because they have a poor conversion percentage for shots. The conversion percentage is defined as the goals divided by the total number of shots (excluding blocked shots). Below is a scatterplot similar to the one above, this time with conversion percentages on the x-axis. The conversion rates are rounded to 2 decimal places, hence the bunching. Again, this shows a positive relationship between conversion percentage and goals. Teams with higher conversion rates tend to score more and vice versa. This relationship is also statistically different from 0 (p=.002). A quick note: the product of chances created and conversion rate is very close to the number of goals a club has scored. I'm pretty sure the discrepancy comes from including blocked shots in shots attempted, but not in conversion rates.



EPL 2010-2011, Chances Created and Conversion %
With this in mind, I created a scatterplot of conversion rates and chances created for EPL teams last season. The plot shows that clubs found scoring success in different ways. The Manchester clubs did it by being efficient scorers; they had conversion percentages of .15 and .16. Chelsea and Tottenham were on the other end of the spectrum with higher chances created, but lower conversion percentages (.12 for both). The graphic also shows that West Ham did not struggle because they were not creating chances; they struggled because they had a low conversion percentage (.10). On the other hand, Birmingham struggled because they failed to create enough chances to score, despite a decent conversion percentage of .12.



EPL 2011-2012 thus far, Chances Created and Conversion %
What about this year? Below, I created the same scatterplot as above, this time for the current season. City's dominance is really highlighted. They are leading in both chances created AND conversion percentage, hence the massive number of goals this year. Again, United seems to be scoring because of their high conversion percentage. QPR and United actually have very similar number of chances created, United just finishes their chances with a much higher percentage. Liverpool sticks out because of their high number of chances created, but really low conversion percentage (.09).




Conclusion
The bottom line is that creating chances and conversion rates are the key to understanding goal scoring. A club can succeed with a high conversion rate (United) or by creating a lot of chances (Liverpool). A club can really dominate by doing both well (City). The graphic above can also suggest what kind of players each club needs. For example, Manchester United and Newcastle would benefit by picking up a creative midfielder who creates more chances, and Liverpool and QPR would benefit by picking up a more efficient scorer. The scatterplot also tells us why some clubs struggle. Wigan needs to up their conversion percentage (currently a dismal .06) and Stoke needs to create more chances. City, on the other hand, should just continue to buy all the best players.

All data comes from eplindex.com (@EPLIndex)

Tuesday, October 11, 2011

An Analysis of the Performance of Promoted Clubs

Joey Barton, of newly promoted QPR


An aspect of English football that I love that does not exist in American sports is the promotion/relegation aspect. It makes not just the race for first exciting, but also the race to avoid relegation entertaining. In American sports, last place teams often simply give up, a disappointment for fans. 

I wanted to see exactly how promoted/relegated teams fared throughout the season. Some statistical research has already been done on the subject: Omar Chaudhuri, writer of the 5 Added Minutes blog, looks at conversion rates of promoted teams and their corresponding ability to stay in the top flight here.  In part 1 of this post, I have looked at how promoted teams have done in their first season in the top flight. My original idea was that teams may struggle early in the season to adjust to the higher level of competition, and eventually even out as the weeks go on and the teams adjust. This also puts the performance of QPR, Swansea, and Norwich into perspective with past promoted team's performances. I use data from promoted teams from the 2003/2004 to 2010/2011 season. 

I've created 5 graphs to illustrate the performances of promoted teams. The first one, below, shows how all the promoted clubs' point totals have progressed over the 38 games. On average, promoted teams earn around a point per week. The greenish linear-looking line in the middle is the average. All the other jagged lines are the point totals over the season of promoted clubs. This graph isn't too informative, but is an interesting graphic nonetheless.


The next graph is the same as the one above, but only looks at the three promoted clubs this season in comparison to the average points line and the linear points line. To clarify,  the linear line shows is a line illustrating what would happen if a team earned the same points every week to end up with the average point total for promoted clubs. The average line shows the average points earned each week of the season. These may sound the same at first, but I will show in the next couple of paragraphs that there is an important distinction. Anyways, the graphic below illustrates that all 3 promoted clubs are faring about as well as the average promoted team does. QPR started off a little stronger, but has since returned to the average. Norwich and Swansea both started a little weaker, but have improved to end up just above the average 7 weeks in to the season. All 3 teams have 8 points so far, just above the point per week average of promoted teams.


Another way of looking at the first graph is by looking at points per game of promoted teams. The graph below shows this. Obviously, at first clubs' point per game total is a little spread out. As the season progresses, teams earn an average of 1 point per game, as mentioned above. Some clubs have done a little better, and some a little worse, as evident from the graph.


Next is the graph above, but again looking at the performance of the 3 promoted teams this season. Again, the graph shows that QPR started off the campaign a little stronger, but has since regressed to be even with Norwich and Swansea.


The final and most informative graph shows the cumulative points per game of promoted clubs. This graph answers my question of how promoted teams fare throughout the season. As you can see below, promoted teams seem to struggle up until week 7, where they turn it around and do better than their average point total up until around week 20, where they hover around the point per game mark until the end of the season. There could be a lot of explanations for this trend. Maybe clubs struggle at first, and then adjust to the higher competition? Maybe clubs transfer window acquisitions (think QPR) start to pay off around week 7? It would be tough to tell what the true factors driving the trend are really. However, the graph does highlight the interesting phenomenon. 

I'm still working on doing a similar analysis of clubs that are relegated at the end of the season to analyze how their performance fluctuates throughout the season.

Tuesday, August 30, 2011

Expected Points Added (EPA) Leaders Through Week 3

Below are the Expected Points Added (EPA) leaders for the EPL through week 3. The week 1 leaders can be found in an earlier post here. To reiterate, EPA weights goals based on how important they are to the team's chance of winning the game. This is based on the notion that a go ahead goal in the 90th minute is worth more than the 5th goal in a 5-0 win.


Some interesting things to point out...

  • While Rooney has 5 goals this season, Welbeck's 2 goals have actually been more beneficial to United. In fact, Rooney doesn't even make the top 15 list above considering most of his goals were in the recent Arsenal blowout.
  • Dzeko gets to the top of the list by scoring frequently and in important situations. His average goal weight is a solid .51 expected points added, but just because of the fact that he has scored 6 goals puts him at the top.
  • It's still early in the season. Arteta makes third on the list with only 1 goal (a late game winning goal). Soon we'll start to see the top dominated by players who have scored a lot, and in important situations.

Sunday, August 21, 2011

Expected Points Added (EPA) Data Through EPL Week 1

Before the season I promised to post Expected Points Added (EPA) totals after each week of the season. Here are the EPA totals from week 1. If you don't know what EPA is, check out a full explanation here.

To summarize it very basically, EPA is the total measure of how much each player's goals add to team's expected points total. That is why you see some EPA's of 0 below. These players scored goals that added nothing to the teams expected points total (for example, a team is up 3-0 and is already going to win, and a player scores a 4th in the 90th minute. This does not add to the team's chance of winning technically, because the team is already very likely to win.)

Average Goal Weight (AGW) is just EPA divided by the number of goals a player has scored. This measures how important, on average, a player's goals are. It can show us that a player consistently scores clutch goals (high AGW) or that they are scoring useless goals in blowouts (low AGW).


Dzeko has the highest EPA from his go ahead goal in the 57th minute. This equated to a little more than a point for City. Klasnic, Muamba, and Silva all scored goals that added no expected points for their team.

If you have any questions feel free to ask in the comment section. I'll be super busy this week between moving in to my apartment at school and 3-a-days for preseason but I'll try to keep some posts coming.

Wednesday, August 3, 2011

Refining The Win Probability Statistic


Last year I was planning on going to go to the Sloan Sports Conference but ended up not being able to make it. I was thinking about it again this year, and I decided it wouldn't be a bad idea to submit something for this year’s conference. At first I wasn’t going to, but why the hell not? Might as well go for it, I guess.

My win probability added statistic has generated some interest for people, and I think it gives some pretty interesting insight, so I’ve been working on expanding it. If you have no idea what win probability added is, check out my first post on win probability and another on win probability added. Anyways, thus begins my quest to refine and expand the win probability added statistic for submission to the sports conference. To make it a lot better, comments, criticisms, and suggestions are very much appreciated and would help a lot.

The first fix I made was change the name based on a simple fix. The problem with “win probability added” is that it doesn’t necessarily calculate the win probability added. That’s a little bit problematic. For example, if two teams are tied in the 90th minute, the win probability under my old calculations was .333 for both teams. This doesn’t really make sense, because each team has close to a 0% chance of winning the game, not 1/3. This comes from modeling the statistic after the similar calculation in professional baseball. My fix for the problem is extremely simple: multiply all the values by 3. This changes the statistic from win probability added, to the expected points added. It basically makes much more sense now. If a player scores a go ahead goal in the 90th minute, the Expected Points Added (easier to write EPA from now on) is going to be almost 2. If a player scores a tying goal in the 90th minute the EPA would be almost 1. Much simpler and easier this way (originally got the idea from @11tegen11’s similar analysis).

After this, I noticed the graphs were not nice easy curves. Even though I took a big sample size of games (about 10 years worth) there isn’t enough data to give a nice curve. To fix this, I just created lines of best fit for each game situation. The home and away graphs for each minute and goal differential are below. Before there were a few situations that didn’t give a realistic expected point total because there were so few game situations (like a 2 goal lead in the 5th minute). Making the nice smooth curves fixes this problem. It also allows me to use equations to calculate EPA instead of the annoying process of referencing a massive excel chart.





I think there’s a lot of possible paths to take from here. I’m going to recalculate the top goal scorer’s EPA using the equations. It won’t change much, but it’ll be nice to have some continuity because I’ll be calculating EPA week by week for every goal next EPL season.

I’m also working on creating a database of the top goal scorers in the last 10 years in the EPL with their goal totals and their EPA over the years. Looking at goals and EPA over time will hopefully give some insights in to clutch (or lack thereof) goal scoring. If some players consistently have very high EPA’s and some players consistently have low EPA’s, it could be an indicator of clutch goal scoring in football.

Like I said before, I’d love comments and suggestions on ideas for where to go next on the blog, via Twitter, or even email. 

Wednesday, July 27, 2011

Does More Possession=More Wins in the MLS?



In the past couple of blog posts I've looked at two common statistics and shown that they are not as meaningful as most people believe. shots on goal do not predict success very well, and assists favor players on better clubs. In keeping with this theme of misleading statistics in football, I decided to look at possession data. The commonly held notion is that the team that has the ball more (has a possession percent over 50) is more likely to win. This makes sense. A team with the ball more is more likely to score and less likely to concede. But does the data back it up? Does having more possession than your opponent mean you are more likely to win the game? I looked at the possession data from the MLS season so far. What I found goes completely against what most people would think. So far this season in the MLS, the average possession percentage for teams that have won the game is 48.5%. Teams that win actually posses the ball less. This means the average possession percentage for losing teams is 51.5%.

To get even more specific, I broke down the possession data further. Winning home teams average 50.9% possession, and winning away teams average 43.4% possession. On the other side, losing home teams average 56.6% possession and losing away teams average 49.1% possession. The histograms below illustrate these facts. I found that away teams, on average, have a possession percentage of 47.3%, and home teams have a possession percentage of 52.7%.


So what does all this mean? It seems possession percentage in the MLS does not predict success. Teams that possess the ball more don't win more; they actually lose more. Home teams also have a slight advantage in possession percentage compared with away teams.

What about teams that completely dominate possession? You might think that a team that had the ball much more often than their opponent would be much more likely to win. I defined "dominating possession" as having the ball more than 60% of the time. So far this season, teams that have dominated possession have a record of 10 wins, 19 losses, and 18 ties. Domination in possession? Yes. Domination in wins? No.

This analysis calls in to question statements like "the Union had the run of play, they possessed the ball more and deserved the win." It's apparent that in the MLS, possession is not all that important when it comes to winning games. So what's the problem with possession? One reason could be that the best teams do not play possession football. The teams with the most success may play kick and run. Another possibility is that possessing the ball simply doesn't lead to wins. Either way, having the ball more than your opponent does not mean much in the MLS.

Monday, July 25, 2011

Why We Shouldn't Put Much Value in Assists



Last week I wrote a post on why shots on goal are a misleading statistic. In keeping with the analysis of the problems with some commonly kept statistics in football, I decided to look at assists. 

If you think about it, assists are highly misleading. Simply playing with good players boosts your assist total. Similar to shots on goal, not all assists are the same. There are the assists where a player makes a short pass in the midfield that leads to a teammate dribbling through all the opposing defenders and finishing, and the assists where a player makes a beautiful cross where their teammate simply has to tap the ball in the open net. These obviously shouldn't be counted as the same value to the team, yet they are. Hell, I could probably record an assist eventually in the EPL if I played for one of the top teams (OK, maybe an exaggeration but you get the point.)

First, let's look at the assists data for all the teams in the EPL league. As the graph below shows, as the point value of a team increases (basically, the better the team is) the assist total also generally increases. This is no surprise. We would expect better teams to score more goals and thus have more assist totals.



Basically what this means is that the assist statistic should favor players on better teams. Players on better teams play with better teammates and should therefore have more opportunities for assists. Below is a screenshot from the EPL website of the players with the top 20 assist totals.



9 players from top 5 clubs are in the top 20 for assist totals. No players from bottom 3 clubs are in the top 20, with the exception of Blackpool's Charlie Adam who was just signed by Liverpool. It's easy to see assists totals are higher for players on better clubs.

A better statistic that is not influenced by the quality of your teammates are chances created. A chance created is defined as a pass that leads to a shot. These are obviously not as dependent on your teammates and give a more fair and true assessment of how much of a playmaker that player is for their team. 

The next time a club is looking to sign a player based solely on their assists totals, they should take a more in depth look. Assists can tell an inaccurate, or at the least biased, story.

Monday, July 18, 2011

Do Shots on Goal Matter?


The major point of this blog is to test commonly held notions in football for their validity. After watching the US women lose to Japan yesterday, I started to think about shots on goal. I don't have the exact numbers, but I'm pretty sure the US crushed Japan in the shots on goal category. This made me think, do shots on goal matter? Most people would quickly say yes. It would make sense that more shots on goal mean more chances to score and thus more goals. The only problem is that some things in football just don't make sense. I wanted to see if shots on goals equate to success in two categories: 1.) Do more shots on goal mean more success for a team as a whole? 2.) Do more shots on goal mean more goals for a specific player? To test these questions I used data from the MLS website. As an aside, mls.com has extensive statistics for every season in a bunch of categories. Great to see. Anyways, the data is from the 2010 MLS season.


First question: Do more shots on goal mean more success for a team as a whole?

If this was true, we would expect points to increase as shots on goal increase on a team level. In other words, teams that have more shots on goal would be more successful. The graph below tells us a different story.


The graph shows there is no real relationship between shots on goal and points. Most teams cluster around just under 140 shots on goal on the season. The line of best fit shows a positive relationship, but this relationship is not strong at all. The correlation of the graph is r=.1311. As a reminder, the correlation of a graph tells us how strong the linear relationship is between two variables. The correlation coefficient (the value of r) gives a numerical value of the strength of the relationship. A value of 0 means there is no linear relationship at all, and a value of 1 means there is a perfect positive linear relationship. In this case, the value is .1311, telling us there is a very weak linear relationship.


Second question: Do more shots on goal mean more goals for a specific player?

Similar story for this question: is there a linear increase in the amount of goals as the amount of shots on goal increases? The graph below gives us the answer.



This graph shows a stronger relationship compared with the graph above. However, the relationship is still not very strong. The value of r in this case is .4722, indicating that the relationship is stronger than the graph above. However, a correlation under .5 is generally considered to be a weak relationship. This means for individual players, shots on goals are not a very good indicator of goals.

Here's my best explanation for why shots on goal are not a very indicative statistic: Not all shots on goal are the same. There are 40 yard weak rollers that the goalie easily saves, and there are 5 yard shots that the keeper barely gets a hand on. There are weak attempts by a center back getting forward and there are breakaways by forwards. In the shots on goal statistic, in both cases the shots on goal are counted as equivalent. Obviously this makes no sense. A statistic that would be better indicative of goals scored for both questions I looked at above would be shots on goal inside the box. Shots on goal inside the box would get rid of the shots on goal that have no chance of going in. Not all shots inside the box are the same, so we have somewhat of the same problem as shots on goal. However, I assume there would be a much stronger correlation between shots on goal inside the 18 and points, and shots on goals inside the 18 and goals by an individual player. Unfortunately, I don't have the data to back up this claim (working on it). If/when I do get the data from shots inside the box I'll post the graph and the correlation between shots on goal in the box and goals.

Even without the data, the point I'm making is still clear: shots on goal do not equate to more success from a team perspective and do not correlate with goals for individual players very strongly like most people assume they do. There are better statistics than shots on goal. This means statements like "New England had 5 more shots on goal than New York, they dominated the game" and "Donovan had 4 shots on goal in the game, he was due for a goal" are not neccesarily valid. What if New England had a bunch of shots on goal from outside the 18 that never had a chance of going in? And what if Donovan's shots on goal all were weak rollers? Shots on goal are often misleading.

Thursday, July 14, 2011

A Different Look at League Parity: MLB vs. EPL




I was intrigued after reading a post last month by Chris Anderson on his Soccer By the Numbers Blog. The post compares the competitiveness of different football leagues in Europe. You can find it here. Anderson talks about "uncertainty of the outcome" as a measure of parity. This makes sense, as leagues where the outcome is not a sure thing are more equal.

With an uncertainty of the outcome in mind, I took another approach to analyzing the parity in a league is by looking at the amount of champions. In the past 10 years, only 3 clubs have won the English Premier League.  In baseball, 9 different teams have won the World Series. Of course, this has flaws and is not a complete look at the league. This does suggest that the outcome is not fixed for baseball though.

Does this mean that professional baseball has a more balanced league than the EPL? If you've read Moneyball by Michael Lewis (if you haven't you should) you would know that MLB is facing payroll disparities similar to the one's in the EPL. So why the large difference in the number of winners? The answer is the playoffs.

In baseball, the 6 division winners plus 2 wild cards make the playoffs. There is one best of 5 series, followed by 2 best of 7 series. This adds up to only 11 wins to take home the World Series. Most people say that the playoffs are different than the regular season. They say all previous records are thrown out the window and any team can beat any other team. While there may be some change in the way a team plays when it comes to the playoffs, there is a more important factor at work: a small sample size of games. With such a small sample, it is not uncommon for a less skilled team to simply get lucky and beat a better team. Assume a team has a 30% chance of beating another team in a playoff game. For a best of 5 series, that team has a 16.3% chance of winning. For a best of 7 series, the team has a 12.6% chance of winning. All in all, upsets are not uncommon in the MLB playoffs. These upsets are the force behind the multitude of World Series winning teams this decade.

In contrast, we can look at the EPL. The EPL has no playoff system, and the winner is determined by the most number of points after each team plays 38 games. Effectively, you can look at this as just being one long playoff. Here, the sample size is much bigger: 38 games. Historically, teams have to win above 25 games to win the league (with the exception of last season). If we look at an above average team, what is their chance of winning more than 25 games? Let's take Liverpool from last season. For simplicity's sake I will only look at wins for the analysis. This may hurt a team with a lot of draws, but it makes the analysis a lot simpler. Last season they finished 6th with 17 wins. This means they win about 45% of their games. I am also assuming that Liverpool's record is an accurate measure of their ability to win games. In other words, Liverpool really does have a 45% chance of winning a game. The probability of Liverpool winning more than 25 games last year, if they have a 42% chance of winning each game, is .3%. For a team that won 25 games, or 65% of their games (in the past 10 years it has been ManU, Chelsea or Arsenal), the chance of winning more than 25 games is 42% Because of the bigger sample size, upsets are much less likely in the EPL. Even with a good team like Liverpool (I don't think anyone would say Liverpool winning the league is an upset), the probability of it happening is very low.

Baseball's smaller sample size of games in the playoffs allows for upsets and gives the appearance of parity with numerous teams winning the World Series. The EPL's larger sample size and lack of playoffs vastly reduces the chance of an upset which leads to the same powerhouse teams winning over and over again. John Henry already has two championships this decade with the Red Sox. The way the leagues are set up, a third championship with Red Sox is more likely than his first with Liverpool.

Tuesday, July 12, 2011

WPA and AGW Weekly Updates this Season

I just added the image on the right of the page ranking the players ranked by their WPA totals. The chart also includes the player's AGW and their goal totals for the season. I'll update this every week during the EPL season. An explanation of WPA and AGW are below.

WPA: Win Probability Added defines exactly what it sounds like it should: How much a player has added to their team's success through their goals. The way I calculate this is to sum how much each player's goals add to the team's probability of winning. Goals are a flawed statistic because every goal is obviously not worth the same amount. The 5th goal in the 90th minute in a 5-0 win is not important. The 1st goal in the 90th minute in a 1-0 win obviously is very important. To quantify these values I accumulated the total record (wins, losses, and ties) of every game in the past 10 years in the EPL. This way, I could calculate the exact winning percentage at every different game situation for both teams. For example, I know that scoring the 2nd goal to make a game 2-0 at home in the 67th minute increases a team's chance of winning by 10.845983%. WPA takes in to account the importance of each goal, and shows how much, overall, a player has added to their team's chance of winning a game through their goals.

AGW: Average Goal Weight is simply how much, on average, the player's goal is worth. Mathematically, it is the player's total WPA divided by the number of goals they have scored. For example, one player may only score 5 goals on the season, whereas another may score 15. However, the first player could have a higher AGW if they tended to score pivotal goals while the second player scored useless goals.

WPA and AGW are not perfect statistics, but they do provide a little more insight in to a player's goal scoring ability.

Monday, July 11, 2011

Answer to my Question via Twitter Posted Earlier

The question I asked earlier today via my twitter @SoccerStatistic was, "Which statistic correlates best with a team's point total?" The options were goals against, corners, goals for, and shots on target. The answer is extremely surprising to say the least.

Another way to ask the question is "Given the goals against, corner, goals for, or shots on targets total for a team in the EPL, which variable would allow you to best predict the point total of the team?" Turns out the answer is not goals for, goals against, or even shots on target. Yep, its the corner total. This means the amount of corners a team accumulates during the season is a better indicator of the team's standing than the other variables. To me, this is mind-boggling. The point of the game is to score more goals than your opponent, yet the amount of corners predict point totals the best.

The way to figure this out is with linear regressions between points and the 4 statistics in questions using season totals for EPL teams. A linear regression tells us how strong the linear relationship between two variables are with a number called the correlation coefficient. A value of 0 would mean there is absolutely no relationship, and a value of 1 would mean a perfect linear relationship. Below is a chart of the 4 variables and their correlation coefficient value. The absolute value of the correlation coefficients are given below, as goals against obviously has a negative relationship with.







Corners just edge out goals for and goals against as the strongest relationship. There is only really one explanation I can think of to explain this: Corners result from pressure on the goal, and more corners would mean more pressure on the goal which corresponds with more wins and a higher point total. Still, the fact that the relationship is stronger than the relationships between points and goals for and points and goals against really amazes me.

A few things to point out: First, the best way to really predict a team's success is with their goal differentials. However, it is still interesting that corners have the strongest relationship of the 4 variables above. Second, the relationship between corners and points shouldn't be read in to too much. This doesn't mean that if a team goes out trying to get more corners they will be more likely to win the game; instead it means that better teams tend to earn more corners based on the way they are playing.

This also leads in to something else I will be working on in the near future which relates somewhat. Are the amount of goals scored by a player a good indication of the quality of the player? Forwards are the highest paid players in soccer, but what if goal scorers are significantly overvalued? Is it right when we say "Player x is a better player than player y because he scored more goals this season"? I think there are a number of ways to test these questions, so check back in the coming week for some results and analysis.

Thursday, July 7, 2011

An Analysis of City Pre/Post Abu Dhabi Using the Transfer Price Index


Pretty soon I'm going to start writing the Manchester City statistical blog over at http://www.eplindex.com/ (@EPLIndex). I also just read Pay As You Play by Paul Tomkins. If you haven't read it and you're interested in statistics and football, you should really give it a read. The book basically outlines the trend in the EPL that money buys points using what Tomkins calls the Transfer Price Index. More specifically, the higher the cost of the XI (Tomkins refers to this as £XI) the more a team tends to win. Of course, there are exceptions to this, but in general it seems to hold true. Anyways, when I was reading the book I thought it would be a good idea to analyze City using Tomkin's data, especially when I saw that my future fellow City blogger at EPL Index Danny Pugsley (@danny_pugsley) wrote the "Expert View" for the City section. I'm no expert on the analysis that Tomkins does, but I understand a good amount from reading the book. The subject of the book rings especially true for City considering the recent Abu Dhabi takeover and sudden influx of large amounts of cash for the club.

Some notes before the analysis: One, the data I am using is all from the book Pay As You Play, as I mentioned above. Two, make sure to notice some data is missing for years when City was not in the top flight. Three, the data in the book only goes to the 2009/2010 season, so the 2010/2011 season is missing.

Basically, I looked at 3 questions: 1.) Does City really spend more money since the Abu Dhabi take over? 2.) Does a higher £XI cost equate to success for City in the EPL? 3.) Screw 1 and 2. What if City keeps buying Robinho's?

Does City really spend more money since the Abu Dhabi take over?

Yeah, really dumb question. Pretty obvious the answer is yes. Below is the graph comparing the league average starting eleven cost and the City starting eleven cost since 1992. In the 2008/2009 City's £XI is higher than the league average for the first time since the 1994/1995 season. Remember, Abu Dhabi took over at the start of the 2008/2009 season. For the 2009/2010 season it skyrockets to over £120,000,000. City now has money to spend.



Does a higher £XI equate to success for City in the EPL?
The answer Tomkins gives for EPL clubs in his book is yes. Again, this makes sense. Clubs that are able to spend more on players should be able to produce higher quality sides and win more. I wanted to analyze specifically City's success, so I looked at the data to see if their £XI rank in the EPL follows their league position. In other words, does City succeed more when they spend more? Looking at the graph below, the answer seems to be yes. The league postion (green line) generally follows the club's £XI rank (orange line).



Screw 1 and 2. What if City keeps buying Robinho's?
The first two graphs seem to point to inevitable success for City. They have a lot of money and money can buy success, so they'll succeed, right? People will obviously point to some recent not-so-successful expensive purchases. Robinho, Jo, and Santa Cruz are the 3 big ones. Each has had start percentages of 47, 16, and 16 respectively, despite a massive total cost of £69,000,000. A good graphic to show the efficiency of purchases is the cost per point used in Pay As You Play. Clubs that are efficient in this regard will have spent less money per point earned, while clubs that are inefficient will do the opposite. The graph shows how much City spent in each year for each point they earned. Not surprisingly, the cost per point has spiked since 2008. This may seem like money is being wasted. While City may not be getting as much bang for their buck, it likely won't matter in terms of success. According to Tomkins, the highest cost per point goes to Chelsea in 2006/2007. They finished in 2nd that year. It seems that simply having a lot of money can trump inefficiencies displayed from the cost per point value. Tomkins even refers to City's high cost per point on page 18: "Manchester City will certainly close the gap for this unwanted honour (although if they win the league, they won't care what people think; they could probably afford to pay £4m or £5m per point if it would guarantee them success)." So yes, City may make some poor purchases like Robinho, Jo, and Santa Cruz in the future. All in all, it doesn't matter that much though. City has so much money that they'll win anyways.



Wednesday, July 6, 2011

Fun With Graphs

Often graphs can tell us a lot more about certain data then just the numbers itself. At least they are usually easier to understand. I just downloaded Aaron Nielsen's (@ENBSports) amazing database from the 2010 MLS season and started playing around with it. Here are some interesting graphs I came up with:


This is probably a graph that already exists somewhere, but I made it anyways. It really highlights how much Seattle dominates attendance in the MLS. Also added in a bar for average attendance (between Chicago and Salt Lake) for comparison.




Another graph that highlights domination (in this case probably in a negative sense) of one team over all the others. All teams fall in the range of 1.4 to 1.8 cards per game. However, its clear that Toronto is an outlier with 2.17 cards per game.


This graph once again shows domination by one team in a certain statistic. Dallas scored almost 20% of their goals from PK's. That's 1 out of every 5 goals. This almost doubled every other team in the MLS last season, and was 10 times the percentage of Seattle. Hmm. Not exactly sure what the explanation here is. Is Dallas really good at diving? Are they being favored by refs? Are they just getting a lot of chances in the box? Something to look at in the future.


For the percentage of goals scored outside the 18, I took the 2 lowest, 2 highest, and the average. Dallas (likely from their massive share of goals from PK's) and Columbus have the lowest percentage of goals scored from outside the 18. New England and Chivas USA have the two highest percentage of goals scored from outside the 18. This shows not every team is scoring goals the same way in the MLS. Having a high percentage of goals from outside the 18 doesn't exactly mean the team is being creative or is better at long distance shooting. Instead, it more likely tells us that the team struggled in scoring goals within the 18, where the bulk of goals are scored. Dallas and Columbus were 4th and 5th last year, respectively, while New England and Chivas USA were 13th and 15th, respectively.

Thursday, June 30, 2011

Another Look at Referee Bias: Extra Time Given

Yesterday I looked at referee bias in this past season for the EPL. It turned out that while referees favored the home team overall in parts of the game like fouls, yellow cards, and red cards, it is more likely due to the advantage the home team has in a game. One statistic I did not look at though, is the amount of extra time given.

Extra time has nothing to do the relative abilities or score of the game like many other parts of soccer do. In theory, it should be an objective amount not dependent on if the home or away team is leading in the game. You see in almost every game though, the home crowd jeering for the ref to end the game if their team is ahead, or cheering even louder for their team to come back if they are trailing. Based on this, referee bias would be present if home teams that are leading have shorter games compared with away teams that are leading. The obvious logic being that the referee gives in to the home team's fans and adjusts his extra time given unconsciously.

To do this I looked at the length of the game for home teams that won the game versus length of the game for away teams leading. If there is indeed a referee bias then we should see that the length of games is shorter for home leading teams versus away teams.

Below are histograms (graphs showing the frequency of each dependent variable value) of the length of the game for the two categories above.


We can see the graphs are very similar, except for the tail on the right end of the away win time. This is in accordance with our hypothesis that away teams that are leading face more extra time. It seems refs gave trailing home teams more than 10 minutes of stoppage time more than they gave trailing away teams more than 10 minutes.

Like the previous post looking at referee bias, I did statistical analysis to see if the difference was actually statistically significant (in other words, the difference was not due to randomness). The mean length of game for leading home teams was 96.36 minutes, while the mean length of games for leading away teams was 96.56.

Using the data, I ran a two sample t-test. Basically what a t-test does is takes in to account the number of observations, mean, and standard deviation (measure of spread) and tests to see if they are equal. In the end, the test gives a p-value between 0 and 1. A p-value basically answers the question, if the two means were actually the same (time given for leading teams were the same for home and away), what is the probability that we there would be a difference in the means that we actually saw. In this case, a probability of 0 suggest that the means are different, and one of 1 suggests they are the same. Generally, a p-value of .05 or lower is statistically significant, meaning we can rightfully say the means are not the same.

After doing the test, the p-value I got was 0.2013. While this suggests that referees are giving more time to trailing home teams, it is not at a statistically significant level. In other words, we cannot conclude that referees give more extra time to trailing home teams compared with trailing away teams. It may seem like there is a bias evident based on the means, but it is not at a statistically significant level.

All in all, referees are doing a good job in terms of not favoring home teams over away teams. Next time someone complains that the ref is favoring the home team, you can just tell them to look at the data.

Tuesday, June 28, 2011

Checking for Referee Bias in the 2010 EPL Season

Referee bias is a hot topic in any sport, not just soccer. People often accuse referees of favoring the home team in matches. The accusation makes sense: with a stadium full of fans rooting for one team, you would think it would be hard not to favor the home team just a little bit.


But is there a bias evident in the data? To look at this, I looked at data from last season in the EPL. Referees have control over a number of parts of the game. The parts I looked at were fouls, yellow cards, red cards, and offsides. If refs exhibit a home bias in the EPL, they would call more fouls and offsides and give more yellow and red cards to the away team. Pretty simple logic. Let's look at the data piece by piece.


Fouls:

Clearly, the graph shows that average number of fouls is indeed higher for the away team. The away team is called for, on average, 13.04474 fouls, while the home team is called for only 12.09737. That's a difference of about a foul per game. I also ran a two sample t-test to test for significance. Basically what a t-test does is takes in to account the number of observations, mean, and standard deviation (measure of spread) and tests to see if they are equal. In the end, the test gives a p-value between 0 and 1. A p-value basically answers the question, if the two means were actually the same (fouls were the same for home and away), what is the probability that we there would be a difference in the means that we actually saw. A probability of 0 suggest that the means are different, and one of 1 suggests they are the same. Generally, a p-value of .05 or lower is statistically significant, meaning we can rightfully say the means are not the same.

Anyways, the p-value I came up with after running a t-test for home and away fouls was .0003. In other words, we can say that refs called more fouls on the away team at a statistically significant level.

Yellow Cards:



Next I looked at yellow cards. Again, looking at the graph away teams received way more yellow cards on average than home team. Specifically, the home team averaged 1.413158 per game, while the away team averaged 1.955263 per game. That's a difference of about .5 per game. Again, I ran a t-test similar to the one above for fouls, this time for yellow cards. This time, the p-value was 0. This means there were definitely more yellow cards given to away teams than home teams at a statistically significant level.

Red Cards:



Third, I looked at red cards. If there is a home referee bias present we would expect to see more red cards given to away teams. Like fouls and yellow cards above, the bias seems to continue. Looking at the graph, there were definitely more red cards given to away teams on average. Per game, home teams received .0605263 per game, and away teams received .1184211. In other words, away teams receive about twice the red cards than home teams. Again, are these numbers significant? Turns out, like fouls and yellow cards, they are. The p-value was .0042, again telling us that away teams received more red cards per game at a statistically significant level.

Offsides:

Finally, I looked at offsides. In this case, the home team was actually called more for offsides. Huh? What's going on here? Home teams, on average, were called for offsides 2.35 times per game, while away teams were only called 2.223684 times per game. Are referees not being biased for offsides, while they are for fouls, yellows and reds?


As always, we should check all possible scenarios. One explanation for the differences in these 4 differences in calls could come not from referee bias, but from the advantage that home teams have over away teams. Maybe teams that are losing naturally foul more, receive more yellow and red cards, and get called for offsides less. It's obvious that home teams have a big advantage over away teams in the EPL: To name just one statistic, home teams scored, on average, 1.63 goals per game, while away teams scored only 1.01 goals per game. This is a pretty wide margin.

If the apparent bias was actually due to the home team's advantage, then losing teams would follow the same pattern as away teams. In other words, losing teams would be called for more fouls and receive more yellow and red cards. Most importantly, losing teams would be called for offsides less. Well, let's look at the data for losing teams compared with winning teams side by side with the data for away teams compared with home teams.

Fouls:
Disregarding the draws column on the far left, the graphs look similar. Both away teams and losing teams are called for more fouls.

Yellow Cards:
Again, if we look at the loss and win bars, they coincide closely with the away and home bars, respectively.

Red Cards:
Three in a row. The bars look strikingly similar for away versus home and loss versus win.

Offsides: Finally, we should expect losing teams to be called for less offsides than winning teams, just like how away teams are called for less offsides than home teams...
Look at that! Winning teams are indeed called for more offsides than losing teams.


Conclusion: Based on the first half of the post, it truly appears that referees favor the home team with their calls. In fact, I convinced myself that was the case for a little bit. However, it really comes down to the advantage a home team has in a game instead of any referee bias. While this post doesn't show anything revolutionary about referee bias (admittedly, it would have been pretty cool to make a groundbreaking discovery proving refs favor home teams), it is a good reminder that data can often be deceptive in the way that you look at it. It's important to look at all angles to really understand what is going on beneath the surface before you jump to any conclusions.