Popular Posts

Thursday, June 30, 2011

Another Look at Referee Bias: Extra Time Given

Yesterday I looked at referee bias in this past season for the EPL. It turned out that while referees favored the home team overall in parts of the game like fouls, yellow cards, and red cards, it is more likely due to the advantage the home team has in a game. One statistic I did not look at though, is the amount of extra time given.

Extra time has nothing to do the relative abilities or score of the game like many other parts of soccer do. In theory, it should be an objective amount not dependent on if the home or away team is leading in the game. You see in almost every game though, the home crowd jeering for the ref to end the game if their team is ahead, or cheering even louder for their team to come back if they are trailing. Based on this, referee bias would be present if home teams that are leading have shorter games compared with away teams that are leading. The obvious logic being that the referee gives in to the home team's fans and adjusts his extra time given unconsciously.

To do this I looked at the length of the game for home teams that won the game versus length of the game for away teams leading. If there is indeed a referee bias then we should see that the length of games is shorter for home leading teams versus away teams.

Below are histograms (graphs showing the frequency of each dependent variable value) of the length of the game for the two categories above.


We can see the graphs are very similar, except for the tail on the right end of the away win time. This is in accordance with our hypothesis that away teams that are leading face more extra time. It seems refs gave trailing home teams more than 10 minutes of stoppage time more than they gave trailing away teams more than 10 minutes.

Like the previous post looking at referee bias, I did statistical analysis to see if the difference was actually statistically significant (in other words, the difference was not due to randomness). The mean length of game for leading home teams was 96.36 minutes, while the mean length of games for leading away teams was 96.56.

Using the data, I ran a two sample t-test. Basically what a t-test does is takes in to account the number of observations, mean, and standard deviation (measure of spread) and tests to see if they are equal. In the end, the test gives a p-value between 0 and 1. A p-value basically answers the question, if the two means were actually the same (time given for leading teams were the same for home and away), what is the probability that we there would be a difference in the means that we actually saw. In this case, a probability of 0 suggest that the means are different, and one of 1 suggests they are the same. Generally, a p-value of .05 or lower is statistically significant, meaning we can rightfully say the means are not the same.

After doing the test, the p-value I got was 0.2013. While this suggests that referees are giving more time to trailing home teams, it is not at a statistically significant level. In other words, we cannot conclude that referees give more extra time to trailing home teams compared with trailing away teams. It may seem like there is a bias evident based on the means, but it is not at a statistically significant level.

All in all, referees are doing a good job in terms of not favoring home teams over away teams. Next time someone complains that the ref is favoring the home team, you can just tell them to look at the data.

Tuesday, June 28, 2011

Checking for Referee Bias in the 2010 EPL Season

Referee bias is a hot topic in any sport, not just soccer. People often accuse referees of favoring the home team in matches. The accusation makes sense: with a stadium full of fans rooting for one team, you would think it would be hard not to favor the home team just a little bit.


But is there a bias evident in the data? To look at this, I looked at data from last season in the EPL. Referees have control over a number of parts of the game. The parts I looked at were fouls, yellow cards, red cards, and offsides. If refs exhibit a home bias in the EPL, they would call more fouls and offsides and give more yellow and red cards to the away team. Pretty simple logic. Let's look at the data piece by piece.


Fouls:

Clearly, the graph shows that average number of fouls is indeed higher for the away team. The away team is called for, on average, 13.04474 fouls, while the home team is called for only 12.09737. That's a difference of about a foul per game. I also ran a two sample t-test to test for significance. Basically what a t-test does is takes in to account the number of observations, mean, and standard deviation (measure of spread) and tests to see if they are equal. In the end, the test gives a p-value between 0 and 1. A p-value basically answers the question, if the two means were actually the same (fouls were the same for home and away), what is the probability that we there would be a difference in the means that we actually saw. A probability of 0 suggest that the means are different, and one of 1 suggests they are the same. Generally, a p-value of .05 or lower is statistically significant, meaning we can rightfully say the means are not the same.

Anyways, the p-value I came up with after running a t-test for home and away fouls was .0003. In other words, we can say that refs called more fouls on the away team at a statistically significant level.

Yellow Cards:



Next I looked at yellow cards. Again, looking at the graph away teams received way more yellow cards on average than home team. Specifically, the home team averaged 1.413158 per game, while the away team averaged 1.955263 per game. That's a difference of about .5 per game. Again, I ran a t-test similar to the one above for fouls, this time for yellow cards. This time, the p-value was 0. This means there were definitely more yellow cards given to away teams than home teams at a statistically significant level.

Red Cards:



Third, I looked at red cards. If there is a home referee bias present we would expect to see more red cards given to away teams. Like fouls and yellow cards above, the bias seems to continue. Looking at the graph, there were definitely more red cards given to away teams on average. Per game, home teams received .0605263 per game, and away teams received .1184211. In other words, away teams receive about twice the red cards than home teams. Again, are these numbers significant? Turns out, like fouls and yellow cards, they are. The p-value was .0042, again telling us that away teams received more red cards per game at a statistically significant level.

Offsides:

Finally, I looked at offsides. In this case, the home team was actually called more for offsides. Huh? What's going on here? Home teams, on average, were called for offsides 2.35 times per game, while away teams were only called 2.223684 times per game. Are referees not being biased for offsides, while they are for fouls, yellows and reds?


As always, we should check all possible scenarios. One explanation for the differences in these 4 differences in calls could come not from referee bias, but from the advantage that home teams have over away teams. Maybe teams that are losing naturally foul more, receive more yellow and red cards, and get called for offsides less. It's obvious that home teams have a big advantage over away teams in the EPL: To name just one statistic, home teams scored, on average, 1.63 goals per game, while away teams scored only 1.01 goals per game. This is a pretty wide margin.

If the apparent bias was actually due to the home team's advantage, then losing teams would follow the same pattern as away teams. In other words, losing teams would be called for more fouls and receive more yellow and red cards. Most importantly, losing teams would be called for offsides less. Well, let's look at the data for losing teams compared with winning teams side by side with the data for away teams compared with home teams.

Fouls:
Disregarding the draws column on the far left, the graphs look similar. Both away teams and losing teams are called for more fouls.

Yellow Cards:
Again, if we look at the loss and win bars, they coincide closely with the away and home bars, respectively.

Red Cards:
Three in a row. The bars look strikingly similar for away versus home and loss versus win.

Offsides: Finally, we should expect losing teams to be called for less offsides than winning teams, just like how away teams are called for less offsides than home teams...
Look at that! Winning teams are indeed called for more offsides than losing teams.


Conclusion: Based on the first half of the post, it truly appears that referees favor the home team with their calls. In fact, I convinced myself that was the case for a little bit. However, it really comes down to the advantage a home team has in a game instead of any referee bias. While this post doesn't show anything revolutionary about referee bias (admittedly, it would have been pretty cool to make a groundbreaking discovery proving refs favor home teams), it is a good reminder that data can often be deceptive in the way that you look at it. It's important to look at all angles to really understand what is going on beneath the surface before you jump to any conclusions.

Friday, June 24, 2011

Win Probability Graphs and Regressions

Earlier in this blog I wrote a post on Win Probability in every possible game situation. I posted the excel files but they aren't as informative as a graph. I made up graphs for home and away and +2, +1, 0, -1, and -2 goal differentials for every minute. I didn't make up graphs for GD's bigger than that because there is basically no point. The fact that a team has a .999% win probability when they are up 4-0 isn't that exciting.

Each graph has the line of best fit and a scatter plot of the data. The equations for those lines are also on the graph along with the r^2 value for correlation. The graphs are below to look at.

Some interesting things I noticed:

-Most graphs show a very strong relationship between minute and win probability. The only ones that don't really are when teams are away and are tied, when teams are home and up by 2, and when teams are away and down by 2. Not really sure why these three stick out.

-Some of the graphs have linear relationships, while others are quadratic. Again, not really sure why this is. Why is the win probability when you are at home and tied follow a quadratic curve while the win probability of a team at home and down by 1 is linear? Maybe people have ideas as to why this happens.

-For some of the scenarios (the +2 and -2 GD's for home and away) I didn't start the graph at minute 1 because the data points were a little all over the place. This happens because there are so few data points so the win probabilities are screwed. Example: There aren't many times when a team has a 2-0 lead in the 5th minute.

-I added the graphs of all the goal differentials together for comparison, one for home and one for away. They're interesting to look at.

-Finally, because of this we now have some basic equations to model a team's chance of winning a game. Feel free to use them and check them out.

Thursday, June 23, 2011

WPA and AGW: Van Persie is overrated

Well, maybe the title is a little exaggerated. What I really mean is the value of Van Persie's goals last season are overweighted. On the other hand, Darren Bent's goals were undervalued. The explanation comes from WPA, or "win probability added".

If you read the last post, I explained win probability. If not, check it out here. Because we have a probability for every game situation, I was able to weight goals by the added win probability a team has from that goal. In soccer, is a little more complicated because teams can tie. To solve this, I use win percentages instead of win probability. To get a team's win percentage you weight a win as 1 point, a draw as 1/3 of a point, and a loss as 0. The sum of these divided by the number of games a team has played gives us the win percentage. I guess in this case it should be win percentage added instead.

The added part comes in by calculating how much a goal adds to a teams win percentage. Here are a couple of examples:
-A goal in the 95th minute to put the home team up by a goal would have a WPA of .666666. A tie game in the 95th minute gives the home team a win percentage of .33333 (almost every time they will draw the game). However, in this case the home team scored. Now the score is 1-0 in the 95th minute. Now the home team's win percentage is almost 1 (almost every time they will win the game). To get the WPA of the goal we subtract the win percentage before the goal (.3333) from the win percentage after the goal (1). This gives us a WPA of .666666

Basically what WPA does is values goals that are more important to the team. In the example above, that goal is obviously very important to the team. However, a goal in the 90th minute to put a team up by 6 would be worthless to the team. That goal would have a WPA of 0.

I calculated the WPA of the top scorers in the EPL last season (players with more than 10 goals). Interestingly enough, the list shook up a bit. The table is below.


Notably, Darren Bent moves up to first on the list, and Van Persie moves down to 8th. Beyond this, I wanted to know which players tend to score more important goals and which players score non-important goals. Obviously, Van Persie has a higher WPA than most of these players because he scored a lot more goals than them. 

The way I did this was to calculate the average WPA of a goal by a player. I called this the Average Goal Weight, or AGW. The list of the AGW versus goals is below.


Not surprisingly, Van Persie moves to the bottom of the list, and Bent stays at the top. So what does all this mean? I don't think its a good idea to jump to the conclusion that Van Persie is not a good goal scorer. Despite everything, he scored 18 goals last season, which is good no matter how you score them.  However, I think AGW is a good supplement to the top goal scorers list. Last season, Bent was consistently scoring goals that added a whole 10 points to the winning percentage than Van Persie on average.

You shouldn't base your entire assessment of a goal scorer only on AGW. However, I think its something to take in to account.

Tuesday, June 21, 2011

Win Probability Added in Soccer

Everyone hears it all the time: A 2-0 lead is the most dangerous lead in soccer. But is it really? Thinking about the led me to wonder how exactly dangerous leads were in soccer. In fact, I wanted to find out what win, loss and draw percentages a team had in all situations. The best way to find this out is to analyze a lot of games and calculate the win, loss, and draw percentages in every possible game situation. To do this, I took in to account the venue of the game (home versus away), the goal differential between the teams (team is up by 2, team is up by 1, game is tied, team is down 1, team is down by 2 etc) and the minute of the game. I took goal differentials of -5 to 5 and minutes 1-90. I thought these were probably really the most important factors. You could maybe take in to account cards too, but this is hard and makes it pretty complicated. Overall, there are 2*11*90 = 1980 combinations of game situations.

The idea relates to WPA in baseball. Basically, WPA is a measurement of how much a play adds to the chance the team wins a game. For example, how much does a 2 run home run help the team's chances in the 6th inning? In soccer, a question would be how much does a goal at home to give you a 2 goal lead in the 67th minute change your winning percentage in the game? Pretty simple concept.

To get the percentages for all of these situations I imported game data from the past 10 years of the EPL in to Excel. My Excel skills are not the best but with some help I was able to eventually get these to convert in to percentages for each game situation mentioned above. The basic idea is this: how often do teams with a 1 goal lead in the 40th minute at home win? How often do they draw? How often do they lose? This was done for every minute and every goal differential both home and away. The results truly tell us how dangerous a variety of leads are.

Here's an example: The team is away, the game is tied, and it's the 67th minute. Any guesses on the win, draw and loss percentages? Well turns out the team has about a 19% chance of winning, a 51% chance of drawing, and a 30% chance of losing.

We can also test the "2 goal leads are the most dangerous leads theory". Let's say the team is home and it's the 35th minute. Here are the percentages for 1 and 2 goal leads:

1 goal lead: win: 78%, draw: 16%, loss: 6%
2 goal lead: win: 96%, draw: 2%, loss: 2%

The same holds true for all minutes and both home and away teams. A 2 goal lead is in fact not the most dangerous lead in soccer.

I'm also in the process of making a Java Applet to post here that lets you input the goal differential, venue, and minute, and spits out the win, loss and draw percentages. Again, my Java programming talents are not the best, so no promises on anything getting finished or uploaded soon. I uploaded the actual excel files to a google sites page though if you're curious to look at other percentages. If you want to download the files click here and type in the search bar ".htm" without the quotes to find the files.

Next, I'm planning on relating this more to how WPA is used in baseball by using it to analyze specific players by calculating how much percentage they add to their team winning by scoring goals. Not sure how useful this statistic will actually be, but it's worth a shot.

Monday, June 20, 2011

Can a Formula for Points Predict the Final Standings?

Based on the post below determining a formula for points based on goals for and goals against, I naturally wondered if this formula was in any way projective. At the half way point of the season, could the goals for and goals against of a team predict the final standings?

The simplest way to do this is to compare the points formula standings accuracy with the accuracy of the standings by just doubling the points of teams at the half way point of the season. At first glance, it seems that the points formula is projective. The average points error for the MLS is 4.26, which is not off by much. However, if we compare it with the accuracy of simply doubling the midyear points, it is not as projective. The average points error for the MLS with this approach is only 2.75. Clearly, the points formula is not very good at predicting the final standings. You'd be better off just looking at the mid year standings.

Below is the chart for the 2009 season showing what is explained above.

An Accurate Formula for Points Using GD and GA

It may seem pretty obvious that goals for and goals against should be a good predictor of success in soccer. Teams that score a lot and concede a little should win more than teams that don't score often and concede a lot of goals. But how predictive is it? How accurately can we determine the standings based solely on the goals for and goals against of a team in a season?

Apparently, if the formula is tweaked enough, we can get it down to an average error of just above 3 points across all teams. Considering 3 points equates to only one win in most leagues, goals for and goals against can narrow down the error to only one win. Not bad.

The equation I used to do this is based on Bill James' Pythagorean Expectation formula. The formula is pretty simple: (Runs Scored)^2/((Runs Scored)^2 + (Runs Allowed)^2). Basically, Bill James calculates the winning ratio using the formula above.

Of course, soccer is not the same as baseball and a few adjustments have to be made. First, there are draws in soccer. How can the formula be adjusted to take in to account draws? If you think about it, a draw is basically equal to 1/3 of a win (a draw counts for 1 point, and a win counts for 3 points). Therefore, we can calculate the winning percentage of a team as (wins + draws/3)/(total number of games * 3). This formula is used to calculate winning percentage. We can then convert the winning percentage back in to points by multiplying it by 3.

The next change I made was to change the exponent. Using data from the tables from 2000-2010 for the MLS and 97-98 to 09-10 for the EPL, the exponent that minimized the average error was 1.4

Next, I added a coefficient to further minimize the average error. After fooling around with different exponents and coefficients, the combination of 1.4 for the exponent and .9 for the coefficient got the average error to right around 3 for both the MLS and the EPL. This gives an equation of .9*(GF^1.4)/((GF^1.4)+(GA^1.4))

Overall, the average error for the EPL was 3.21 points and the average error for the MLS was 2.85 points.

Interestingly enough, the equation works for both the EPL and the MLS. Doesn't seem to matter across these two leagues. I haven't looked, but I assume it would work for other leagues across the world.

Below is an example of a one of the charts I created. This one is using the tables for one year in the MLS.