Pages

Monday 31 October 2016

Passing Motifs at a Player Level: Player Passing Style

This is a pretty exciting entry, so bear with me if it gets a bit long, I think its worth it…

Ever since the first entry on Passing Motifs I mentioned the potential of extrapolating the methodology to study passing styles at a player level. That first entry mentioned the idea set forth by Javier Lopez and Raul Sanchez to answer the question “Who can replace Xavi?”. Nevertheless, that particular example always left me wanting for more because the outcome was noticeably skewed towards players from Barcelona and a few other teams like Arsenal and even Swansea surprisingly. It made me think that the methodology was ignoring individual player traits and rather picking up stats that are reflective of the team the player plays for, not of the player himself.

I’ve been thinking ever since what the best way to extract player passing style from passing motifs is. Here are some of the ideas I’ve had:
  • One first objective is to neutralise the effect of the team passing style on a player. If a team proportionately uses ABAB a lot, then inevitably so will the players. Therefore, if you put Fernandinho in Barcelona, his motif frequencies will start to resemble those of the whole team without it having been something inherent to him all along. The idea I had was to view how a player’s relative motif frequencies diverged from his team’s frequencies in each match. That is to say, if in a match Arsenal performed 40% of its frequencies as an ABAC and 43% of the motifs Coquelin was involved in were ABAC, then Coquelin had a +3% for that motif for that match. Averaging for the whole season, Coquelin could be seen as 5-dimensional vector where each entry corresponds to his average divergence for each of the 5 motifs. When the performance of this vectorisation is measured through the methodology outlined in my previous entry using data from the 2014-15 and 2015-16 seasons of the Premier League (only players who had at least 18 appearances to avoid outliers), this was the result:


The fairly negative z-scores reveal that this methodology has an agreeable stability for those two seasons and is therefore picking up on some underlying quality of the players playing style.

  • Just as we did for team motifs, instead of considering the raw values of motifs a player performed, we consider each performance in a match by a player as a 5-dimensional vector in which each entry is the percentage of the player’s total motifs that that motif corresponds to. So we can represent a match played by Romelu Lukaku as 5% ABAB, 13% ABAC, 25% ABCA, etc. Averaging over a whole season, each player is represented by a 5-dimensional vector.


Once again, we’re reasonably happy that this vectorisation is picking up on stable player qualities.

  • Another way of seeing that data which I felt might be useful is seeing each player’s match as the proportion of each motif his team performed that he participated in. That is to say, if Southampton completed 50 instances of ABAB in a match, and Jordy Clasie participated in 25 of those, he would have a 50% score for ABAB in that match. If in that same match Southampton completed 80 instances of ABAC and Clasie participated in 20, he would have a 25% score for that motif. Applying this logic to the 5 different motifs and averaging over the whole season, each player is once again represented by a 5-dimensional vector. This is how well it performs:


Out of the three 5-dimensional vectorisations we have shown so far, this is by some margin the one which performs the best. Both its z-scores are considerably lower than the other two, meaning its capturing pretty stable information for each player.

  • In the first entry regarding passing motifs we mentioned how the motifs could be vectorised in a 15-dimensional vector for players. To refresh your memory, for an ABAC sequence a player could participate as the A player, the B player or the C player. It’s straightforward to count that looking at all 5 motifs there are 15 “participation” possibilities for each player. If we count how many times each player was each letter in each of the 5 motifs, we are left with a 15-dimensional vector representing each player. This is basically the methodology used in the “Who can Replace Xavi?” article.



     Comparing things in different dimensions is rather difficult and not too standardised in mathematics but I would dare say that it performs worse than previous 5-dimensional vectorisation, especially considering Z-Score 1 which is the most important indicator.

  •      Finally, we can take this 15-dimensional idea and slightly alter it to not count the total of each pseudo-motif but rather what their relative frequencies are, so once again do something like if Dimitri Payet performed the B in an ABAC 15 times out of 100 total motifs he participated in, that pseudo-motif has a score of 15%. Once again, each player is represented by a 15-dimensional vector:


Immediately we appreciate that this is the best performing of all the vectorisations we have seen.

Now, the first thing we must say is that all the 5 different ways of obtaining player vectors shown here show evidence of uncovering some stable and underlying qualities of players’ passing style. We have used the indicators to compare them and discuss which might be better, but there is no way of determining whether some information which one of them is picking up on is missed by another.

Here’s the advantage: there is no downside to combining them all. If we simply glue together all these representations to make one long 45-dimensional (5+5+5+15+15) vector representation for players, then all the qualities on which each methodology picked up are at a scale represented. If two players were similar across all representations, they will be similar in the long one as well; if two players were similar across some of the representations but not others, then they will be mildly similar depending on how dissimilar they were in the others; etc.

Here is the performance of this long 45-dimensional vectorisation:


The results are very satisfying and it proves to be a robust vectorisation for player playing style, more than 1 standard deviation below the mean distance between all players and more than 4 standard deviations below the Gaussian distances, even in this very high dimensional space.

This vectorisation will surely provide me with a lot of material to explore for a good while, its even a little frustrating not finding an easy visual way in which to convey it to the readers. Lets settle for now on a hierarchical clustering dendrogram as a visualisation tool.

Below is a link for the pdf for the hierarchical clustering dendrogram applied to the data set for the 2015-16 season of the Premier League (only players who played in over 18 matches). Since there are 279 players, the tree labels are really tiny so the image couldn't be uploaded onto the blog directly, but on the pdf you can use your explorer's zoom to explore the results.

https://drive.google.com/file/d/0Bzvjb5fnv1HtZjFtRDJjUVBua0E/view?usp=sharing

If you'd rather not, here is a selection of the methodology's results:

  • Mesut Ozil has one of the most distinctive passing styles in the league. Cesc Fabregas is the player closest to him and together they form a subgroup with Juan Mata, Ross Barkley, Yaya Toure and Aaron Ramsey.
  •  Alexis Sanchez is in a league of his own but the players with the most similar passing style are Payet, Moussa Sissoko, Jesus Navas, Sterling and Martial.
  • Troy Deeney is in the esteemed company of Aguero, De Bruyne, Oscar and Sigurdsson.
  • David Silva, Willian, Eden Hazard and Christian Eriksen are all pretty similar.
  • Nemanja Matic, Eric Dier and Gareth Barry have a similar passing style.
  • M’Vila, Lanzini, Capoue, Puncheon, Ander Herrera and Drinkwater are all similar, pretty good and perhaps underrated.
  • Walcott, Ihenacho, Scott Sinclair, Jefferson Montero, Wilfired Zaha, Bakary Sako, Albrighton, Bolasie and Michail Antonio form a subgroup of similar wingers.
  • Giroud is more similar to some rather underwhelming strikers such as Gomis, Cameron Jerome and Pappiss Cisse rather than to world class strikers. The same can be said of Harry Kane being similar to Aroune Kone, Son and Marc Pugh. Maybe the methodology is not as convincing for strikers?
  • Shane Long and Odion Ighalo are good alternatives to Jamie Vardy.
  • Diego Costa and Lukaku are similar to Rooney.
  • Victor Moses, Aaron Lennon and Jordon Ibe are similar.
  • Mahrez is similar to Sessegnon, Nathan Redmond and Jesse Lingard. Did Southampton know this?
  • Matt Ritchie (ex-Bournemouth now at Newcastle) is in a group with Lallana, Alli, Pedro and Lamela. An opportunity for the taking?
  • Angel Rangel has (and has always had) unusual stats for a full-back.
  • The methodology recognises who the goalkeepers are and set them apart without this information being explicitly available in the datasets. The same applies for many other players from similar positions which are grouped together like the CBs and full-backs.

This is a poor man’s substitute to actually exploring the dendrogram yourselves. Not to mention that a clustering dendrogram is not even the most faithful representation of the information being collected by this vectorisation, but I’m more than happy with the results and feel there is some real promise to the methodology. If I can come up with some better visualisations for the results I’ll post those later on.

Please have a look through the results from the dendrogram and comment on whether you feel we’re getting close to convincingly capturing player passing style through passing motifs.

Distinguishing Quality from Random Noise: How do we know we’re getting valuable information?

One of the main challenges of football analytics is ensuring that their manipulation of the available data is in fact uncovering underlying “qualities” of teams and players, instead of just randomly picking up statistical noise or irrelevant facts. I could certainly use the available data and assign a number to each player by summing up the number of blocked shots plus the square root of the number of headed shots inside the area divided by the goal difference his team obtained with him on the pitch multiplied by his number of interceptions. Can I use this number in any way to advise a club on whether they should buy him? Probably not. How can I know what is valuable?

Recall from the previous entries on team passing motifs that a main reason why I stated that the methodology was picking up on a stable quality of passing style was the fact that it was stable for consecutive seasons. If the methodology was just randomly assigning motif distributions, then surely there would be no consistency between different seasons.

The implication then is this: if a certain vectorisation of the data is in certain sense “stable” across seasons, then this vectorisation is representative of an underlying quality of the data observations. Metrics intended to measure qualities which one would expect to be stable over seasons such as “playing style” or “potential” should be able to be validated in this way.

The question then is how would the details of this validation go. In this entry, I’ll go through a “validating methodology” that I’ve been working on lately:

 Take a vector representing a team or a player for a given season (something like the 5-dimensional vector representing a team in the passing motifs methodology). If my reasoning above is correct, if the vector contains valuable information regarding that player/team, then if I consider the equivalent vector for the season directly before, they should in theory be in some sense “close” to each other. The “closeness” of two vectors is of course a relative concept, so this should be measured in relation to the average distance between any pair of vectors.

As an example: If Juan Mata’s vector for 2014-15 is at a distance of 2.3 from his vector for 2015-16, and on average the distance between any two player vectors (not necessarily from the same player) in this context is 9.5, then we can say with reasonable certainty that Juan Mata’s vectors are “close” to each other.


The method which I wrote out takes as parameters the two vectorisation matrices for two consecutive seasons, normalises them, considers only players who have played at least 18 matches in each season; and prints out the following:



Here’s what we want to look at on this table: First of all, the lower the mean distance between the two vectors of each player, the better our methodology is according to the reasoning above. However, a “low” number is a very relative concept, so we need a reference with which to measure how low this number actually is. This methodology provides two such references:

  • The first and most important one is the mean distance between all vectors, not just between the two corresponding to each player. This gives us an idea of how far any two vectors in this context are and if the “closeness” of vectors of the same player is significant. Z-Score 1 is the mean distance between all vectors minus the mean distance between the vectors from the same player divided by the standard deviation of the distance between all players. The lower this number (accepting negative values of course), the better.
  • The second reference provided is the mean distance between Gaussian vectors in the dimension of the problem. Z-Score 2 is the mean distance between the simulated Gaussian vectors minus the mean distance between the vectors from the same player divided by the standard deviation of the distance between the Gaussian vectors. I feel this is also an important frame of reference because it gives a measure of just how “normalised” the scaled problem is. It also provides important “dimensional” context because for example if our vectorisation is in 15 dimensions as opposed to 5 dimensions, then the raw distances will increase but this does not necessarily mean that the higher dimensional vectorisation is less valuable, simply that the numbers we deal with in higher dimensions are naturally larger and we need to know this to know how “low” our mean distance number really is. Hence the importance of the Z-Scores.

This entry was a bit technical and perhaps less interesting for the average football fan, but I thought it was important to explain it because it’s what I’ve been using to understand how to best translate the passing motifs problem to a player context. I’m looking to follow up this entry with an applied example comparing different vectorisations of passing motifs at a player level very soon (2-3 days hopefully if I can find the time), so stay tuned!

Monday 10 October 2016

New Season, New Ideas

I spent the majority of the summer off in Colombia and then Croatia and took a bit of a break from football and math. But now I’m back in London and settling into my routine, and even though I didn’t spend much time in front of the computer over the summer and have no finished results to show you yet, even on holiday I can’t stop my mind from drifting off towards football and new ideas that could be applied. I’m going to use this entry to tell you about the plans I made to explore some of these ideas during this new “Analytics Season” leading up  to the 2017 OptaPro Forum in February where I’m hoping to get the chance to present them.

Up until this point, I’ve spent most of the entries speaking about team passing sequences and the results of their quantification through network motifs. This is a very interesting topic, and I think there is still more to come from this. These are some areas where I still hope to do some more work:

  • The vectorisation through motif frequencies can be refined with some more information. For example, I’ve been thinking that different instances of ABAC represent very different kinds of combination play. An ABAC passing sequence can be composed of a short one-two between players A and B, and in the third and final pass A plays a long ball to player C. Alternatively, it can simply be composed of 3 short passes. The distance of the third pass should be the main factor differentiating different types of ABAC, because in the vast majority of cases the ‘ABA’ part will be made up of short passes (If Coutinho gives it to Lallana and Lallana gives it back to Coutinho, we don’t expect either of those passes to have been long otherwise Coutinho would have to run a large distance after his first pass). When the weights of the first principal component of the lengths of each pass in all instances of ABAC are looked at, effectively almost 97% of the variance is on the length of the third pass. It remains to be seen whether this ‘refinement’ can be used to further discern and distinguish team playing style.

NOTE: Think of Principal Component Analysis as a method assigning coefficients to what features contain the most variance in a set of data. If we have the height and weight of a population of hippos and a population of zebras, the height of the whole set is roughly the same but the weight differs a lot, and Principal Components Analysis tells us precisely this: the weight is where the variance is.

  • Even though we’ve spoken about “playing style” convincingly through this methodology, we still haven’t related this vectorisation to “success on the field” yet, which is actually what ultimately matters. It would be interesting for example to use Topological Data Analysis (you can read about it in my previous entry) to map out the motif vectorisation and discover where success in the league is being accumulated. We can also fit a probability distribution that gives the probability of a certain motif structure leading to a top 4 finish for instance. In this sense, we could potentially advise clubs on what they need to change in their passing play to increase their probability of finishing in the top 4, or of not being relegated, or of winning the league, etc.
  • As I said before and showed with the Xavi example, this ‘passing motifs’ idea can be simply extrapolated to a player level by vectorising each player’s frequency of participation in the different motifs. Once players are represented as vectors in this high-dimensional space, we can apply a whole arsenal of methodologies to answer questions such as which player is better suited to replace an outgoing player, which players have a similar style, how individual players affect the passing motif structures of their teams. It remains to be seen whether this approach will quantify a meaningful underlying quality in players as we have shown it does for teams, but certainly, more information is preferable to less information.
  • These topics (team style, player style and recruitment) can be combined; so for example we can advise clubs on how the recruitment of a certain player will affect their passing motif structure, and whether this change will improve their probability of a top 4 finish.

These are ambitious plans, but I think that they are important ones because I must admit that (fair) criticism that could be thrown my way is that so far the results are very interesting theoretically but difficult to translate into actual practical and applied contexts within the industry. This is fair, but it doesn’t take away in the least bit from the value of the results obtained. The passing motifs methodology has a lot going for it. It proved to be consistent across different seasons which is strong evidence that it identifies some underlying inherent property which we called “passing style”, instead of just randomly picking up statistical noise. It was also used to identify a passing style unique to Leicester City which was present even before their title winning season, something that no one could have predicted or expected. As I said, it has a lot going for it.

The key counter-argument for this criticism is this (opinion): there doesn’t have to be an obvious, direct and immediate practical application for theoretical work to be valuable for a field. I strongly suggest those with a true interest in the topic of Football Analytics read this fascinating entry from Statsbomb author Marek Kwiatkowski. Here’s an excerpt if you can’t be bothered to read the whole thing:

“(I) believe that we have now reached the point where all obvious work has been done, and to progress we must take a step back and reassess the field as a whole. I think about football analytics as a bona fide scientific discipline: quantitative study of a particular class of complex systems. Put like this it is not fundamentally different from other sciences like biology or physics or linguistics. It is just much less mature. And in my view we have now reached a point where the entire discipline is held back by a key aspect of this immaturity: the lack of theoretical developments. Established scientific disciplines rely on abstract concepts to organise their discoveries and provide a language in which conjectures can be stated, arguments conducted and findings related to each other. We lack this kind of language for football analytics. We are doing biology without evolution; physics without calculus; linguistics without grammar. As a result, instead of building a coherent and ever-expanding body of knowledge, we collect isolated factoids.”

When looking for conceptual theoretical developments, passing network motifs fit the bill of a consistent and robust concept with a clear underlying motivation (representation of “passing style”). Practical applications will inevitably follow from this maturation of the discipline, and I have already outlined above some much more practical approaches which can be looked into.

Finally, and before this entry gets any longer, this approach has given me valuable insight into a type of conceptual processing that can be done to raw football data in order to obtain a meaningful representation. Football events during a match are very dynamic, complex and interdependent, but they codify all the necessary information to determine results, quality, potential, etc. The network motifs approach suggests taking the constituent blocks of the passing events graph and applying an equivalence relationship on the identity of the nodes in order to study their nature (this simply means that instead of focusing on the specific players performing the passes, for instance we consider any occurrence of a one-two as belonging to the same “class” of pass motif regardless of who the specific players were). It has made me think: why not attempt this with other types of events? Consider for example having a directed graph representing a team’s performance, but instead of the nodes representing players and the vertices passes, each node can be seen to represent an “area” of the pitch and a vertex is simply the act of going from one area to another through a pass, dribble, etc. A sequence in this network is simply a movement between different areas. The ‘equivalence relationship’ on the identity of the nodes which I think would be useful for this approach would work something like this: a play starting in one area, then moving to an area three spaces to the right through a pass, and then forward 2 spaces through a dribble would be classified exactly the same regardless of the players involved and if it happened starting inside our own penalty box or from the halfway line.

Vectorising team or player performance through the frequency of the motifs in this context could lead to a very robust quantification of playing style, performance metric, probability of success… who knows?! I don’t yet, but let’s hope I can find out from here to February.