2020: Week 19 - Solution



You can find our full solution workflow below and download it from the Preppin' Data Community page!


This week was a follow up task from last week so that we could further analyse the Liverpool line-ups from this season's Premier League. 

Our first task was to use the previous solution's workflow, along with a couple of other data sources in order to further our analysis.

Step 1 - Calculate Goals Scored

The first step is to calculate how many goals each team scored, so we start by creating a branch just before the final aggregation tool on the previous workflow:

From here we can change the [No.] field to a Whole Number, then remove any unneeded fields for this part of analysis. These include:

  • Comp.
  • Date
  • Player Name
  • Subbed Off?
  • Subbed On?
  • Subbed On/Off
  • Sub Number
  • Mins Played 
  • Appearances
The next step will be to isolate each of the scores for each team. We can do this by splitting the score field using an automatic split on the [Result] field, leaving us with a column for each side of the '-' divider.


From here we can rename the split fields to [Home Goals] and [Away Goals] (Home goals will be the first number, and Away goals will be the second number). Then use the following calculations to determine whether Liverpool or the Opposition had scored the goals.

[Liverpool Goals]

IF [ Location] = 'H' THEN [Home Goals] 
ELSE [Away Goals]
END


[Opposition Goals]

IF [ Location] = 'A' THEN [Home Goals]
ELSE [Away Goals]
END


Now we have classified who scored the goal, we can remove the Home/Away goals field and we are left with this:


Next, we need to use an Aggregation tool to calculate the goals per match as there is currently a line per player, therefore this increases the amount of goals massively. The aggregation tool will need the following setup:


We group by [Formation], [Oppo Form.], [Opposition] and [No.] so that we can have a row for each match played. We need to group by [No.] as each team plays each other twice (Home & Away) each season. Then, if we average the [Liverpool Goals] and [Opposition Goals] we are left with the scoreline for each match that has been played. 


Now we have a single line for each match, we can again use an aggregation tool to calculate the numbers needed to answer our questions. The second aggregation tool for the combinations has the following setup: 


Group by [Formation] and [Oppo Form.] gives us all of the different combinations that we require, then calculating the Sum of [Liverpool Goals] and [Opposition Goals] gives us how many goals were scored in each combination. Then finally, bringing through the [number of rows] (Games Played) allows us to calculate how many times each combination of formation appears.

From here, we can use the following calculation to answer our question: 

[Avg Goals Scored]
zn([Liverpool Goals]/[Games Played])

[Avg Goals Conceded]
zn([Opposition Goals]/[Games Played])

Using the ZN function will allow all Nulls to be returned as 0.

That's all that is needed for Output 1


Step 2 - Join Positions & Players


For our 2nd output this week, we need to incorporate the other data sets that have been provided. First, is the formation positions which joins nicely with our existing workflow by joining on [Formation] = [Formation] and [Player Position] = [Player Number]: 


Now we have the Formation Positions joined, we can now add in the player positions. This is a little harder as the players are all in one single string with their Squad Number, Name, and Position.

We want to follow these step to prepare the Player Names for joining: 
  • Remove Numbers from [Player Name]. We don't need the squad numbers for this exercise.
  • Automatic Split on [Player Name]. This gives us a column for the name and their position. 
  • Rename [Position Letters]. Replace the letters G/D/M/A with the text that was provided in the challenge post. 
  • Remove [Player Name] Field.
  • Rename Split 1 to [Player Name], and Split 2 to [Player Position].
We should be left with two columns looking like this: 


Next, we have the problem of removing the first name of each player so that it can join with our existing workflow. The names have different structures just like the following examples: 
  • First Name Last Name (Andy Lonergan)
  • Single Name (Adrián)
  • Split Surname (Virgil Van Dijk or Sepp van den Berg)
To split out the correct names we need to use the following steps:
  • Split [Player Name]. Use a custom split to split all values in the player name by a ' ' (space).
  • Use IF calculations to fill in the blanks with the correct part of the name.
[Player Name - Split 2]
IF [Player Name - Split 2] = '' 
THEN [Player Name - Split 1]
ELSE [Player Name - Split 2]
END

[Player Name - Split 2]
[Player Name - Split 2]+' '+[Player Name - Split 3]+' '+[Player Name - Split 4]
  • Remove extra split fields. Split 1, Split 3, Split 4
  • Rename Split 2 to [Player Surname]
  • Remove extra spaces from [Player Surname]
  • Rename [Player Position] to [Player Preferred Position] 
We should be left with following: 


We can now join to our original flow and combine all our data sources together. 


The second join condition looks like this (Player Surname=Player Name):


Notice how there are some excluded from the Player list. These are players who haven't played in the Premier League this season so therefore they aren't needed going forwards.

Step 3 - Find Substitute Positions

Now, we have all of our data in one table, we need to work out what position the substitute takes when he enters the field of play. From the challenge post we are assuming that it is a direct swap and that the formation doesn't change. 

Our first step is to remove the following columns to clean up the table after all the joins.
  • Player Name-1
  • Formation
  • Oppo Form. 
  • Comp.
  • Location
  • Result
  • Opposition
  • Date
  • Player Position
  • Player Surname
We are now going to branch off into two separate streams. 


First is for players who have been substituted off of the pitch. We want to know what player has been subbed off, therefore we can use the following calculation to return the player name: 

[Player Off]
IF [Subbed Off?] = TRUE 
THEN [Player Name]
ELSE NULL
END

We can use a similar calculation to return the position as well:

[Position Off]
IF [Subbed Off?] = TRUE 
THEN [Position Name]
ELSE NULL
END

We can now identify what player has been subbed off and what position they were playing in. Therefore, we can remove any additional columns including:
  • Player Name
  • Subbed Off?
  • Subbed On?
  • Mins Played
  • Appearances
  • Player Name
  • Player Preferred Position
  • Position Name
  • Position Type
Currently, we have all players included therefore we want to filter out all of the players who haven't been substituted. We can use exclude all Null values from the [Player Off] field, which leaves us with all the player names who have been substituted off this season.

Next, we can follow a similar process but on a different branch for the player's who were substituted on. First we use this calculation to identify if the players that came on: 

[Player On]
IF [Subbed On?] = TRUE THEN [Player Name] END

Then we can remove the same fields as before, and filter to exclude all Null values in the [Player On] field. 

Now we have cleaned each branch it is time to join these back together. We can use the following join conditions:


We have identified the players who have been subbed on and off, we now need to join this table back to our table with all players in. With join conditions that look like the following: 


We have used a Right Join here as we want to include all of the values from our original table even if they are Null or don't match with our Subs On/Off table. 


Next, we need to identify what positions the subs that came on have taken up. This can be done by renaming [No.-1] to [Match No.] and using the following calculation: 

[Position Name]
IF [Position Name] = 'sub' THEN [Position Off]
ELSE [Position Name]
END

We can then use a filter to exclude nulls from [Position Name] field. This will exclude any subs that have not been used across the season.

Finally, we want to bring in our original Formation Positions table as a new data source. We are going to use this to join back our positions instead of using from earlier in the workflow. In order to join we need to exclude any 'Sub' from the position type, then use an aggregation tool to eliminate any duplicated fields. We group by [Position Type], [Formation Name], [Position Name].

We can then join back to our original flow with Position Name = Position Name & Formation Name = Formation Name. 


Step 4 - Calculate Playing Times

After the join we have an exploded data set containing lots of rows for each player and match, therefore we need to use an Aggregate tool to group these together and calculate the totals. Our aggregation looks like this:


We can then rename the following fields: 
  • [Position Type-1] to [Position Type]
  • [Number of Rows] to [No Times Played]
Finally, we just need to answer the questions from the challenge post:
  • How many times did each player play in each position? 
  • How many minutes did they play in each position? 
  • A player has a preferred type of position, how many times did they play not in that position type (Out of Position)?
To answer these questions we need to use the following calculations: 

[Out of Position?]
IF [Player Preferred Position] != [Position Type] 
THEN 1 
ELSE 0
END

This identifies if a player played out of position in that match. 

[Times OoP]
[Out of Position?]*[No Times Played]

This calculates the total time played out of position. 

Next we want to calculate the number of games each player is out of position across the whole table. We can use an LOD to calculate this: 

[Games OoP]

The last task is to remove any additional fields and we are ready to output the data!


Make sure to fill in the participation tracker, share using #PreppinData on Twitter and post you solutions onto our Tableau Forums community page so that we can compare our workflows! 

Popular posts from this blog

2024: Week 1 - Prep Air's Flow Card

How to...Handle Free Text

2023: Week 1 The Data Source Bank