2020: Week 3 Solution

You can see our full workflow solution below and download it here!


There are 3 main portions to this solution:

1. Preparing the data.
2. Calculating each of the new fields.
3. Joining all the new fields together.

In order to get to the point where we begin to figure out each of the 9 new fields we need to properly filter, shape, join, and prepare the data. You can see how our flow is broken into these three groups.


Preparing the Data

1. Import the game data.

  • Use a wildcard import which includes sheets with a matching pattern of "*Results".
  • In the import, untick everything that we do't actually need. We only need to keep:
    • [Visitor/Neutral] (renamed to [Visiting Team])
    • [Home/Neutral] (renamed to [Home Team])
    • [Date]
    • [PTS] (renamed to [Visitor PTS])
    • [PTS 1] (renamed to [Home PTS]).
Removing unplayed games.

2. Remove games that haven't been played yet.

As January isn't yet over, there are a lot of fixtures without any results. We can remove these just by right-clicking on the "Null" bar in [Visitor PTS] or [Home PTS] and selecting exclude.

3. Convert the date field from a string to an actual date.

Currently the date is stored as a string. Later on, we need to be able to compare dates, such as for the "L10" and "Strk" fields. We can't simply change the field type in this case, so instead we can use the DATEPARSE() and MID() functions to convert the string to a date.

[Date]

DATE(
  DATEPARSE(
      'MMM d yyyy'         //The format of the following string.
      , MID( [Date], 5 )    //Everything from the 5th char onwards of the [Date] field.
  )
)


4. Get the winner of each game in it's own field.

We don't actually need to know the specific score of each game, just who won. To this end, we can create the "Winner" field below to store this information and afterwards remove the "PTS" fields.

[Winner]

IF [Home PTS] > [Visitor PTS]
    THEN [Home Team]
    ELSE [Visiting Team]
END


NBA games never end in a draw (they just keep playing overtime) so we don't need to account for the event that [Home PTS] = [Visitor PTS].

5. Determine whether each game was between teams in the same conference or in different conferences.

Joining in the conference data.
One of our final fields, the [Conf] field, requires us to know the wins and losses for games within each team's own conference, so we need to label each game appropriately for later.

Firstly, we need to join in the "Team List" data TWICE. Join the data in on [Home Team] = [Team], and rename the new [Conference] field to [Home Conf]. Then, join in the data again on [Visiting Team] = [Team] and rename the new [Conference] field to [Visiting Conf].


Finally, create a new field called [In/Cross Conf?] to determine whether each game was played across conferences or within a conference.

[In/Cross Conf?]

IF [Home Conf] = [Visiting Conf]
    THEN "In Conf"
    ELSE "Cross Conf"
END

6. Pivot the teams so each team in each game has its own row.

Our final data is all about the stats for each individual team, so it makes sense that we should have a row per team, per game, instead of just a row per game. This will help us aggregate the data at the team level later.

Use a Columns-to-Rows pivot with the first group of pivoted fields being [Home Team] and [Visiting Team], and the second group of pivoted fields being [Home Conf] and [Visiting Conf]. These groups should get names [Team] and [Conf] automatically.


Afterwards, rename the [Pivot1 Names] field to [Home/Visiting]. This is important as two of the final fields are [H] and [A] (home and away) so we need to know for each team in each game whether they were the home team or the away team.

7. Finally, label whether each team won or lost.

Now each team in each game has its own row, we need a couple of fields to help us count wins and losses when we aggregate. In our example we use two numerical fields called [Win] and [Lose] which contain just ones and zeros so we can sum them up later.

[Win]
[Lose]

IF [Team] = [Winner]
    THEN 1
    ELSE 0
END


IF [Team] != [Winner]
    THEN 1
    ELSE 0
END
We're now ready to start on the second part: calculating all the new metrics for each team! These all branch away from this step, so each heading below returns to and continues from here.

Calculating the Fields

[W], [L], [Pct], & [Rank]

1. [W] & [L]

Use an aggregate step with the following settings:
  • GROUP by:
    • [Conf]
    • [Team]
  • SUM up:
    • [W]
    • [L]
We now have our overall wins and losses for each team. Following on from this we can calculate our win percentage field.

2. [Pct]

Create a new calculated field called [Pct] which divides the total number of games by the total number of wins. Round this to 3 decimal places to finish off.

[Pct]

ROUND(
  [W] / ([W] + [L])
  , 3
)


3. [Rank]

We now need to find the rank of each team within their conference. To do this we can use the self-join trick to perform the necessary table calculation. 

Firstly, duplicate the data in a fresh clean step.
Secondly, join the data to itself on:
  • [Pct] >= [Pct]
  • [Conf] = [Conf]   (as we're only comparing teams in the same conference as each other)
Finally, use an aggregate step to count the number of rows for each team and conference. This equates to the rank (modified competition ranking in this case, i.e. "1334" instead of "1234"). 
  • GROUP on:
    • [Team-1] (rename to [Team])
    • [Conf] (rename to [Conference])
  • SUM up:
    • [Number of Rows (Aggregated)] (rename to [Rank]).

[Home] & [Away]


Firstly, sum up the home and away wins and losses for each team using an aggregate step:
  • GROUP on:
    • [Home/Visiting]
    • [Team]
  • SUM up:
    • [Win]
    • [Lose]
Next, use a calculation to combine the wins and losses into the "12-3" format we're after:

[W/L]

STR([Win]) + "-" + STR([Lose])


Pivoting home and away games.
Finally, we need to pivot these so that we have [Home] W/L and [Away] W/L instead of just [W/L] with home and away on different rows. We can use a columns-to-rows pivot to accomplish this:
  • PIVOT:
    • [Home Team] (rename to [Home])
    • [Visiting Team] (rename to [Away])
  • AGGREGATE:
    • [W/L] using MIN or MAX

[Conf]


For the results within each teams conference we need to only include rows where the [In/Cross Conf?] value is "In Conf". We can use an aggregate step after filtering to simply sum up the wins and losses for each team.
  • GROUP on:
    • [Team]
    • [In/Cross Conf?] (filtered to just "In Conf")
  • SUM up:
    • [Lose]
    • [Win]
Afterwards, simply apply the W/L formatting function again to get [Conf].

[Conf]

STR([Win]) + "-" + STR([Lose])

[L10]


To get the most recent 10 games for each team, we can use another self-join to rank how recent each game was for each team and then filtering for the top 10.

Duplicate the data into a fresh clean step, and then join it to itself on:
  • [Team] = [Team]
  • [Date] >= [Date]
Next, use an aggregate step to rank how recent each game was for each team. We also need to make sure to retain whether each game was a win or a loss too.
  • GROUP on:
    • [Date-1]
    • [Team]
  • MIN of:
    • [Win]
    • Range of values filter.
    • [Lose]
  • SUM up:
    • [Number of Rows (Aggregated)]
Now create a "Range of Values" filter on [Number of Rows (Aggregated)] set with the maximum value as 10. Applying this filter leaves us with the 10 most recent games for each team.

We can now use an aggregate step to count the wins and losses for each team during their last 10 games:
  • GROUP on [Team]
  • SUM up:
    • [Win]
    • [Lose]
Finally, we can use the same function from the "Home and Away" section to format the wins and losses into a single field.

[L10]

STR([Win]) + "-" + STR([Lose])


[Strk]


This field required the most steps (in our solution) to calculate. The steps required are:

1. Use an aggregate step to get the most recent win and most recent loss for each team:
  • GROUP on:
    • [Team]
    • [Win] ("1" = most recent win, "0" = most recent loss)
  • MAX of [Date].
2. Use a rows-to-columns pivot so we have the dates of the most recent wins and losses on the same row for each team.
  • PIVOT:
    • [Win].
  • AGGREGATE:
    • [Date] using MAX.
3. Label whether the most recent result was a win or a loss so we know what kind of streak they're currently on. We use 1 and 0 as our labels here so we can join on them later.

[Last Result]

IF [Last Win] > [Last Loss]
    THEN 1
    ELSE 0
END


4. Retrieve the date that the streak began. If the [Last Result] is a win, then we take the date of their last loss, and vice versa.

[Streak Beginning]

IF [Last Result] = 1
    THEN [Last Loss]
    ELSE [Last Win]
END


5. We now need to retrieve all games for each team that form part of their current streak. We can then count how many rows, i.e. games, there are for each team to determine how long their streak is. 

Firstly, retrieve all the game information by joining our current aggregated and pivoted data back to the clean step at the end of the preparation section. We want to only bring back games for each team where the result is the same as the result of their most recent game and the date is after the date that their streak began. Thus, we want a join with the following settings:
  • [Team] = [Team]
  • [Last Result] = [Win]
  • [Streak Beginning] <= [Date].
As our [Last Result] is a 1 for a win and a 0 for a loss, and the original [Win] field is a 1 for a win and a 0 for a loss, this join ensures we only return games that match the most recent result for each team.

With this join completed, we can now use an aggregate step to count how games are in each team's current streak.
  • GROUP on:
    • [Team]
    • [Last Result]
  • SUM up:
    • [Number of Rows (Aggregated)] (and rename to [Strk])
6. Finally, we have to prefix the length of the streak with an 'L' or a 'W' to indicate whether it is a winning or losing streak.

[Strk]

IF [Last Result] = 1
    THEN "W"
    ELSE "L"
END + STR( [Strk] )


Joining it all together

We know have a series of steps, each with one row per team. So, we can use four joins in a row, each on [Team] = [Team]. After the fourth join, remove all the duplicated [Team] fields.


Finally, have two branching clean steps. In one clean step, filter the data to keep only rows where [Conference] = "Eastern" and in the other where [Conference] = "Western". You should now have two data sets: one containing the standings and stats for the Eastern conference and one containing the standings and stats for the Western conference!

Popular posts from this blog

2024: Week 1 - Prep Air's Flow Card

2023: Week 1 The Data Source Bank

How to...Handle Free Text