2020: Week 5 Solution




The Six Nations Championship started last weekend, so it was the perfect timing to do a Preppin Data themed challenge. This week we focused on the different types of ranking, in previous tasks we have used a technique to create a Modified Competition rank, so this week we are turning our focus to the Standard Competition rank.

The first task this week is to clean the data to remove any null values within the Half Time score fields. This is easily done with a right click and 'Exclude' on the null values.


Our next focus is calculating the Standard Competition rank on the Diff field. To create the rank, we use the same process as before, which includes a self-join and then summarise.

The first step is to create a clean step, which duplicates the data and can then be used within the self-join. We can then use a join tool on both of the clean steps, and with the join condition where the Diff from the 'View Data' clean is greater than or equal to the Diff from the Duplicate Data clean. This step will explode our data out and create 5,476 rows of data from the original 103.


Now that we have so many duplicated rows, we can use the aggregation tool to remove these duplicates but retain our ranking. To do this, we group by Venue and Date, then Sum Number of Rows. We can then rename the 'Number of Rows' field to 'Mod Rank' as this is now our modified ranking.

Now we have the modified rank, we can go on to calculate the Standard ranking. The first step is to identify how many rows have the same rank and have therefore been modified. We can calculate this by using an aggregate tool with a grouping on Mod Rank, then summing the Number of Rows again. This will provide us with how many times each rank appears within our data.


After renaming the new 'Number of Rows' to 'Shared Rank', now we have how many times each rank appears, we can calculate the Standard ranking. To do this we use the following calculation:

Stand Comp Rank
IF [Shared Rank] = 1 THEN [Mod Rank]
ELSE [Mod Rank] - [Shared Rank] +1
END

When breaking this calculation down, if the Shared Rank equals 1, then we want to return the original 'Mod Rank' as these ranks are the same on both the Modified and Standard ranks. However, if Shared Rank does not equal 1, then we want to subtract the Shared Rank value from the Mod rank then add 1. This repositions the ranks to the higher end of the shared ranks, therefore creating the Standard Competition rank as a result.

Now we have both the Modified Rank and Standard Rank, we can join back the original data flow and bring back the venue's and dates. To do this we use the join tool, and join on Mod Rank.


The next stage is to calculate the Best, Worst and Avg rank for each venue. To do this, we can first clean any fields that had been duplicated as a result of the join. We then need to duplicate the Standard Comp Rank twice, so that it appears as three different columns in our data.

After the duplication, we can use the aggregation tool to group by venue, then we can use the duplicated Standard Comp rank field to find the Min, Max and Avg rank. We also need to sum the Mod Rank field, as this will bring through the number of games that were played at each venue.



The final step is to rename the newly aggregated fields, and then prepare the flow for the output.



Popular posts from this blog

2023: Week 1 The Data Source Bank

2023: Week 2 - International Bank Account Numbers

How to...Handle Free Text