2020: Week 31 - Solution
This solution is from Tom Prowse and you can download our full solution here.
This week saw our yearly collaboration with Workout Wednesday, with a live virtual event. We looked into the history of the Olympic Games, and which countries/athletes have won the most medals. This is the solution for the Preppin Data half of the event, which can then be used to create your solution for Workout Wednesday.
Step 1 - Join Countries & Code
The first step with this challenge is to make sure that every medal has a Country and Country Code. To do this we are going to use a couple of joins with the following setup:
1. Join on Country
2. Join on Code
Our workflow should now look like this:
As a result of the join, both of our workflows now have a couple of fields that we no longer need. Therefore, we can remove Country Code and Country-1 from both of the branches.
Once we have removed the additional fields, we can then union both of the branches together, and stack the tables on top of each other. The fields should match up correctly and we should be left with 9 fields and around 33K rows.
Step 2 - Group & Clean Sports
Now we have successfully joined the Medallists & Country Codes tables together, we can start cleaning some of the sports so that they are a little more useful for our analysis.
First we want to group various disciplines together, so we made the following groupings:
- Sport
- Canoeing and Canoe/Kayak
- Swimming and Aquatics
- Discipline
- Beach volley. and Beach volleyball
- Cycling Road, Road Cycling, Cycling Track and Track Cycling
- Artistic and Artistic G.
- Modern Pentath, Modern Pentathlon and Modern pentathlon
- Mountain Bike and Mountain Biking
- Rhythmic and Rhythmic G.
- Synchronized S., Synchronized Swimming and Synchronised swimming
- Greco-Roman and Wrestling Gre-R
You can choose to do these manually, but we would recommend using the grouping features within Tableau Prep. The Common Characters, Pronunciation and Spelling options will allow you to quickly group to different disciplines together, but it is always useful to go and double check what groupings are actually being made!
Once we have made the groupings, we can then clean the distances within the Event field. We can use the following calculations:
1. Replace Kilometres with 'km'
REPLACE([Event],"kilometres","km")
2. Replace Meters with 'm'
REPLACE([Event],"metres","m")
3. Replace Metre with 'm'
REPLACE([Event],"metre","m")
Be careful with the order that you complete these replace calculations, because if you do Metres before the Kilometres step, then you will be left with Kilom resulting in the Kilometre calculation not working correctly.
Finally, we can use the TRIM clean functionality to remove any leading or trailing spaces from the Country field.
Our workflow now looks like this, and we have completed our first output that is required:
Workflow
Medallists Dataset Output
Step 3 - Aggregate & Pivot
Now we have completed the first output, we want to use an aggregation tool to create a row for each Event and Medal. To do this we use the aggregation tool and group on the following fields:
- Country
- Sport
- Medal
- Event
- Year
- Event_Gender
- Discipline
The aggregation setup looks like this:
Note, I have changed my view to 'Show List View' so that it is easier to view all of the fields. If yours looks a little different then that's fine, just make sure you have the same fields in the grouped section!
We now have a row for each event medal, therefore it's time to calculate which countries have won the most. First, we need to create a 'Number of Rows' field, to do this we can just create a calculation with the number 1 in it. We will call this field Count.
Next, we want to tidy up the fields so that we keep only the fields that we need. The fields that we keep are:
- Country
- Count
- Medal
- Year
The next step is to transform the data by using a Rows to Columns Pivot, using the Medal & Count fields. The setup looks like this:
After the pivot our data should now look like this:
Notice that there are a few nulls as a result of the pivot. This is where the country didn't win any medals in that category in that year, therefore we want to replace the Nulls with 0.
To replace the Nulls, you can use the ZN() function within a calculated field, or alternatively you can select the null value from the profile pane and replace with a 0.
Our table is now complete and can become our 2nd Output.
Country Medals Dataset
Step 4 - Host Cities
Our final task for this week is to clean the Hosts data set so that we have a nice clean table to identify the host cities.
The first step is to use the split functionality to separate the city and the country from the Host field. To do this we use the automatic split, which provides us with a column for each.
We then want to rename our fields so they represent the data, so rename Host - Split1 to Host City, and Host - Split2 to Host Country. We can also remove the Host field at this stage.
The next step is to tidy the dates, by converting them from a string to date format and also extracting the year that the games took place.
First we are going to convert both date fields to a date format, this can be done simply by changing the field type:
We then want to isolate the year from the Start Date, we can use the following calculation to do this:
Year
YEAR([Start Date])
The final task for this week's challenge is to do a final tidy of the Host Country:
- Trim Leading/Trailing Spaces
- Replace United Kingdom with Great Britain
After these changes we should be ready to output the table for our final output:
Host Cities Dataset
All of the above outputs can be downloaded here.
Once you have completed this challenge, don't forget to complete the Workout Wednesday which uses the data that you have just prepped to created a visualisation within Tableau.
After you finish the challenge make sure to fill in the participation tracker, then share your solution on Twitter using #PreppinData and tagging @Datajedininja, @JennyMartinDS14, @JonathanAllenby & @TomProwse1
You can also post your solution on the brand new Tableau Forum where we have a Preppin' Data community page. Post your solutions and ask questions if you need any help!