2020: Week 43 - Solution

 


Solution by Tom Prowse and our full solution can be downloaded here.


This week's challenge was based around cleaning a formatted spreadsheet that focused on the most popular baby names in England & Wales from 2019. This sort of problem occurs often when a spreadsheet is formatted nicely in Excel, but not so nicely when trying to use in Prep!

Step 1 - Input Files

The first task is to input all of the tables for each gender. At first we are only given the option to use a single table from the input, and when bringing this into our workflow, a lot of the months are combined within the same fields. This isn't ideal for Prep!

Therefore, the first step is to use the Data Interpreter. This identifies the different tables within the spreadsheet and breaks these down so that each can be a separate input. 


Now we have separate inputs for each section we can identify which ones have the same structure, these include:

Boys

  • Table 5 A6:O18 = Jan - Apr
  • Table 5 A20:O32 = May-Aug
  • Table 5 A34:O46 = Sept-Dec
Girls

  • Table 5 A6:O18 = Jan - Apr
  • Table 5 A20:O31 = May-Aug
  • Table 5 A35:O46 = Sept-Dec

  • All of these tables have the same structure, therefore we can use the union tool to combine these together, and rename the Table Names to Gender. 

    Step 2 - Clean Additional Rows

    Within the Girls input there are a few additional rows which don't quite match the same format as our initial input. Therefore we will need to clean these separately and ensure that they are in the same format so we can combine them later. 

    First, we want to focus on Table 5 A32:C33 & Table 5 M32:O33 which have extra date for months May & Aug. Both of these tables are in a similar structure and require the same set of steps to transform the data.

    Starting with the May input we need to follow these steps: 

    1. Add Month field - Use a calculated field to add a string called 'May'
    2. Pivot - Use a columns to rows pivot on all of the fields apart from Month
    3. Identify Values Names - Using the following calculation we can identify whether the row is a Name, Rank or Count: 
    Pivot1 Values

    [Month]+' '+ 
    (IF REGEXP_MATCH([Pivot1 Names],'[[:alpha:]]')
    THEN "Name"
    ELSEIF int([Pivot1 Names])<=10
    THEN "Rank"
    ELSE "Count"
    END)

    This calculation can be broken down in the following way: 

    [Month]+' '+ - this adds the month and a space to the start of the string

    IF REGEXP_MATCH([Pivot1 Names],'[[:alpha:]]') THEN "Name" - identifies if there are letters and returns 'Name'

    ELSEIF int([Pivot1 Names])<=10 - If the number is below 10 then return Rank

    Then all other rows will be returned as 'Count'.

    We need to repeat these steps for the Aug table as well. Therefore, instead of repeating the process manually, we can use Tableau Prep's 'Save Steps as Flow' feature. To use this, we need to highlight the steps that we want to use again, then Right-Click and choose 'Save Steps as Flow':


    You have the option to save these to a file and also to your server. When saving to the Tableau Server, other people within your organisation can also access these steps, so this can be a really useful feature if you have a lot of repeatable processes. 

    When adding the saved steps onto our Aug input, we do this by pressing the '+' then choosing 'Insert Flow'. 


    You can then select the steps that we have just created, and this will be input into the workflow. All we need to do is update the Month field in the first step from May to August. 

    As these are similar steps, there is also the option to Group these together so that our workflow looks a bit cleaner. This was a feature released in 2020.3 and we can highlight multiple steps, and then right-click and Group them together.


    Now we have prepared and grouped our steps, we can combine these together using a union tool. This creates two columns (Metric & Fields) but we want the metrics to be in separate columns therefore we now need to use a rows to columns pivot:


    Our data now looks like this:


    The last input table that we need to clean is the additional rows for December. This is in a slightly different structure to the other additional rows, so we can't use the Saved Steps. Instead we are going to remove any null values, and then rename the fields to the following:

    • F1 - December Rank
    • F2 - December Name
    • F3 - December Count
    After renaming the fields our table looks like this: 


    We are now ready to combine all of our different data sources so that we have a single table. Using the initial union that we created for the first set of files, we can add the additional rows to this pivot as well, so our flow now looks like this:


    Step 3 - Pivot & Clean

    Now we have everything in a single table, it's time to make some transformations so that we can start building towards the desired outputs. 

    First, we want to pivot the data so that the monthly columns are in rows instead. However, we want to maintain columns for our Rank, Name, & Count so we can use a multiple pivot for this. Using the wildcard pivot entry, we want to add three columns (these can be added using the '+' in the top right corner of the Pivoted Fields section) with the following setup: 


    Now we have the data pivoted into the correct structure, we need to complete the following steps to clean the fields up a bit: 

    1. Filter Nulls - Remove any null values from the Name field
    2. Split Month - Using an automatic split on the Pivot1 Names field will split out the Month. We can also rename this field Month and remove the Pivot1 Names field.
    3. Split Gender - Using the Tables Names field, create a custom split to split off the first field before the full stop. 
    4. In the newly split field, we want to remove any numbers (the 2019), and then use a Replace calculation to remove the 'names' part of the string: Gender= REPLACE([Gender],'names','')
    5. Capitalise the first letter. We can use the following calculation to make the first letter a capital: 
    Gender
    REPLACE([Gender],LEFT([Gender],1),UPPER(LEFT([Gender],1)))

    The way that our data is currently structured is correct for the first output which should look like this: 



    Step 4 - Prepare Outputs

    To create our second input we need to use an aggregation to calculate the totals for the year. Using the aggregation tool we want to group by Gender & Name, then Sum the Count fields: 


    We now need to recalculate the ranks for the yearly totals using the following Rank calculation:

    2019 Rank

    Then finally we just want to keep the top 10, so we can use a range filter to keep ranks under 10:

    We are now ready for the 2nd output: 


    Both of the outputs can be downloaded here

    After you finish the challenge make sure to fill in the participation tracker, then share your solution on Twitter using #PreppinData and tagging @Datajedininja@JennyMartinDS14@JonathanAllenby & @TomProwse1

    You can also post your solution on the Tableau Forum where we have a Preppin' Data community page. Post your solutions and ask questions if you need any help! 

    Popular posts from this blog

    2023: Week 1 The Data Source Bank

    2023: Week 2 - International Bank Account Numbers

    How to...Handle Free Text