2022 Week 12 - Solution

Solution by Tom Prowse and you can download the workbook here.

This week take inspiration from the Gender Pay Gap Bot on Twitter, and take a look at historical data to get this into a nice, easy to understand structure.

Step 1 - Combine Files

The first step is to combine each of the different files from the different years. We have 5 different CSV files so we need to use the Wildcard Union within the Input step to stack these on top on one another.


After inputting all of the data files, we can then remove a lot of the fields that are included. We can remove fields straight from the input step by using the tick boxes on the right-hand side. 

We want to keep the following fields (these will be ticked): 

  • EmployerID
  • EmployerSize
  • DiffMedianHourlyPercent
  • File Paths
  • EmployerName
Our table should now look like this: 


Step 2 - Clean Report Year

Next we want to parse the year from each of the file names. After the wildcard union, Tableau Prep automatically creates the File Paths field which provides information about which input each of the rows come from. We can use this field to create a Year field for our data.

First, we want to split the data so that we only return the Year section at the end of the string. Fortunately, we can use the Automatic split to split the two years from the string into separate fields:


We can then rename the File Paths - Split 1 to Year, and then remove the File Path - Split 2 and File Paths fields so our data now looks like this:



Step 3 - Most Recent Report

To ensure we are including any companies that have changed their name, we now want to find out the latest year that each EmployerID had submitted a report. Therefore, we need to calculate the Max Year for each EmployerID by using a Fixed LOD calculation: 

Max Year 

Then we want to filter to return only where the Year = Max Year. This will only return the latest year that each EmployerID had submitted a report. 

At this stage we no longer need the Years, because we now have the latest name for each EmployerID, so we can keep only the EmployerName & EmployerID fields:


Then finally we can join this back to our original data set with an Inner Join where EmployerID = EmployerID


And then remove the duplicated EmployerID and the Employer Name from the original workflow - this will leave us with the updated company names.


Step 4 - Pay Gap Explainer

We now want to create a field that will help to explain the Pay Gap information in easy to understand English. 

As noted in the requirements there are a few different problems that might occur here. Therefore we follow these steps to create the string for each company:

1. Float Point Inaccuracies

String 

IF ABS([DiffMedianHourlyPercent])<10 THEN
LEFT(STR((CEILING(round(abs([DiffMedianHourlyPercent]),1)*100))/100),4)
ELSEIF ABS([DiffMedianHourlyPercent])<100 THEN
LEFT(STR((CEILING(round(abs([DiffMedianHourlyPercent]),1)*100))/100),5)
ELSE
LEFT(STR((CEILING(round(abs([DiffMedianHourlyPercent]),1)*100))/100),6)
END

This will ensure that the numbers are correct rounded to 2 decimals places.

2. Positive & Negatives

String  

IF RIGHT([String],1)='0' and LEN([String])>1
THEN LEFT([String],LEN([String])-1)
ELSE [String]
END

This will ensure that the positive and negatives are as we expect.

3. Correct Phrasing

IF [DiffMedianHourlyPercent]>0 THEN
"In this organisation, women's median hourly pay is "+[String]+"% lower than men's."

ELSEIF [DiffMedianHourlyPercent]<0 THEN
"In this organisation, women's median hourly pay is "+[String]+"% higher than men's."

ELSE "In this organisation, men's and women's median hourly pay is equal."
END

This creates the string for each of the three different options.

The final step of the challenge this week is to reorder the fields and then we are ready to output our data: 


You can download the full output here

After you finish the challenge make sure to fill in the participation tracker, then share your solution on Twitter using #PreppinData and tagging @Datajedininja@JennyMartinDS14 & @TomProwse1

You can also post your solution on the Tableau Forum where we have a Preppin' Data community page. Post your solutions and ask questions if you need any help! 


Popular posts from this blog

2024: Week 1 - Prep Air's Flow Card

2023: Week 1 The Data Source Bank

How to...Handle Free Text