Posts

2024: Week 41 - Solution

Image
Solution by Tom Prowse and you can download the workflow here .  Step 1 - Input All Years The first step is to input all of the sheets for each years worth of data. We can do this using the 'union multiple tables' functionality within the input step where we want to include all sheets with the matching pattern '20*':  We can then merge some of the columns together so that we have a complete table with no mismatched fields. The fields we need to merge are:  Total & Total Earnings Salary/Winnings & Salary/winnings Country, Nationality, & Nation After merging the table should look like this:  Step 2 - Create Years and Monetary Amounts Next we want to rename the Table Names field to Year and make sure it's a number field. Then we can turn to the monetary amounts by pivoting the Salary/Winnings, Total Earnings, and Endorsements fields using a Columns to Rows pivot:  This allows us to then remove the following in these fields:  Remove $ sign REPLACE([Pivot1 Va

2024: Week 41 - Forbes Highest Paid Athletes

Image
Challenge by: Robbin Vernooij   Recently, one of the Data School Coaches, Robbin, set the following challenge. It seemed perfect for a Preppin' Data, so over to Robbin: We'd like to get historical data on the highest paid athletes so we can do temporal analysis. Lucky us, it turns out Wikipedia has been tracking the Forbes list of the world's highest-paid athletes. Unlucky us, it is in an HTML table format with human readable symbols and table by table basis. Now it's time for you to clean it up into one single dataset, so that it's ready for analysis. Inputs The data for this challenge comes from this Wikipedia page . There is a table for each year that looks like this (2024 example): As well as a source table:  Requirements Input the data Bring all the year tables together into a single table Merge any mismatched fields (there should not be any Null values)  Create a numeric Year field Clean up the fields with the monetary amounts  One way of doing this could be p

2024: Week 40 - Solution

Image
Solution by Tom Prowse and you can download the workflow here . Step 1 - Split Users First up we want to split the users field so that we have a separate row for each user. Currently, they are all in a single row and are separated by a ','. Therefore, we can use the custom split functionality to break these into separate fields:  From here we can remove the original field and then pivot the split fields using a Columns to Rows and the word 'users' as the wildcard option:  At this stage we can rename the Pivot Values to User and then start to create some IDs from the user data.  User ID LEFT([Users],7) Private or Dealer   RIGHT([Users],1) Then we want to keep only the 'D' for Dealers. Dealership ID   MID([Users],8,3) At this stage our Users table should now look like this: Step 2 - Combine Ads Data Once we have input the Ads data, we can then remove any null values from the sale_date field and make sure it's a Date field type.  We can then join this to our us

2024: Week 39 - Solution

Image
Solution by Tom Prowse and you can down load the workflow here .  Step 1 - Each Day of Engagement First, we want to ensure that there is a row for each day that a consultant is on an engagement. For this we can use the New Rows step within Prep to help pad out any missing days. Within the setup we want to add new rows between the engagement start and end dates and have an increment of 1 day:  We can then remove the weekends from the list by first identifying the weekday by using the Datename function:  Weekday DATENAME('weekday',[Work Day]) Then from this field we can exclude the weekends (Saturday & Sunday).  Finally we can calculate the number of calendar days for each engagement:  Calendar Days DATEDIFF('day',[Engagement Start Date],[Engagement End Date]) At this stage the table should look like this:  Step 2 - Aggregate & Rank  The final part of the challenge is to aggregate our table as per the requirements. For this we can use the aggregate step where we g

2024: Week 40 - Vrroom

Image
Challenge by: Abhishek Minz  Vrroom is an online platform for used cars where individuals and dealerships can advertise their vehicles. At present, five different dealerships are using the website. Vrroom's management team need to find out which dealership is taking the longest time (in days) to sell their vehicles through the platform.  Input  There are two csv data sets this week: 1. The Adverts (ads) data set: 2. The Users data set: There are 365 users of the website - each vehicle listed is classed as a different registration number. Requirements Input the data sets Break the Users data set into individual records (you should have 365 rows) The User data is formed from: 1st 7 characters is the User ID The last letter signifies whether the user is a private individual ('P') or Dealership (D) The 3 characters after the User ID for Dealerships is the Dealership ID With the Ads data, remove any unsold vehicles Join the data sets together Find when an advert is first posted

2024: Week 39 - Preppin' Consultancy Ranks

Image
 Created by: Carl Allchin Last week's challenge involved cleaning up the consulting engagements to ensure we didn't have any overlapping engagements. This week's challenge involves conducting some analysis on the engagements. We want to understand who our top earners are at each grade and for the organisation.  Input One excel file (the output from last week's challenge) Requirements Input the data Create a row for each day a consultant is on the engagement Remove weekend days  Work out how many calendar days occur in each engagement (incl. weekend days) Aggregate the data to: Count the number of calendar days a constant is on engagements for The total earned by each individual per engagement Retain the engagement number, initials and grade Rank the consultants by day rate earned, per engagement: Overall rank  Grade rank Output the data Output 7 data fields: Calendar Days Initials Engagement Order Grade Name Day Rate Overall Rank Grade Rank 718 rows (719 incl. headers)

2024: Week 38 - Solution

Image
Solution by Tom Prowse and you can download the workflow here . Step 1 - Initials Field First we want to add the Engagements and the Initials data source into the workflow. From here we can join these together using an inner join on Consultant Forename and Initial ID.  We can then remove the Initial ID field and rename the Initial to Initial Forename. Then we can create another join but this time it will be for the Consultant Surname and Initial ID.  After removing and renaming the fields, we can create an Initials field using the Forename and Surname:  Initials [Initial Forename]+[Initial Surname] Out table should now look like this:  Step 2 - Engagement Dates & Grades Next we want to create dates for when the engagement started and ended. We can do this with the Makedate function:  Engagement Start Date MAKEDATE(2024,[Engagement Start Month],[Engagement Start Day]) Engagement End Date MAKEDATE(2024,[Engagement End Month], [Engagement End Day]) We can then correct some of the grad