Posts

Showing posts from October, 2024

2024: Week 42 - Strictly Come Dancing

Image
Challenge by: Jenny Martin Autumn is always the time of year that Strictly Come Dancing returns to our TVs in the UK. I always find it interesting which songs are chosen for the couples to dance to - particularly when there are repeats. With 22 seasons, this repetition is not surprising, so I set about gathering a dataset that would allow me to see what the most common song choices are.  Inputs I used this as an opportunity to learn to webscrape Wikipedia  using Python (with a lot of help from ChatGPT!), so the resulting table is a combination of each dance from each series: Requirements Input the data One thing the data is missing is a year field for when the Series took place Series 1 and 2 were both in 2004 All following series happen annually Series 3 in 2005 etc. The webscraping isn't quite perfect and the table headers are repeated throughout the dataset, make sure these are removed Split the Week field into a numeric value and put extra details in the theme week  Split this

2024: Week 41 - Solution

Image
Solution by Tom Prowse and you can download the workflow here .  Step 1 - Input All Years The first step is to input all of the sheets for each years worth of data. We can do this using the 'union multiple tables' functionality within the input step where we want to include all sheets with the matching pattern '20*':  We can then merge some of the columns together so that we have a complete table with no mismatched fields. The fields we need to merge are:  Total & Total Earnings Salary/Winnings & Salary/winnings Country, Nationality, & Nation After merging the table should look like this:  Step 2 - Create Years and Monetary Amounts Next we want to rename the Table Names field to Year and make sure it's a number field. Then we can turn to the monetary amounts by pivoting the Salary/Winnings, Total Earnings, and Endorsements fields using a Columns to Rows pivot:  This allows us to then remove the following in these fields:  Remove $ sign REPLACE([Pivot1 Va

2024: Week 41 - Forbes Highest Paid Athletes

Image
Challenge by: Robbin Vernooij   Recently, one of the Data School Coaches, Robbin, set the following challenge. It seemed perfect for a Preppin' Data, so over to Robbin: We'd like to get historical data on the highest paid athletes so we can do temporal analysis. Lucky us, it turns out Wikipedia has been tracking the Forbes list of the world's highest-paid athletes. Unlucky us, it is in an HTML table format with human readable symbols and table by table basis. Now it's time for you to clean it up into one single dataset, so that it's ready for analysis. Inputs The data for this challenge comes from this Wikipedia page . There is a table for each year that looks like this (2024 example): As well as a source table:  Requirements Input the data Bring all the year tables together into a single table Merge any mismatched fields (there should not be any Null values)  Create a numeric Year field Clean up the fields with the monetary amounts  One way of doing this could be p

2024: Week 40 - Solution

Image
Solution by Tom Prowse and you can download the workflow here . Step 1 - Split Users First up we want to split the users field so that we have a separate row for each user. Currently, they are all in a single row and are separated by a ','. Therefore, we can use the custom split functionality to break these into separate fields:  From here we can remove the original field and then pivot the split fields using a Columns to Rows and the word 'users' as the wildcard option:  At this stage we can rename the Pivot Values to User and then start to create some IDs from the user data.  User ID LEFT([Users],7) Private or Dealer   RIGHT([Users],1) Then we want to keep only the 'D' for Dealers. Dealership ID   MID([Users],8,3) At this stage our Users table should now look like this: Step 2 - Combine Ads Data Once we have input the Ads data, we can then remove any null values from the sale_date field and make sure it's a Date field type.  We can then join this to our us

2024: Week 39 - Solution

Image
Solution by Tom Prowse and you can down load the workflow here .  Step 1 - Each Day of Engagement First, we want to ensure that there is a row for each day that a consultant is on an engagement. For this we can use the New Rows step within Prep to help pad out any missing days. Within the setup we want to add new rows between the engagement start and end dates and have an increment of 1 day:  We can then remove the weekends from the list by first identifying the weekday by using the Datename function:  Weekday DATENAME('weekday',[Work Day]) Then from this field we can exclude the weekends (Saturday & Sunday).  Finally we can calculate the number of calendar days for each engagement:  Calendar Days DATEDIFF('day',[Engagement Start Date],[Engagement End Date]) At this stage the table should look like this:  Step 2 - Aggregate & Rank  The final part of the challenge is to aggregate our table as per the requirements. For this we can use the aggregate step where we g

2024: Week 40 - Vrroom

Image
Challenge by: Abhishek Minz  Vrroom is an online platform for used cars where individuals and dealerships can advertise their vehicles. At present, five different dealerships are using the website. Vrroom's management team need to find out which dealership is taking the longest time (in days) to sell their vehicles through the platform.  Input  There are two csv data sets this week: 1. The Adverts (ads) data set: 2. The Users data set: There are 365 users of the website - each vehicle listed is classed as a different registration number. Requirements Input the data sets Break the Users data set into individual records (you should have 365 rows) The User data is formed from: 1st 7 characters is the User ID The last letter signifies whether the user is a private individual ('P') or Dealership (D) The 3 characters after the User ID for Dealerships is the Dealership ID With the Ads data, remove any unsold vehicles Join the data sets together Find when an advert is first posted