Posts

2024: Week 43 - MiniBytes Worst Selling Products

Image
Challenge by: Jessica Kirk It's week 43 of the year, so it seems only right to return to the challenges that DS43 created for us earlier in the year:

2024: Week 42 - Solution

Image
Solution by Tom Prowse and you can download the workflow here . Step 1 - Year, Week, & Theme After inputting the data source we can start to tidy up the year and weeks for each of the show. To calculate each year we can use the requirements as a guide to when the show started:  Year IF [Series]<=2 THEN 2004  ELSE 2002+[Series] END Next we can extract the week information to include the number and theme. For this we want to split out the information by using an automatic split which will remove the word 'week.  Then we can use that split field to extract the week number and theme using a split on the ':' - an automatic split works here as well.  Finally we want to tidy up the weeks where there is semi-final or final in the brackets. This time we need to use a custom split on the ' ' which will separate the numbers and text and then remove the punctuation from the text field. After we have parsed the final & semi final text we can merge this field with the

2024: Week 42 - Strictly Come Dancing

Image
Challenge by: Jenny Martin Autumn is always the time of year that Strictly Come Dancing returns to our TVs in the UK. I always find it interesting which songs are chosen for the couples to dance to - particularly when there are repeats. With 22 seasons, this repetition is not surprising, so I set about gathering a dataset that would allow me to see what the most common song choices are.  Inputs I used this as an opportunity to learn to webscrape Wikipedia  using Python (with a lot of help from ChatGPT!), so the resulting table is a combination of each dance from each series: Requirements Input the data One thing the data is missing is a year field for when the Series took place Series 1 and 2 were both in 2004 All following series happen annually Series 3 in 2005 etc. The webscraping isn't quite perfect and the table headers are repeated throughout the dataset, make sure these are removed Split the Week field into a numeric value and put extra details in the theme week  Split this

2024: Week 41 - Solution

Image
Solution by Tom Prowse and you can download the workflow here .  Step 1 - Input All Years The first step is to input all of the sheets for each years worth of data. We can do this using the 'union multiple tables' functionality within the input step where we want to include all sheets with the matching pattern '20*':  We can then merge some of the columns together so that we have a complete table with no mismatched fields. The fields we need to merge are:  Total & Total Earnings Salary/Winnings & Salary/winnings Country, Nationality, & Nation After merging the table should look like this:  Step 2 - Create Years and Monetary Amounts Next we want to rename the Table Names field to Year and make sure it's a number field. Then we can turn to the monetary amounts by pivoting the Salary/Winnings, Total Earnings, and Endorsements fields using a Columns to Rows pivot:  This allows us to then remove the following in these fields:  Remove $ sign REPLACE([Pivot1 Va

2024: Week 41 - Forbes Highest Paid Athletes

Image
Challenge by: Robbin Vernooij   Recently, one of the Data School Coaches, Robbin, set the following challenge. It seemed perfect for a Preppin' Data, so over to Robbin: We'd like to get historical data on the highest paid athletes so we can do temporal analysis. Lucky us, it turns out Wikipedia has been tracking the Forbes list of the world's highest-paid athletes. Unlucky us, it is in an HTML table format with human readable symbols and table by table basis. Now it's time for you to clean it up into one single dataset, so that it's ready for analysis. Inputs The data for this challenge comes from this Wikipedia page . There is a table for each year that looks like this (2024 example): As well as a source table:  Requirements Input the data Bring all the year tables together into a single table Merge any mismatched fields (there should not be any Null values)  Create a numeric Year field Clean up the fields with the monetary amounts  One way of doing this could be p

2024: Week 40 - Solution

Image
Solution by Tom Prowse and you can download the workflow here . Step 1 - Split Users First up we want to split the users field so that we have a separate row for each user. Currently, they are all in a single row and are separated by a ','. Therefore, we can use the custom split functionality to break these into separate fields:  From here we can remove the original field and then pivot the split fields using a Columns to Rows and the word 'users' as the wildcard option:  At this stage we can rename the Pivot Values to User and then start to create some IDs from the user data.  User ID LEFT([Users],7) Private or Dealer   RIGHT([Users],1) Then we want to keep only the 'D' for Dealers. Dealership ID   MID([Users],8,3) At this stage our Users table should now look like this: Step 2 - Combine Ads Data Once we have input the Ads data, we can then remove any null values from the sale_date field and make sure it's a Date field type.  We can then join this to our us

2024: Week 39 - Solution

Image
Solution by Tom Prowse and you can down load the workflow here .  Step 1 - Each Day of Engagement First, we want to ensure that there is a row for each day that a consultant is on an engagement. For this we can use the New Rows step within Prep to help pad out any missing days. Within the setup we want to add new rows between the engagement start and end dates and have an increment of 1 day:  We can then remove the weekends from the list by first identifying the weekday by using the Datename function:  Weekday DATENAME('weekday',[Work Day]) Then from this field we can exclude the weekends (Saturday & Sunday).  Finally we can calculate the number of calendar days for each engagement:  Calendar Days DATEDIFF('day',[Engagement Start Date],[Engagement End Date]) At this stage the table should look like this:  Step 2 - Aggregate & Rank  The final part of the challenge is to aggregate our table as per the requirements. For this we can use the aggregate step where we g