Preppin' Data

Posts

Showing posts from October, 2024

2024: Week 44 - MiniBytes Returns Analysis

October 30, 2024

Challenge by: Jessica Kirk MiniBytes were really happy with the analysis last week on worst performing products. This week they'd like us to look into the returns data so that they can better prepare for how much needs to be set aside for each month for returned items. They have warned that there has been an issue with the returns systems that means some of the return and sale dates for particular items are in the wrong order (i.e. the sales date has been input as the return date). Input We've been provided with a small subset of data to build the logic from again: Requirements Input the data For all cases where the Date Sold is after the Date Returned, swap the values in these fields Calculate the number of days between the Date Sold and Date Returned For returns within 60 days, customers get a full refund. For returns between 60 and 100 days, customers will be refunded half (rounded to the nearest penny). Returns after 100 days are not refunded MiniBytes want one out...

2024: Week 43 - Solution

October 29, 2024

Solution by Tom Prowse and you can download the workflow here . Step 1 - Sales The first part of this challenge is to calculate the sales value for each product, the total sales per category, and then calculate the % of total for each category. Sales ROUND([Price]*[Quantity],2) Then we can use this field to calculate the sales per category using a fixed LOD: Category Sales Now we have the category sales we can then calculate the % of total for each category: % of Total ROUND([Sales]/[Category Sales],2) The table should now look like this: Step 2 - Bottom 15% Now we can filter for the bottom 15% of sales in each category. Before we filter we need to calculate a running total on the % of total: Running % of Total Then we can use this to filter and keep only the values where [Running % of Total]<=0.15 The last step is to remove any unrequired fields and our output should look like this: After you finish the challenge make sure to fill in the...

2024: Week 43 - MiniBytes Worst Selling Products

October 23, 2024

Challenge by: Jessica Kirk It's week 43 of the year, so it seems only right to return to the challenges that DS43 created for us earlier in the year:

2024: Week 42 - Solution

October 22, 2024

Solution by Tom Prowse and you can download the workflow here . Step 1 - Year, Week, & Theme After inputting the data source we can start to tidy up the year and weeks for each of the show. To calculate each year we can use the requirements as a guide to when the show started: Year IF [Series]<=2 THEN 2004 ELSE 2002+[Series] END Next we can extract the week information to include the number and theme. For this we want to split out the information by using an automatic split which will remove the word 'week. Then we can use that split field to extract the week number and theme using a split on the ':' - an automatic split works here as well. Finally we want to tidy up the weeks where there is semi-final or final in the brackets. This time we need to use a custom split on the ' ' which will separate the numbers and text and then remove the punctuation from the text field. After we have parsed the final & semi final text we can merge this field with the ...

2024: Week 42 - Strictly Come Dancing

October 16, 2024

Challenge by: Jenny Martin Autumn is always the time of year that Strictly Come Dancing returns to our TVs in the UK. I always find it interesting which songs are chosen for the couples to dance to - particularly when there are repeats. With 22 seasons, this repetition is not surprising, so I set about gathering a dataset that would allow me to see what the most common song choices are. Inputs I used this as an opportunity to learn to webscrape Wikipedia using Python (with a lot of help from ChatGPT!), so the resulting table is a combination of each dance from each series: Requirements Input the data One thing the data is missing is a year field for when the Series took place Series 1 and 2 were both in 2004 All following series happen annually Series 3 in 2005 etc. The webscraping isn't quite perfect and the table headers are repeated throughout the dataset, make sure these are removed Split the Week field into a numeric value and put extra details in the theme week ...

2024: Week 41 - Solution

October 15, 2024

Solution by Tom Prowse and you can download the workflow here . Step 1 - Input All Years The first step is to input all of the sheets for each years worth of data. We can do this using the 'union multiple tables' functionality within the input step where we want to include all sheets with the matching pattern '20*': We can then merge some of the columns together so that we have a complete table with no mismatched fields. The fields we need to merge are: Total & Total Earnings Salary/Winnings & Salary/winnings Country, Nationality, & Nation After merging the table should look like this: Step 2 - Create Years and Monetary Amounts Next we want to rename the Table Names field to Year and make sure it's a number field. Then we can turn to the monetary amounts by pivoting the Salary/Winnings, Total Earnings, and Endorsements fields using a Columns to Rows pivot: This allows us to then remove the following in these fields: Remove $ sign REPLACE([...

2024: Week 41 - Forbes Highest Paid Athletes

October 09, 2024

Challenge by: Robbin Vernooij Recently, one of the Data School Coaches, Robbin, set the following challenge. It seemed perfect for a Preppin' Data, so over to Robbin: We'd like to get historical data on the highest paid athletes so we can do temporal analysis. Lucky us, it turns out Wikipedia has been tracking the Forbes list of the world's highest-paid athletes. Unlucky us, it is in an HTML table format with human readable symbols and table by table basis. Now it's time for you to clean it up into one single dataset, so that it's ready for analysis. Inputs The data for this challenge comes from this Wikipedia page . There is a table for each year that looks like this (2024 example): As well as a source table: Requirements Input the data Bring all the year tables together into a single table Merge any mismatched fields (there should not be any Null values) Create a numeric Year field Clean up the fields with the monetary amounts One way of doing this could ...

2024: Week 40 - Solution

October 08, 2024

Solution by Tom Prowse and you can download the workflow here . Step 1 - Split Users First up we want to split the users field so that we have a separate row for each user. Currently, they are all in a single row and are separated by a ','. Therefore, we can use the custom split functionality to break these into separate fields: From here we can remove the original field and then pivot the split fields using a Columns to Rows and the word 'users' as the wildcard option: At this stage we can rename the Pivot Values to User and then start to create some IDs from the user data. User ID LEFT([Users],7) Private or Dealer RIGHT([Users],1) Then we want to keep only the 'D' for Dealers. Dealership ID MID([Users],8,3) At this stage our Users table should now look like this: Step 2 - Combine Ads Data Once we have input the Ads data, we can then remove any null values from the sale_date field and make sure it's a Date field type. We can then join this to our u...

2024: Week 39 - Solution

October 02, 2024

Solution by Tom Prowse and you can down load the workflow here . Step 1 - Each Day of Engagement First, we want to ensure that there is a row for each day that a consultant is on an engagement. For this we can use the New Rows step within Prep to help pad out any missing days. Within the setup we want to add new rows between the engagement start and end dates and have an increment of 1 day: We can then remove the weekends from the list by first identifying the weekday by using the Datename function: Weekday DATENAME('weekday',[Work Day]) Then from this field we can exclude the weekends (Saturday & Sunday). Finally we can calculate the number of calendar days for each engagement: Calendar Days DATEDIFF('day',[Engagement Start Date],[Engagement End Date]) At this stage the table should look like this: Step 2 - Aggregate & Rank The final part of the challenge is to aggregate our table as per the requirements. For this we can use the aggrega...

2024: Week 40 - Vrroom

October 02, 2024

Challenge by: Abhishek Minz Vrroom is an online platform for used cars where individuals and dealerships can advertise their vehicles. At present, five different dealerships are using the website. Vrroom's management team need to find out which dealership is taking the longest time (in days) to sell their vehicles through the platform. Input There are two csv data sets this week: 1. The Adverts (ads) data set: 2. The Users data set: There are 365 users of the website - each vehicle listed is classed as a different registration number. Requirements Input the data sets Break the Users data set into individual records (you should have 365 rows) The User data is formed from: 1st 7 characters is the User ID The last letter signifies whether the user is a private individual ('P') or Dealership (D) The 3 characters after the User ID for Dealerships is the Dealership ID With the Ads data, remove any unsold vehicles Join the data sets together Find when an advert is first post...