2021: Week 8 - Solution


Solution by Tom Prowse and you can download the full workflow here

This week we were looking at data related to a Karaoke club and the different songs that their customers would sing throughout the various sessions. The challenge was a little of the tricky side as it involved lots of joins & analytical calculations which could get a bit confusing! Hopefully you stuck with it and could figure it out, but if not take a look at the solution below to help you out.

Step 1 - Calculate Time Between Songs

The first task this week is to calculate the time between the start of each song. This is a little tricky as each song only has a start time and is also on a separate row. We need to first give each song an ID, then use a join to bring the start time of the song and also the start time of the second song onto the same row.

After inputting the Karaoke Choices table, we want to create an ID for each song. We can do this by using a Row_Number calculation that looks like this: 

Song Order  
{ORDERBY [Date] ASC : ROW_NUMBER()} 

Now each of the songs has an ID, we can then use another calculation to identify the ID of the next song by adding 1. 

Next Song 
[Song Order]+1 

As a result of these calculations our songs now have two IDs: 


Now we have both of the IDs on the same line, we can use a self join on these IDs so that we bring through the song end time (next song start time). You can remove the Song Order, Artist, & Song name from the 2nd step as we are only interested in the date/time field. 

Our workflow looks like this: 

The join condition looks like this: 


We want to include all values from the Song Order so that we don't have any missing.

The table should look like this with both the song start time and the previous song start time on the same row: 


We can now calculate the time between each of the songs by using the following calculation: 

Time Between Songs
DATEDIFF('minute',[Previous Song End Time],[Date])

After calculating the time between each song, we can also identify the session that each song belongs to by using the following calculation: 

New Session?
INT([Time Between Songs]>=59
OR 
ISNULL([Time Between Songs]))

This returns a 1 if the time between songs is longer than 59 minutes.

Our data now looks like this: 


Step 2 - Create Session Number Field

Now we have identified whether the song was the start of a new session, we now need to fill down the songs to give each song a session number. 

First we need to do a self join where the song order >= song order which will massively increase our number of rows. After the self join we use an aggregate tool to fill in all of the session number . 

The workflow looks like this: 


The join conditions are: 


The aggregate setup: 


After these steps the table should look like this: 


Now we have a session number for each song, we also want to calculate a song number for each session. We use a similar Row_Number calculation as before but this time we need to partition by Session No. instead: 

Song Order Per Session
{ PARTITION [Session No.] :
{ ORDERBY [Date] ASC: 
    ROW_NUMBER()}}

As a result we now have a song order for each different session: 


Step 3 - Match Customer Timings

It's now time to combine the customer table with our workflow. To do this we need to make sure that we have the start time for each session so that we can join this with the customer's entry time. 

Therefore we need to use an aggregate tool on our current workflow to bring back the minimum date/time for each session: 


We've then renamed Date to Session Start so that it's clear what date we are working with when we come to do the join. The join conditions look like this where we need to ensure that we are returning all of the entry times for each session: 


As this has exploded the number of rows we have, we now need to use an aggregate and join to make sure we are returning the correct range of rows. 

The workflow looks like this:


Aggregation step: 


Self-Join conditions: 


Our table should now look like this: 



Step 4 - Final Cleaning

There are a couple more bits of cleaning that we need to do before we have the desired output. First, we need to remove any customers that arrived more than 10 minutes before their session started. We use this calculation within a filter: 

DATEDIFF('minute',[Entry Time],[Session Start])<=10

We now have a match between the Session No. and Customer ID which looks like this: 


We can now join this back onto workflow on the step before the aggregate to calculate the session start time. From this step the join conditions are: 


Then after the join and removing any extra fields, we have completed the challenge and the output should look like this: 


The full output can be downloaded here

After you finish the challenge make sure to fill in the participation tracker, then share your solution on Twitter using #PreppinData and tagging @Datajedininja@JennyMartinDS14 & @TomProwse1

You can also post your solution on the Tableau Forum where we have a Preppin' Data community page. Post your solutions and ask questions if you need any help!


Popular posts from this blog

2023: Week 1 The Data Source Bank

2023: Week 2 - International Bank Account Numbers

How to...Handle Free Text