2020: Week 17 - Solution




You can find our full solution workflow below and download it from the Preppin' Data Community page!



This week we looked at survey results about what Netflix shows people are watching. We want to find out what shows are being watched, how they are rated and on what devices they are being watched on.

Remove Duplicates

First, we are going to input the Form Responses table to start our workflow. At this stage, we don't require the rating fields, therefore we are going to remove these straight from the off (we'll bring them back in later). Our input step looks like this: 


The next step in this challenge is to remove any duplicate entries, where someone has answered the survey twice but with the same answers each time. The first step is to use an LOD analytical calculation to find the first time someone had responded. 


Within this calculation we are saying, for each Respondent return the Min from the Timestamp field. 

We can now use an aggregation and self join technique to remove any of the duplicate values. 

First we aggregate by grouping the First Response field, then we use a join on Timestamp = First Response. The steps look like this: 


Aggregate

Self-Join
Now we have removed the duplicates, we are ready to separate the workflow into two branches. 1 for the device numbers output, and another for the shows and ratings output.

Output 1 - Devices

Split Devices

The first step is to split the 'How have you been watching Netflix?' (device) field. We are going to use a combination of splits here to parse out the different devices. 
  1. Custom Split - Split all values with a separator of ','. This returns three columns as below: 

  2. Custom Split - On the 'split 1' field. Split all values with the separator on '&'. This returns an extra two columns like the following: 

We now have lots of individual columns with different devices split out. At this stage we have removed some columns that are no longer needed. These include:
  • How have you been watching Netflix?
  • How have you been watching Netflix? Split 1
  • What have you been binging during lockdown?
  • First Response-1
We now want to pivot the split fields, using a Columns to Rows pivot. We can use the wildcard pivot to auto select the fields, we can just type 'Pivot' and these will be automatically selected.


After the pivot step, we now have all of our different device choices in one column, so we can rename 'Pivot Values' to 'Device'. However, we also have some unwanted blanks and 'etc', therefore we can select these and 'Exclude':


Join Device List

Now we have cleaned and parsed the respondents devices, it is time to join this to our list of devices, and also create an 'Other' category for any that are not recognised. 

First we can input the 'Devices' file, and join this to our workflow. The join will be an inner join on 'Device': 


Notice that there are 4 records that have been excluded from the join, those are our 'Other' category. Therefore, we need to use another join to capture these. To do this we use a different branch, but join the workflow to the devices table again, this time with an outer join:



This time we have the 'Other' category that has been returned, and we can use the Group functionality to group these together and rename as 'Other'. 

We now have two different branches, one for recognised devices and another for 'other' devices, therefore we need to use a union tool to 'stack' these on top of each other and produce a single table again. The fields should match up, but you may need to remove some of the duplicated fields from the join!

The final step is to use an aggregate tool to calculate the totals for each device. The aggregate setup looks like this:

We are now ready to output the table, which should look like this: 


Output 2 - Shows & Ratings

Split Shows

Similar to the previous output, the first step we need to take is to split out all of the different shows that had been watched. Luckily this is slightly easier for us as they are all separated by a ','. Therefore we can use a Custom Split to split all fields with a separator of ',', and should have results that look like this: 


Again, similar to output 1, we want to use a pivot tool to bring all of the shows into a single column. We again use a Columns to Rows pivot, and use the wildcard pivot with the word 'Split'. 


Now we have all the shows in one column, we need to clean the field containing the shows. There are a few steps here: 
  • Remove any Blank rows
  • Remove 'What have you been binging during lockdown' field
  • Make the 'What have you... lockdown Split' field all uppercase. This will make sure that there aren't any problems with joins later.

Join Show List

Next, we can input the Netflix Shows table, and use a Custom Split on the separator of  '(' but this time we just want to return the first value. This isn't a perfect way of removing the years at the end of the shows, but it works well enough for the shows that we are working with. The last step before joining is to make the shows all uppercase, so that they match our original shows field. 

We can now join our original workflow with the Netflix Shows table by using an inner join on 'What have you been binging' = 'Show' 


We also need to include the ratings for each of the shows, so we need to input the Form Responses table again as we removed these at the start. This time we can remove the fields about the shows and devices as they aren't needed: 


As with a couple of the other steps, we need to use a pivot tool to get all of our ratings into a single column. Within the pivot tool, this time we use a wildcard pivot on the word 'How': 


Again, we need to use the split functionality to parse out the title of each of the shows from the pivoted column. This time it is a little easier, as we can just use an automatic split and Tableau Prep will extract the title for us: 


We now want to clean by the following steps: 
  • Remove 'Pivot1 Names'
  • Rename 'Pivot1 Names - Split1' to 'Shows'
  • Rename 'Pivot Values' to 'Rating'
  • Make 'Show' all uppercase
  • Exclude nulls from the 'Rating' field
Now we have prepared all the different tables, we now need to join these all together. Now this is a bit of a web of joins so follow closely as it could get confusing!

First, join our original workflow (Respondents & Netflix Shows) to the 2nd Form Responses table. We will be joining on the following: 



Next we want to join our current workflow to our Respondents and Shows table, but this time with an outer join, and only joining on 'What have you been binging':




We are now ready to determine what the 'other' shows are, and join these to our workflow. Therefore, from the 2nd Form Responses part of the workflow we want to 'Keep Only' the 'Other' values from the shows field. It's now time to join this to our original flow: 




The final step to bring all the required fields back together is to use a union tool to bring our Responses & Shows to our Other shows flow. 


Now we have all our required rows back into a single table, we can now use an aggregate tool to calculate the average ratings for each show: 


The final step of the challenge is to rank the shows based on their ratings. To do this we can use the Dense Ranking functionality: 


We are now ready to output our data for the 2nd output, it should look like this: 


Make sure to fill in the participation tracker, share using #PreppinData on Twitter and post you solutions onto our Tableau Forums community page so that we can compare our workflows! 

Popular posts from this blog

2023: Week 1 The Data Source Bank

2023: Week 2 - International Bank Account Numbers

How to...Handle Free Text