2020 - Week 4 Solution



The challenge this week was all related to survey data and duplicate values! In this challenge we had to overcome the problem of customer's sharing their opinions multiple times meaning that the results of the Net Promoter Score are incorrect.

Step 1: Clean!

Our first task is to tidy up some of the fields where the data has been input with different structures. These fields include the Question Number, Country, Store, Dates & Time, and also Ages.

Question Number

To update the question number we need to join both of the tables together, this is a nice easy step as both tables have a question number field therefore these join nicely with an inner join.


We also want all of the questions to be in their own separate column. To do this we can use a Rows to Column pivot on the Question field, then bring through the MIN of Answer:



Age
The next step is to calculate the age of each of the respondents. This is easier now as we have each response per row, meaning we can use a DATEDIFF function to calculate the difference, in years, from the DoB to the 22nd Jan 2020. The calculation looks like this:

Age
DATEDIFF('year',[DoB],#2020-01-22#)

Country & Stores
The next area to clean is the Countries and Stores. This will involve us using the Grouping functionality within Tableau Prep. Firstly, we want to use an automatic group by common characters or pronunciation on the Country field. This will eliminate any rouge spellings or different formats, but it doesn't clean all of the options, therefore we will need to use a manual grouping to tidy up any results that have been left out. After the grouping you should have four countries:

  • England
  • Netherlands
  • Scotland
  • United States
We can now focus on cleaning the stores, and this is a little easier as we can just use a manual group to bring together the 'Amsterdam' and 'Amstelveen' fields.

Dates & Times

The final stage of the clean up is to format that dates and times in a similar fashion so that we can create a Creation Date Time field. There are various different ways of doing this so let us know if you have any alternatives!

Time -
First use the REPLACE function on the 'What time did you fill the survey' question to replace any '.' with a ':', as this is what is required to within the time formatting.

Next use the REGEX_Extract to extract the numbers that are related to the time. We can use the ':' as a separator so our calculation would look like this:

Clean Times
REGEXP_EXTRACT([What time did you fill the survey in?] , '(\d+:\d+)')

This returns one or more digits before and after the ':' symbol, or in other words, the Hours and Mins of the time.

We can now duplicate the Time field, this is so that we can compare the times and also retain the original survey results.

Next is to isolate the fields that contain 'Am' or 'Pm', we can do this by removing any numbers, punctuation, and extra spaces from the 'What time did you fill the survey in?' field. We now have a field with just the 'Am' or 'Pm' rows remaining, we can rename this to 'AM or PM'.

There are still some Times which fall through the net, therefore we need to use a LEN function to catch the rest. We can use the following calculation:

More Clean Times
IF LEN([What time did you fill the survey in?]) = 4
THEN 
LEFT([What time did you fill the survey in?],2) 
+ ':'
+ Right([What time did you fill the survey in?],2)
END

The LEN function allows us to count how many characters are within the given string, so therefore we are saying that if it is equal to 4 then split the first two characters from the left (hours), add a ':', then add the two characters from the right (Mins). This brings the 4 digits into the Time required format.

Now we have a few different fields that all have clean times within them. To bring them all together into one clean column, we can use the Merge functionality. This creates one clean column, and then from here we can add a ':00' to the end. The final step would be to clean up any unwanted fields, so we're just left with the clean times in one column.

Date - 
Now we have cleaned the times, we can turn our focus onto the Dates. Currently, the date field is a string, therefore we can change the Data Type to a date and Tableau automatically recognises that these are dates so it converts them into our required format.

We can now bring the Date and Time fields together to create our Completion Date field. To do this we can use the following calculation:

Completion Date
Str([What day did you fill the survey in?])
+ ' ' 
+ [Clean Time]

Note, make sure you add the STR function to convert the date back to a string before adding the time field (or else you will get an error).

The final steps are to convert the Completion Date field to a Date Time data type, and then use the DATEADD function to factor in whether the time was AM or PM. We did this with the following calculation: 

Completion Date
IF [AM or PM] = 'pm' 
THEN DATEADD('hour', 12, [Completion Date])
ELSE [Completion Date]
END

This adds 12 hours to the time if the time was noted as being in the afternoon (PM). Now we have the completion date, we can remove any unwanted fields and move onto the next task!


Step 2: Aggregate! 

The next part of the task is to determine whether this is the first or latest time that a person had completed the survey. We have a few tasks here, we need to:

  • Find the First answer for each customer in each store and country
  • Find the Latest answer for each customer in each store and country
  • Remove any answers that are not the First or Latest
First Answer -
First we are going to focus on the first time someone had completed the survey, so we can use the Aggregation tool to return the Min Completion Date for each Name, Country and Store. The aggregation is setup with the following: 



Latest Answer - 
To determine whether this was the latest response, we need to complete a rank of each of the responses. The first step to complete the rank, so to use an aggregate calculate the Number of Rows for each Country, Store and Name (this will allow us to see if there are multiple responses). 


We can then join this aggregated field, back onto the original data flow so we can see how many responses per person. The join condition should look like this: 


After the join we can then remove any duplicated fields, and then use a filter to return all of the rows that have a No of Rows that is greater than 1. This would be the filter condition: 

[Number of Rows (Aggregated)] > 1


Now we can use another aggregation tool, this time we want to group by the same fields, but instead return only the Maximum Date, as this will be the latest date that they completed the survey. The aggregation setup is like this:


We now have the First and Latest responses for each person, therefore we can use the union tool to 'stack' these on top of each other and into one table. As a result of the union we now have a Table Name field. This gives us the information about the First or Latest response, and we can rename these accordingly to get the Result field.

After the union, we can then join the new table with the original data flow (before the First Answer aggregate tool), using the Country, Store, and Name as the join conditions. We need to bring back all of the data from the union table, therefore the join will need to be a left join (or right depending on which table you have selected).

The join should look like this:


Finally we can tidy up the table by removing any unwanted fields, and also changing the 'Would you recommend C&BSco...' field to a whole number.


Step 3 - Calculate! 

The final step within this weeks challenge is to calculate the NPS score for each respondent. We need to classify each person based on the following logic:

  • 0-6 = Detractor
  • 7-8 = Passive
  • 9-10 = Promoter
Using this logic, we can use the following calculation to classify each respondent:

Net Promoter Status
IF [Would you recommend...] >=9 
THEN 'Promoter'
ELSEIF [Would you recommend...] >=7 
THEN 'Passive'
ELSE 'Detractor'
END

We also need to create a field which will be used in the next pivot step. We can create a '1' as a number, then use this within a Pivot step to determine which category each respondent falls under. The pivot tool should be setup like this: 



Then we are finished!


Popular posts from this blog

2023: Week 1 The Data Source Bank

2023: Week 2 - International Bank Account Numbers

How to...Handle Free Text