2021: Week 9 - Solution


Solution by Tom Prowse and you can download our workflow here


This week we had a guess challenge by Owen Barnes where we looked at working with strings and parsing out the important parts. It was a great challenge to practice your parsing and regex skills so let's see how we can solve it!

Step 1 - Reshape Data

The first part of the challenge is to input the customer Info table and then split apart each of the different values. Each of these values are a series of different characters, however they have a repeating pattern and can be split apart by using a ' ' (space). 

Our custom split looks like this: 

As a result we now have 5 separate column all containing an individual string that we want to parse later in the challenge. 

The next step is to reshape the data so that all of these different strings are in a single column and have an individual row. To do this we need to use a columns to rows pivot, and we can use a wildcard pivot on 'IDs' (or similar). The wildcard will bring through all of the rows where this pattern is matched therefore if we reload the data in the future and there are more than 5 columns, then it will automatically pick this up.


Our data has now been reshaped and looks like this:


Step 2 - Parse String

Now we have all of the strings into a single column we can start to extract the key information as per the challenge instructions. In this solution we have used regex however there are alternative ways using splits and other techniques. If you aren't comfortable with the regex syntax then I would recommend using a website such as regex101.com to help you along the way.

First we want to remove any blank or null values from the IDs Split, then we can use the following calculations to parse the string.

Phone Number

REGEXP_EXTRACT([IDs Split], '(\d{6})\,')

This returns the 6 digits (\d) before the comma (\,) from the IDs Split field.

Area Code & 1st Letter  

REGEXP_EXTRACT([IDs Split], '\,(\d\d[A-Z])')

This gets us the 2 digits (\d \d) after the comma (\,) and then any letter after those two digits ([A-Z]). The digits and letters are in brackets as this is what we want to be included in the output and not the comma. 

Quantity 

REGEXP_EXTRACT([IDs Split], '\,\d\d[A-Z](\d+)')

This time we have the same pattern as before, 2 digits and letter are the comma, however this time we don't want to include it (outside of brackets) and instead we want to include any one or more (+) of the numbers (\d). 

Product Code 

The final part is the product code, which is the last letters after the '-'. These are at the end of the string therefore we've used a split to return any characters from the right of the last '-': 

After all of the parsing our data looks like this: 


Step 3 - Join Area Code Locations

Now we have parsed the key information from the Customer Info table, we can input the Area Code Lookup table and combine them. 

First, we want to create a join calculation for the Area Code Lookup, 

Join Calc 

RIGHT(STR([Code]),2) + LEFT([Area],1)

This combine the two characters from the right side of the code with the first character from the area and this matches the area code in our customer info table.

We can then join this to the customer info table using these join conditions: 


Then after the join we can clean the table by removing any duplicated fields (Join Calc & Area Code + 1st Letter) then filtering the Area field by removing 'Clevedon', 'Fakenham', 'Stornoway'.

Our table should now look like this:


Step 4 - Join Product Lookup

We can now join our workflow with the Product Lookup table using the following condition: 


Then we want to clean the following fields:
  • Price - Remove the £ sign - REPLACE([Price], '£', '') 
  • Change the Price data type to a decimal number
  • Change the Quantity to a whole number
  • Calculate Revenue - [Price * Quantity] 
Then we can remove any duplicated rows and aggregate our data by using the aggregate tool with the following setup: 


As a result our data now looks like this: 


Step 5 - Rank Areas

The final task for this week is to rank the different areas based on their revenues. This is using the visual calculation editor where we group by Area and rank revenue in a descending order. 

Rank


Next we want to calculate the % of total per product and area. So the first step is to calculate the total revenue for each area using an analytical calculation (Fixed LOD): 

Total Revenue per Product, Area


We can now calculate the % of total for each area:

% of Total - Product

ROUND(([Revenue] / [Total Revenue per Product, Area]) * 100, 2)

Then finally we can round the Revenue to a whole number: 

Revenue

ROUND([Revenue],0)

And that's the challenge complete! The output should now look like this: 


The full output can be downloaded here.

After you finish the challenge make sure to fill in the participation tracker, then share your solution on Twitter using #PreppinData and tagging @OwenBData@Datajedininja@JennyMartinDS14 & @TomProwse1

You can also post your solution on the Tableau Forum where we have a Preppin' Data community page. Post your solutions and ask questions if you need any help! 

Popular posts from this blog

2023: Week 1 The Data Source Bank

2023: Week 2 - International Bank Account Numbers

How to...Handle Free Text