2020: Week 46 - Solution

Solution by Tom Prowse and you can download the full workflow here

This week we looked into incidents that have occurred in the aviation industry. The aim of the task was to parse some key information, and then count how many times each of our categories have happened. 

There were plenty of different ways of completing this week's challenge, and make sure you check out Mo Hassn's solution where he challenged himself not to use Regex or Tableau Prep's native grouping feature. 

Step 1 - Extract Information

First up we want to extract the key information about the Aircraft, Location and Date from each of the strings. There are various ways to do this including RegEx or Splits, and we have decided to use RegEx to parse out the information. If you aren't familiar with RegEx then I would recommend taking a look at Regex101.com which will help you along the way. 

Aircraft
REGEXP_EXTRACT([Incident],'(^.*?)\s+(at|near)')

Using this calculation, we are extracting anything before the first 'at or near' in the string - This is our aircraft.

Location
REGEXP_EXTRACT_NTH([Incident],'^.*?\s+(at|near)+\s+(.*?)\s+(on)',2)

This is similar to the Aircraft calculation, but this time we don't want to include the anything before the at or near, but include everything before the word 'on'. We use the Regexp_Extract_Nth calculation as the Regex_Extract will pick up the 'at or near' within the first set of brackets.

Date
REGEXP_EXTRACT_NTH([Incident],'(^.*?)\s+(at|near)\s+(.*?)\s+(on)\s+(.*?)\,\s+(.*?)$',5)

The final string Regex calculation can actually be used for all of the previous calculations but would depend on which 'Nth' you were extracting. For example, we want to extract the 5th set of brackets so we use 5. There are more efficient ways of writing Regex expressions, but I thought this was a good example of how you can reuse and repurpose the same expression for different parts of a string. 

Finally we can split off the last second of the string, after the ','. This will give us the Incident description. 


After extracting all of the information we now have a table that looks like this: 



Step 2 - Incident List Output

After extracting the key information, we can now look to create the incident list output. All we need to do for this, is transform the date field from a string to a date. Again, there are various ways of doing this but we have used the following calculation: 

Date
DATEPARSE("MMM dd yyyy", REGEXP_REPLACE([Date],'(\d+)[a-z]+', '$1') )

Again, we are using a Regex expression but this time we are replacing values instead of extracting them. In this Regex expression, we are identifying any digits that are followed by text (eg, 21st or 14th) and replacing that with whatever is in the 1st set of brackets (eg, just the numeric value). 

We can then change this to be a date field, rename Incident - Split1 to Incident Description, and finally remove the RecordID & Incident fields. Then we have our first output: 

Incident List


Step 3 - Group Key Words

The next step is to group the words that are similar or plurals. To do this we first want to split the Incident - Split1 field so that we have each word in a separate column by using a custom split on a ' '(space).

This creates multiple columns, each with a single word in: 


Now we have split each of the words out, we no long need the Incident - Split1 field so this can be removed. After removing, we now want to Pivot the words into a single column. We can use the Wildcard pivot to do this by using the word 'Split': 


After the pivot we can tidy things up by removing the Pivot1 Names field, and then Excluding any blank values from the Incident Split field. 

The next step is to use Tableau Prep's Grouping functionality to Group similar words. At this stage, you may need to play around with the different options to ensure the best fit, but it's good to check the Category list so that you ensure that the words are grouping correctly. We have used grouping by pronunciation and then manually selected a couple of values.

Step 4 - Create Category List

Now we have grouped the words, we can combine with the Category table. We are going to use a join for this, where we are joining the individual words to the category: 


The final step, is to count the totals for each of the categories by using an aggregation tool. In the setup we are Grouping by Category, and Sum Number of Rows:


After renaming the Number of Rows field, we should now have our second output: 


Both of the outputs can be downloaded here

After you finish the challenge make sure to fill in the participation tracker, then share your solution on Twitter using #PreppinData and tagging @Datajedininja@JennyMartinDS14@JonathanAllenby & @TomProwse1

You can also post your solution on the Tableau Forum where we have a Preppin' Data community page. Post your solutions and ask questions if you need any help! 


Popular posts from this blog

2023: Week 1 The Data Source Bank

2023: Week 2 - International Bank Account Numbers

How to...Handle Free Text