2020: Week 2 Solution



The challenge for this week was set by 100% club member Michael Heighway. This was based on a real life problem within his organisation when collecting data relating to dates and time. Now the date aspect was fine, however we entered into the realms of free text input with the time field. Our task is to help clean up the data and produce a nicely formatted output.

Again there are plenty of different solutions for this week so make sure you check out the #PreppinData solutions on Twitter, and Jenny has made a video solution, so make sure you take a look!


Step 1 - Extract Hours

The first step of the task is to extract the hour from the time field. There are various different ways of doing this including splitting, REGEX etc, but we have decided to use a mix of these.

The first calculation deals with all of the times with a ':' or a ';'.

Hour
IF CONTAINS([Time],':') 
OR CONTAINS([Time],';') 
THEN LEFT([Time],2)

ELSE NULL
END

This calculation determines if there is a ':' or ';' within the Time field, and if there is it extracts the first 2 characters from the Time field.

After this first calculation we are left with a column containing mostly 2 digits, but there are some single digits with punctuation and also some nulls. To tidy the single digits, we can use the 'Remove Punctuation' clean step within Tableau Prep.

The next task is to tidy the remaining Nulls. This is a little more difficult as there are some different structures within the Time field. The first step is to identify any hour rows that are Null (using the IsNull function):

Hour
INT(
IF ISNULL([Hour]) 
THEN LEFT([Time], 2)

ELSE STR([Hour])
END)

This calculation takes all of the values that are Null, within the Hour field, then returns the first two characters from the Time field. If the fields aren't Null then it will return the original Hour value.

By taking the first two characters, some of the hours will be returned correctly, however some are only one digit in length meaning that it is incorrect. To overcome this we have used the following calculation:

Hour
INT(
IF [Hour]>23 
THEN LEFT(STR([Hour]),1)

ELSE STR([Hour])
END)

We have used the logic that, if the two digits are greater than 23 then we want to just return the first digit, but if they are less than 23 then keep the two digits as before. This eliminates any values which could be incorrect.

The final step is to determine if the hour is AM or PM. Some of the rows indicate whether or not they are AM/PM therefore the first step will be to extract this information from the Time field. We have again used the Contains function to do this:

AM/PM
IF CONTAINS([Time],'a') 
OR CONTAINS([Time],'A') 
THEN 'AM'

ELSEIF CONTAINS([Time],'p') 
OR CONTAINS([Time],'P') 
THEN 'PM'

ELSE ''
END

This calculation is identifying whether there is an 'a' or 'p' within the time field then returning AM/PM as a result. Note, that Tableau Prep works in a case sensitive way, therefore you will need to include both the lower and upper case letters to ensure you return the correct values.

Once we have extracted whether the time is AM or PM, we can then use the following logic:

Hour
IF [AM/PM] = 'PM' 
THEN [Hour] + 12 

else [Hour] 
END

If the Hour is in the PM, then we will need to add 12 hours onto the figure, if it is AM then it can remain the same.

This is the first step complete and we have successfully extracted all of the hours from the Time field!


Step 2 - Extract Minutes

Now we have successfully extracted the hours from the Time field, next we need to extract the minutes and again we need to use multiple techniques as the Time is in different formats.

The first technique is to use the ':' or ';' as a separator, similar to the technique used earlier to obtain the hour. However, instead of taking 2 characters from the left, this time it is slightly more complicated fields are in the following formats:

16:38
2:51pm,
2;04PM

Therefore we can't use the Right functionality (yet) to bring through the 2 characters from the right side of the string as this will bring through unwanted characters. To extract the correct digits for the minutes, we need to use the REGEX_Extract() function, in the following calculation:

Mins
IF CONTAINS([Time], ':') 
OR CONTAINS([Time],';') 

THEN REGEXP_EXTRACT([Time],'\d\S(\d{2})')
else NULL
END 

When breaking down this calculation, we are saying... If the Time field contains a ':' or a ';' then we want to use the Regex_Extract function to identify when there is any digit (\d) followed by any non-whitespace character (\S). Every time this pattern occurs, we then want to extract the next two digits only (\d{2}). We can then use the rounded brackets to identify what part of the expression we would like to return. If you would like to learn more about the REGEX functionality then take a look at our blog here.

Using the example strings, we have highlighted what is being returned by each part of the Regex function:

'\d\S(\d{2})'

16:38
2:51pm,
2;04PM

Now we have extracted the majority of the minutes, we now need to extract any fields that have been returned as NULL. To do this we can use the following calculation with some further REGEX functionality:

Mins 
IF ISNULL([Mins]) THEN 

REGEXP_EXTRACT([Time],'(\d{2})\D')
ELSE [Mins]
END

This time we want to find any Null values within the Mins field, and use the Regex functionality to identify every time there is two digits (\d{2}), which are then followed by any non-digit character (\D) within the Time field.

The final calculation to extract the minutes is:

Mins
IF ISNULL([Mins]) 
THEN RIGHT([Time],2)

ELSE [Mins]
END

This time we can use the Right() function to tidy up any final values that are still Null. As we have extracted the majority of the minutes that contain a PM or similar, therefore we are just left with digits, meaning we can use the Right() function to extract the two most right characters.


Step 3 - Assemble DateTime Field

We now have a clean Hours and Mins field, we can now assemble the DateTime field that is required. To do we can use the MakeDateTime() function, to bring together the date and each part of the time fields. We used the following calculation:

Date Time
MAKEDATETIME([Date],
MAKETIME(int([Hour]),int([Mins]),00))

Make sure you have the correct field types within this calculation, as this will error if you don't have two date fields. You could also break this down into two different calculations where you create the time field first, then bring together to make the DateTime.

This is now complete, and after removing any unwanted fields we are ready to Output the data.



Popular posts from this blog

2023: Week 1 The Data Source Bank

2023: Week 2 - International Bank Account Numbers

How to...Handle Free Text