2020: Week 24 - Solution




You can find our full solution workflow below and download it from the Preppin' Data Community page!


This week we went back in time and looked at some historical battles that have happened across the UK throughout history. It included cleaning some webscraped data and also parsing some unusual dates. 

Step 1 - Split Strings

The first step is to bring in the data, this consists a string containing details about each battle that has taken place. Our aim is to split this string so that we can isolate some of the key details like Battle Name, War, Date etc. 

To do this we want to identify a repeating pattern that is similar across all rows, which will help to break apart the string into different sections. In this case, there is a repeating '<br />', so we can use this to split the string: 


We use the custom split to break apart all fields when the '<br />' occurs: 


After the split we now have our different categories in separate columns which can then be individually cleaned: 


We can now identify if there are any rows that have incomplete information. Within the 'DownloadData Column6' field, there are a couple of empty rows. We can exclude these from our table, therefore taking our row count from 65 to 63 rows. 

Step 2 - Clean Battle Names

After we have split the string into different categories, we can focus on cleaning each of the different fields. First, we want to clean the each of the Battle Names. There are a couple of ways to do this, however we have used a split on 'DownloadData - Split2' to return everything after '>': 


This removes the 'Span Style' text and leaves us with just the name of each of the battles as required. As a result, we now have a couple columns that aren't needed anymore, so this is a good time to clean up our fields. 

We have completed the following cleaning: 
  • Removed these fields: 
    • DownloadData - Split1
    • DownloadData - Split2
    • DownloadData
    • DownloadData - Split7
  • Renamed these fields: 
    • DownloadData - Split2 - Split1 to Battle
    • DownloadData - Split3 to Date
    • DownloadData - Split4 to War
    • DownloadData - Split5 to Victors
    • DownloadData - Split6 to Descriptions
We now have this table which is much more user friendly going forwards: 


Step 3 - Clean Dates

We can now turn our attention onto clean the next field, which is the dates. Currently, our dates aren't in a nice format, where some dates are within a range and others aren't. Therefore our first step is to only account for the start date, and we can use this calculation to do this: 

Date
IF REGEXP_MATCH([Date], '\d+\s?-\s?\d+.*')
THEN REGEXP_REPLACE([Date],'(\d+)(\s?-\s?\d+)(.*)','$1$3')
ELSE [Date] 
END 
This allows use to identify any dates that have a range (eg, 10-11 August, 1332) then if this pattern is matched, then replace it with only the first date (10) and everything after the space (August, 1332). 

Now we only have the start date for each battle, we can now use the Remove Punctuation functionality to remove any commas or other punctuation from the strings. 

Next, we want to use the dateparse function to change each of dates from a string to a date. However, some of our dates are in different formats, for example 14 October 1066, September 14 1402, 991 AD therefore we can't use just one dateparse but we will need to use an IF statement and REGEX to split each of these out. 

Date
IF
REGEXP_MATCH([Date],'\d+\sAD')
THEN date(DATEPARSE('yyy GG',[Date]))

ELSEIF
REGEXP_MATCH([Date],'.*?\d+\s\d+')
THEN date(DATEPARSE('MMMM dd yyyy',[Date]))

ELSEIF
REGEXP_MATCH([Date],'\d+\s.*?\d+')
THEN date(DATEPARSE('dd MMMM yyyy',[Date]))

ELSEIF
REGEXP_MATCH([Date],'\d+')
THEN date(DATEPARSE('yyyy',[Date]))
END

This identifies each of our different date structures using the REGEX_Match function, and then we are going to use a different DATEPARSE for each different format. 

The formats are as follows: 
  • 30 June 1643 - 'dd MMMM yyyy'
  • September 14 1402 - 'MMMM dd yyyy'
  • 991 AD - 'yyy GG'
  • 1578 - 'yyyy'
There is a good page on the Tableau website where you can learn more about the DateParse functionality.

After this clean up we now have a column of nicely formatted dates:


Step 4 - Clean Victors

The last steps that we have for this week's challenge is to clean the Victors field. This time there is a column with the result of the battle. Therefore, we want to use the split functionality again and split off everything after the '>': 


This removes the information about the Span and leaves us with the result of the battle. The final step here is to replace the word 'Victors:' with a blank: 

Victors
REPLACE([Victors - Split 1],'Victors: ','')
Then remove any leading spaces and any extra fields that aren't required for the output. 

We should be left with the following Output:



Make sure to fill in the participation tracker, share using #PreppinData on Twitter and post you solutions onto our Tableau Forums community page so that we can compare our workflows! 
 

Popular posts from this blog

2024: Week 1 - Prep Air's Flow Card

How to...Handle Free Text

2023: Week 1 The Data Source Bank