2020: Week 40 - Solution

 

Solution by Tom Prowse and you can download the workflow here. 


This week we analysed another impressive viz on Tableau Public and decided to take a look at the behind the scenes data prep that was required! We looked at Matthew Armstrong's - 'What Were William's Words Worth' viz and how to clean up the data from the Every Poet website. 

Step 1 - Clean HTML

The data is webscraped from a website, therefore is in the form of HTML, CCS & JavaScript, so our first task this week is to remove any rows of data which contains this. This is slightly tricky as there are lots of lines, but first we want to remove any extra spaces from the Line field using the Clean functionality.

Next we can use this calculation within a filter to help catch most of the unwanted lines:

Line
NOT
CONTAINS([Line],'<')
OR
CONTAINS([Line],'>')
OR
CONTAINS([Line],'()')
OR
CONTAINS([Line],'=')
OR 
CONTAINS([Line],'e9')
)

This calculation only returns lines that do not contain any of these punctuation symbols or 'e9'.

This doesn't quite clean the Line field so that it only contains lines from the poems, so we will use a Replace calculation to remove any unwanted '--'.

Line 
REPLACE([Line],'--','')

We now have a table which only contains the lines from the Poems
 


Step 2 - Line Order

Next we have to ensure that the first line of the poem matches with the title of the poem. Therefore, we first want to determine what order the lines are in, and give each a line number. To do this we can use the Rank functionality: 


This gives us a new line number for each line, in order, per poem:


We now want to identify whether the first line is the title of the poem. To do this we will duplicate the Line field, to create Line-1. From Line-1 we want to remove all punctuation using the clean tools so we are just left with the words only. 

Next we want to use a calculation to see if the first line is equal to the poem title. We use this calculation: 

First Line 

UPPER([Line-1])=UPPER([Poem])

Wrapping both of the fields in an UPPER will allow all letters to become uppercase, meaning there won't be any problems if some of the cases are different. 

Finally we want to keep any rows where the First Line field is True or if the First Line is False and the line number does not equal 1.

[First Line]
OR
(NOT [First Line] AND [Line #]!=1)

We now have the following table:


Step 3 - Line Numbering

Now we have the correct line order with only lines from the poems, we want to re-number these lines so that they start at 1. First, we want to use a FIXED LOD to find out what the first line number is for each poem: 


 We can then use the following calculation to adjust the line number as required: 

Line #
IF [First Line No.] = 1 THEN [Line #]
ELSE [Line #]-1
END

We now have the correct line number for each line of each poem: 


Step 4 - Split Words

The next step is to split each of the words so that they each have their own line. First, we want to make sure that all of the words are in the same case, so we will make Line-1 all Uppercase. Then can use a custom split, using a space as the separator, to split each word into a separate column:


After the split, we no longer need the Line-1 field so this can be removed. 

Now each word has a separate line, we can use a Columns to Rows pivot to bring all into a single column. Using the wildcard pivot on 'Split' makes this nice and easy:


After the pivot we have some null values which can be removed so we are left with the following table: 


Step 5 - Score Each Letter

The final step is to give each letter for each word a scrabble score. To do this we want to split each of the letters into a separate row. 

First, calculate how many letters each word has:

Word Length
LEN([Word])

Next, use the Pivot Names field to find what the position of each word was within the line. We do this by using a custom split to split off the Last 1 occurrence after the last space to just leave the number. We can then rename this Word #.

After this, we can now join the Scaffold data set onto our workflow. We join this using the following join conditions: 


Notice, how the number of rows has exploded. Don't worry, we will deal with this later.

Finally, we want to use the MID function to extra each of the different letters in the string. 

Letter
MID([Word],[Scaffold],1)

This will split each character onto a separate line.


Now we have a line for each letter, we can join the Scrabble table which documents all of the different scores. We join them on: 


Now for each letter we have the corresponding score as well:


Step 6 - Score for Each Word

The final step this week is to make a total score for each of the words, and then identify the highest scoring word in each poem. 

First, we want to use an aggregate tool to calculate the total for each word. We want to group by - Word, Word #, Line #, Poem & Line. Then we want to Sum the Score: 


Finally, we want to identify which word scored the highest in each poem. Therefore, we can use a FIXED LOD to find the Max score in each: 


Then we want to identify whether the field is the highest by creating a flag:

Highest Scoring Word?
[Highest Score]=[Score]

Our table now looks like this and is ready for the output: 


After you finish the challenge make sure to fill in the participation tracker, then share your solution on Twitter using #PreppinData and tagging @Datajedininja, @JennyMartinDS14, @JonathanAllenby & @TomProwse1

You can also post your solution on the Tableau Forum where we have a Preppin' Data community page. Post your solutions and ask questions if you need any help! 


Popular posts from this blog

2023: Week 1 The Data Source Bank

2023: Week 2 - International Bank Account Numbers

How to...Handle Free Text