2019: Week 30 Solution
You can view our full solution workflow below and download it here!
Just a foreword on this solution: this workflow has gone through a number of iterations as small quirks are located and corrected, so if your results don't quite match then that's fine! As long as you have a final data set with roughly 31k rows, 7 fields, a logical looking [Comment Split] field, and Celcius temperatures that are both positive & negative then you've done great in my eyes.
In our solution, I opted for RegEx, but that was mostly because it was quicker to write the RegEx than to count how many spaces into each Tweet the temperatures were for the SPLIT() function. To help understand the RegEx below, use the link attached to each command to view it on RegEx101 which breaks down each part and demonstrates it in action.
This is by far the trickiest step, not least in part due to the fact that the Custom/Automatic Split options struggle when trying to split a string field int more than 10 parts.
Before splitting the Tweets up though, we first need to bin all the temperature info that occurs at the start of every Tweet. There’s a few different ways to go about doing this, however we opted to use the REGEX_REPLACE() function to replace the temperature and date information with nothing.
In order to get around splitting up the rest of the words into the 25 fields we need, our current workaround is to first split each Tweet up into 3 chunks, each containing no more than 10 spaces.
To do this, we first need to use FINDNTH() to check if each Tweet contains at least 10, 20, or 30 spaces.
Each of these fields returns the index of the 10th/20th/30th space in each Tweet, or NULL if it doesn’t exist. [30th Space] returns NULL for the whole data set. If it returned a single valid number, we’d have kept going with 40, 50, and so on until we found the maximum length of a single Tweet.
We now can use these fields to break each Tweet into 3 separate fields
These functions make use of the LEF() and MID() functions to take chunks of text from the [Text] field. The first one takes everything to the left of the 10th space if it exists. The second one takes everything between the 10th space and the 20th space, if the 20th space exists. If it doesn’t exist, then it takes everything from the 10th space onwards. Finally, the last one takes everything from the 20th space onwards, if it exists.
Once we’ve done this pivot, we need to clean up the values a bit before moving on. The three things we need to take care of are:
Next, we can use a one-sided outer join between our Tweet data and the words data to remove any row where [Comment Split] equals a [Word] in the word data. Whether it’s left- or right-outer depends on whether your Tweet data is on the left or right.
After performing this join, and removing any leftover unnecessary fields, you should have a data set that matches ours.
Our full solution workflow. |
1. Exclude all non-temperature Tweets.
We need all Tweets that have a date, then water temps, then air temps. Filtering on “: Water –“ does the job here, though we could use Regex to ensure every Tweet fits the matching pattern.The filter settings. |
2. Extract the temperatures.
There’s two main methods here:- Use the SPLIT() function to extract the correct chunks of text.
- Use REGEX_EXTRACT() to match on patterns.
In our solution, I opted for RegEx, but that was mostly because it was quicker to write the RegEx than to count how many spaces into each Tweet the temperatures were for the SPLIT() function. To help understand the RegEx below, use the link attached to each command to view it on RegEx101 which breaks down each part and demonstrates it in action.
[Water TempF]
|
[Water TempC]
|
[Air TempF]
|
[Air TempC]
|
REGEXP_EXTRACT(
[Text]
, '(\d+\.\d)[F]'
)
|
REGEXP_EXTRACT(
[Text]
)
|
REGEXP_EXTRACT(
[Text]
)
|
REGEXP_EXTRACT(
[Text]
)
|
3. Pivot all the temperatures and categories
We now want a field for all Celsius temperatures, a field for all Fahrenheit temperatures, and water & air on separate rows. To achieve this we can use a Columns-to-Rows pivot with two sets of pivot values. First add [Water TempC] and [Air TempC] to Pivot1 Values. Then, add [Water TempF] and [Air Temp F] to Pivot2 Values. This should result in separate fields for Celsius and Fahrenheit, as well as put Water & Air on their appropriate rows.Pivoting the temperatures. |
4. Split the Tweets up into their individual words.
This is by far the trickiest step, not least in part due to the fact that the Custom/Automatic Split options struggle when trying to split a string field int more than 10 parts.
Before splitting the Tweets up though, we first need to bin all the temperature info that occurs at the start of every Tweet. There’s a few different ways to go about doing this, however we opted to use the REGEX_REPLACE() function to replace the temperature and date information with nothing.
[Text]
|
REGEXP_REPLACE( //Replace...
[Text] //in the [Text] field...
,'^(.*?C\.\s+)' //everything up to the first instance of "C. "...
,"" //with nothing.
)
|
In order to get around splitting up the rest of the words into the 25 fields we need, our current workaround is to first split each Tweet up into 3 chunks, each containing no more than 10 spaces.
To do this, we first need to use FINDNTH() to check if each Tweet contains at least 10, 20, or 30 spaces.
[10th
Space]
|
[20th
Space]
|
[30th Space]
|
FINDNTH(
[Text]
," "
,10
)
|
FINDNTH(
[Text]
," "
,20
)
|
FINDNTH(
[Text]
," "
,30
)
|
Each of these fields returns the index of the 10th/20th/30th space in each Tweet, or NULL if it doesn’t exist. [30th Space] returns NULL for the whole data set. If it returned a single valid number, we’d have kept going with 40, 50, and so on until we found the maximum length of a single Tweet.
The [Xth Space] fields. |
We now can use these fields to break each Tweet into 3 separate fields
[Text 1]
|
[Text 2]
|
[Text 3]
|
IF [10th Space] != 0
THEN
LEFT(
[Text]
, [10th
Space]-1
)
ELSE [Text]
END
|
IF [20th Space] != 0
THEN MID(
[Text]
, [10th Space]+1
, [20th Space]-1
)
ELSEIF [10th Space]
!= 0
THEN MID(
[Text]
,[10th Space]+1
)
END
|
IF [20th Space] != 0
THEN MID(
[Text]
,[20th Space] + 1
)
END
|
These functions make use of the LEF() and MID() functions to take chunks of text from the [Text] field. The first one takes everything to the left of the 10th space if it exists. The second one takes everything between the 10th space and the 20th space, if the 20th space exists. If it doesn’t exist, then it takes everything from the 10th space onwards. Finally, the last one takes everything from the 20th space onwards, if it exists.
5. Pivot all the words into a single field.
With 25 brand new fields, each containing a word or a NULL, we now have to pivot these into a single field. To do this we can just use another Columns-to-rows pivot with all our [Text X – Split Y] fields in the Pivot1 Values.Once we’ve done this pivot, we need to clean up the values a bit before moving on. The three things we need to take care of are:
- Make all the words lowercase. The common words we’re using later are all lowercase and Tableau Prep’s joins are case sensitive.
- Remove all punctuation. This is just to clean up a bunch of words that either are surrounded with random punctuation or aren’t even words at all. A few “non-words” will sneak through, like “+” and “$”, but we’re letting these slide.
- Filter out any blank or null words. After performing all the above steps we have huge number of blank and null rows which we can just exclude.
Pivoting all the words, plus all the changes. |
6. Filter out the common words using a one-sided outer join.
Now we have all our Tweet words prepared, we need to remove any words from the common English words data. First, use the ‘Remove Punctuation’ option on the [Words] field in the common words data; at least one word contains an apostrophe, so we need to remove it to match our prepared Tweet words.Next, we can use a one-sided outer join between our Tweet data and the words data to remove any row where [Comment Split] equals a [Word] in the word data. Whether it’s left- or right-outer depends on whether your Tweet data is on the left or right.
Filtering using an outer join. |
After performing this join, and removing any leftover unnecessary fields, you should have a data set that matches ours.