2019: Week 24 Solution

You can view our full solution workflow below and download it here!

Our full workflow solution.

This week saw our first official foray into the world of RegEx. There’s a lot to learn when it comes to RegEx and some great resources to help are https://regexr.com/ & https://regex101.com/. In this solution post I’ll highlight the functions and RegEx we personally used as well as the overarching method for our solution. I'll break down the RegEx within the first calculation, however for the others and for a better understanding I'd recommend using one of the aforementioned resources which help break them down - I find Regex101 is particularly good for this.

Use REGEXP_EXTRACT() to find and pull out information

REGEXP_EXTRACT() is a function that allows you to look within a string and extract the first instance of a specific combination of characters. It takes two parameters:
  1. The string/string field.
  2. The RegEx that defines the character string you’re looking for.
Take the [Name] field we wish to create for example. Here’s the expression we used:

[Name]

REGEXP_EXTRACT(
    [Field_1]
    , ']\s(.*):'
)



Let’s break down this confusing ‘]\s(.*):’ expression.
  1. Firstly, you can ignore the ‘ ‘ as that just signifies where the expression starts and ends for Tableau.
  2. ] – This is the first character we’ve asked REGEXP_EXTRACT() to look for.
  3. \s – This signifies a space character.
  4. .* – A fullstop is like a wildcard; it means ‘any character’. An asterisk means ‘0+’ instances of the preceding character, so .* means ‘any combination of characters if they exist’. 
  5. (.*) – Brackets signify the section of the expression we actually want returned.
  6. : - This is the last character we’ve asked REGEXP_EXTRACT() to look for.
Putting this together means we want to find any characters (.*) that have a bracket and space ]\s directly in front of them and a colon : directly after them. This means we’ll obtain the highlighted section from the following string: “[01/06/2019, 19:55:57] Alessandro‬: O'er Milford Haven blown.”‬

We can use the REGEX_EXTRACT() function to also obtain the [DateTime] and [Text] information.

[DateTime]
[Text]

REGEXP_EXTRACT(
    [Field_1]
    ,'\[(.*)]'
)


REGEXP_EXTRACT(
    [Field_1]
    , ':\s(.*)'
)

Use RegExp_Replace() to help count words

REGEXP_REPLACE() works in a similar way to the REPLACE() function. However, instead of defining the exact string that needs replacing, you instead use a RegEx expression to define the characters that need replacing. This comes in handy when counting how many words are in each [Text]. This is because there’s a space between every word. If we remove all characters that aren’t spaces and +1 then we have the number of words.

Example: Here the spaces are highlighted.
There are 5 words and 4 spaces so: number of words = number of spaces + 1.

We can measure the length of a string using the LEN() function, hence we can get the [Number of Words] for each message like so:

[Text – Spaces Only]
[Number of Words]

REGEXP_REPLACE(
    [Text]
    , '[^\s]'
    , ""
)


LEN([Text - Spaces Only]) + 1

Join the holiday information to the chat logs

The Dates data contains whether each [Date] is a [Holiday?]. We can use DATEPARSE() to convert the [Date] field into an actual date and then join this information to the Chat data on [Date] = [Date].

[Date]

DATE(
  DATEPARSE(
      "dd MMMM yyyy",
      [Date] + " 2019"
  )
)


Check whether each text was sent whilst at work

Now that we have the [DateTime] and [Holiday?] information for each message we now need to check whether each message was sent whilst at work. This is defined as being on a ‘Weekday’ and between the hours of 9AM & 12PM or 1PM & 5PM. We can use an IF statement to return either 0 or 1 depending on whether the message meets this criteria:

[Texts while at work]

IF [Holiday?] = 'Weekday'              //If it is a weekday…
    AND
    DATEPART('hour',[DateTime]) >= 9   //and the message is after 9AM…
    AND
    DATEPART('hour',[DateTime]) < 17   //and before 5PM…
    AND
    DATEPART('hour',[DateTime]) != 12  //but not during the lunch hour
THEN 1                                 //then 1
ELSE 0                                 //else 0
END


Here we use 1 & 0 instead of TRUE and FALSE as we want to sum up the field later.

Aggregate to get counts and averages

To calculate the final [Texts], [Texts while at work], & [Number of Words] counts and the [Avg Words/Message] value we can use a single aggregation step with the following settings:
  • GROUP on [Name]
  • SUM [Number of Words]
  • SUM [Number of Records (Aggregated)] (rename as [Texts])
  • SUM [Texts while at work]
  • AVG [Avg Words/Message]

Calculate % sent whilst at work

Finally, the last major step required is to calculate the % of texts sent during work hours for each person. This can be done by dividing [Texts while at work] by [Texts] and then multiplying by 100. We also round this to 1 decimal place to get a nice-looking percentage.

[% sent from work]

ROUND(
    ([Texts while at work]/[Texts]) * 100
    ,1
)


Popular posts from this blog

2023: Week 1 The Data Source Bank

2023: Week 2 - International Bank Account Numbers

How to...Handle Free Text