2021: Week 7 Solution




Solution by Jonathan Allenby and you can download the full workflow here


This week in Preppin' Data we were highlighting the slightly confusing and inconsistent world of vegan shopping. A world where Sweet Chilli crisps, Tesco jam doughnuts, and Kellogg's Crunch Nut Clusters aren't vegan but somehow Bacon Rasher crisps, Co-op jam doughnuts, and Cookie Crisp cereal are somehow. Perhaps most enlightening of all however is that Jammie Dodgers use raspberry flavoured apple jam instead of just using raspberry jam. What a world.

Step 1 - Prepare the keywords

In order to check whether each product contains any of our non-vegan keywords within their ingredients we're going to take every keyword and E number and place them into a single field, one keyword per row, and append them to the list of products.

Prepare the E Numbers

There's a few different ways and points where you could attach an 'E' to the start of every number. One great way that we used was to use the REGEXP_REPLACE function to update the E Numbers field.

E Numbers   |   REGEXP_REPLACE([E Numbers], '(\d{3})', 'E$1')

What this does is finds every 3-digit set of numbers (\d{3}) in the [E Numbers] field, and prefixes them with the letter E: 'E$1'. $1 here refers to each 3-digit string captured, so '123' becomes 'E123'.

Stack both keyword fields

Before splitting the various keywords up we use a columns-to-rows pivot to put the E Numbers and Animal Ingredients into a single field. This means that we only have to use the Split function once and keeps the data a bit taller rather than wider.


Stack the individual keywords

Now we use a custom split on every comma to place every keyword into separate columns on each row.
We follow this with another columns-to-rows pivot to place every keyword onto their own row.
You should expect 16 new 'Keywords - Split X' fields after the split and 26 rows after the pivot.

Step 2 - Check whether each product contains each keyword

In order to check each product for each ingredients we're going to perform a full append; every keyword joined to every product.

Prepare the Append

For both the prepared keywords data and the untouched products data create a new field containing the same value. We called ours 'Append' and it just contains the string 'x'.

At this stage we also made all the Ingredients in the products data and all the Keywords in the keywords data lowercase using the 'clean' options as later we'll use the CONTAINS function which is case sensitive.

Append the keywords to the products

After this, use an inner-join between to two datasets on the 'Append' fields to perform the full append. You can delete the Append fields afterwards.

Check each keyword for each product

Finally, we can flag whether each product contains each keyword. For this you can use a simple CONTAINS function to give a TRUE flag if the product's ingredients contain the keyword and FALSE otherwise. A single TRUE value means the product isn't vegan.

Contains Ingredient?   |   CONTAINS([Ingredients/Allergens], [Keywords])

You should have 29 TRUE values and 1014 rows of data.

Step 3 - Prepare the Vegan Shopping List

Summarise the products into vegan and non-vegan.

In order to label whether each product is vegan or not we can use an Aggregate step and the 'Contains Ingredient?' field. Group on Product and Description and take the MAX of 'Contains Ingredient?'. 


A TRUE value is equivalent to 1 whilst a FALSE value is equivalent to 0, so taking the MAX returns TRUE if a product contains any of the non-vegan ingredients and is therefore not vegan. You should find there are 20 vegan products (FALSE) and 19 non-vegan products (TRUE).

Filter the products

Now you can simply filter the products to keep only the vegan products (FALSE), remove any unnecessary fields, and output the data.

Step 4 - Prepare the Non-Vegan List

Now we need to create a list of the non-vegan products with their new field that indicates why they're not vegan. To do this, we're going to go back to the step where we created the 'Contains Ingredient?' field and create a new branch separate from the one where we aggregated to get the Vegan list. Unfortunately, there's currently no one-step/dynamic method to concatenating fields or values across rows so it'll need a little manual work.

Filter the Products

Firstly, we can keep only the non-vegan products by filtering the 'Contains Ingredient?' field to exclude FALSE values or keep-only TRUE values which should leave us with 29 rows.

Rotate every keyword into their own field

In order to create our final 'Contains' field, we need one row per product with each keyword in their own field so we can concatenate them together. The first step in this is to create a header for each keyword. The quickest way of doing this is to simply duplicate the 'Keywords' field.

From here, we can use a rows-to-columns pivot with the duplicate keywords as the header field and the original field as the aggregated field using either a MIN or MAX aggregation (NULL values have no value so can't be min or max).

Concatenate the keywords into a single field

Now we know which non-vegan ingredients have been picked up for each product we can concatenate them together into a single field. As mentioned before, there isn't any fantastic way to do this other than writing a calculation. You can see our calculation below.

REGEXP_EXTRACT(
IIF( ISNULL([milk]), "", [milk] + ", " )
+ IIF( ISNULL([whey]), "", [whey] + ", " )
+ IIF( ISNULL([lactose]), "", [lactose] + ", " )
+ IIF( ISNULL([egg]), "", [egg] + ", " )
+ IIF( ISNULL([gelatin]), "", [gelatin] + ", " )
+ IIF( ISNULL([gelatine]), "", [gelatine] + ", " )
+ IIF( ISNULL([honey]), "", [honey] + ", " )
+ IIF( ISNULL([beeswax]), "", [beeswax] + ", " )
, '(.*)\, $'
)

We use the IIF functions to include the keyword followed by a comma and a space if the keyword exists and move on otherwise. This results in an extra comma and space at the end of the final value, so we use REGEXP_EXTRACT to get everything apart from these last two characters, however you can use a variety of methods, such as the LEFT and LEN functions to achieve the same result.


You should now be left with 19 rows and, after removing the unnecessary fields, 3 final fields. The data is ready for outputting and the challenge has been completed!

The full outputs can be downloaded here: 1. Vegan List and 2. Non-Vegan List.

After you finish the challenge make sure to fill in the participation tracker, then share your solution on Twitter using #PreppinData and tagging @Datajedininja@JennyMartinDS14 & @TomProwse1

You can also post your solution on the Tableau Forum where we have a Preppin' Data community page. Post your solutions and ask questions if you need any help! 

Popular posts from this blog

2023: Week 1 The Data Source Bank

2023: Week 2 - International Bank Account Numbers

How to...Handle Free Text