2020: Week 15 - Solution




You can find our full solution workflow below and download it from the Preppin' Data Community Forum page!



As ever, there are probably many different ways to come to the same answers… but this is the one that we found.

Split and Pivot

The first couple of steps are nice and easy:


Connect to the data, and then split the [Items] field based on a comma. This will give you three extra columns. Then pivot columns to rows, and filter out nulls. We’ve also renamed my fields to give me this data structure:



Upper Branch: Calculating Support

At that point, we split off our flow into two branches. In the first branch, we calculate the support for each item. Now, you could aggregate first - group by [Item], and count distinct of [TransactionID]. But then you’d also lose the total number of transactions, so you’d have to do another aggregation alongside it where you’d take the count distinct of [TransactionID] without grouping by [Item], and then join those two together.

That would be a faff, so this is where the new LODs come in handy:


The first thing we found out is that Prep doesn’t support table-scoped LODs the way that Tableau Desktop does - you can’t simply do something like {SUM([variable])}. So, what we did was create a dummy field, which was simply the value “x”. That let us calculate the total number of transactions like this:

[TotalTransactions]
{FIXED [Dummy]: COUNTD([TransactionID])}

We also calculated the total number of transactions containing each item by doing:

[CountOfItem]
{FIXED [Item]: COUNTD([TransactionID])}

That meant we could calculate support for each item by doing:

[Support]
[CountOfItem] / [TotalTransactions]

You could, of course, wrap this up into a single calculation, like:

{FIXED [Item]: COUNTD([TransactionID])} / {FIXED [Dummy]: COUNTD([TransactionID])}

Now you can aggregate to simplify your table: group by [Item], and take the AVG - or the MIN, or the MAX, but not the SUM - of [Support].

We’re done with the upper split for now, so let’s leave that there and go back to where we branched off after the pivot.

Lower branch: Creating the Association Rules

Back we go. Remember, this is the table we’ve got:



The first part of this lower branch is as follows:



We're creating the left hand side and the right hand side by duplicating the table, renaming the headers, and creating a dummy join field (again, every value is “x”). The only difference between the two nodes is whether we're renaming things to “RHSWord” or “LHSWord”:


Next, join both of those nodes together again. You’ll want to set it to [JoinDummy] = [JoinDummy], and also include [LHSItem] != [RHSItem]. This stops us generating rules for impossible association rules like “hand soap → hand soap”:


Finally, create the association rules, simply by adding the calculation:

[AssociationRule]
[LHSItem] + “ ---> “ + [RHSItem]

Lower branch: Calculating Confidence

The next bit of this lower branch is a bit more complicated.


The aim of the next few steps is to calculate the confidence for the association rule. This is the support for the right hand side of the rule out of all transactions which contain the left hand side of the rule. So, the two numbers we need are the number of transactions which contain the LHS item across the whole data set, and the number of transactions which contain the RHS item if they also contain the LHS item.

Remember, this is the data structure we’ve currently got:


So, we can first calculate the number of transactions which contain the LHS item across the whole data set with a fixed LOD:

[LHSItemCount]
{FIXED [LHSItem]: COUNTD([LHSTransactionID])}

That bit is nice and easy. But the next bit is a little trickier. At the moment, there’s a lot of rows here that we don’t need because we joined all LHS rows to all RHS rows in the previous step. We already have the figure for the number of transactions the LHS item occurs in from our Fixed LOD - now, to calculate the number of transactions the RHS occurs in given the LHS item, we only want to keep rows where the LHS item exists in the RHS transaction and the RHS item exists in the LHS transaction.

For example, here’s the all the rows we currently have for the association rule “hand soap → moisturiser”. We want to get rid of the rows highlighted because the LHS item, hand soap, does not exist in the RHS transaction list.


The way we did this was to calculate the RHS item count given the LHS item by doing this:

[RHSItemCount]
IF CONTAINS([LHSTransactionItemList], [RHSItem])
AND CONTAINS([RHSTransactionItemList], [LHSItem])
THEN 1
ELSE 0 END

This will return quite a lot of rows where the [RHSItemCount] is zero, like this - note that these are the same rows highlighted above:


And we want to get rid of those 0 rows, so we can simply filter out rows where RHSItemCount = 0. Now that we’ve filtered based on RHSItemCount, you can remove this field (and we're going to create a new field called that in the next step anyway).

The next step is to calculate the confidence. To get the number of transactions where the RHS item turns up given the LHS item, we can use another Fixed LOD:

[RHSItemCount]
{FIXED [AssociationRule]: COUNTD([RHSTransactionID])}

Finally, we can calculate confidence for the association rule! And all we have to do is this:

[Confidence]
[RHSItemCount]/[LHSItemCount]

The last step in this section is to aggregate this data into a nice simple table of association rules, LHS items, RHS items, and confidence:


Bringing it Back Together

The hard bit is over! All we need to do now is bring in the support for individual items. This will require two joins:


First, join on Item = LHSItem to get the support for the LHS item, rename [Support] to [LHSSupport], and remove the duplicate item field. Then do exactly the same thing for the RHS.

Calculating Lift

Only one step left! And it’s a really simple one. While we prefer thinking of lift in symmetrical terms (see the challenge blog), it’s easier to calculate with this data to use divide the confidence by the support for the RHS item given the LHS item. That means we can just do this:

[Lift]
[Confidence] / [RHS Support]

And that gives us our results:


This was a tricky challenge, but hopefully you got there, and hopefully you had fun! You may even have found a better way of doing it than we did - we’d love to hear what approach you took.

Make sure to fill in the participation tracker, and post you solutions onto our Preppin' Data Community Forum page so that we can compare our workflows!

Popular posts from this blog

2023: Week 1 The Data Source Bank

2023: Week 2 - International Bank Account Numbers

How to...Handle Free Text