How to...complete Advanced Joins
Joining is one of the more powerful techniques for completing your data preparation for analysis. Adding in an additional dataset to what has been available before can either help give more context as to why something is happening or, answer more granular questions. In the previous post on How to... Join Datasets together, we covered Join Types and Join Conditions to get to the data you want in a straight forward way. However, there are many use cases of joins where sadly, the logic is more complex and that is what this post will cover.
Multiple Join Conditions
Frequently the way to link two datasets will be more complicated than a simple single join condition. Therefore, being comfortable adding multiple Join Conditions and understanding the impact of this is key. The key piece of information to understand about Multiple Join Conditions is that they behave as an AND rather than an OR statement. This means that if you have Multiple Join Conditions, all of them need to be met for the data record being assessed to be output by the Join Step.
In the example challenge, by joining the Actual Volume and Values dataset to the Profit dataset, it requires Multiple Join Conditions. By only having one Join Condition (in the image below, only Type is joined), each dataset only has 16 rows of data but the output of the join is 128 rows.
This is due to Type being equal to Join in the other dataset multiple times. Each dataset has 8 rows for each type as each type of product (bar and liquid) is sold in each of the 8 weeks recorded. Therefore, all the bar records are joining to the bar records in the other dataset (again 8) so 8 x 8 gives us 64 rows for each type and we have two different types, hence the 128 rows resulting from the join.
By adding an additional Join Condition to match the rows of data on a one-to-one basis, in this case adding is Week equal to Week, we get the desired result. The additional condition is formed by clicking on the plus in the circle in the top right corner of the Join configuration pane.
Non-equals Join Conditions
In many cases, when thinking about how to Join two datasets together, the search begins for common fields in the two datasets. As we think about common fields, we also often think about finding things that are equal to each other. This shouldn't be the case as there are many situations where using a condition other than equals can be very powerful. These examples are taken from the Preppin' Data challenge Week 8 in 2020.
Filtering with a Join
In the challenge there are a number of week's worth of sales that need to be compared to either Expected Values and Volumes that would make the company profitable and Budget values. The Profit data has one record for each week to be appraised against. The challenge also requests that only those weeks that exceed the Profit point are included.
This is where Join Conditions that only return values that exceed that taken from another dataset can come in particularly useful. Remember the less data being returned reduces the processing time in data preparation.
In the challenge, the required Join Condition is that the Actual Values and Volumes recorded exceed the Profit Minimum expectations. This is set using two Join Conditions that are 'greater than or equal to' rather than just a column from each dataset having values that are equal to each other:
There are actually four join conditions in use here. The Type of product and Week number should be equal from each dataset but to only return records that beat the minimum level of profitability, the 'Sales Volume' has to be Equal to or Greater Than (=>) 'Profit Min Sales Volume' and 'Sales Value' has to be Equal to or Greater Than 'Profit Min Sales Value'. All four of these conditions must be met for the records to be output from the Join Step. In this case, we can see a number haven't been:
Joining by a Range
When joining different datasets together, sometimes there isn't an exact match to join on. In these situations, if there is a range shown within one set of data, that the other dataset has a corresponding column that holds values within that range, Non-Equals Join Conditions can be used to join these together.
In the challenge set in Week 8 2020, there is a range of weeks for each budget level that the actual value and volumes need to be assessed against.
In this instance to make the range available to be a minimum and maximum value to have the actual week to be tested against, the value in 'Range of Weeks' column needs to be split.
This results in a data set with two additional columns - Start Week and End Week of the range:
These fields can then be used to form the range for the actual week in the original dataset. Using the minimum of the range as the Less Than or Equal (<=) to conditions and maximum week of the range as Greater Than or Equal to (>=) compared to Week, allows the relevant range records to be added to the Actual Volumes and Values.
As discussed earlier in this post, Multiple Join Conditions behave as an AND condition but what happens if you don't want all the records to meet all the Join Conditions you need to assess. In the Week 8 2020 challenge, this is the case when assessing the Budget dataset against Actual results. The output from the Join required is whether either Actual Volume OR Value is below the Budget for that week. This can be completed through Joins but we actually need two Joins and then to Union those results back together to get a complete set of records for the question.
The Join Conditions are very similar set up but we will just show one of the two. For the other, just switch Volume for Value. Type being equal to Type is a Join Condition, it's just held off the screen in this screen shot:
The output of these two Joins can then be simply stacked on top of each other. The technique to do this is a Union.
The ability to add the OR question gives much greater flexibility in analysis rather than having to work around Join Types and Conditions to form an ability to bring the required data back.
Practice Advanced Joins in the following challenges: