How to...Join Datasets together

Whether your data comes from Files, Databases or both, mastering Joins will be a key skill to learn. Joins are required as most software requires one large table of data to reference when conducting analysis. As data will come from a lot of different sources, being able to combine these is key. Data sources are joined together to allow the user to take columns from each data source and use them alongside each other in the output.

For this post, the examples will use the 2019 Week 29 inputs except the initial Join Condition example.

Join Logic and Terminology

The majority of data software uses similar logic and terminology when joining datasets together. Prep is no different but some of the functionality when joining datasets together is a lot more user friendly than other tools.




Here are two tables that we might want to join together for analytical purposes. If we have price, but we don't know how many we have sold, we can't be sure of revenue. Therefore, these two tables need to be put together to allow us to find out the revenue created. 

To do this, there are two aspects to Join:
  • Join Condition - determines how to link the two tables together
  • Join Type - determines what will be returned by the Join
Join Conditions can be simple matches or, more complex logic. In the above example, the Join Condition would be does the value in the data fields 'Packages' and 'Subscription Package' match.To be able to conduct the revenue analysis, we'd want to analyse the Price paid for each package along with the number sold. When joining datasets together, there can be multiple Join Conditions set so the condition isn't simply a case of a simple one value matching another in a different dataset. 

Joins in many tools, Tableau included, are often represented by a Venn diagram. The reason for this is, we need to consider what parts of the tables we return - this refers to the Type of Join. The terminology associated to Joins is aligned to this logic also. One dataset will be classed as the Left side, demonstrated as the left hand circle in the Venn diagram, and the other as the Right side, demonstrated as the right hand circle in the Venn diagram.

Inner

The inner join will only return records where the join condition(s) is met.

Left

The left join returns all rows from the the left table of data as well as the records from the right table that meet the join condition(s). Any rows that do not meet the join condition will not be carried through to the next step.

Left Only

The Left Only join returns records from only the left table that do not meet the join condition(s).

Right

Similar in theory to the Left Join. The Right Join returns everything from the right table as well as records from the left table that meet the join condition(s).

Right Only

Similar to Left Only. The Right Only Join will return just records from the Right table that does not meet the Join Condition set.

Full

The Full Join will join all records that match the join condition together but the dataset available after the join will also include the records where the join condition isn't matched. In the data fields from the other dataset where the join condition isn't met, the values will be null. 

Not Inner

The Not Inner Join will return the same as the Full Join except it will not return the records where the Join condition is met. 

How do Join datasets together in Prep?

Joining separate data sources is very easy in Prep. Simply drag one data source towards another and Prep will give a number of options:
By dragging the second data set (in this case 'Subscription Packages') to the original (blue) data flow, the option to Union or Join the data sets appear. By releasing the mouse click when hovered over the 'Join' option, Prep will automatically link the two flows of data together. 
Another approach to creating the Join between two data is to add a Join step and then drag the other input on to the created Join Step by releasing it above the 'Add' section. 

As can be seen on the flow above, an error is highlighted as the Join condition hasn't automatically been set. This occurs when Prep finds matching data field names in each of the two separate inputs. Be careful to check the condition makes sense though as even if the column headers match, it doesn't mean the fields with similar values are exactly the same.

By clicking on 'Add' in the 'Applied Join Clauses' section of the Join, the Join Condition can be set. This is done by selecting a field to assess from each data source and then setting the type of assessment made. Equals is the most common qualifier but Prep allows the user to set other qualifiers like Does Not Equal or Less Than which can save the user from filtering the result of the Join as well as meeting the need to deal with more complex logic.

Once the Join Condition is set, Prep will demonstrate clearly the results of the Join.

The Join Result at the bottom of the image demonstrates the number of rows the resulting dataset will have. When deciding what type of Join is needed, this can be very useful to assess the impact. The Join Type is altered by clicking on the different sections of the Venn diagram. Any areas in dark grey will be included in the resulting output. If you want to remove a section of the Venn diagram from being returned, click on that section again to deselect it. 

In which situations should each Join Type be used?

Just because the correct number of records are returned by the Join, doesn't mean the Join is correct. The Join Condition and Type of Join needs to be thought through thoroughly as to what data you are both including as well as excluding. 

There are a number of situations that require certain Join Types so let's consider a few of those:

Inner
The Inner join is the workhorse of Join techniques. When most people think of a Join in data, this is what they will consider. Inner Joins are where you are pulling together two data sources and only if the condition is met, then will the records from either table be returned. This is a way of completing a Complete dataset holding all the values in result of the Join. The Inner Join is a way of filtering out missing records which would be returned as a null value in a Left or Right Join and thus protecting calculations that require each value to be non-null.

Not Inner
This seems to be a bizarre Join Type on first appearances. Why join two datasets together that fundamentally won't join due to not matching the Join Condition set? Simply, data quality. By creating a Not Inner Join, the user can 'treat' the data that does not meet the condition to meet the condition of the Join is rejoined in the future. For error checking, this is a good technique to get used to. Depending on the data Left Only or Right Only joins would be used in a similar way but only returning the data from one of the tables that doesn't meet the join conditions.

Left
Left Joins are very common, almost to the same extent as an Inner Join. In a Left Join, as every record is returned from the left table, the right input could be considered as being 'appended' on to the end of the relevant rows in the left hand table. 

Joins can be a fantastic way to add additional context into your analysis as you can add data that previously didn't reside in the original source. There are challenges with non-unique join conditions that we will cover in future posts but hopefully this post will help think about how tables should be joined and, depending on the data you want returned, what type of join you want to use.
                                     ____________________________________________________

Here are some of the weekly challenges to practice Joins:

Popular posts from this blog

2023: Week 1 The Data Source Bank

2023: Week 2 - International Bank Account Numbers

How to...Handle Free Text