How to...Union

Most software that works with data demands the user form a single table of data to work from. However, the world is often not that simple and often the user will find themselves with many tables of data to pull together to form that single table. On technique will frequently be found in data preparation for this task, its called Unioning.

What is a Union?

A Union can be thought of as stacking one dataset above another. Columns will be stacked on top of each other when they contain the same content. This requires the data structures to be very similar (we'll handle those differences in the next section). 

Let's take sales from our York and Leeds Store. Both are separate tables.


But when Unioned, the extra set of column headers are removed to leave the rows of data stacked on top of each other.


The Union is determined by whether the data fields are:
  1. Named the same in each data set - Any column with the same name, will be stacked above the column in the alternate data set of the same name.
  2. Positioned in same order - the first column will be stacked on top of the first column in the other data set, regardless of what is in it.
  3. Manually matched by the user - some software allows the user to make a matching of columns of their own choice before unioning the data sets together
In the example above, the tables could have been unioned by either their position or name. Currently, Tableau Prep only allows for the first type of Union. In Prep, columns can be united using the other techniques by renaming the relevant columns the same name to meet the demands of the first technique.

What if the data structure isn't identical?

In Prep, as the Union is based on matching column names, mismatching occurs when those names are not the same. For example, Scent has been captured under the correct column header but the same data has the column header of 'Type' in the alternate table. 



Therefore, in a normal union, even though the user might assume the columns should be stacked on top of each other in the resulting table. Where the columns are not stacked, rows from the other table will have nulls where not matching column is found.

In Prep, you will clearly see these mismatched fields as they are shown within the Union step. 

The mismatched fields can be merged within the configuration space of the step by selecting the fields to merge and selecting 'Merge Fields'. 

When to Union data?

So now you know how, you just need to know when to deploy the technique. Unioning data can happen for a lot of reasons but there are some consistent circumstances where Unioning data is a consistent strategy to deploy.

1. Monthly data sets
When working with other teams in your organisation or external third parties, analysts will often receive files on a monthly basis to work with. Automated data preparation flows can help save a lot of time each month to shape the data in to a consistent, useful state for analysis. However, if the data needs to build up over time, then the files need to be appended to each other. Unions are the perfect technique for this. Here's how the York store would change from monthly Sales files to a single file for analysis.


2. Downloading data sets from web-based sources
Pulling together similar data sets from web-based sources is another instance when uniting files can be useful. For example, taking team rosters from a single source, but different web pages can be unioned together. One technique to add is to use the URL, or sheet name, to leave a reference to the team in the data set. 

In this example, using ESPN's team rosters and Google Sheet's IMPORTHTML() function, to pull together a full list of players, you can union the two team roster tables together but in doing so the analyst would lose the team each player plays for. 


When using the Union step in Prep, Tableau adds the Table Name from where the data originates. This allows the analyst to split the team name off from the rest of the table source. 



3. During Mergers of companies
With the rise of cloud based software, more companies are using similar tools. This means, during mergers of departments and organisations allows data exports from these tools can be unioned together. Taking Salesforce as an example, pulling together the new list of Sales Pipeline using a Union will be very simple. 

Multiple tables...Wildcard Union?

Setting up a Union for each merge of two tables could be very time consuming. When using the Input step in Prep, you can select 'Wildcard Union' instead of a Single Table. The Wildcard has three sections that enables the user to select not just a single input but actually any folder, file or sheet that matches the pattern specified by the user. 


The orange sections highlighted above can be altered to have a '*' (asterix) inserted into it to act as a catch-all. The asterix symbol is known in data circles as a wildcard and allows the user to set the patterns for what should and shouldn't be captured in the Union. 

Let's take the following example: one Excel workbook but three sheets: York Sales; Leeds Sales; Reference Table. If the analyst wants to pull the York and Leeds Sales tables together then in the final orange section above, would set a matching pattern of '*Sales'. 

This pattern means that any characters in the Sheet name before the word 'Sales' would be ignored and therefore, the York and Leeds sheets would be absorbed. As the Reference Table doesn't meet the criteria, it wouldn't be absorbed in to the Union and wouldn't be available for analysis. 

Now you can easily pull similar data sets simply to make your data preparation and analysis easier.
                  ____________________________________________________________
If you want to practice using Unions, have a go at the following challenges:

Popular posts from this blog

2023: Week 1 The Data Source Bank

2023: Week 2 - International Bank Account Numbers

How to...Handle Free Text