How to... Create Additional Data
The challenge with data preparation is often about cleaning and removing columns or rows of data, but that isn't always the necessary action to take. When preparing datasets for analysis, there is often the need to create additional data through certain techniques.
What situations require data creation?
The aim of the data preparation is to reduce the level of manipulation required when analysing the data. Therefore, any effort that can be made to simplify the 'data work' during analysis is one worth taking in the data preparation stage. Any column or row (record) that is not in the dataset, but should be for the analysis, needs to be created. There are multiple techniques for this purpose but firstly, let's explore where you shouldn't be creating extra columns of rows:
Dynamic Calculations in Desktop
Tableau Desktop has been designed to allow the user to conduct analysis at the speed-of-thought. Part of this design are the two calculations that are designed to factor in the data fields within the view:
If you want calculations that adapt with the dimensions on the view, then these needn't be set in the data preparation stage. However, if you want your columns to be clear on what they should contain and result in, then calculations should be completed in the data preparation stage.
Duplicate rows from Joins
One way to create additional rows is to join datasets together. If you join a dataset on to another where the granularity is identical, or a value is appended, then you are unlikely to create duplicate rows of data if there are matching data fields that represent the granularity of the data. However, if you have different levels of granularity, or the join conditions fields do not match, then joining data can create additional rows of data - not always in a useful way through.
If the granularity of the dataset that is being joined on has more detail within it, the join is likely to add additional rows by duplicating records in the original dataset. Any form of duplication will hinder the analysis as it is unlikely to make the analyst's life easy.
So what approach should be taken if additional columns or rows should be created?
Dynamic Calculations in Desktop
Tableau Desktop has been designed to allow the user to conduct analysis at the speed-of-thought. Part of this design are the two calculations that are designed to factor in the data fields within the view:
- Table Calculations - calculations that use the dimensions within the view to create additional analysis. As the dimension change, or the position of the dimensions, on the view, the table calculations update.
- Level of Detail calculations - there are two types of Level of Detail calculations that update as different dimensions are brought in to the view on Desktop, Exclude and Include. Therefore, the user sets the results from the calculations based on the dimensions used within the view.
If you want calculations that adapt with the dimensions on the view, then these needn't be set in the data preparation stage. However, if you want your columns to be clear on what they should contain and result in, then calculations should be completed in the data preparation stage.
Duplicate rows from Joins
One way to create additional rows is to join datasets together. If you join a dataset on to another where the granularity is identical, or a value is appended, then you are unlikely to create duplicate rows of data if there are matching data fields that represent the granularity of the data. However, if you have different levels of granularity, or the join conditions fields do not match, then joining data can create additional rows of data - not always in a useful way through.
If the granularity of the dataset that is being joined on has more detail within it, the join is likely to add additional rows by duplicating records in the original dataset. Any form of duplication will hinder the analysis as it is unlikely to make the analyst's life easy.
So what approach should be taken if additional columns or rows should be created?
How to create additional columns
Creating calculations adds an additional column to the dataset (unless that calculation is named identically to an existing data field in which case, the original field is overwritten). Calculations are either cleaning, or data creation functions:
- Cleaning - using a calculation to split longer strings or transforming existing data fields into new columns more useful for analysis.
- Creation - ratios, counts and totals can all be formed through calculations to make analysis easier or more thorough.
Any calculations completed in the data preparation can save users from having to know how to complete the same in Desktop, lowering the technical skills required to work with the data.
Additional columns can be created from existing rows using the Pivot - Rows to Column step. The shape of the data before the pivot will often see rows holding different measures. To make analysis easier in tools like Tableau Desktop, the Pivot step can be set-up to transform Rows into Columns. This will allow each measure within a row to become a column of its own.
As mentioned above, joins can create complexity if granularity is mixed or join conditions not clear. However, joins are the best way to add additional columns to a dataset, often from a separate dataset. The additional data can add a lot of context or further depth in the resulting analysis.
How to create additional rows
Just like the Pivot - Rows to Columns functionality, the Columns to Rows Pivot can create a lot of additional rows that will be useful for analysis. The Columns to Rows pivot is often used to take columns of multiple dates and transform them into a single column for the date and a separate column for the corresponding values. This makes analysis a lot easier as Tableau uses Dates in a very flexible way once they are in a single column.
Unions are frequently used to merge data sets with a very similar structure on top of each other. This adds additional rows for each additional dataset used. Unions are often used to add differently weekly or monthly datasets together to create a single file.
Scaffolding
The final technique to add rows is called scaffolding. This is the process of taking your core dataset and adding it on to a scaffold that includes all the records you require. Scaffold is often used to fill in missing dates or records that simply don't exist but where you require a record to exist, even if it shows a null or zero.
These techniques are not the only ways to create additional columns or rows in your dataset but will probably be the main techniques you use. Time spent completing these tasks in the data preparation stage can save hours of rework for lots of users of the dataset or missing the ability to analyse the data in that way if their skills don't allow them too.
The final technique to add rows is called scaffolding. This is the process of taking your core dataset and adding it on to a scaffold that includes all the records you require. Scaffold is often used to fill in missing dates or records that simply don't exist but where you require a record to exist, even if it shows a null or zero.
These techniques are not the only ways to create additional columns or rows in your dataset but will probably be the main techniques you use. Time spent completing these tasks in the data preparation stage can save hours of rework for lots of users of the dataset or missing the ability to analyse the data in that way if their skills don't allow them too.