How to... Remove data when Inputting
Making things easier for yourself in any task you face is always a good tactic. Within Data Preparation, you can simplify a lot of your following work based on making amendments to the initial data connection. The initial data connection is created and shown in the Input step in Prep. However, you often know in your dataset that certain elements want to be changed or removed. This post will cover the considerations to make and cover why it is important to make these considerations.
By clicking on the Data Type icon in the metadata grid within the Input step, you can change the data type, in this case from String to Number (whole). However, the sample values shown in the metadata grid (the right side of the image below), has changed to 'null' rather than previous string values.
Whilst most datasets you work with, there may be slight differences felt, it's the cumulative effects of removing unnecessary columns and rows from your analysis that will help save time within your analysis as well as for others using shared data sources and server resources.
Why might you change your dataset before you've even loaded the data?
Datasets are proliferating and growing rapidly. This means that consideration has to be given about what is actually being brought into the tool. Any data that is being brought in, will be loaded into your computer's memory. Therefore, any chance to reduce the amount of data having to be processed will be useful. Tableau Prep will sample the dataset on the initial load, with only the full dataset only being processed when the output is run.
For Tableau Prep, the initial connection is the input step but Prep doesn't load all of the data in instantly. Prep will load the metadata - the data about the data - in the Input step. This helps the user in two ways:
For Tableau Prep, the initial connection is the input step but Prep doesn't load all of the data in instantly. Prep will load the metadata - the data about the data - in the Input step. This helps the user in two ways:
- Giving an easier view overview of the data
- Prevents slow load times as Prep is not having to process all of the data
Deselecting the fields you, and the dataset users, do not require will save that data from being processed by Tableau Prep.
Making amendments to your dataset here can also save time and complication further through the data preparation process.
Slow Performance, Slow Build, Slow Output
As already mentioned, performance is a big reason to remove unnecessary data whenever you get the opportunity. But it isn't just faster processing speeds that helps, it's that the software can keep up with how you want to work, show you what you might want to do next, and then allow you to make those alterations. Nothing is more frustrating than the wasted time of watching a load screen.
Reducing the number of rows, beyond the sampling Prep does by default, and reduction in columns, can aid performance and cause Prep to have to process less data. Cleaning, Reshaping and Merging processes are easier to spot what is required once there are less rows to process, as well as processing faster once those changes are made.
It isn't just the input and build time that is effected by using unnecessary data, it's the processing of the output too. Flows maybe run very frequently through either Tableau Prep Builder, or the server based tool, Prep Conductor. Conductor gives the user the ability to schedule their flows to be run on a regular basis.
After publishing a flow to your Tableau Server, you can then set a schedule if you have Prep Conductor available to you.
The schedules available to you are determined by what is set up by your Server Administrator.
If you have a very frequent schedule like every 15 minutes or each hour, any unnecessary data being handled soon adds up to additional load on the server. This can reduce performance for others and soon adds up across hundreds of data sources being processed each day on sometimes very large datasets.
Reducing the number of rows, beyond the sampling Prep does by default, and reduction in columns, can aid performance and cause Prep to have to process less data. Cleaning, Reshaping and Merging processes are easier to spot what is required once there are less rows to process, as well as processing faster once those changes are made.
It isn't just the input and build time that is effected by using unnecessary data, it's the processing of the output too. Flows maybe run very frequently through either Tableau Prep Builder, or the server based tool, Prep Conductor. Conductor gives the user the ability to schedule their flows to be run on a regular basis.
The schedules available to you are determined by what is set up by your Server Administrator.
If you have a very frequent schedule like every 15 minutes or each hour, any unnecessary data being handled soon adds up to additional load on the server. This can reduce performance for others and soon adds up across hundreds of data sources being processed each day on sometimes very large datasets.
Changing Columns
The majority of changes made on the Input step are to the data fields, or columns, of your dataset. Simply deselecting the column in the Input step will remove the field from being loaded by Tableau Prep. Only do this if the column contains nothing that you need. Having messy data in a column is not sufficient as a reason for its removal. If any of that data could be useful to you, or your data source end users, then cleaning up that data can happen in subsequent steps.
Completely null columns are a likely data field to be removed at this stage. These commonly occur in Excel files where a user has spaced out the data for formatting reasons if the data was to be consumed in Excel. The data will likely appear with a column header like 'F4' if it's the fourth column in the dataset. If all of your data field names appear in this way, you may need to use the Data Interpreter to find the column headers in the Excel Worksheet. You can manually rename the data field names in the Input tool but this can be laborious.
Completely null columns are a likely data field to be removed at this stage. These commonly occur in Excel files where a user has spaced out the data for formatting reasons if the data was to be consumed in Excel. The data will likely appear with a column header like 'F4' if it's the fourth column in the dataset. If all of your data field names appear in this way, you may need to use the Data Interpreter to find the column headers in the Excel Worksheet. You can manually rename the data field names in the Input tool but this can be laborious.
Changing Rows
By changing the columns, you can change the values in the dataset. This is often the result of changing data types. If a column contains any non-numeric characters and the data type of that column is changed to numeric, Tableau Prep will replace the values with nulls instead of the initial value.
If those records with a null result are not required to be considered within the flow, can also be filtered out too.
Whilst most datasets you work with, there may be slight differences felt, it's the cumulative effects of removing unnecessary columns and rows from your analysis that will help save time within your analysis as well as for others using shared data sources and server resources.