How to... keep your data up to date

Once a data set has been prepared for analysis, the next choice you want to make is whether the process needs to be repeated and if so, how often. If the Input data source is updated, the likelihood is that your flow will need to be refreshed. Depending on whether you have Prep Conductor, or not, will determine what options are available to you to maintain the data.

In this post, we will look at the two different approaches to data sets that do require a refresh, full or incremental and what the difference is between them. For each of these techniques, we will walkthrough how to set these up in Tableau Prep Builder and Conductor. A few key strategies will be offered on what to look out for when using these techniques too.

What do we mean by refreshing data?

By refreshing data, it means the original input will receive new data, or have the original data altered. Once there is new or altered data, a consideration needs to be made as to whether the changed data should be pushed through the data preparation flow. One of the advantages of Prep is that building the preparation flow once means that it can be reused by clicking the Run icon or setting up a refresh schedule in Prep Conductor. 

The refreshed data may have the following main changes:
  1. New rows - new data gets added as new rows
  2. Overwritten values - new data values overwrite existing values 
  3. New columns - new columns appear in the data set
Each of these types of refresh pose their own challenges but can be covered by all of the different techniques in this post.

When a data set is provided for analysis, a question you should ask is will the data refresh? If the answer is yes, it is likely the data preparation flow will need to be re-run each time new data is available. Understanding the frequency of the refresh is important as sometimes requests for 'live' reporting will be impacted by the frequency of the updates to the original source. Also an assessment should be made on how frequently the reporting requires fresh data as often the data may be constantly refreshing and therefore, if the reporting is to be refreshed once a day, a cut-off point needs to be made. 

What is the difference between Full and Incremental refreshes?

As the names suggest, a full refresh refers to a complete update of the data set whereas an incremental refresh is only a partial refresh. With a full refresh, the original data set has all records removed and replaced with the records in the refreshed data source. Therefore, a full refresh is usually irreversible and thus should be checked to be correct before run to ensure data is removed that isn't going to be properly replaced. 

An incremental refresh uses only the latest additional new records to update an existing data set. Normally, these new records are added to the end of an existing data set. There is less risk associated to this type of refresh as those records being added can always be removed if no overwriting occurs. 

In Tableau Prep Builder, the development team have sought to add additional flexibility by allowing incremental refreshes to overwrite existing data points to mean that only latest data points are remaining in the data source. For a full refresh, rather than overwriting the data original data set, the full refresh can be added to the end of the existing data set, forming a history table capturing snapshots of data over time. These different approaches to full and incremental refreshes allows the user to set up refreshes for many different scenarios. 

How to set-up the different types of refresh

Full Refresh
A full refresh is set-up in the Output step of Prep. At the final part of the Output configuration pane, there is an option to set a Full Refresh:

If you have used Prep before the refresh options were given, any run of the Prep flow would have behaved like a full refresh. If the data set already existed it would be overwritten but if the data set the Output step writes too doesn't already exist then it will be created. The main change that was offered when refresh options came in to Prep was that a Full Refresh could append all the new data to the end of the flow, rather than eradicating the previous data set.

Incremental Refresh
The set-up of the Incremental refresh begins with an additional option added to the Input tool. This functionality is a newer feature for Prep and therefore, the Input step looks different from what it has done previously in other 'How to...' posts.

After the 'Set up Incremental Refresh' has been ticked, a further set of options will appear that need to be configured correctly to avoid errors messages appearing on your flow.

The first configuration option is to select which data field will show there is new data in the Input data source. In the example above, date has been selected as profit figures are recorded daily per store and product type. Thus, as the next days data is available, a new date will appear in the data set that needs to be incrementally added to the output data set.

The second element that requires configuration is which output to update. A Prep flow can result in multiple outputs so this is why this needs to be specified. In the example above, I have a single output and haven't renamed it from the default.

Finally, the name of the matching field to the one being accessed for new data is set. This is purely due to the likelihood of data fields being renamed as you prepare the data throughout the flow. In the example, the Date field hasn't been renamed so therefore the field in the output that is being accessed for whether there are any differences in the input is still Date.

When running the Output, you can choose whether to run an Incremental Refresh or a Full Refresh both within Prep Builder and Prep Conductor. In Prep Builder, you can select this from the dropdown on the Output tool.

These type of refresh is not the only thing that can be controlled but also whether the data is replaced or added to the existing data set. This is completed at the final stage of the Output configuration step:

What to watch out for when refreshing data sources

When setting up a refreshing flow within Prep, a few areas should be questioned to determine that the results of the Prep flow are as intended:

Changing data values?
When setting up an incremental refresh that adds new rows of data to an existing data set, the existing records need to be considered too. If there is any chance that the records might be updated, a full refresh should be made periodically to ensure that any updates to existing records do get reflected in the data. Otherwise, the new data will get added so most users of the data set will incorrectly assume that the data is refreshed.

Altering structure of sources
There is a difference between a new column being added that is another 'month' of data that will be pivoted in to a new record of data, and a new column that is not expected within a flow. The new date columns can be treated through the wildcard pivot option to prevent any errors or unexpected consequences. 

New columns that are not expected in the data set are harder to predict the affects to the output and are highly dependent on the contents of the flow. Here are some basic ideas as to what should happen:
  • If there are no pivots, the data is still likely to output but might not include the new column. 
  • Aggregation steps will remove the new column of data as it won't be included in the step and won't appear in the output unless that data set is added back in through a self-join.
New data, new input
If the new data being created by a system or individual is saved as a data source under a different name, it may not get picked up by the existing Prep flow. The only time the new named file will is if a Wildcard Union is set-up to absorb the file. The data structure of the file will need to remain consistent for the flow to work as intended. 

Summary

When setting up data preparation flows, the updates made to the original source of the data needs to be considered. Depending on the updates that are made to the source will alter how the flow should be designed to ensure the output is as intended. Once those considerations are made, the type of refresh and how the original should be amended to or overwritten is the final key set-up step. Get all these factors correct and you will be able to analyse the latest data on an ongoing basis with little ongoing effort. 

Popular posts from this blog

2024: Week 1 - Prep Air's Flow Card

2023: Week 1 The Data Source Bank

How to...Handle Free Text