How to...decide Where to Prepare your Data?

Like many software providers in the data landscape of providers, Tableau doesn't just have one tool where you do absolutely every task. Frankly, that would be either crazily complex to work with that level of options in one tool, or the tool would be incomplete in terms of what the user needed to complete. Having multiple tools within the Tableau platform though does pose another question - where should we complete certain processes?

What Processes should we consider?

Data Preparation comes down to a number of key steps:
  • Inputting Data
  • Joining / Unioning Multiple Datasets
  • Pivoting
  • Cleaning 
  • Aggregating 
  • Outputting Data
Each of these could have a chapter of their own with lots of hypothetical situations but in reality the majority of these steps should take place in the Data Preparation tool. Joins, Unions, Pivots are common tasks at the Data Preparation stage and that complexity should be removed for the user. Whilst flexibility might be required in some cases (specific visualisation styles for example demanding a different data structure), the majority of datasets will have a relatively standard set-up required for analysis. 

This leaves cleaning, including calculations, and aggregations as two processes that may fit in either the data preparation tool or the visualisation tool. With small datasets and simple calculations, the answer to the question is more ambiguous as to what the correct or 'best' way is. However, as the size of the dataset grows and/or the complexity of the calculations become more difficult, this consideration begins to determine how successfully your organisation will utilise data. 

Data Preparation vs Visual Analytics

Considering what tasks should be completed in each tool can help us shape the allocation of work when in comes to data preparation. This answer again is on a spectrum and should be answered in each organisation's context of how sophisticated it is in terms of a number of factors:

Data Literacy
Data Literacy, the understanding of data products (graphs, results etc), is a key determinant in deciding where you will conduct the cleaning and aggregation. Making data easy to work with is important, but ensuring answers formed from datasets are correct is even more important. If your peers do not have the understanding to take on the tasks, you'll need to ensure the work is completed before you make the data available to them. 

Size of Organisation
Having a team that is competent and able to complete the tasks is one thing, but thinking about the volume of work it would take to repeat this task multiple times across an organisation is a key consideration. If you are asking one person to complete a task once, it doesn't matter where that task should be completed. If that same task would need to be repeated hundreds or thousands of times by multiple individuals across an organisation, then this task should be driven in to the Data Preparation tool to reduce the volume of effort. Data Preparation tools are designed to take these repeated tasks and automating them once set-up.

Quality of Technological Hardware
The hardware the tasks are processed on will have a strong impact on the time it takes to complete these tasks. Companies across the world pay people highly but then equip them with older laptops or under-powered computers. This situation has a large impact on people being able to work with data and the problem is being further exacerbated by increasing volumes of data. If the datasets for analysis are small, then any basic Data Preparation may still be fine on the individuals' computer. If the datasets are large, then a Data Preparation tool might be a better option. Data Preparation tools can often work with just sample dataset (like Tableau Prep does automatically for large datasets) and only process the full dataset when required. This full processing is likely to moved to a server (computer with more processing power) once the full end-to-end data flow and logic is established.

History of Data Investment
If data solutions have been invested into well overtime and continue to be so, the likelihood is that databases will contain clean, ready to use information. By conducting the analysis, any additional fields may be added to the database for future use. If this isn't the case, then it's likely you'll be wrestling with messier data from multiple sporadic sources. The result of this isn't a clear answer as to where you should think to do your data preparation but know you are more likely to need to iterate between the data visualisation tool to find out what is useful data and the data preparation tool to set up more strategic data sources for future use.

All of these contextual factors will help guide you to a decision but it's only when doing the actual work will it drive you to a decision as to where to complete your work.

Performance - software

Prep is specifically designed to optimise the process of forming the data preparation flow and then executing it.

Sampling
When connecting to a dataset in Prep, the software runs a sampling algorithm that means you will see a suitable profile of the data but never have to process all the rows of the full dataset.

The sample is designed to represent what the preparation will need to include. Tableau Desktop also shows a small sample of data but is only based on a certain 'N' number of rows. For many data sources this will just be the first 1,000 or 10,000 rows of data in the table you are connecting to. Therefore, if there are challenges that have recently arisen in the latest rows of a table, you might not see them until much further down the analytical process. 

Functionality
Data Preparation functionality was initially built in to Tableau Desktop in the Data Connection window but whilst trying to keep the screen clean and clear for users, the need for additional functionality led to Prep.

Whilst basic tasks can be completed in Desktop, there are limitations where Prep steps in to its own. Planning the required Data Preparation steps will often remove Desktop as an option for completing the Preparation. Multiple pivots, Unioning datasets from different sources and pre-aggregating tables before joining are just a few of the tasks that you will need to complete in Tableau Prep rather than the visualisation tool.

Documentation
Ensuring that you don't just solve a problem once, but making that solution suitable for future and maintaining the solution is a significant reason to drive the data preparation in to a Prep tool. Being able to document the steps taken through naming the logical steps as well as describing what happens within them makes the process much more robust. 

Agility vs Functionality is a key battle when balancing which tool to complete each task. If you remove the agility by separately preparing data in a tool like Prep, you remove the option for each user to do this individually. Removing that flexibility might actually be useful as it will potentially prevent mistakes, optimise performance and complete tasks that would be otherwise impossible. There is no single answer as to where to Prepare your Data, but by being more considered, you'll improve everyone's ability to use data well. 

Popular posts from this blog

2023: Week 1 The Data Source Bank

2023: Week 2 - International Bank Account Numbers

How to...Handle Free Text