How to... Document your Data Preparation

Jobs are no longer for life as people look for variety and challenges in their career. Data roles are no different as the skills are highly transferable between company and industries. Therefore, the data sets you prepare are likely to be passed to others. To ensure the work continues to be understandable, up-to-date and of value, it will need to well documented so you can walk through other Preppers through your logic once you move on (or get a promotion).

Basic Documentation

Folder Structure
All documentation within the data preparation file is useless if you can't find the flow file in the first place. Keeping a folder that is available to you and your team for all the flows will help dramatically. My organisation uses Google Drive as this has the benefit of not just controlled sharing but also being available on any computer I log in to. Setting up a structure of those files is also key. In large organisations with more complex data flows, you will want to think about setting up files for:
  • In Production - the holy grail of folders, strict control is needed with this file to avoid changes that break data sets your organisation relies on. Flow files should only enter this file once fully tested.
  • Development Flows - the work in progress folder or sandbox
  • Testing - once a flow is developed, you want to lock which version is being tested
  • Archive - having a history of key versions can help learn about output changes
File Name
The name of the file is also key. If you can't locate the correct file in a folder of flow files, there is no point spending time documenting the work in the first place. As a consultant, I have probably seen every naming convention type that has ever been thought of. No one works better than others as the key point is Does everyone understand it? If they don't then there is no point having a naming convention as your colleagues will soon start to break it, or chase you to understand where a certain file is. You have data to prepare, you don't have time for that!

Input clarity / Source

Within the file, the key piece of documentation that needs to be crystal clear is where has that input data come from. This may sound obvious but those source files / tables are likely to move over time or change structure. All Tableau tools can only read what is in the underlying source. As the source changes, so will the resulting data being used by the flow and therefore, the output. 

Recording the original data source, location, file / table name and the frequency it updates (or not), will help you or your colleagues understand what should, and should not, change when the flow is next run.

Output 

Knowing what file you may overwrite is key in Data Preparation as you may not be able to reverse the changes if you need to. The file location should be clear and re-running the flow will create an output in the location specified so someone moving that file has less impact (unless your resulting analysis is not pointing to this output.

Within Tableau Prep there are a number of other stages where documentation can make the difference between work being easy to pick-up / fix if necessary. 

Step Names


Clean Step
The Clean step (or 'Step') is the Swiss army knife of Tableau Prep as one step can have hundreds of different combinations of preparation. The 'Step' can include calculations, filters, string value cleaning, splitting fields, renaming or even deleting fields. Also consider, the Step can contain any number of the combinations of these actions together. Naming the Step what you are doing is key to ensure the users of the flow, and yourself, can check what it is doing.

The Prep developers have helped out with the inclusion of small icons that show some of the top-level actions happening within the Step.
For example, in this step there is a Filter, calculation and a field has been removed. This gives the user a very good quick review of what is happening and a nudge as a reminder of what happened. The step names don't have a significant amount of characters that show on the flow pane and therefore, the names need to be very concise.

Step Descriptions

Step Descriptions have the significant benefit of not only being able to be much longer, 200 characters, but can also be toggled between shown and hidden. 
The grey quotation box, shown on the Union step above, can be clicked to show the description on the flow pane or removed. This allows the author of the flow to add much more detail as to what is happening at each step in the flow as the data is prepared for analysis. 

Colour

One change that was added very early on in the development of Tableau Prep was being able to assign the colours of Steps yourself to add visual documentation to your flow. This is useful as the developer whilst building the flow but also as someone picking up another's flow for maintenance or further development. There are two key steps where colour particularly makes a difference in Prep:

Joins 

When joining data sets together, or self joining data together as per the above image, using colour to show there the two different datasets coming together becomes very useful when picking up someone else's flow and also ensuring you have used the correct data fields from the incoming data sets. 

The colour logic I like to apply us the idea of mixing together the yellow and blue inputs create a green output. Within the tool, the use of colour helps with the set up of the step correctly as we can instantly see the two data fields that are being joined on. This is useful especially on inner joins as the fields are normally identical and thus seeing where values have not been joined is more easily understood if you can see which input source has the mismatched fields.

The result of the join is shown in the example above as the colour of the join step. The green line running above the profile pane helps to demonstrate to the developer of the flow the data fields that the join will be creating.

Unions
In a Union step, colour can be used to demonstrate the mix of the two flows coming together. Unlike a join, my preference here is that the two flows are a blend of the inputs rather than a mix as the data structure is the same or similar.

Within the set-up of the step, the input flows' colours are also representative of where they have come from. In this example, the 'Time' field has come from the 'Clean Times' input and the '24 Hour Time Format' has come from the 'Unclean Times' Input. This helps to identify why field names differ and where you may want to go back 'upstream' in your preparation flow to either amended the naming or understand why it differs. The blank colour demonstrates where the values in the field have come from except the nulls which occur due to not having a corresponding field in the other data set.

Although documentation sounds laborious and time consuming, in Prep this isn't the case and editing step names, adding short descriptions or changing the colour of the steps can make development less prone to errors and handover much easier. 

Popular posts from this blog

2023: Week 1 The Data Source Bank

2023: Week 2 - International Bank Account Numbers

How to...Handle Free Text