How to... decide where to Store your Data?

One of the key considerations in data preparation is where to hold the Output. After all, what is the point of doing all that hard work if you then put the data somewhere that is:
  1. Inaccessible to someone who needs to use the data
  2. Is somewhere slow / unresponsive 
  3. May run the risk of eradicating the source data if overwriting other data incorrectly
Let's consider each of these scenarios in turn to determine what should be considered when writing your Outputs to a location.

Inaccessibility

Data Openness vs Data Security is a consideration to make and take seriously. With more data legislation coming in across the world as the general public learns the value of their own data, and what effect their data can have on their lives, the more restrictions should not be taken lightly. At the same time, whilst not breaking any rules / legislation, giving people freedom to work with data will lead to future innovation and better, more efficient decisions. So how can this balance be struck? Well let's first consider the absolutely must nots:

Break the Law - PII
Personally Identifiable Information (PII) is where you have data that can identify individuals. For operational reasons, you may need to be able to identify someone (ie check the balance in their bank account) but for analytical purposes this shouldn't be the case. This isn't a book on data security but on preparing datasets for data analysis. Therefore, you shouldn't need to be able to pinpoint individuals so data that can, should be restricted to prevent improper use. 

Break the Law - Right to be Forgotten
Numerous pieces of legislation have been created, or restructured over the last few years, about the right of an individual to remove their data from your organisations possession. Therefore, ensuring there is a clear trail of where data is used, what it is used for and that it will be deleted after it stops being relevant / used for that purpose is key. 

Delete Operational data
Operational Systems, the technology systems that allow you to make payments, take orders or provide the services your organisation exists for, can not be effected by your analytical queries. All of these systems rely on data and often hold data in databases. If you are querying these systems directly, you are one poor query away from causing a lot of damage. Copies of the data in these systems, if legal and useful for analysis, should be copied in to a specialist analytical environment. This way, analytical queries are against a database that if you cause an issue to, you will not stop the key systems that allow your organisation to operate. 

So with these aspects in mind, let's look at the other side of the coin and still see data as an asset and not a liability waiting to go wrong. Making data accessible is key for any organisation to progress and develop. 

Data for the Experts
Without having awareness of what is in the data, poor decisions can be made. This isn't a matter of technical skills but understanding the context of the data. By giving the experts on the data the right access to the information, work with it to understand exactly what each column is doing, will ultimately ensure the data source gets documented and becomes useful. Otherwise, you are storing or potentially using data that could be misconstrued. By storing the data somewhere the business experts don't have the access credentials, or skills, to query will result in 'Opinion' driven decisions rather than data driven decisions.

Documented Sources
Storing data that isn't clear and obvious what it doesn't lead to any success. Curating data sources so they can be easily understood by the organisations data users will unlock greater potential than saving on small amounts of storage space. Clearly named column titles, creating views on top of tables to 'humanise' the language or publishing data sources on easier to use platforms like Tableau Server are all ways to document a data source and make it more usable. 

Slow / Unresponsive

One major consideration when deciding to store data is speed of the responses people demand. In the age of being able to ask questions and get answers using the internet in seconds from all over the world, a dataset taking twenty seconds to load can feel positively glacial. At a previous job, I had two computers running at any one time so whilst one was loading one query, I could be making progress on a separate task on the other. This is not the set-up that will optimise your users of your dataset for success.

Ensuring datasets that are responsive to the queries being made is a key to putting data at the heart of your organisation. Not all data resides in this state as lots of data stores will be on a slow, archaic database but as data prepares, our task is to not just clean datasets but ensure the resulting dataset is going be responsive wherever it is output. Ultimately, if it isn't, your users will tell you by not using the dataset. 

Overwriting Risk

Although Tableau Desktop is a read only tool, the same can't be said for Prep. Whatever alterations a user made in tableau Desktop, they would never change the underlying data source. This allowed a user to experiment and ultimately 'try' new techniques or queries. If you enable the same level of freedom in Prep, you must be conscious that Prep could overwrite data that you might not be able to recover (if it the updates are incorrect).

This factor is nothing new for running data infrastructure. For decades, database environments have been a battle of limited resource to administer the needs of the users of those environments. If you reduce the work load of the administrator, you open up less experienced users to more responsibility. However, tilt that balance in the other direction and everyone is stuck waiting for the administrator to be able to do anything. Creating an equilibrium is clearly impossible but here are some factors that can help:

Read Only access
Giving people access to the raw data can help without loading huge processing loads on the data storage environment. At a large bank, there was initially resistance to give Tableau Desktop to users for the fear that an already stretched environment in terms of processing power would be tipped over the edge by increased demand. The opposite actually happened due to the quality of the drivers that Tableau uses. Queries became more optimised in the majority of cases and the value gained from running the environment increased dramatically as more users were getting value from the data assets.

Prep should be looked at in a similar way. The only difference is that instead of visual analysis being produced, cleaner, merged datasets will be produced. Give the users a set location to publish these too (commonly nicknamed a 'sandpit' or 'playground') will not only empower your users to try to gain more value from the data, but can also allow better specifications of future developments to be developed as they will be able to prove what they need to do to empower themselves and others.  

Training before Publishing
The Prep flow itself doesn't have to be run as an output. The process of going through, cleaning data can be beneficial for both the user to learn what they would like to achieve, how to do it and for those that have permissions to 'hard code' the results (ie write them in the database) to see the process taken to do so. This means the users requirements that may have been spurious, or iterative, are now clear to the end point before the administrator spends time working on this work. Overtime, the skills the individual develops will empower them to do this work themselves as they will know how to use the tool and the environment to succeed.

So where to write that output?

Well it depends. It depends on the user, the type of data involved, the responsiveness of the database  and the investment in the data platform and key roles supporting it. Empowering the user and allowing them to learn overtime will ensure data driven decision making develops in your organisation. Ensuring people developing can't go too far wrong is good for them and the administrator of the platform as fixing mistakes can be time consuming.

Creating an analytical data store that allows users to work with their data has huge benefits for the organisation and is a worthwhile investment. Allowing users to have access to Tableau Prep, which is focused on making learning the data preparation skills easier to develop, is another plus point in an organisations evolution of data work. Combining the two is a strategy that has a lot of upside but may take some time to work towards. Having this set-up as a target is a great starting point if data seems locked-down or potential users don't have the skills to access the data sources they need. 

Popular posts from this blog

2023: Week 1 The Data Source Bank

2023: Week 2 - International Bank Account Numbers

How to...Handle Free Text