How to... think about Data Management
Data Management and Information Security is a growing focus area for consumers and companies. This post is not a guide on what the laws are and are not but will be highlighting the areas of consideration for most Tableau Prep users. By looking at the types of sensitive data you may come across, we'll then be able to determine how to handle and use the data correctly.
You will have read throughout this book that I am for more data use and proliferation. This is due to seeing many organisations lock data access down so much it becomes impossible to make data driven decisions either due to a lack of data or skills as no-one hasn't worked with the data correctly to use data sets once they become available. This is not to say that you should be reckless with data access but preventing data use can cause harm to consumers to through poor 'gut instinct' decisions.
Public
This information has probably been sourced from publicly available sources and is free to use. The data may come from government sources or just be data that is used openly - like spatial data sets like post codes / zip codes or census responses or social media posts.
Confidential
This data has probably been processed and therefore information has been refined from that data. The investment in time and effort might result in the holders of the data not wanting to be openly share the data with potential competitors. If this information was to leak out into the public realm there would be no consequences for the entities covered by the data.
Strictly Confidential
This is likely to be information on your sales, customers and products. This is information that you do not want to be seen by your competitors and that you can not share either. This data may contain lists of your customers' details but nothing that is so sensitive that would impact the customer if the data was leaked into the public realm. This level of sensitivity is likely to include intellectual property within the company that may cover data assets like financial models and projections.
Restricted
This is the most sensitive data the organisation holds. For large organisations this can cover a vast spectrum of data. The data is likely to include customer sensitive data like banking details but also demographic information which could potentially be used against someone if leaked to the public. Not all organisations will hold information like political affiliations or sexual orientation but data can be a proxy for this like bank transactions.
Restricted data doesn't just need to be individual specific as company price sensitive information would be covered by this category too. Price Sensitive Information (PSI) relates to companies that have shares that are traded. If PSI is not shared correctly, it gives the holder of the information an advantage over others who do not and therefore are able to trade their shares with more information; thus an unfair advantage. People with access to this information need to understand how to handle this data but also need to be prevented from trading shares or sharing the information with those that do. Breaking rules around restricted information doesn't just pose a risk to damaging the organisations reputation but also could lead to fines and imprisonment for misuse.
Overall, data management and information security isn't the most fun subject but can make working with data much easier and faster if the situation is managed well. Allowing new Data Preppers to learn and develop within a controlled environment also needs to be thought about and learning 'sandpits' created to allow their skills to grow and develop.
You will have read throughout this book that I am for more data use and proliferation. This is due to seeing many organisations lock data access down so much it becomes impossible to make data driven decisions either due to a lack of data or skills as no-one hasn't worked with the data correctly to use data sets once they become available. This is not to say that you should be reckless with data access but preventing data use can cause harm to consumers to through poor 'gut instinct' decisions.
What is Sensitive Data?
Data sensitivity is measured in many different ways depending on the organisation, the sector that organisation operates in and the subject matter of the data. Most data is classified on it's sensitivity that if released on purpose or by accident would cause issues for the organisation releasing the information or the individuals the data is about. Typically 3-5 levels of data security exist in most organisations that I have worked with and they bands are as follows:Public
This information has probably been sourced from publicly available sources and is free to use. The data may come from government sources or just be data that is used openly - like spatial data sets like post codes / zip codes or census responses or social media posts.
Confidential
This data has probably been processed and therefore information has been refined from that data. The investment in time and effort might result in the holders of the data not wanting to be openly share the data with potential competitors. If this information was to leak out into the public realm there would be no consequences for the entities covered by the data.
Strictly Confidential
This is likely to be information on your sales, customers and products. This is information that you do not want to be seen by your competitors and that you can not share either. This data may contain lists of your customers' details but nothing that is so sensitive that would impact the customer if the data was leaked into the public realm. This level of sensitivity is likely to include intellectual property within the company that may cover data assets like financial models and projections.
Restricted
This is the most sensitive data the organisation holds. For large organisations this can cover a vast spectrum of data. The data is likely to include customer sensitive data like banking details but also demographic information which could potentially be used against someone if leaked to the public. Not all organisations will hold information like political affiliations or sexual orientation but data can be a proxy for this like bank transactions.
Restricted data doesn't just need to be individual specific as company price sensitive information would be covered by this category too. Price Sensitive Information (PSI) relates to companies that have shares that are traded. If PSI is not shared correctly, it gives the holder of the information an advantage over others who do not and therefore are able to trade their shares with more information; thus an unfair advantage. People with access to this information need to understand how to handle this data but also need to be prevented from trading shares or sharing the information with those that do. Breaking rules around restricted information doesn't just pose a risk to damaging the organisations reputation but also could lead to fines and imprisonment for misuse.
How to manage data based on Sensitivity
Getting the balance right between over-protection and preventing people from being able to do their work, and under-protection meaning sensitive data could be used incorrectly, is a challenge but far from impossible.
Having data sources that are held centrally but with the ability to gain access quickly is a good target to aim for. Removing the proliferation of small data sets held individually on colleagues laptops and personal drives is likely to help ensure data is up-to-date and data correctly removed after it stops being relevant. Providing centralised data sources can ensure that data is accurate as multiple people will be using the data sets and will make adjustments if they find errors.
The challenge with centralised data sets is getting access in a timely fashion so people can answer the questions they have, when they have them. Centralised data sets often include a lot of data across the spectrum of the data sensitivity levels and therefore, permission controls are tightly administered. A fast turnaround, or devolved permission approval process where multiple people who are close to individual making the request is important. Otherwise, no-one is able to grant access or requests are held-up by individuals who maybe away from the office or busy.
For Data Preppers, having space in similar environments to the centralised stores are important so they:
- Get used to connecting to them
- Get used to using naming conventions for that software and data sources
- Get used to writing to them
- Removing content from them once their data is no longer needed
All these skills will mean that when central sources need to be used and/or improved, the Data Prepper is ready to use the right terms and probably have a process developed in Prep that mirrors what needs to happen on the centralised, production environment.
Production vs Development servers
Production environments are not a term covered deeply in this post or others to date. Not all data is prepared perfectly first time just in the same way that a report or analytical dashboard is right first time. Iterations are often needed based on the feedback of others using the asset that has been developed. This is where Development spaces are required to test the data preparation flow as well as the resulting data set. Only once the asset has been tested and is approved for use more widely and in key reports should the flow be moved into a Production set-up. The Production environment is likely to be more tightly controlled and therefore, most people will not have permission to write content to it and nor should they to prevent mistakes that may be very difficult to resolve.
When to Delete Data
So if you understand the sensitivity of the data and have tested the dataset in a development environment before publishing the same flow to a production environment then you are fine? Well not really, you need to consider when to delete data when it is no longer needed or when you lose the right to hold the data. The most common situations where this occurs is:
Timing
Data becomes less relevant and potentially accurate over time. When creating a data source, it should be thought how long that data should be retained for. Obviously, the date initially designated for when the table, or records, should be removed doesn't means it has to be. A reassessment can be made but my having a date against that decision means the data should be reappraised and not simply left.
Customer / Client leaves
Data should only be retained whilst it is relevant and is allowed to be retained. Detailed customer data should be removed when a customer leaves. To be able to do that, it must be known where all of that customer's data actually resides, in what tables and what systems. If data is proliferated out to a lot of different sources and files, this becomes more difficult to do so. Having data sources driven by centralised sources means that when customers are removed from that central source, the removal will proliferate out into the other data sets.
Overall, data management and information security isn't the most fun subject but can make working with data much easier and faster if the situation is managed well. Allowing new Data Preppers to learn and develop within a controlled environment also needs to be thought about and learning 'sandpits' created to allow their skills to grow and develop.