How to... use History Tables

Live data connections are fantastic to allow analysis to maintain it’s relevance and prevent repeated work each time a dataset can be updated which may be frequently. Building data visualisations that connect to ‘Live’ datasets means that the user of the analytics knows they are getting the latest view on the topic. 

The challenge with frequently updating data is what happens with records that are either ‘updated’ or removed if no longer relevant to the system that formed the data point in the first place. Forming history allows an analyst to record data points at a set point without fear that the data points will be overwritten and lost forever.

Why are History Tables required?

When analyzing data, there are many questions to ask of a dataset to get a comprehensive view of key trends in the dataset, along with finding outliers and unexpected results. However, once these point have been found, the questions of users and the analysts quickly turn to:

  •      Why are these trends or points showing what they do?
  •      Has it always been this way?
Forming context as found in the former question can be a tough challenge. Often working with experts on the subject can help with this. However, the second question often boils down to having historical data to draw upon. With datasets growing rapidly, and despite storage costs reducing for data, retaining sufficient historical data points to answer the latter question can be a challenge. 

Retaining too much information will swamp the analyst and the tool. Too little history and it will be impossible to judge whether a trend is present. Here is where a well-built History Table can make all the difference in your analysis. Retaining the right information about a situation can allow meaningful analysis to be formed. Let’s consider a few scenarios where History Tables can come in useful:
·     Staff progression – maintaining a view of who held which role when, along with how they have progressed through the organization can offer a lot of insight on how well your organization develops it’s people
·     Customer retention – Like staff retention, understanding what products / holdings a customer had with your organization before they left allows for better proposition development and improvement rather than just only looking at the current snapshot.
·     Operational Processes – customers often get frustrated with organisations when they are stuck going round in circles when trying to meet their need. If the organization doesn’t have a clear view of the journey the customer has had, it is less easy to improve the process for the future. If a system only holds the latest status for a customer, it is impossible to judge where operational issues might lie. 

Therefore, history tables allow data points to be stored for future analysis and offer insights that may have been otherwise lost if only Live views of the data were used. 

What to consider when forming History tables

As mentioned above, there is a constant battle between storing everything, or potentially losing useful data. So considerations should you make when forming your own History tables for use in your analysis:

Ability to Join to Live data
When using history tables, an analyst can form useful findings by just using the History table but by joining it to the Live data can create stronger, more pertinent results.

Relevant Information only
Unless you are a shoe retailer, holding your employee / customer’s shoe size isn’t going to help. Holding information about the products held and when they joined / left becomes useful to understand patterns of behavior. Too much information makes the analysis harder and slower to form so leaving just the principle data is key. 

Frequency of update
Taking a historical snapshot of dataset is useful but not too frequently. Often monthly views can begin to show patterns of behavior if you customer is transacting with you daily. The more frequent the update, the less movement you will notice. On the other hand, if you don’t capture your customer’s traits efficiently enough, you will not have the data points to show what is happening in your customer’s holdings and transactions. 

Level of Granularity
One data point per group of customers? One data point per customer? One data point per month, per customer? You will need to decide what is relevant to the analysis you may wish to conduct but whatever you decide, you may need to further aggregate the data for your analysis. This is only possible to go from more granularity to less as by aggregating data, you are removing detail from the dataset. When analyzing history over time, think about the comparison you may want to make – this month versus the same month last year? This Quarter vs the same quarter last year. This will all affect the volume of data the history table will need to retain.

All of these choices may change overtime, but importantly, by forming the History Tables, you are leaving yourself the opportunity to conduct analysis you might be able to otherwise. 

Performance

With all the focus on relevance, frequency and granularity of data above, all of these factors add into the performance when forming the analysis as well as using the analysis. Data software processes large datasets at an ever-increasing pace. But with History tables, the challenge of ensuring datasets remain small and concise enough to join to what can be already large datasets. Ensuring you have a clear join condition is important but also being clear on how many rows your join will create is important too. 

If you have 10 million customers, but then hold a monthly snapshot for the last year, then all of a sudden, you are analyzing 120 million records of data instead. This size of dataset is likely to start to slow analysis down when running calculations or rendering the visuals of your analysis. When forming the analysis, consider aggregating the 110 million rows of data in the history table to a lower granularity by reducing the monthly frequency. An alternative approach might be to stop analyzing the dataset on a per customer basis and form ‘clusters’ of customers with similar demographics or product holdings. 

The benefit of holding a separate history table is that you can treat the history table differently and then join on only the relevant data that you require. When removing data in analysis with your ‘live’ dataset, it is much easier to make mistakes that will have a bigger impact on your analysis rather than removing data from your history table that is then joined on to the live data. 

Data Regulations 

The major consideration about history tables is to respect the law on use of data. When a customer or employee leaves they have the right to have their data removed. However, a record that a customer, non-identifiable, had certain products or transactions, is useful information. As long as that information is not personally identifiable then these records can be held but they should not be otherwise. This becomes a lot easier to keep when the data is aggregated up to a higher level than at the individual level. 

Forming and using history tables will allow you to conduct better and deeper analysis than otherwise. Refining the relevance, frequency and granularity of the data stored in the history tables will ensure analytical value is received from the effort taken in forming the analysis. 

Popular posts from this blog

2024: Week 1 - Prep Air's Flow Card

How to...Handle Free Text

2023: Week 1 The Data Source Bank