How to... use Identifiers and Keys in data

As mentioned in multiple 'How to...' posts to date, databases are designed as powerful, performant and secure locations to store and work with data. One concept at the heart of the majority of the databases that you will use in organisation, are Keys and Identifiers (IDs). These databases that utilise these concepts are called Relational Databases as relationships are created between the tables of data through Keys and ID fields linking the various tables together. In this post we will look at ID fields, how they can be used as Keys between tables, and how to create ID fields in Tableau Prep.

What is an Identifier in data?

When using databases, like most computing software, numbers are processed more efficiently than other characters. When working with data, there is often the need to process large volumes of data and therefore, any efficiency in that processing can make a significant difference in the time it takes to prepare your data or form your analysis. This is where Identifier fields come into play for databases. Rather than repeatedly storing names, addresses or other long string values, a relational database architect will use separate look-up tables to store those values just once and associate an identifier to the value instead. This purchases table has a number of ID fields that can be used to link together other reference tables with the names of customers or products. These type of tables containing lists of transactions or measures are often referred to as a fact table.

Identifiers can also be used in a boolean way too indicating whether something is true or not. For example, yes is often encoded as 1 and no as 0. This type of logic often does not have an associated reference table but some data architects will still insist on the reference table clarifying this. 

What is a Key in a database?

The concept of Keys helps identify how to join tables together in databases. Databases can contain thousands of tables and therefore, trying to determine how to join tables together comes from the keys in the tables. Traditionally, there are two types of keys in database terminology:

  1. Primary Key - a unique identifier that can not be replicated in a table
  2. Foreign Key - an identifier that can be duplicated in a table of transactions but is unique in it's reference table
Let's look at the Customer table that has a Unique key for each customer.

This table would be used to add the Customer's name to the Purchases table seen earlier in this post. The Customer ID in this table needs to be Unique as we wouldn't want to wrongly attribute orders to the wrong customer. In the Purchases table, the Customer ID could be classed as a foreign key as the values in the column are duplicated but are used to join the Customer table.

How to use Keys and Identifiers in Prep

Keys in the context of Prep are very similar to the use of keys in a database. Keys in Prep are used to guide users on how to join tables together. Information about what data fields in a database table are likely to be held in separate documentation about the database. Keys are not explicitly shown in Prep in a different way to other data fields. This is where the Join configuration pane comes in particularly useful to see what values are matched as well as the resulting number of rows that are being created by the join tool.

When joining fact tables, like the Purchase table, you would expect to retain the number of rows originally found. Otherwise:

  • Increase the number of rows - you will have duplicated the purchases and simple counts or aggregation of values may over represent reality. 
  • Decrease the number of rows - you will have lost rows of data meaning records of sales or number of customers are going to be under-represented.
The reason why this view in Prep is useful is because it clearly shows the number of rows entering the Join step and resulting from the Join. In this example, there are seven rows of data from the purchases table and seven rows of data are returned. This is a sign the correct key has been joined up. The Join Clauses pane demonstrates that all Customer IDs have been matched as they remain black text. Any unmatched values will turn red and show which source table they come from. 

As mentioned earlier, identifiers don't have to be used as keys in join conditions. In the Purchase table, the 'Order Dispatched' column is very likely an identifier field as the only values present are ones and zeroes. These identifiers can easily be clarified if preparing the data for end users. This can be done by aliasing the values in a field. 

As the data field is numeric, the data type will need to be altered to be string to allow the number be changed to a string. The resulting data field desired would be:

How to make Identifier data fields in Prep

All of the above points rely on an identifier or key being available in your data source. However, if you are preparing your data source for storage in a database, you may need to create a key from the data you have. Let's use Superstore, the sample data set that is downloaded as part of the Prep or Desktop package and create a Category table that could act as a Category reference table.

Since Prep version 2020.1, the tool has the functionality to create Rank automatically. In this example, you can remove some of the complexity by creating rank based on the Category data field. 

By selecting this option, you will be presented by the Rank visual set-up which makes the process coding free. 

In the default set-up of the Rank function, the value of the rank is being ordered by the number of rows attributed to each Category. To get past this and create a simple ID for Category, change the Rank function in 'Order by' to 'Dense Rank'.

To create the reference table to be able to lookup these values, you can branch off an Aggregation step. Aggregation steps remove all data fields from your data set that are not used in step. In this step Group By Category and average the Rank to provide a single ID per Category. Rank can be renamed as Category ID.

The resulting table from the Aggregation step is very simple and can be output separately.

The 1, 2, 3 result is perfect for creating the IDs we want in the data set. You can remove the Category field from the original data set now.

The resulting flow looks like the image below although you may wish to add Outputs to use the data or write it to where you need to use it. The Full data set now only has the Category ID and not the Category name. This saves on space and can be processed faster. The reference table can be used when the names need to be added back into the data set if required.

If the data set is to update and increase the number of potential values in the Category field, care needs to be made to keep the original IDs set and assign the next rank value to the new values. 

Summary

Keys and Identifiers can be seen to pose unnecessary challenges when you are new to data preparation but getting used to them can actually save time and cost in the long term. By computing faster and taking less space to store, you can reduce the time it takes to complete your analysis and the cost of storing the data, especially when data sets get larger. Prep can help people see the effects of the joining the reference tables back into the data, helping individuals sense-check they are using the correct keys. Prep can also be used to create reference tables and write them too databases. 

Popular posts from this blog

2023: Week 1 The Data Source Bank

2023: Week 2 - International Bank Account Numbers

How to...Handle Free Text