How to... deal with Nulls

Nulls, or the absence of data, are a fickle challenge within data preparation. Experienced data preppers will almost instinctively know how to deal with them, or at least manage the challenges that come with a dataset that contains nulls. Newer data preppers, often do not have the same set of use cases to draw on to know how to handle the null fields. Therefore, this post is looking to share the considerations you should make when working with a dataset with a null in.

What is a null?

A null is the absence of a value in a data field within a dataset. The absence of data is very different to a zero, a new row or a space. These are all values that although look similar to the absence of the data, they are actually a value of some kind. Nulls appear in datasets for many reasons including:
  • The result of mismatched fields in a Union
  • Mismatched fields in a Left, Right or Full Outer Join
  • No original data entry for that record but other data points for that record existing (ie other fields are not null)
Nulls are important in Data Preparation and thus, Prep will show the number of Null records in any data field in the profile pane at the top. 

Now you know what they are and where to find them, let's look at the considerations you need to make when preparing data for analysis.

When is a null ok?

To understand when a null value is acceptable, let's assess the impact of a null value. The most common place to find a null affecting the analysis conducted is when averaging values. Here is a simple dataset that in one case has nulls and another that has zero values instead of the nulls.
With Nulls

With Zeros

When using both these tables in Tableau Prep in exactly the same way, due to the null values, there are two different outputs. In Prep, adding an aggregation step after the Input and forming the average 'Volume' for each Type of product gives very different results.

When apply this technique to the dataset with Nulls, the values returned for Bars and Liquid is 56.6 and 566.6 respectively. However, applying the same technique to the the table containing zeros gives results of 42.5 for Bars and 425 for Liquid product types. 

So what is the maths on this situation and why does it differ? Well the numerator always stays the same for both tables, 170 for Bars and 1700 for Liquid. Therefore the difference is in the denominator for the average. With the null table, the denominator is 3 but with the zero table, the denominator is 4.  Prep is therefore not considering the null within the average. When a value is returned as zero, Prep includes the fact this record adds no volume but does have a record. 

This consideration is important as if a null is correct, ie a record didn't occur and shouldn't have, a null is the perfect entry for the record in the respective data field. However, if a null is present but it's due to a value not being recorded but should have been then the null should be replaced with a zero. Situations like this occur where a product is stocked in a shop but not sold. Nulls should be used where a product isn't stocked within a certain store and therefore, never had the possibility of being sold. 

What can you do when a null isn't ok?

As a data prepper, you may have to decide that the null should be a record with a value, even if that value might be zero. So what options do you have to replace the null?

Filter
If your analysis doesn't allow for a null value to be in place then the row may need to be filtered out. ISNULL() is a fantastic calculated function to allow you to simply assess whether a record has a null in a given field. ISNULL() returns a Boolean result of True or False and therefore, the 'True' values can be filtered out, thus removing the nulls. 


In this example, each row with a Null in the Volume data field will return a True. This data can then be filtered out if necessary.

Filtering out columns full of nulls is a wise choice too as it is very unlikely that it is useful for any reason, especially if the column doesn't have a name and returns a 'Fx' (x is a number) where Prep doesn't find a field name in the dataset.

ZN()
ZN() replaces a Null value for the data field that is placed inside the brackets with a zero. This is useful when you want the row to be considered in aggregations, especially averages like the example in the section above.

In this example, Prep will overwrite any Null values with a zero instead.

Merge
Not all Nulls are meant to be zeros, they might be due to data entry, or system, errors. This means that if you have a value within the dataset that can replace those nulls, a Merge could be used instead. The situation where this mostly occurs is during a Union as if data field names do not exactly match, the Unions will form separate columns for the differing named fields. 

Let's use Preppin' Data 2020 Week 8 challenge to show the effect of this. This challenge starts with a Wildcard Union but results in two columns for the Volumes.

The easiest way to merge these two columns back is in the Profile Pane. Select both columns by holding down Control, or Command if you are a Mac user, and then hit the '...' icon. Select merge and you will have merged those who fields together. This means that any null value will be replaced by a value from the other column. 

This results in the following dataset:

Nulls can be frustrating when you receive a dataset with them present. Hopefully this post has helped you to understand the choices you have, whether they should remain, or whether you need to remove them and how to go about that. 
                                           ________________________________________________

Try to handle Nulls in the following challenges:


Popular posts from this blog

2023: Week 1 The Data Source Bank

2023: Week 2 - International Bank Account Numbers

How to...Handle Free Text