How to... use Smarter Filtering

The basics of Filtering was covered in this previous post but filtering doesn't have to be just manual clicks or setting date ranges to determine what data is included in your output. Calculations, joins and measuring percentage variance all can be used to make filters smarter and more automated. Removing manual work within the Data Preparation stage reduces the chance of mistakes and makes it easier if the dataset is shared with others once published.

Calculations

Any calculation can be used as a way to trigger filters. There are certain types of calculations that are more common than others when being used as filters. 

Boolean
Testing whether a value in a data field meets a condition or not, is a great way to clean data and remove the records that you do not want. A common test to complete is if a value exceeds a logical limit. In this example, it is expected that 'Percent Complete' can not exceed 100% and therefore, a Boolean filter calculation ensuring the value is 1 or less in Prep would remove any values that exceeded 1.

Remember, percentages are often not multiple by 100 as this helps formatting in Tableau Desktop.

Logical Calculations
IF and CASE logical calculations can contain quite complex logic that makes filtering easier by matching certain scenarios across multiple data fields. In this example, the dataset is sports data where the user only wants to return games that turned from one team winning at half time to losing by full time; or vice-versa.

This calculation allows both a losing team at half time that went on to win the game, and a winning team at half time that went on to lose the game to be given the same result, in this case 'Upset', so 'Expected' results can be filtered out of the dataset.

IF statements work through the conditions from first conditions to the latter. As soon as one condition or set of conditions are met, the calculation returns the value that is set. The conditions set can be very complicated and detailed so IF statements can be very large, complex calculations but useful for where that complexity is needed to determine whether the data should be filtered out or not.

Regex for Matching Patterns
For more complex filtering calculations, Regex can be a massive help. As covered in the Regex chapter, Regular Expressions allows the user to set a pattern that can be tested. For example, if a letter is present in an expected numerical field, then it can be handled separately. Regex can be used to identify where a letter is present:

This calculation tests whether a letter is present in the 'For' column, a column that should contain points scored, and will create a true / false response.

A filter can then be configured to return just those values in the column that don't contain any letters.

When the data field 'For' is now converted to a Number (Whole) there are no Null values created, as the value couldn't be converted.

Join Ranges

Using Join Conditions that intentionally excludes data is a technique that prevents data being added to the dataset that would then need to be filtered out later. Join Ranges refer to use of join conditions that join two datasets together based on join conditions using less than or greater than instead of just the most common join condition - equal to. The Advanced Joins post covers these scenarios in a lot of detail but one other technique that isn't included in that post but uses Join Ranges to filer is Scaffolding covered in this post.

Self-service Data Preparation is enabling more people to complete their own analysis but Tableau Prep is primarily used on laptop and desktop computers of 'normal business users'. This situation means computing power is often limited as most business laptops are not top specification. Join Range filters prevents rows being formed when joining datasets that would normally be filtered out at a later stage. This reduces the amount of data processed and therefore requires less computing power.  

Percentage Variance

Another form of smarter filtering is testing key measures to ensure they are within a tolerance level. These tolerance levels will often be dependent on business logic and rules. Percentage Variances are often good checks to place into a flow where the data will refresh or due to manual data input, mistakes could be made. There are two different techniques to consider dependent on how the errors may enter:

Manual Entry - Level of Detail Calculations
As Manual Entry is likely to be ok most of the time, but with the odd mistake, then the actual dataset being used within the flow can set the benchmark and have the values within it tested.

With the addition of Level of Detail calculations in Tableau Prep Builder 2020.1.3, it became a lot easier to check the percentage variance by using Level of Detail calculations as they add an additional column to the dataset. Previously, the workaround to construct calculations in the Level of Detail style was to use Aggregation an aggregation step which would then remove all other data fields. Let's use some Soap Production Cost data to see if there are any outliers:


To test whether there are any outliers in the dataset, first the 'normal' level needs to be tested. This can be completed by using Level of Detail calculations at the level you choose. For this dataset, let's assume it's known that there is normally a different Production Cost for the different Types of Product. This calculation can be formed using the Visual Editor:

This gives us two values to test going the Production Costs against. Creating two calculations that adds, and subtracts, the percentage variance will allow each value to be tested. For this dataset, the percent tolerance level is 30% above and below, the average Product Cost per Product. The Upper Check can be formed by creating this calculation:


The Lower Check level is created by the following calculation:

The tolerance can then be checked by assessing whether the Product Cost lies between the Upper and Lower Checks. This can be formed using the following IF statement in a new calculation:

The flow can then split based on whether the values are within the tolerance levels to continue the flow. The records that have values that sit outside the checks, can be branched off to be sent for further checks and validation manually or by the business experts.

Reloaded Data - Join to Previous output
When a system goes wrong, or not as expected, it often goes very wrong. Therefore, judging a data field being within tolerance if all the values have gone very wrong is not going to work. If all of the values fundamentally change, then the average technique above will not catch the issues raised as the average values are driven by the changed column. To resolve this, the previous dataset needs to be tested against the new data load.

Using an example where the Purchase Cost data system has suddenly stopped using decimal points, let's form the flow to check how to recognise the issue.

Many of the same steps to the earlier technique are used but the key differences are:
Average the Average Production Cost per Type
This step changes the granularity of the data to the level you want to make the tolerance check too. This calculation is done on the original data and not the new dataset that is being fed in. In our example, we aggregate up to a single record per Product Type.

The benefit of using an Aggregate step is that the step reduces the granularity to one record per element you want to join by. You will still need to calculate the Upper and Lower Checks based on the results from the Aggregate step.

Join the datasets together
Join the new dataset and original dataset together to add the Checks on to the new dataset. 

This results in a dataset where the same calculation to determine whether the Production Cost is within tolerance levels or not.

The filtering choices can then be made in the same way as per the earlier technique.

Combining Techniques

Using the Join Range and Percentage Variance techniques together can also automate the filtering to only ever leave the results you require. In the previous example, the Join Condition could be set to remove any values outside of the tolerance levels set:

This prevents any data being processed and shared with users of the dataset that could lead to misinformed decisions.

Using these techniques can begin to provide safeguards when things go wrong, through either manual data entry or when systems that produce datasets go wrong. Calculations, Join Ranges and determining Percentage Variations can help break up datasets based on a more rules approach. 

Popular posts from this blog

2024: Week 1 - Prep Air's Flow Card

2023: Week 1 The Data Source Bank

How to...Handle Free Text