How to...Filter

One of the most important factors when cleaning data is deciding whether the data:
  • Can be cleaned up?
  • Should be ignored?
  • Has to be removed?
As soon as you decide on the latter option, then you need to filter out of your dataset. This sounds like a very easy decision to make but shouldn't be that easy, especially if you are preparing data for others to use. Being certain that you, and the data sets user, doesn't need this data going forward is a difficult challenge. Unless you are certain it isn't needed, don't remove the data unless it is the last step in the process before publishing for the following reasons:
  • Losing context - does that data help the user provide context to other data points?
  • Messy but manageable - Just because the data might be hard to tidy up, could it still be of value
  • Business logic - by your user having different business experience, does the data suddenly have meaning?
So let's get back to basics first to explore what is a filter through to where to use them.

What is a filter?

A filter in data preparation is the way to make a choice as to whether you want to keep or remove data from a data set. This is commonly classed as 'Keep Only' or 'Exclude' respectively. There are multiple ways of making a choice about what you would like to keep or get rid of:
  • Selection
  • Calculation
  • Wildcard 
  • Null Values
There are also two different forms of filters that can be applied as well:
  • Data Field filters - removing columns of data
  • Data Value filters - removing rows of data

What are the different type of filters?

Let's dig in to each of the filters to understand how to use each type of filter:

Selection
This is the most basic form of filtering. The experience is different is the user wants to filter out data field or data values.

To filter out data fields, this can happen in a through places within Prep. The Metadata grid is ideally designed for this. The tick box selections can be easily selected, or deselected, as required. The metadata grid is available on the Input step, as well as cleaning steps throughout the flow.


Fields can also be removed through the '...' option at the top of a data field.


For data value filters, the user selects, through a range of actions, what they want to apply the filter too. This can happen in multiple places within Tableau Prep:

1. Profile Pane
Within the Profile pane, the distribution of instances of a value are shown by the grey bar behind the values. This bar can be used as a filter. By right-clicking on the value or bar, you can keep or exclude the value in question. Multiple values can be selected by dragging your mouse over the items to select or using CTRL / Command key on your keyboard.

2. Data Pane

Within the Data Pane, you can select any of the values in the data pane at the bottom of the screen to decide whether to filter out all other values or remove the value selected. This option will keep or remove all of the records (rows) that feature this value.

3. From a Data Field
By clicking on the '...' when hovering over the data field in the profile pane, you can choose to filter by 'Selected Values' that allows you to tick the values you want to 'Keep Only' or 'Exclude' based on the selection you make at the top right of the dialogue box.


Calculation
The most common form of filters after selections, are calculations. The easiest way to set these filters will be through the grey bar separating the Flow Pane and the Profile / Data Pane.


By selecting the 'Filter Values...' option, the calculation dialogue box will open but notice the slight difference at the top of the screen as it will show 'Add Filter' instead of 'Add Field' that will appear at the top of of the same screen when creating calculated fields from scratch.
By creating a Boolean calculation, the true values will remain in the data set to use in future steps. The false values will be removed from any potential output. This type of filtering can also be triggered by clicking on the '...' at the top of the data field in the Profile Pane. The only difference is, the data field used to trigger the measure will be added to the calculation editor.

Wildcard
Found in the same '...' filter options at the top of a data fields as 'Calculation' and 'Selected Values', the wildcard filter will be familiar to Tableau Desktop users found when filtering by a Discrete Field.

The user is promoted to enter a value to identify within the field chosen. The value entered doesn't have to exactly match the full value found in the data field; this is sometimes known as a 'substring'. The user is prompted to choose where in the wider value that the substring should be found.

  • Contains - can be anywhere within the string
  • Starts With - The first characters of the value being assessed has to match the substring
  • Ends With - The last characters of the value being assessed has to match the substring
  • Exact Match - the substring entered needs to be a perfect match for the value in the data field
If the match occurs, the row of data for that value will continue to be used in the data set where 'Keep Only' is selected at the top of the Filter or otherwise it is removed from the dataset. Exact matches can be filtered out if the user chooses an 'Exclude' filter at the tip of the filter. 


Null Values

In this option, again triggered from the '...' option at the top of a data field in the Profile Pane, the user selects whether they want to keep rows that have null, or are non-null, values only. 

When to filter data fields out of your data?

Removing columns of data is a significant step within the data preparation process. Once this decision is taken, there is very little (apart from a confusing join) from an earlier step that will bring this data back from oblivion.

Being able to remove unnecessary data fields is an important step to take for the following reasons:
  • Removing blank columns - when using Excel or Text Files, you will often find totally blank columns that often have a space lurking in just one of rows. Remove these and don't look back!
  • Columns that are mostly nulls - just because certain rows are null, doesn't instantly mean that you should remove the full column. Is there a way to populate those rows of data using the Merge functionality? Is there reason why there should be nulls in the respective column? If so, leave the column in place, otherwise filter it out.

When to filter rows out of your data?

Like date fields, filtering out rows of data can be for many reasons. Taking care not to remove data that others might find useful in their analysis is a key consideration. By filtering values out of a dataset, all the rest of the information within that row is also removed.

Some of the reasons for filtering rows of data out of your data set include:

  • Removing data errors - it's difficult to correct erroneous data as it may be hard to prove what the correct value should be. Thus removing the data record is the only option.
  • Outside of scope - your dataset might be on the last complete year of transactions and therefore, there is no point having multiple years of data so the additional years should be filtered out.
The filter may be the most used cleaning technique in data preparation so this post could have gone on for a long time. Practising filtering and being comfortable with what is being removed by your filter is a key technique to master.
                      _________________________________________________________

You can practice filtering in the following challenges:

Popular posts from this blog

2023: Week 1 The Data Source Bank

2023: Week 2 - International Bank Account Numbers

How to...Handle Free Text