How to... deal with Boolean data

My favourite data joke is "What's a ghost's favourite data type? BOO-lean".

Now that's out the way, let's talk about potentially the simplest data type but actually one that sits at the heart of a lot of what we do in data analytics and therefore is an important part of data preparation - Boolean data.

What does Boolean data mean?

Boolean comes from the mathematician George Boole. He was Cork University's first mathematics professor who's theorems eventually became applied to computing. Boolean data is simply a 'True' or 'False' response to a conditional statement or test. 

Why is it so useful in Data Analysis?

Simply, because the response of True or False is often encoded as 1 or 0 behind the scenes in the software we use, the performance of calculations that use Boolean data is very quick. Computing is based on 1s and 0s so Boolean data is easily processed by a computer. 

A simple column of 1 or 0 responses is actually amazingly useful in data analysis for many reasons beyond just performance:

Indicators
These can be analysed very simply in most data tools. If for example you want to count how many customers have a certain product type then summing a Boolean Indicator 1/0 is very simple in most tools. 

In Tableau Prep or Desktop, it would be easy to aggregated these values as either a SUM() or AVG().

Logical Tests to Filter
Determining whether sometime meet a condition, or not, is a key part of data analytics. Boolean tests will simply result in True / False responses allowing the user of the test to decide how to progress. These tests can be useful in cleaning data as well as building out more complex logic of what course of action to take to solve a specific problem. For example, using the banking data set above, if we are analysing only Customers with Credit Cards, checking if that customer has a 1 value in the 'Credit Card Ind' column gives us the perfect data set to work with. 

Aliasing
Getting a response of True / False or 1 / 0 might be useful as the analyst but won't likely be the response the user of the analysis will understand instantly. Therefore, using an alias to detail what the Boolean data type refers to is very useful. 
In Tableau Prep, the profile pane is the perfect place to change your values. Simply double clicking a value, you can replace the original value (in this case 1) and replace with a 'Yes'. Remember to change your numeric data type from an Integer to a String or Prep will not accept the change. 

Functions featuring Boolean logic

There are many calculation functions that will assess whether a value in a data field meets a condition to be output as either True or False, or uses Boolean logic. Here are some of the most useful:

IIF() 
Is short for, an Immediate IF function. IIFs are the equivalent of choosing what to output instead of just True or False where a condition is or isn't met. 

The way to set this calculation up is:
                                       IIF(Logical Test, If True return this, If False return this)

The function returns the True result (second part of the function) if the logical test is met, and the False result (third part of the function) if the logical test is not met. Using an IIF can save a lot of typing of traditional IF statements and thus prevent keying errors.

Contains()
This is a great function when working with string data to test whether certain words or terms are present in a string. The Contains function saves users splitting each word in a string out individually (or key characters) before understanding if they exist in certain rows. This can prevent data being converted to being massively long datasets and battles with duplication of values.

For example, if a single word is being looked for within a tweet, the Contains function can tell if that term exists without splitting each word to an individual row and having the values associated to that tweet replicated within the data set (ie number of retweets or likes).
                                                       Contains([Tweet],'Key Word')

The function returns True if the keyword is found within the given field, False if the keyword isn't found. The function is assessed for each row within the data set.

IsDate()
Tests whether a column is a valid date. Whilst working with manual date recording (for example, a team capturing records of events in Excel), it is very useful to be able to identify those dates that are not valid dates so as to be able to clean the inaccurate records to ensure the data is captured for analysis.
                                                                     IsDate([Date])

The function returns True if the Date field (shown as square brackets) is in the format yyyy-MM-dd. This means the year is two to four digits, month one or two digits, day as one or two digits. The date parts need to be valid, ie day value has to be less than or equal to 31 and the month values have to be below 13. The date field needs to have the date parts separated by hyphens.

Dates that are not possible, like the 31st Feb, are also returned as false so this function can help validate dates and not just check their formatting.

IsNull()
Nulls in data can often be valid but where you don't expect to find them in fields can be a sign of issues that need to be resolved. Branching off records that have unexpected nulls to either alert an analyst or for differing data preparation can save misanalysis from the data.
                                                                     IsNull([Field])

If the field isn't null, ie. it contains any value or string (including a space character), the function will return False. If the field being assessed returns True then the field will be a Null.
                          __________________________________________________________________

Boolean data type fields can be very powerful as demonstrated by the varying use cases mentioned above so practising their use is key to becoming competent at Data Preparation.

Popular posts from this blog

2024: Week 1 - Prep Air's Flow Card

How to...Handle Free Text

2023: Week 1 The Data Source Bank