How to...use Regex
Data can be complex; it's not all adding values together or counting things. String data fields can have a lot of complexity but utilising them is often essential in forming your data sets and analysis. Regex is a set of commands that make difficult string parsing challenges much more achievable.
What does Regex mean?
Regex is a shortened version of the name Regular Expressions. This refers to what Regex actually does, applies a calculation to a pattern of characters in a string of characters. Regex is much more flexible in the patterns it can identify than what can be achieved through other functions found in most data software like Left, Mid or Right.
The advantage of learning regex is that is a common set of commands exists across most data tools and therefore is highly transferable between data preparation tools.
The advantage of learning regex is that is a common set of commands exists across most data tools and therefore is highly transferable between data preparation tools.
How can it be used in Prep?
Prep is one of the many places you can make use of Regex. There are three main different ways to use Regex in Prep.
Regex_Extract or Regex_Extract_Nth
Extract allows the user to return a part of a string, or substring as it is also known. The part returned is captured within brackets within the regular expression - this is called the 'capturing group' and is shown by the part of the pattern inside parentheses.
Regex_Extract([String Field],'regex pattern')
The Regex_Extract_Nth is a way to bring back the Nth instance of the capturing grouping.
Regex_Match
The Match function returns a Boolean result of either True or False if a regex pattern is matched within a given string. The regex pattern can be only part of the overall string to still return a True result.
Regex_Match([String Field],'regex pattern')
Regex_Replace
The Replace function uses regex to identify a part of a string field to replace and then also informs the function what to replace it with.
Regex_Replace([String Field],'Substring to replace','Replacement')
The Match function returns a Boolean result of either True or False if a regex pattern is matched within a given string. The regex pattern can be only part of the overall string to still return a True result.
Regex_Match([String Field],'regex pattern')
Regex_Replace
The Replace function uses regex to identify a part of a string field to replace and then also informs the function what to replace it with.
Regex_Replace([String Field],'Substring to replace','Replacement')
What situations should I look to use Regex in rather than other string functions?
Replacing Common Mistakes
Many string fields come from manually entered text. This can mean that common issues are repeated within a data field that will need to be removed before accurate analysis can occur. For example, if you have times of different events captured but in different formats, the relevant values need to be extracted from the string values before you can begin your analysis.
In this instance, the time needs to be removed from this string. Using Regex_Extract this can be achieved as the string can be described by regex and then parentheses can be used to strip out the part we need for the analysis.
This gives the result required:
Removing spaces in Contact Details like UK Postcodes or Phone Numbers
When using contact details for marketing or customer analysis, there are common inconsistencies with how people capture their own details. Adding a space, + or hyphen within the formatting of a telephone number is common place but there is no single correct format. Therefore Regex can be used to make this data the same format to understand if there are duplicates for example.
Using a random telephone number from the UK, this format is what we would expect but let's assume the data analysis requires the space to be removed. Using Regex_Replace, we can use the '\s' command to identify the space and replace it with nothing which can be coded as two quote marks next to each other ''.
This gives the result required:
Many string fields come from manually entered text. This can mean that common issues are repeated within a data field that will need to be removed before accurate analysis can occur. For example, if you have times of different events captured but in different formats, the relevant values need to be extracted from the string values before you can begin your analysis.
In this instance, the time needs to be removed from this string. Using Regex_Extract this can be achieved as the string can be described by regex and then parentheses can be used to strip out the part we need for the analysis.
This gives the result required:
Removing spaces in Contact Details like UK Postcodes or Phone Numbers
When using contact details for marketing or customer analysis, there are common inconsistencies with how people capture their own details. Adding a space, + or hyphen within the formatting of a telephone number is common place but there is no single correct format. Therefore Regex can be used to make this data the same format to understand if there are duplicates for example.
Common Regex commands
In order to be able to utilise Regex functions, you will need to be able to form the Regex patterns. Here are the key parts of the Regex that should help you form the patterns you will come across:
\d any digit 0-9
\d+ one or more digits
\a-z any lower case letter
\A-Z any upper case letter
\D any non-digit
\s white space
^ start of a string
$ end of a string
( ) capture anything within the parentheses, like in Regex_Extract function
{ } exactly a number of the given substring, set-up by adding a number between the brackets ie {3}
For testing, you can use Prep or alternatively can get instant feedback on websites built for testing regex like regex101.com