Last updated: 2024/14/01
In regular cases it’s very usefull to use Regex (or Regular Expressions) if you want to extract or replace data in some text. When there always is a similar pattern in your text data, regex is ideal to use. But how can you use regex?
Table of content
1 – What is regex
4 – Basics - Extracting data from text
5 – Basics - Replacing data with the replace() function
6 – Basics - Named capturing groups
7 – Advanced - Extracting multiple values using only 1 Text Parser
8 – More examples
What is a Regex?
A RegEx, or Regular Expression, is a sequence of characters that forms a search pattern . RegEx can be used to check if a string contains the specified search pattern. It isn’t a coding language but could be seen as one since there is a specific syntax using it.
Here an addition of @alex.newpath:
First, a regex is a text string. For instance, foo
is a regex. So is [A-Z]+:\d+
.
Those text strings describe patterns to find text or positions within a body of text. For instance, the regex foo
matches the string foo , the regex [A-Z]+:\d+
matches string fragments like F:1 and GO:30 .
Typically, these patterns (which can be beautifully intricate and precise) are used for four main tasks:
-
to find text within a larger body of text;
-
to validate that a string conforms to a desired format;
-
to replace text (or insert text at matched positions, which is the same process);
-
and to split strings.
How to begin with RegEx
To be able to develop your own regex, there are a few very helpfull tools and processes which helps you getting started. These are my own personal recommendations, if you have anything extra to add up feel free to comment. Some of the things I use and do to make patterns:
- Regex101 for pattern development, syntax help and debugging
- Stackoverflow / Make community for questions and answers if you get stuck
- A make account and good dose of perseverance
Starting development
Before creating the regex module in Make, I always develop the pattern in Regex101 first. Once you’ve successfuly created your pattern, you can copy it over to the make module. Steps to take:
-
Make sure you set the Flavor within regex101 on “ECMAScript (JavaScript)”. This is used by Make.
-
When looking for patterns use the “Quick reference” in the right bottom to search for generic used patterns.
-
Start the pattern development. If a pattern gets complex, split it up and start with something simple first.
-
Once you succeeded and copying over the pattern to Make, make sure you group the pattern you want to output with brackets. See more information in the example below.
Basics - Extracting data from text
When you want to extract data out of a string, in Make you can use the “Match pattern” module within the “Text parser” app. Lets say I get some HTML data with an URL (href) and I want to extract the URL. It would look like this:
Test string
<a href="www.google.com">Google</a>
Pattern
(?<=href=\").+(?=\")
Output
www.google.com
Now, like stated above, when you copy this over to Make you need to make sure the output you want to retrieve is grouped with brackets. The above pattern will output empty since the output I want is not within brackets (even though regex101 gives you output).
So the correct pattern would be:
(?<=href=\")(.+)(?=\")
And now in regex101 you will also see it gets grouped:
Within make, you can now use this code in the Text parser module and get the data you want to extract.
Basics - Replacing data with the replace() function
Using the “Text parser” app you can also use the “replace” module to replace some text. However if you want to limit the amount of operations your scenario uses, or want to easily replace multiple variables somewhere the replace()
function is ideal.
In the following example I got some text with HTML tags inside of it. For the output data I want to have clean text where all HTML tags are removed. What we will do is replace() all tags with an emptystring
which basically means we will delete those items.
The syntax of the replace() function is as follows;
replace(text; search string; replacement string
text: this is the text you will search for
search string: this is the regex pattern between slashes
replacement string: what you are going to replace it for
In our example it looks like this:
{{replace(1.`raw data`; "/(<b>|<\/b>|<\/br>)/g"; emptystring)}}
The regex syntax is as following:
/<regex pattern>/<regex flags>
In this example we have multiple different HTML tags, so we are using a group with multiple alternatives “|”. Since the tags are also used multiple times we don’t want to stop at the first match, so we are using the /g flag to find all matches.
By using the replace() function you can either search for a certain pattern and remove it, or replace it with some other text.
Basics - Named capturing groups
Within the regex you create, you usually just get an output like $1, $2 etc. However this doesn’t show you much about what kind of data it has extracted.
Here comes the Named Capturing Groups in play which help you A LOT when developing these regex patterns. Basically a named group is exactly the same as any other pattern, however the output tells you the name of the patter it extracted.
As you can see, instead of the standard $1 and $2 we get the named groups h4 and p. The synax for this is as follows:
(?<h4>(?<=\<h4\>).+?(?=\<\/h4>))|(?<p>(?<=\<p\>).+?(?=<\/p>))
Inside of a group, when starting the group () we add the name of the group like ?<group name>
. Then everything that is captured inside of this group will get called this way.
Advanced - Extracting multiple values using only 1 Text Parser
If you have a string with text which contains multiple variables you want to extract, you have 2 options;
- Using multiple Text parser apps
- Using 1 Text parser app which searches all patterns & aggregates them
Since we (at Drivn) personally like to build our scenarios efficiently and easily manageable, we usually go for the second option. With the second option you reduce the amount of operations you use + you can extract all values at once.
In the following example we get a dutch booking email to rent a boat, which holds multiple variables such as the;
- Booking time (vaarduur)
- Food requirements (hapjes & eten)
- Amount of persons (aantal personen)
etc.
We want to extract all values at once, and then add these variables in a google sheet. The following RegEx101 shows how we do this and which groups we get.
As you can see in the screenshot, this regex pattern contains a lot of alternatives and the Global flag. This basically means that it will try to find all groups & won’t stop at the first match. The syntax looks like this:
(?<vaarduur>(?<=Vaarduur\: ).+)|(?<eten>(?<=Hapjes & Eten\: ).+)|(?<personen>(?<=Aantal Personen\: ).+)|(?<datum>(?<=Datum\: ).+)|(?<tijd>(?<=Tijd\: ).+)|(?<vragen>(?<=vragen\?\: ).+)|(?<naam>(?<=Naam\: ).+)|(?<email>(?<=Email\: ).+)|(?<telefoon>(?<=Telefoonnummer\: ).+)|(?<registratieDatum>(?<=Date\: ).+)|(?<registratieTijd>(?<=Time\: ).+)|(?<registratieWebAgent>(?<=User Agent\: )[.\n\s\S]+(?=\nRemote IP))|(?<registratieIP>(?<=Remote IP\: ).+)
When we run this with a Match pattern module, it will output multiple bundles. Meaning that every seperate bundle will have it’s own variable, making it difficult to get all variables at once.
Since we want all variables at once, we will aggregate all bundles inside 1 array so we can easily extract them.
Last thing we have to do now, is extract the values we get out of the Aggregator and put them in our gSheet. As you can see within the output of the aggregator, we will have multiple collections inside of the array.
To retrieve the variable, finally we have to use a combination of map() remove() and get() to retrieve the variable we want:
What we basically do is
- finding the variable in the array
- removing all other empty variables
- making a string out of the final array we get
More examples
Here an example from @alex.newpath
The regex below is borrowed from chapter 4 of Jan Goyvaert’s excellent book. What you really want is an expression that works with 999 email addresses out of a thousand, an expression that doesn’t require a lot of maintenance, for instance by forcing you to add new top-level domains (“dot something”) every time the powers in charge of those things decide it’s time to launch names ending in something like .phone
or .dog.
Email address regex:
(?i)\b[A-Z0-9._%+-]+@(?:[A-Z0-9-]+\.)+[A-Z]{2,63}\b
Let’s unroll this one:
(?i)
# Turn on case-insensitive mode
\b
# Position engine at a word boundary
[A-Z0-9._%+-]+
# Match one or more of the characters between brackets: letters, numbers, dot, underscore, percent, plus, minus. Yes, some of these are rare in an email address.
@
# Match @
(?:[A-Z0-9-]+\.)+
# Match one or more strings followed by a dot, such strings being made of letters, numbers and hyphens. These are the domains and sub-domains, such as post. and microsoft. in post.microsoft.com
[A-Z]{2,63}
# Match two to 63 letters, for instance US, COM, INFO. This is meant to be the top-level domain. Yes, this also matches DOG. You have to decide if you want achieve razor precision, at the cost of needing to maintain your regex when new TLDs are introduced. 63 letters is the current longest length of a TLD although you rarely find any longer than 10 characters.
\b
# Match a word boundary
Note in Make the (?i) modifier is implemented as an option in the text parser module so it must be taken out of the regular expression pattern.
And one from @JimTheMondayMan
In addition to the Text parser module that Bjorn pointed out, you can often just us the replace() function. There are cases, due to the differences in the implementation (like the multiline flag not being allowed), where using the Text parser is basically required. But in all my scenarios I think I have used the Text parser module twice. Everywhere else where I want to use Regex I just use good old replace().
Here is a little info on it from the documentation: String functions
Of course, you can use replace() to, well, replace with Regex. Most often, I use it as Bjorn used the Text parser module in his example, to extract data.
To use it this way requires a little change in perspective. Basically, to extract data, you need to MATCH the entire string, capturing (and replacing) only the part(s) you want.
To extract the same data as in Bjorn’s example using replace() you would get something like this:
replace(<a href="www.google.com">Google</a>;/.*href=\"(.+)\".*/; $1)
Some notes:
- In this case the global flag is not required.
- The multiline flag is not allowed in replace() (nor needed here)
- “$1” refers to the first capture group, the “(.+)” in the middle of the search string.
Happy integrating!
If you have any questions, feel free to place a comment below.
~Bjorn