How to extract certain data using Regex in Make?

Last updated: 2024/14/01

In regular cases it’s very usefull to use Regex (or Regular Expressions) if you want to extract or replace data in some text. When there always is a similar pattern in your text data, regex is ideal to use. But how can you use regex?

Table of content

1 – What is regex

2 – How to begin with RegEx

3 – Starting development

4 – Basics - Extracting data from text

5 – Basics - Replacing data with the replace() function

6 – Basics - Named capturing groups

7 – Advanced - Extracting multiple values using only 1 Text Parser

8 – More examples

What is a Regex?

A RegEx, or Regular Expression, is a sequence of characters that forms a search pattern . RegEx can be used to check if a string contains the specified search pattern. It isn’t a coding language but could be seen as one since there is a specific syntax using it.

Here an addition of @alex.newpath:
First, a regex is a text string. For instance, foo is a regex. So is [A-Z]+:\d+.

Those text strings describe patterns to find text or positions within a body of text. For instance, the regex foo matches the string foo , the regex [A-Z]+:\d+ matches string fragments like F:1 and GO:30 .

Typically, these patterns (which can be beautifully intricate and precise) are used for four main tasks:

  • to find text within a larger body of text;

  • to validate that a string conforms to a desired format;

  • to replace text (or insert text at matched positions, which is the same process);

  • and to split strings.

How to begin with RegEx

To be able to develop your own regex, there are a few very helpfull tools and processes which helps you getting started. These are my own personal recommendations, if you have anything extra to add up feel free to comment. Some of the things I use and do to make patterns:

  1. Regex101 for pattern development, syntax help and debugging
  2. Stackoverflow / Make community for questions and answers if you get stuck
  3. A make account and good dose of perseverance

Starting development

Before creating the regex module in Make, I always develop the pattern in Regex101 first. Once you’ve successfuly created your pattern, you can copy it over to the make module. Steps to take:

  1. Make sure you set the Flavor within regex101 on “ECMAScript (JavaScript)”. This is used by Make.
    regex101 flavors

  2. When looking for patterns use the “Quick reference” in the right bottom to search for generic used patterns.
    regex101 quick reference

  3. Start the pattern development. If a pattern gets complex, split it up and start with something simple first.

  4. Once you succeeded and copying over the pattern to Make, make sure you group the pattern you want to output with brackets. See more information in the example below.

Basics - Extracting data from text

When you want to extract data out of a string, in Make you can use the “Match pattern” module within the “Text parser” app. Lets say I get some HTML data with an URL (href) and I want to extract the URL. It would look like this:

Test string
<a href="www.google.com">Google</a>

Pattern
(?<=href=\").+(?=\")

Output
www.google.com

Now, like stated above, when you copy this over to Make you need to make sure the output you want to retrieve is grouped with brackets. The above pattern will output empty since the output I want is not within brackets (even though regex101 gives you output).
So the correct pattern would be:

(?<=href=\")(.+)(?=\")

And now in regex101 you will also see it gets grouped:

Within make, you can now use this code in the Text parser module and get the data you want to extract.

Basics - Replacing data with the replace() function

Using the “Text parser” app you can also use the “replace” module to replace some text. However if you want to limit the amount of operations your scenario uses, or want to easily replace multiple variables somewhere the replace() function is ideal.

In the following example I got some text with HTML tags inside of it. For the output data I want to have clean text where all HTML tags are removed. What we will do is replace() all tags with an emptystring which basically means we will delete those items.



The syntax of the replace() function is as follows;

replace(text; search string; replacement string

text: this is the text you will search for
search string: this is the regex pattern between slashes
replacement string: what you are going to replace it for

In our example it looks like this:

{{replace(1.`raw data`; "/(<b>|<\/b>|<\/br>)/g"; emptystring)}}

The regex syntax is as following:
/<regex pattern>/<regex flags>

In this example we have multiple different HTML tags, so we are using a group with multiple alternatives “|”. Since the tags are also used multiple times we don’t want to stop at the first match, so we are using the /g flag to find all matches.

By using the replace() function you can either search for a certain pattern and remove it, or replace it with some other text.

Basics - Named capturing groups

Within the regex you create, you usually just get an output like $1, $2 etc. However this doesn’t show you much about what kind of data it has extracted.

Here comes the Named Capturing Groups in play which help you A LOT when developing these regex patterns. Basically a named group is exactly the same as any other pattern, however the output tells you the name of the patter it extracted.

As you can see, instead of the standard $1 and $2 we get the named groups h4 and p. The synax for this is as follows:

(?<h4>(?<=\<h4\>).+?(?=\<\/h4>))|(?<p>(?<=\<p\>).+?(?=<\/p>))

Inside of a group, when starting the group () we add the name of the group like ?<group name>. Then everything that is captured inside of this group will get called this way.

Advanced - Extracting multiple values using only 1 Text Parser

If you have a string with text which contains multiple variables you want to extract, you have 2 options;

  • Using multiple Text parser apps
  • Using 1 Text parser app which searches all patterns & aggregates them

Since we (at Drivn) personally like to build our scenarios efficiently and easily manageable, we usually go for the second option. With the second option you reduce the amount of operations you use + you can extract all values at once.

In the following example we get a dutch booking email to rent a boat, which holds multiple variables such as the;

  • Booking time (vaarduur)
  • Food requirements (hapjes & eten)
  • Amount of persons (aantal personen)
    etc.

We want to extract all values at once, and then add these variables in a google sheet. The following RegEx101 shows how we do this and which groups we get.

As you can see in the screenshot, this regex pattern contains a lot of alternatives and the Global flag. This basically means that it will try to find all groups & won’t stop at the first match. The syntax looks like this:

(?<vaarduur>(?<=Vaarduur\: ).+)|(?<eten>(?<=Hapjes & Eten\: ).+)|(?<personen>(?<=Aantal Personen\: ).+)|(?<datum>(?<=Datum\: ).+)|(?<tijd>(?<=Tijd\: ).+)|(?<vragen>(?<=vragen\?\: ).+)|(?<naam>(?<=Naam\: ).+)|(?<email>(?<=Email\: ).+)|(?<telefoon>(?<=Telefoonnummer\: ).+)|(?<registratieDatum>(?<=Date\: ).+)|(?<registratieTijd>(?<=Time\: ).+)|(?<registratieWebAgent>(?<=User Agent\: )[.\n\s\S]+(?=\nRemote IP))|(?<registratieIP>(?<=Remote IP\: ).+)

When we run this with a Match pattern module, it will output multiple bundles. Meaning that every seperate bundle will have it’s own variable, making it difficult to get all variables at once.

Since we want all variables at once, we will aggregate all bundles inside 1 array so we can easily extract them.

Last thing we have to do now, is extract the values we get out of the Aggregator and put them in our gSheet. As you can see within the output of the aggregator, we will have multiple collections inside of the array.

To retrieve the variable, finally we have to use a combination of map() remove() and get() to retrieve the variable we want:

What we basically do is

  • finding the variable in the array
  • removing all other empty variables
  • making a string out of the final array we get

More examples

Here an example from @alex.newpath

The regex below is borrowed from chapter 4 of Jan Goyvaert’s excellent book. What you really want is an expression that works with 999 email addresses out of a thousand, an expression that doesn’t require a lot of maintenance, for instance by forcing you to add new top-level domains (“dot something”) every time the powers in charge of those things decide it’s time to launch names ending in something like .phone or .dog.

Email address regex:

(?i)\b[A-Z0-9._%+-]+@(?:[A-Z0-9-]+\.)+[A-Z]{2,63}\b

Let’s unroll this one:

(?i) # Turn on case-insensitive mode

\b # Position engine at a word boundary

[A-Z0-9._%+-]+ # Match one or more of the characters between brackets: letters, numbers, dot, underscore, percent, plus, minus. Yes, some of these are rare in an email address.

@ # Match @

(?:[A-Z0-9-]+\.)+ # Match one or more strings followed by a dot, such strings being made of letters, numbers and hyphens. These are the domains and sub-domains, such as post. and microsoft. in post.microsoft.com

[A-Z]{2,63} # Match two to 63 letters, for instance US, COM, INFO. This is meant to be the top-level domain. Yes, this also matches DOG. You have to decide if you want achieve razor precision, at the cost of needing to maintain your regex when new TLDs are introduced. 63 letters is the current longest length of a TLD although you rarely find any longer than 10 characters.

\b # Match a word boundary

Note in Make the (?i) modifier is implemented as an option in the text parser module so it must be taken out of the regular expression pattern.


And one from @JimTheMondayMan

In addition to the Text parser module that Bjorn pointed out, you can often just us the replace() function. There are cases, due to the differences in the implementation (like the multiline flag not being allowed), where using the Text parser is basically required. But in all my scenarios I think I have used the Text parser module twice. Everywhere else where I want to use Regex I just use good old replace().

Here is a little info on it from the documentation: String functions

Of course, you can use replace() to, well, replace with Regex. Most often, I use it as Bjorn used the Text parser module in his example, to extract data.

To use it this way requires a little change in perspective. Basically, to extract data, you need to MATCH the entire string, capturing (and replacing) only the part(s) you want.

To extract the same data as in Bjorn’s example using replace() you would get something like this:

replace(<a href="www.google.com">Google</a>;/.*href=\"(.+)\".*/; $1)

Some notes:

  1. In this case the global flag is not required.
  2. The multiline flag is not allowed in replace() (nor needed here)
  3. “$1” refers to the first capture group, the “(.+)” in the middle of the search string.

Happy integrating!

If you have any questions, feel free to place a comment below.
~Bjorn

27 Likes

Wowzers, thanks so much for the neat tutorial, Bjorn :muscle: Seasoned Make users swear by the usefulness of regex, and your post only further underlines it!

3 Likes

Thanks @Bjorn.drivn I’m literally going to bookmark this post for future regex projects. I’ve used some of these tools before, but this is a great overall resource for building regex expressions!

Thanks for the post!

2 Likes

I agree. Regular Expressions can be VERY useful!

In addition to the Text parser module that Bjorn pointed out, you can often just us the replace() function. There are cases, due to the differences in the implementation (like the multiline flag not being allowed), where using the Text parser is basically required. But in all my scenarios I think I have used the Text parser module twice. Everywhere else where I want to use Regex I just use good old replace().

Here is a little info on it from the documentation: String functions

Of course, you can use replace() to, well, replace with Regex. Most often, I use it as Bjorn used the Text parser module in his example, to extract data.

To use it this way requires a little change in perspective. Basically, to extract data, you need to MATCH the entire string, capturing (and replacing) only the part(s) you want.

To extract the same data as in Bjorn’s example using replace() you would get something like this:

replace(<a href="www.google.com">Google</a>;/.*href=\"(.+)\".*/; $1)

Some notes:

  1. In this case the global flag is not required.
  2. The multiline flag is not allowed in replace() (nor needed here)
  3. “$1” refers to the first capture group, the “(.+)” in the middle of the search string.

Jim

4 Likes

6 posts were split to a new topic: Using regex for an optional string

Thanks for that, @Drivn. I noted you are a Make regex flavor user, and hope you can shed some light on my issue.

I have a scenario - mailhook => iterator => awsS3 which:

  1. Accepts an email;
  2. if it has attachments, forward it to the iterator;
  3. if there is a particular “key:value” pattern in the email text, use the “value” as the folder variable in the S3 Put function.

All works, except parsing out the “value” portion in the S3 module.

The pattern I use in the email if I want the folder changed is: [fpath:somefolder/anotherfolder…]. I put it as first line in forward or direct email.

The formula I use in the folder variable of the S3 model is:

{{if(indexOf("[fpath:"; "!=-1"); replace(1.text; "/^.?\[fpath:([^\s\]]+)]?.+$/s"; "$1"); "")}}

(with or without the quoted text)

The issue is that the capture - ([^\s\]]+) captures the ENTIRETY of the text field - that is, $1 shows ALL of the email’s passed 1.text field, not just the capture, even though the following text is clearly outside the capture parentheses. Any thoughts?

…also tried:

if(indexOf("[fpath:"; "!=-1"); replace(1.text; "/^.?\[fpath:([^\s\]]+)]\n"; "$1"); "")

with and without an ungreedy ? after the \n.

@bullit

I don’t think the problem is that “([^\s]]+)” is capturing everything. Rather you are matching/capturing nothing. So you are just getting back the original string. My guess is that the the single line flag does not work with replace.

Try replacing "/^.?\[fpath:([^\s\]]+)]?.+$/s" with "/^.*?\[fpath:([^\s\]]+)]([^a]|[a])*/".

This will accomplish basically the same thing without requiring the single line flag.


Jim - The Monday Man (YouTube Channel)
Watch Our Latest Video: The monday ITEM ID column - What most people don’t know.
Contact me directly here: Contact – The Monday Man

2 Likes

Brilliant. Thank you!

Thank you for the amazing tips!

We have combined HTTP, RegEx and Hash to keep track of changes in websites that have tables with the status of a series of licensing processes. If the Hash for the captures portions of the Http changes, we alert the user of interest of that particular record and include the http table with the updates information.

It was a break-trough for us since the service provider didn’t have any APIs.

3 Likes

I came across this amazing resource when reading about regular expressions.

It has this gem of a definition for regular expressions (ie Regex):

What is a Regex?

First, a regex is a text string. For instance, foo is a regex. So is [A-Z]+:\d+.

Those text strings describe patterns to find text or positions within a body of text. For instance, the regex foo matches the string foo , the regex [A-Z]+:\d+ matches string fragments like F:1 and GO:30 , and the regex (?<=[a-z])(?=[A-Z]) matches the position in the string CamelCase where we shift from a lower-case letter to an upper-case letter.

Typically, these patterns (which can be beautifully intricate and precise) are used for four main tasks:

to find text within a larger body of text;

to validate that a string conforms to a desired format;

to replace text (or insert text at matched positions, which is the same process);

and to split strings.

1 Like

Here’s probably the best email address validator I have come across.

The regex below is borrowed from chapter 4 of Jan Goyvaert’s excellent book, Regular Expressions Cookbook. What you really want is an expression that works with 999 email addresses out of a thousand, an expression that doesn’t require a lot of maintenance, for instance by forcing you to add new top-level domains (“dot something”) every time the powers in charge of those things decide it’s time to launch names ending in something like .phone or .dog.

Email address regex:

(?i)\b[A-Z0-9._%+-]+@(?:[A-Z0-9-]+\.)+[A-Z]{2,63}\b

Let’s unroll this one:

(?i) # Turn on case-insensitive mode

\b # Position engine at a word boundary

[A-Z0-9._%+-]+ # Match one or more of the characters between brackets: letters, numbers, dot, underscore, percent, plus, minus. Yes, some of these are rare in an email address.

@ # Match @

(?:[A-Z0-9-]+\.)+ # Match one or more strings followed by a dot, such strings being made of letters, numbers and hyphens. These are the domains and sub-domains, such as post. and microsoft. in post.microsoft.com

[A-Z]{2,63} # Match two to 63 letters, for instance US, COM, INFO. This is meant to be the top-level domain. Yes, this also matches DOG. You have to decide if you want achieve razor precision, at the cost of needing to maintain your regex when new TLDs are introduced. 63 letters is the current longest length of a TLD although you rarely find any longer than 10 characters.

\b # Match a word boundary

Note in Make the (?i) modifier is implemented as an option in the text parser module so it must be taken out of the regular expression pattern.

2 Likes

This is absolutely awesome @Bjorn.drivn!

If there’s one thing I might add, for people who have never used RegEx before, regexone.com is a great place to learn the basics. You can “cheat” on some of the tasks and complete them without the operators the lesson is trying to teach you, but if you follow the intended steps it’ll give you a decent overview of what you can do and more importantly, how to do it. :slight_smile:

1 Like

A post was merged into an existing topic: When an email gets a new label, automatically upload attachments to Google Drive

2 posts were split to a new topic: Trying to extract a code out of an email

For people watching this topic and interested in updates…
We have just updated this topic with :

5 – Basics - Replacing data with the replace() function

6 – Basics - Named capturing groups

7 – Advanced - Extracting multiple values using only 1 Text Parser

1 Like

Here are a couple of videos that might help, especially with generating REGEX, when you aren’t a REGEX expert:

Chat GPT & Make.com: How to Generate REGEX to Match Text Values in Make.com

Chat GPT & Make.com How to Replace String Values in Make Scenarios With ChatGPT Regular Expressions

Andy @ Weblytica.com

2 Likes