How to use Regex in Make?

In regular cases it’s very usefull to use Regex (or Regular Expressions) if you want to extract or replace data in some text. When there always is a similar pattern in your text data, regex is ideal to use. But how can you use regex?

What is a Regex?

Here an addition of @alex.newpath:

First, a regex is a text string. For instance, foo is a regex. So is [A-Z]+:\d+.

Those text strings describe patterns to find text or positions within a body of text. For instance, the regex foo matches the string foo , the regex [A-Z]+:\d+ matches string fragments like F:1 and GO:30 , and the regex (?<=[a-z])(?=[A-Z]) matches the position in the string CamelCase where we shift from a lower-case letter to an upper-case letter.

Typically, these patterns (which can be beautifully intricate and precise) are used for four main tasks:

to find text within a larger body of text;

to validate that a string conforms to a desired format;

to replace text (or insert text at matched positions, which is the same process);

and to split strings.

How to begin with RegEx

To be able to develop your own regex, there are a few very helpfull tools and processes which helps you getting started. These are my own personal recommendations, if you have anything extra to add up feel free to comment. Some of the things I use and do to make patterns:

  1. Regex101 for pattern development, syntax help and debugging
  2. Stackoverflow for a lot of questions and answers if you get stuck
  3. A make account and good dose of perseverance

Starting development
Before creating the regex module in Make, I always develop the pattern in Regex101 first. Once you’ve successfuly created your pattern, you can copy it over to the make module. Steps to take:

  1. Make sure you set the Flavor within regex101 on “ECMAScript (JavaScript)”. This is used by Make.
    Screenshot_71

  2. When looking for patterns use the “Quick reference” in the right bottom to search for generic used patterns.
    Screenshot_70

  3. Start the pattern development. If a pattern gets complex, split it up and start with something simple first.

  4. Once you succeeded and copying over the pattern to Make, make sure you group the pattern you want to output with brackets. See more information in the example below.

An example
Lets say I get some HTML data with an URL (href) and I want to extract the URL. It would look like this:

Test string
<a href="www.google.com">Google</a>

Pattern
(?<=href=\").+(?=\")

Output
www.google.com

Now, like stated above, when you copy this over to Make you need to make sure the output you want to retrieve is grouped with brackets. The above pattern will output empty since the output I want is not within brackets (even tho regex101 gives you output).
So the correct pattern would be:

(?<=href=\")(.+)(?=\")

And now in regex101 you will also see it gets grouped:

Within make, you can now use this code in the Regex module and get the data you want to extract.

More examples

Here an example from @alex.newpath

The regex below is borrowed from chapter 4 of Jan Goyvaert’s excellent book, Regular Expressions Cookbook. What you really want is an expression that works with 999 email addresses out of a thousand, an expression that doesn’t require a lot of maintenance, for instance by forcing you to add new top-level domains (“dot something”) every time the powers in charge of those things decide it’s time to launch names ending in something like .phone or .dog.

Email address regex:

(?i)\b[A-Z0-9._%+-]+@(?:[A-Z0-9-]+\.)+[A-Z]{2,63}\b

Let’s unroll this one:

(?i) # Turn on case-insensitive mode

\b # Position engine at a word boundary

[A-Z0-9._%+-]+ # Match one or more of the characters between brackets: letters, numbers, dot, underscore, percent, plus, minus. Yes, some of these are rare in an email address.

@ # Match @

(?:[A-Z0-9-]+\.)+ # Match one or more strings followed by a dot, such strings being made of letters, numbers and hyphens. These are the domains and sub-domains, such as post. and microsoft. in post.microsoft.com

[A-Z]{2,63} # Match two to 63 letters, for instance US, COM, INFO. This is meant to be the top-level domain. Yes, this also matches DOG. You have to decide if you want achieve razor precision, at the cost of needing to maintain your regex when new TLDs are introduced. 63 letters is the current longest length of a TLD although you rarely find any longer than 10 characters.

\b # Match a word boundary

Note in Make the (?i) modifier is implemented as an option in the text parser module so it must be taken out of the regular expression pattern.


And one from @JimTheMondayMan

In addition to the Text parser module that Bjorn pointed out, you can often just us the replace() function. There are cases, due to the differences in the implementation (like the multiline flag not being allowed), where using the Text parser is basically required. But in all my scenarios I think I have used the Text parser module twice. Everywhere else where I want to use Regex I just use good old replace().

Here is a little info on it from the documentation: String functions

Of course, you can use replace() to, well, replace with Regex. Most often, I use it as Bjorn used the Text parser module in his example, to extract data.

To use it this way requires a little change in perspective. Basically, to extract data, you need to MATCH the entire string, capturing (and replacing) only the part(s) you want.

To extract the same data as in Bjorn’s example using replace() you would get something like this:

replace(<a href="www.google.com">Google</a>;/.*href=\"(.+)\".*/; $1)

Some notes:

  1. In this case the global flag is not required.
  2. The multiline flag is not allowed in replace() (nor needed here)
  3. “$1” refers to the first capture group, the “(.+)” in the middle of the search string.

Happy integrating!

If you have any questions, feel free to place a comment below.
~Bjorn

20 Likes

Wowzers, thanks so much for the neat tutorial, Bjorn :muscle: Seasoned Make users swear by the usefulness of regex, and your post only further underlines it!

2 Likes

Thanks @Bjorn.drivn I’m literally going to bookmark this post for future regex projects. I’ve used some of these tools before, but this is a great overall resource for building regex expressions!

Thanks for the post!

2 Likes

I agree. Regular Expressions can be VERY useful!

In addition to the Text parser module that Bjorn pointed out, you can often just us the replace() function. There are cases, due to the differences in the implementation (like the multiline flag not being allowed), where using the Text parser is basically required. But in all my scenarios I think I have used the Text parser module twice. Everywhere else where I want to use Regex I just use good old replace().

Here is a little info on it from the documentation: String functions

Of course, you can use replace() to, well, replace with Regex. Most often, I use it as Bjorn used the Text parser module in his example, to extract data.

To use it this way requires a little change in perspective. Basically, to extract data, you need to MATCH the entire string, capturing (and replacing) only the part(s) you want.

To extract the same data as in Bjorn’s example using replace() you would get something like this:

replace(<a href="www.google.com">Google</a>;/.*href=\"(.+)\".*/; $1)

Some notes:

  1. In this case the global flag is not required.
  2. The multiline flag is not allowed in replace() (nor needed here)
  3. “$1” refers to the first capture group, the “(.+)” in the middle of the search string.

Jim

4 Likes

6 posts were split to a new topic: Using regex for an optional string

Thanks for that, @Drivn. I noted you are a Make regex flavor user, and hope you can shed some light on my issue.

I have a scenario - mailhook => iterator => awsS3 which:

  1. Accepts an email;
  2. if it has attachments, forward it to the iterator;
  3. if there is a particular “key:value” pattern in the email text, use the “value” as the folder variable in the S3 Put function.

All works, except parsing out the “value” portion in the S3 module.

The pattern I use in the email if I want the folder changed is: [fpath:somefolder/anotherfolder…]. I put it as first line in forward or direct email.

The formula I use in the folder variable of the S3 model is:

{{if(indexOf("[fpath:"; "!=-1"); replace(1.text; "/^.?\[fpath:([^\s\]]+)]?.+$/s"; "$1"); "")}}

(with or without the quoted text)

The issue is that the capture - ([^\s\]]+) captures the ENTIRETY of the text field - that is, $1 shows ALL of the email’s passed 1.text field, not just the capture, even though the following text is clearly outside the capture parentheses. Any thoughts?

…also tried:

if(indexOf("[fpath:"; "!=-1"); replace(1.text; "/^.?\[fpath:([^\s\]]+)]\n"; "$1"); "")

with and without an ungreedy ? after the \n.

@bullit

I don’t think the problem is that “([^\s]]+)” is capturing everything. Rather you are matching/capturing nothing. So you are just getting back the original string. My guess is that the the single line flag does not work with replace.

Try replacing "/^.?\[fpath:([^\s\]]+)]?.+$/s" with "/^.*?\[fpath:([^\s\]]+)]([^a]|[a])*/".

This will accomplish basically the same thing without requiring the single line flag.


Jim - The Monday Man (YouTube Channel)
Watch Our Latest Video: The monday ITEM ID column - What most people don’t know.
Contact me directly here: Contact – The Monday Man

2 Likes

Brilliant. Thank you!

Thank you for the amazing tips!

We have combined HTTP, RegEx and Hash to keep track of changes in websites that have tables with the status of a series of licensing processes. If the Hash for the captures portions of the Http changes, we alert the user of interest of that particular record and include the http table with the updates information.

It was a break-trough for us since the service provider didn’t have any APIs.

3 Likes

I came across this amazing resource when reading about regular expressions.

It has this gem of a definition for regular expressions (ie Regex):

What is a Regex?

First, a regex is a text string. For instance, foo is a regex. So is [A-Z]+:\d+.

Those text strings describe patterns to find text or positions within a body of text. For instance, the regex foo matches the string foo , the regex [A-Z]+:\d+ matches string fragments like F:1 and GO:30 , and the regex (?<=[a-z])(?=[A-Z]) matches the position in the string CamelCase where we shift from a lower-case letter to an upper-case letter.

Typically, these patterns (which can be beautifully intricate and precise) are used for four main tasks:

to find text within a larger body of text;

to validate that a string conforms to a desired format;

to replace text (or insert text at matched positions, which is the same process);

and to split strings.

1 Like

Here’s probably the best email address validator I have come across.

The regex below is borrowed from chapter 4 of Jan Goyvaert’s excellent book, Regular Expressions Cookbook. What you really want is an expression that works with 999 email addresses out of a thousand, an expression that doesn’t require a lot of maintenance, for instance by forcing you to add new top-level domains (“dot something”) every time the powers in charge of those things decide it’s time to launch names ending in something like .phone or .dog.

Email address regex:

(?i)\b[A-Z0-9._%+-]+@(?:[A-Z0-9-]+\.)+[A-Z]{2,63}\b

Let’s unroll this one:

(?i) # Turn on case-insensitive mode

\b # Position engine at a word boundary

[A-Z0-9._%+-]+ # Match one or more of the characters between brackets: letters, numbers, dot, underscore, percent, plus, minus. Yes, some of these are rare in an email address.

@ # Match @

(?:[A-Z0-9-]+\.)+ # Match one or more strings followed by a dot, such strings being made of letters, numbers and hyphens. These are the domains and sub-domains, such as post. and microsoft. in post.microsoft.com

[A-Z]{2,63} # Match two to 63 letters, for instance US, COM, INFO. This is meant to be the top-level domain. Yes, this also matches DOG. You have to decide if you want achieve razor precision, at the cost of needing to maintain your regex when new TLDs are introduced. 63 letters is the current longest length of a TLD although you rarely find any longer than 10 characters.

\b # Match a word boundary

Note in Make the (?i) modifier is implemented as an option in the text parser module so it must be taken out of the regular expression pattern.

2 Likes

This is absolutely awesome @Bjorn.drivn!

If there’s one thing I might add, for people who have never used RegEx before, regexone.com is a great place to learn the basics. You can “cheat” on some of the tasks and complete them without the operators the lesson is trying to teach you, but if you follow the intended steps it’ll give you a decent overview of what you can do and more importantly, how to do it. :slight_smile:

1 Like