How to extract and remove from a html-code

rasmusmp · April 16, 2024, 9:44am

Hi good people

I am trying to pass some html code from one Wordpress installation to another. But of course the two installations has slightly different configurations, so I have to alter the html of the content.

My use case is this:

First I want to identify certain elements in the html-code.
Then I want to remove some from the html-code. But the test of the code has to remain html.

For example - here is an html input coming into make.com:

<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Article Title</title>
</head>
<body>

<header>
    <h1>Welcome to My Article</h1>
</header>
<article>
    <h2>Introduction</h2>
    <p>This is the introduction of the article. It sets the stage for the content that will follow. Here you can include interesting facts, figures, and foreshadow the main points that will be covered.</p>
</article>
    
    <div class="subsection">
        <h2>Main Section</h2>
        <p>This is the main section of the article. This part is significantly important as it carries the core information intended for the reader. It's advisable to break it down into several paragraphs to enhance clarity and reader engagement.</p>

        <h3>Subsection A</h3>
        <p>Details about Subsection A come here. It’s good to include data, statistics, and other in-depth info that supports the main article topic.</p>
    </div>

<footer>
    <p>Copyright © 2024 Your Website</p>
</footer>

</body>

I want to get everything within the title tag and the article tag:

So “Article title” should be output number 1.
And this should be output number 2:

<h2>Introduction</h2>
    <p>This is the introduction of the article. It sets the stage for the content that will follow. Here you can include interesting facts, figures, and foreshadow the main points that will be covered.</p>

Then I want the rest of the html without the two first outputs and without the footer to be output 3.

So first I will remove output 1, then remove output 2, then look for the footer tag and remove it. And then output 3 should be:

<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
</head>
<body>

<header>
    <h1>Welcome to My Article</h1>
</header>
    
    <div class="subsection">
        <h2>Main Section</h2>
        <p>This is the main section of the article. This part is significantly important as it carries the core information intended for the reader. It's advisable to break it down into several paragraphs to enhance clarity and reader engagement.</p>

        <h3>Subsection A</h3>
        <p>Details about Subsection A come here. It’s good to include data, statistics, and other in-depth info that supports the main article topic.</p>
    </div>

</body>

I have tried to work with arrays and with the built in opportunities to “replace” and to “get”. But I can’t seem to get my head around the right solution.

Can any of you guide me towards the correct tools to use? Then I will try to code it following your examples.

Thank you in advance.

Best regards,
Rasmus MP

samliew · April 16, 2024, 12:51pm

Welcome to the Make community!

Screenshot_2024-02-20_151445

You can use a Text Parser “Match Pattern” module with this Pattern (regular expression):

<title>\s*(?<title>[^<]+)\s*<\/title>[\w\W]+<article>\s*(?<article>[\w\W]+?)\s*<\/article>

Proof https://regex101.com/r/NABD4d

Important Info

Global match must be set to NO!

Screenshot

Output

For more information, see Text Parser in the Make Help Center:

Match Pattern
The Match pattern module enables you to find and extract string elements matching a search pattern from a given text. The search pattern is a regular expression (aka regex or regexp), which is a sequence of characters in which each character is either a metacharacter, having a special meaning, or a regular character that has a literal meaning.

The complete list of metacharacters can be found on the MDN web docs website.

For a tutorial on how to create regular expressions, we recommend the RegexOne website.

For an easy, quick regex generator, try the Regular Expressions generator.

For experimenting with regular expressions, we recommend the regular expressions 101 website. Just make sure to tick the ECMAScript (JavaScript) FLAVOR in the left panel.

Hope this helps!

rasmusmp · April 16, 2024, 2:02pm

Hi Samliew

Thank you so much. This is an approach, I haven’t tried yet.

I will try in a couple of hours to test if it works in my environment. I will reply again when I have the results.

Thank you for your help until now.

Best,
Rasmus MP

rasmusmp · April 17, 2024, 5:48pm

Hi again Samliew (or others?)

Once again thank you so much for helping out. With your and ChatGPTs help I think I have now manage to build all the necessary regex that I needed even though I never tried this language before.

One make question if I may?

The last operation I want to do is to use all the output provided to remove that text from the original html. What tool in make.com would you use to remove it?

Thank you in advance.

Best,
Rasmus MP

samliew · April 17, 2024, 11:19pm

The Text Parser also has a Replace module. Use two of those (or you can try using the built-in replace function).

Replace the matched text with {{emptystring}}

rasmusmp · April 17, 2024, 11:23pm

Thank you!

It was the emptystring that I didn’t know about.

samliew · April 17, 2024, 11:25pm

No problem, glad I could help!

1. If you have a new question in the future, please start a new thread. This makes it easier for others with the same problem to search for the answers to specific questions, and you are more likely to receive help since newer questions are monitored closely.

2. The Make Community guidelines encourages users to try to mark helpful replies as solutions to help keep the Community organized.

This marks the topic as solved, so that:

others can save time when catching up with the latest activity here, and

allows others to quickly jump to the solution if they come across the same problem

To do this, simply click the checkbox at the bottom of the post that answers your question:
Screenshot_2023-10-04_161049

3. Don’t forget to like and bookmark this topic so you can get back to it easily in future!

Links

Here are some useful links and guides to help you get started and learn more on how to use the Make platform, apps, and app modules —

General

Help Center | Tutorials
Make Academy – learn Make and get your certificate ← CHECK THIS OUT
Make Blog – get the latest updates
Features & Pricing

Help Center Basics

Mapping – What is mapping? What can I map?
Mapping with arrays – How to map items in an array
Tokens for parseDate | Tokens for formatDate
HTTP modules – Make a request, Get (download) a file
Webhooks – Error Handling, Responding to webhooks

Topic		Replies	Views
How to copy HTML completely and use it in another resource? How To web-scraping , linkedin	2	316	February 9, 2024
Extract text, tags & elements from HTML into nested format How To arrays , regular-expressions	1	58	November 18, 2024
Readability / article extraction? Features api	5	1085	January 5, 2024
How to setup google doc to Wordpress post How To	7	655	December 7, 2023
Google Doc to Wordpress How To google-drive , wordpress	9	754	December 31, 2023