Overview of Different Web Scraping Techniques in Make 🌐

As this is a commonly-asked question, I’ve created this post to explore the different methods for web scraping using Make. Each method offers varying levels of complexity and control.

Traditional Web Scraping + Text Parser

If you don’t want to rely on external services which may not be free, you can always fetch the content of the page using the HTTP “Make a request” module, then use a Text Parser “Match Pattern” module to find and return the content in the source code of the page.

To do this effectively, you need to know how to setup regular expression patterns, which can get complex very quickly if you want to match multiple content around the page using a single Match Pattern module. Alternatively, you can use one Match Pattern module per content you want to extract, but this method uses more operations.

Alternatives to consider:

  • XML “Perform XPath Query” —
    You can extract items using XPath, but you have to use one module per extraction.
  • Set Multiple Variables —
    It is possible to use negative regular expressions to remove unwanted content using the replace function, to leave the “match” behind.

Need help with complex web scraping requirements, building a pattern for your Text Parser, AI prompt engineering, or have some other Make-related question?
—> Let’s Talk

Hosted Web Scraping

If you don’t want to deal with web scraping, some apps you can use are ScrapingBee and ScrapeNinja to get content from the page.

ScrapeNinja has jQuery-like selectors in the extractor function, basically it’s how you get elements on a page. This way there are no regular expressions involved, but you can still use regex in the extractor function if you wish.

The main advantages of hosted web scraping services like ScrapeNinja is that it can handle and bypass anti-scrape measures, run the page in a real web-browser, loading all the content and running the page load scripts so it closely simulates what you see, as opposed to just the raw page HTML fetched from the HTTP module. Dedicated scraping services like these makes scraping so much more reliable, because they specialize in one thing and do it well.

If you want an example of ScrapeNinja usage, take a look at Grab data from page and url - #5 by samliew

Alternatives to consider:

References:

Need help with complex web scraping requirements, building a pattern for your Text Parser, AI prompt engineering, or have some other Make-related question?
—> Book a Consultation

Either of the Above + AI Structured Data Extraction

You can combine the traditional HTTP scraping or the hosted web scraping method to fetch the source code of the target page, and feed it through an AI that does transformations to structured data (outputs variables/collections, or JSON that you have to put through a Parse JSON module).

This gives you the flexibility to extract content to complex data structures (collections), but there is some prompt engineering and setting up of the data structure, whether it’s via fields (OpenAI), or JSON in the prompt itself (Groq).

References:

Need help with complex web scraping requirements, building a pattern for your Text Parser, AI prompt engineering, or have some other Make-related question?
—> Submit Enquiry

AI-powered Web Scraping

This is probably the easiest and quickest way to set-up, because all you need to do is to describe the content that you want, instead of inspecting the element to create selectors, or having to come up with regular expression patterns.

The plus-side of this is that such services combine BOTH fetching and extracting of the data in a single module (saving operations), and doing away with the lengthy setup from the other methods.

Here is a simple example using the Dumpling AI “Extract data from URL” module:

As you can see, you can do this easily within a few seconds using Dumpling AI. Just map the URL variable in the module, and add the fields that you want extracted from the page! (you don’t even need to specify the type of data)

Also, if you don’t want structured data, and just want to pass the page content to another AI for further analysis, you can use the “Scrape URL” module which also removes unnecessary elements like headers and footers, leaving just the main/article content! This is extremely useful for training LLMs (e.g.: OpenAI, HuggingFace, etc.).

To learn more about Dumpling AI, see the official documentation at Introduction - Dumpling AI Docs


For those comfortable with regular expressions, traditional web scraping with the “Make a request” and “Match Pattern” modules allows for specific control over data extraction. However, this method can become complex when dealing with multiple content points. Hosted web scraping services like ScrapeNinja offer a more user-friendly approach with jQuery-like selectors and the ability to handle anti-scraping measures. AI-powered web scraping with tools like Dumpling AI provides the easiest and quickest setup, requiring only a description of the desired content for extraction. This method offers great ease of use but potentially less control over the specific data points.

Please leave a comment below if you have other ways you do web scraping.

View my profile for more useful links and articles like these (you need to be logged-in to view forum profiles):

— @samliew —> connect with me

Professional Services

Need help with complex web scraping requirements, building a pattern for your Text Parser, AI prompt engineering, or have some other Make-related question?
—> Get Expert Help


P.S.: Did you know, the concepts of about 70% of questions asked on this forum are already covered in the Make Academy. Investing some effort into it will save you lots of time and frustration using Make later!

3 Likes

(reserved post slot in case the above article needs extending in future)

If you have a question about implementing one of the above for your scenario, please start a new thread. You are more likely to receive help since newer questions are monitored closely.

Don’t forget to like and bookmark this topic so you can get back to it easily in future!

View my profile for more useful links and articles like these!

— @samliew —> Connect with me

This topic was automatically closed after 29 days. New replies are no longer allowed.