Overview of Different Web Scraping Techniques in Make 🌐

As this is a commonly-asked question, I’ve created this post to explore the different methods for web scraping using Make. Each method offers varying levels of complexity and control.

Traditional Web Scraping + Text Parser

If you don’t want to rely on external services which may not be free, you can always fetch the content of the page using the HTTP “Make a request” module, then use a Text Parser “Match Pattern” module to find and return the content in the source code of the page.

To do this effectively, you need to know how to setup regular expression patterns, which can get complex very quickly if you want to match multiple content around the page using a single Match Pattern module. Alternatively, you can use one Match Pattern module per content you want to extract, but this method uses more operations.

Alternatives to consider:

  • XML “Perform XPath Query” —
    You can extract items using XPath, but you have to use one module per extraction.
  • Set Multiple Variables —
    It is possible to use negative regular expressions to remove unwanted content using the replace function, to leave the “match” behind.

Need help with complex web scraping requirements, building a pattern for your Text Parser, AI prompt engineering, or have some other Make-related question?
—> Let’s Talk

Hosted Web Scraping

If you don’t want to deal with web scraping, some apps you can use are ScrapingBee and ScrapeNinja to get content from the page.

ScrapeNinja has jQuery-like selectors in the extractor function, basically it’s how you get elements on a page. This way there are no regular expressions involved, but you can still use regex in the extractor function if you wish.

The main advantages of hosted web scraping services like ScrapeNinja is that it can handle and bypass anti-scrape measures, run the page in a real web-browser, loading all the content and running the page load scripts so it closely simulates what you see, as opposed to just the raw page HTML fetched from the HTTP module. Dedicated scraping services like these makes scraping so much more reliable, because they specialize in one thing and do it well.

If you want an example of ScrapeNinja usage, take a look at Grab data from page and url - #5 by samliew

Alternatives to consider:

References:

Need help with complex web scraping requirements, building a pattern for your Text Parser, AI prompt engineering, or have some other Make-related question?
—> Book a Consultation

Either of the Above + AI Structured Data Extraction

You can combine the traditional HTTP scraping or the hosted web scraping method to fetch the source code of the target page, and feed it through an AI that does transformations to structured data (outputs variables/collections, or JSON that you have to put through a Parse JSON module).

This gives you the flexibility to extract content to complex data structures (collections), but there is some prompt engineering and setting up of the data structure, whether it’s via fields (OpenAI), or JSON in the prompt itself (Groq).

References:

Need help with complex web scraping requirements, building a pattern for your Text Parser, AI prompt engineering, or have some other Make-related question?
—> Submit Enquiry

AI-powered Web Scraping

This is probably the easiest and quickest way to set-up, because all you need to do is to describe the content that you want, instead of inspecting the element to create selectors, or having to come up with regular expression patterns.

The plus-side of this is that such services combine BOTH fetching and extracting of the data in a single module (saving operations), and doing away with the lengthy setup from the other methods.

Here is a simple example using the Dumpling AI “Extract data from URL” module:

As you can see, you can do this easily within a few seconds using Dumpling AI. Just map the URL variable in the module, and add the fields that you want extracted from the page! (you don’t even need to specify the type of data)

Also, if you don’t want structured data, and just want to pass the page content to another AI for further analysis, you can use the “Scrape URL” module which also removes unnecessary elements like headers and footers, leaving just the main/article content! This is extremely useful for training LLMs (e.g.: OpenAI, HuggingFace, etc.).

To learn more about Dumpling AI, see the official documentation at API Reference - DumplingAI Docs


For those comfortable with regular expressions, traditional web scraping with the “Make a request” and “Match Pattern” modules allows for specific control over data extraction. However, this method can become complex when dealing with multiple content points. Hosted web scraping services like ScrapeNinja offer a more user-friendly approach with jQuery-like selectors and the ability to handle anti-scraping measures. AI-powered web scraping with tools like Dumpling AI provides the easiest and quickest setup, requiring only a description of the desired content for extraction. This method offers great ease of use but potentially less control over the specific data points.

View my profile for more useful links and articles like these (you may need to be logged-in to view forum profiles):

— @samliew —> connect with me

9 Likes
Search Linked In for job posts
How to Scrape a Data from Website using RSS Feed
Scrape Ninja (Real Browser) Not Scraping All Data vs. Dumpling AI
Issues with trying to build a page scraper
How to combine 2 scenarios?
Etsy - Gmail - Dropshipping Order Fulfillment Automation
How to get the new posts from a facebook group if i'm not an admin? Public and private groups
Automated Follow-Up Emails After No Reply
Filter bundles based on the presence of a value in an array of objects
Working with dates inside an array
Data Enrichment - Make.com Automation Scenario
Can OpenAI in Make actually read PDFs directly (via file upload)?
Google Calendar Set current week date
[App] Sam's Toolbox of Useful Modules: @samliew
Automate Linkedin Outreach
Help with a Json Scenario
Guided product creation assistant
Meta integration with Clickup
Webhooks or HTTP
Create Trello cards from PDF/Word orders in Google Drive using Make.com
Stasus code 202 - fix?
Google Workspace Admin module configuration
I can't get data from a public link on META
Integration of Planyo with Lexware
I want to append page content into a database page from another database page content
How to count occurrences of a specific value in a JSON array without using iterators?
From iterare to pdfmonkey
How to Map the JSON Data properly using Aggregator or Set Variables for update to Sheets?
Expert advice required and ideally contact
YouTube search automation: how to avoid duplicates and get a limited number of videos
Need help with Outlook and chat gpt!
Webscraping 403 forbidden
How to access and get data of content of a specific news from a website and add it to a google sheet
How to use AI to analyze salesforce case, then fill the lookup relationship field on the Case?
[Free Tool] Reduce blueprint/module export by removing extra details
SEO optimizing wordpress
AI Agents in Make
Seeking Advice: Automating Facebook Posts with Google Drive Images and AI Captioning Tools
Organize invoices in google drive based on content
Inquiry About Using External Libraries with 0CodeKit on Make.com
Using OpenAI to Parse URLs and Automatically Generate Reddit Posts
Problem Capturing the First line from Facebook Post
How to scan own website for specific issues regarding content?
Cleaning Website Text for GPT — Is My Current Setup Efficient or Overkill?
ChatGpt to Google Drive
Text parser error : You need to enable JavaScript to run this app
How to search/filter through a massive amount of data for one resource
Scrape any kind of product pages
Google Search result scrape stopped working
Crawl Cloudflare protected websites | Apify does not fully work
RSS Module – 403 Forbidden Error When Retrieving Feed (Works Fine in Zapier)
Trying to use OpenAI's GPT Assistant (external access) via API in Make.com—anyone got it working?
HTTP Get not getting all data form WEB PAGE
Getting started into make
How can I continue the execution of a router when the output of the first module is empty?
Instagram search
Website integration
How much do AI tools cost?
Recipe Card Automation
How to create a full Webflow CMS RSS feed using Make.com?
Use Custom GPTs (Team account) to create outlook drafts
SUM values of Make modules (Increment and variable)
Telegram bot watch update module missing?
New to Make and Social Media Automation
How to Connect Acuity "New Appointments" Contact details to Google Contacts?
News Automation (RSS -> Scraptio -> OpenAI --> Google Sheet): almost there, please help!
HTTP Timeout with valid URL
How to fetch text from a bundle
Scrape Content from Redirected Google News Links
How to interact with website/platform if api is unaccessible?
How to get information from a platform that doesn't send APIs and has no webhooks
How to approve or reject a text if not falls under specific category
More operations then expected in HTML extraction
Adding an item to a Pipedrive array without overwriting the original values
How to merge results from multiple YouTube search iterations into one array?
Web Archive
[400] Messages have xx tokens, which exceeds the max limit of xx tokens
Extracting contact names from articles produced by a data feed
Scenario to log into a website portal and look up data and export list via email
How to get a spreadsheet ID from a file I'm uploading in the same scenario?
Need Help parsing text
How to loop over an array and generate HTML with native functions?
Using Make
Scrape Tweets from specific users about specific topis and reposting with source mention
The youtube transcript extraction solution by samliew does not work
ChatGPT API and Live Data
Scraping Google Jobs
How to handle "Javascript disabled" issue when web scraping
How to grab only the main content and main image URL from webpage?
The text parser has an program error
Extract Instagram url from website list
How do I trigger an email if a website URL content changes?
Opening URL contained on an airtable field
Convert long string into json
How can I use Apollo IO free plan & Scraping Leads
Error Handling BAD combo: Sequential Processing + Incomplete Executions

Here is more information about the Dumpling AI integration in Make.

AI Agents

AI agents are pretrained on your data and knowledgebase for RAG (Retrieval-Augmented Generation). You can set one up in the dashboard and then call the Dumpling AI “Generate AI Agent Completion” module:

Runs AI Agent completion and returns the result

For more information, see the official documentation at Build Custom AI Agents, Simply.



(source: Dumpling AI website)

Run JavaScript (with plugins)

If you need to run JavaScript/TypeScript with JS libraries (NPM packages) in your scenario, you can consider Dumpling AI’s “JavaScript Code Execution API” available via the “Run Javascript Code” module —

Run your javascript or typescript code and get the result back.

The official documentation on how to use NPM modules with this module can be found here.

DumplingAI also does so much more, see also:

Examples of How to use Dumpling AI

For more information, see these Dumpling AI tutorials below, grouped by category:

YouTube & Videos

Image Generation

AI Agents & RAGs

Searching & Scraping

Other Data Extraction

Business & Social

Dumpling AI Tutorials

In short, Dumpling AI is able to replace several other paid services combined that would cost more than itself, making it a noteworthy choice as the “multi-tool” of AI services.

How to Use

For more information on how to set this up, refer to these forum threads:

View my profile for more useful links and articles like these!

— @samliew —> Connect with me

2 Likes

This topic was automatically closed after 29 days. New replies are no longer allowed.