Issues with trying to build a page scraper

:bullseye: What is your goal?

Scrape 2 websites:
1-That loads as you scroll
2-An indexed content site

:thinking: What is the problem & what have you tried?

Hi!
I’ve been trying to build, based on a request, a web scraper for:

1- a website that loads as you scroll, and somehow can’t manage to get around it to check on the links I’m looking to scrape
2- a website that is indexed, but can’t get to get the total of the links for a specific section I’m scraping

I’ve read and searched online, tried several ways with the HTTP request but either:
1-No links come up (from the page that loads as you scroll)
2-Just 1 of the total links (and only from page one of the site) comes up

I’m having a really hard time and can’t seem to grasp how to really do it with the HTTP request and parser

Any ideas?

:camera_with_flash: Screenshots (scenario flow, module settings, errors)

I think the core issue here is that HTTP requests only fetch static HTML, they can’t execute JavaScript or handle infinite scroll.

That’s why you’re getting 0 or 1 link.

For website 1, you can use Apify for this. There’s like a specific actor that can do this.

For website 2, it’s possible to use HTTP + Iterator/Aggregator but you should take into account mapping, pagination, and parsing.

1 Like

Hi Karmic!!

Thank you so much for your response!
Sorry if this sounds too basic, but I’m not that much experienced with Make, yet (or at least with web parsing)
Can you provide a bit more details on how I could do this? I’m both super stuck and a bit overwhelmed with frustration at the moment :frowning:

1 Like

Welcome to the Make community!

So you basically need to “visit” the site to get the content. This is called Web Scraping. This can seem fairly simple, but get complex very quickly if you encounter the issues described below.

Incomplete Scraping; No Errors?

1. Anti-Scraping; Anti-Bot Measures

Are you getting no output from the HTTP “Make a request” module? This is because the website has employed anti-scraping measures, and has detected that the visit is not made by a human, and has blocked the request silently by returning no content. Hence, you cannot use normal scraping integrations like the HTTP “Make a Request” module to fetch pages from websites like these. This is NOT a Make platform, HTTP, Text Parser, or Regular Expression issue/bug.

Example: Scraping Bee Integration Runtime Error 400

2. Script Tags Do Not Run

Are you getting NO output from the Text Parser “HTML to Text” module? This is because there is NO text content in the HTML! The entire page content you are scraping may be likely hosted in a script tag, which is dynamically generated and placed onto the page using JavaScript when run on the user’s web browser (e.g.: when the page loads, or when an action is taken like on scroll).

Make is a server-side runtime environment, so when you use the HTTP modules it only fetches the initial page code, and all script tags are ignored by the Text Parser “HTML to Text” module because it is not a HTML layout element. Furthermore, the HTTP “Make a request” module also does not run any of those scripts, so no content is loaded on the page. You’ll probably get a default message that tells you to enable JavaScript.

3. Incorrect Regular Expression Pattern

Are you getting the same output as the input when using the Text Parser “Match Pattern” module? Your regular expression pattern may simply be incorrect. A reason for this is that every page is different and only works for a specific page. You also need to ensure that your pattern is built correctly to handle the raw output from the website. One way of building and testing a regular expression pattern is by using a popular tool that I use, regex101.com.

Running Page Scripts; Emulating User Input

For web scraping, a service you can use is ScrapeNinja to get content from the page.

ScrapeNinja allows you to use jQuery-like selectors to extract content from elements by using an extractor function. This is way easier than coming up with a valid and robust[1] regular expression pattern!

ScrapeNinja also can run the page in a real web-browser, loading all the content and running the page load scripts so it closely simulates what you see, as opposed to just the raw page HTML. It can even perform user actions like clicking on elements on the page!

Example: Grab data from page and url

Some tools that ScrapeNinja has graciously provided for free

Use this to test the scraping parameters on web pages:

Use these to build and test the “extractor function”:

If you need help with the above tools, please start a new topic.

AI-powered Web Scraping

You can also use AI-powered web scraping tools like Dumpling AI.

This is probably the easiest and quickest way to set-up, because all you need to do is to describe the content that you want via a prompt.

The plus-side of this is that such services combine BOTH fetching and extracting of the data in a single module (saving operations), and doing away with the lengthy setup and maintenance from the other methods described in the previous sections.

More information; Other methods

For more information on the different methods of web scraping, see my full community blog post here: Overview of Different Web Scraping Techniques in Make 🌐

Hope this helps! If you are still having trouble, please provide more details.

@samliew


  1. A robust regular expression is one that is reliable, efficient, and handles various potential inputs and edge cases, and is able to fail gracefully. ↩︎