A web scraping method to scrape many websites of different structure and content

Hi

I wonder if anyone can help me.

I want to scrape data from different companies’ websites, and then feed the data (HTML text content) from each website to a ChatGPT module to get insights. The problem is that every company’s website differs in the structure and content. Some examples of the type of insights I want from each website are:

  • The kind of work the company is involved in.
  • The company’s mission
  • Where they’re located
  • Their company’s contact email address (if shared on the site).

As you can imagine this requires somehow guessing where this content will live on the website, and if the content exists at all. Given each company will have a different website structure, I currently have a very complex flow:

  • Right now, I’m having to develop a complex scraping process on Make.com, where I first attempt to ping (make HTTP request) to extract all the Home page HTML content. From the Home page, I need to see if there’s a Sitemap XML file to workout the website’s data structure.

Check if there’s a Sitemap file available scenario:

  • First I use Regex to see if there’s a link to a Sitemap XML file I have to try that with a different Regex script,

  • e.g. to cover <a> tag and link <link> tag.

  • Then I have to feed this result into another ChatGPT to identify which is the most relevant Sitemap link out of any possible duplicates and irrelevant Sitemap links that may come up in the result.

  • If the Regex fails, then the fallback will be to ping the Home page, extract the entire Home page content, and feed the content to a ChatGPT to tell it to find the Sitemap link.

  • If there’s a Sitemap link, then ping it (HTTP request) to get the Sitemap XML file to get the website structure and links.

  • Then feed those links to a ChatGPT module to ask it to intelligently identify which internal links are relevant to the insights I need from the company (as mentioned earlier).

  • Then from the resulting links, iterate through them, pinging each link to get the HTML text content.

  • Then finally feed the text to a ChatGPT module to generate insights.

No Sitemap file available scenario:

  • If there’s no Sitemap link on the website then the last fallback is to use a HTML Link Extractor Module to parse the homepage HTML to extract all internal links (URLs under the same domain as the homepage).
  • Then I have to feed these links to a ChatGPT module to ask it to intelligently identify which of the internal links are relevant for the insights I need (mentioned earlier - i.e. company’s type of work, mission, location, email address contact … etc).
  • Then I would iterate through the resulting links, pinging each one to get the HTML text content.
  • Then feed this data to a ChatGPT to generate insights.

Given this complex flow, is there a simpler and still effective web scraping method that can scrape many websites of different structures and content, and intelligently get the insights I’m after (as mentioned earlier)?

Please note, there are thousands of websites I need to scrape. All the URL links are listed on a Google sheet. Hence I need to iterate through all the URL links in the Google sheet, to scrape them.

The other aspect is to ensure my IP doesn’t get isolated and blocked, and I don’t run into any anti-bot barriers.

Any guidance would be much appreciated.

Hey @Harry2 ,
You can easily do it with Airtop.ai, please ping me on Linkedin and I would be to share a similar scenario.

Hi Amir,

I sent an inquiry to your support team for https://clickup.com/ 2 days ago, but haven’t heard anything yet.

You can find ne here - Amir Ashkenazi - Airtop | LinkedIn

Apologies, I sent you the wrong link. But I did go on your website https://www.airtop.ai/. I tried to sign up, via Google OAuth and I got “This page isn’t working”. I tried to reload but same issue occurs. This doesn’t fill me with confidence to see this to be honest.

Sorry about it. We are investigating the reason.

We found the reason, the process broke since you did not have a last name in the profile. We are fixing and I will ping you when done.

Solved. Please try signing up again.
Sorry about the inconvenience and thank you for reporting the bug!

Hi @Harry2,

this should be totally possible by going one of two routes:

  1. Dedicated tool - I recommend exa.ai (which is basically an AI search engine which gives you results from different sites summarized)
  2. Manual method - This is what you tried

If you prefer to do it manually, here’s what I would do:

  1. HTTP request to https://companydomain.com/sitemap.xml (I don’t understand why you would have to search through the website code for the sitemap, when the sitemap is always under the /sitemap.xml path)

  2. From there, I’d do the same you did: With the help of a chatgpt request to an inexpensive model like 4o mini, find out which sites are important (always include the about us site and the homepage). Make sure chatgpt returns the site in json format

  3. Now use an iterator to send an http request to every site ChatGPT found

  4. For each site, use the HTML to text module to remove all code to only have the text

  5. Now you can throw all the pages text into ChatGPT to do whatever you want with it

I hope I could help a bit :smiley:

Hi Amir,

Thanks. I’ll give it a try at some point.

Best regards

Hi @Juliusforster

Many thanks for your response. I’ll try both solutions and see which one yields better results. :slightly_smiling_face:

1- Re your first point, to be honest, ChatGPT gave me the suggestion to search through the website code for the sitemap, in case it’s not placed in the default place. I guess it’s a possible hallucination on ChatGPT’s part.

Once again, I appreciated the solutions. Would you be so kind, if you have time, to look at the other question I submitted: https://community.make.com/t/bundlevalidationerror-validation-failed-for-1-parameter-s-missing-value-of-required-parameter-url/63096

I posted it a few weeks ago and no response so far, unfortunately. Would really appreciate any help on this.

Many thanks :smiley:

Hi @Harry2,

just made a quick video for you on how exactly it would work. It works really well, by the way!
Great idea!

Here you go:

If that works, feel free to mark this as solution so that others can find it as well :smiley:

If there’s anything else you need help with, let me know!

1 Like

Wow!! You actually went as far as creating a video for the solution. That’s a first for me to see in a forum. So much appreciated @Juliusforster . Thank you :smiley: :clap: :clap: :clap:
It’s late for me now, so I’ll try it tomorrow morning and let you know :slightly_smiling_face:
The only remaining problem is what I mentioned in my previous thread, which is the other question I posted in the forum. I’m really stuck on that question and unfortunately, it’s actually related to this pipeline. It’s iterating through all the companies’ website URLs in a Google sheet and making an HTTP GET Request for each. I can’t get the HTTP GET Request module to recognise the dynamic values from the relevant Google Sheet column.

If you could help me with the other question if you have time, that would make the entire pipeline work. I’ll mark both questions as solved, and thank you so much in advance :smiley: :pray: :pray:

1 Like

Sure thing :smiley:

Forgot to attach the blueprint for the solution. Here you go:
blueprint (7).json (78.0 KB)

Will look at the other issue tomorrow. Feel free to remind me there by tagging me in case I forget it.

1 Like

Hi @Juliusforster

Thank you for the solution you shared. I tried your blueprint. It worked well. :slightly_smiling_face:

I made one modifications:

For your first ChatGPT module, which receives the parsing of sitemap.xml, I changed the prompting where you put the example site structure, referencing ‘moonira.com’. I changed this to the dynamic value of the ‘Company Website’ from the Google Sheet, because the issue was ChatGPT was producing the result specifically for moonira.com website, and not the websites from Google Sheet. Otherwise, all worked perfectly :slight_smile:

I have a few questions:

1- To make this pipeline flow more complete, I read on a forum and also watched a YouTube tutorial, regarding ensuring your IP address doesn’t get isolated and blocked when making many website requests (especially if there are thousands of websites to go through), it’s advisable to mimic human-like behavior, by including Request Headers, in the HTTP GET Request module configurations). Because some Request Headers are static while others are specific to the user’s browser, one tutorial suggested you can extract your Request Headers by accessing them from the Google Chrome ‘network’ option. Copying both the Headers and their value (on the right side, which I partially covered with a black box for privacy reasons) to include in the HTTP GET Request module.

I also asked ChatGPT about this. Apart from recommending to include the Request Headers, it also suggested the following:

Proxy Setup in HTTP Module

To use a proxy for each HTTP request:

  1. Set Up a Proxy Service:
  1. Configure the HTTP Module in Make.com:
  • Go to the HTTP Request module.
  • Click on Advanced Settings.
  • Locate the Proxy field and add your proxy URL:

http://username:password@proxyserver.com:port

  • This ensures that each request is routed through the proxy.

Optional: Rotate Proxies

  • If you have multiple proxies, store them in a Google Sheet or Data Store.
  • Use the same randomization logic (as described in Step 1) to randomly select a proxy for each request.

Add Random Delay Between Requests

To avoid detection and mimic human-like behavior:

  1. Add a Delay Module:
  • In Make.com, add the Delay module between each HTTP request.
  • Select Delay for a Duration.
  1. Use Random Delays:
  • Set the delay to a random value using this method:
  • Use a Set Variable module to calculate a random delay:

{{Math.floor(Math.random() * (max - min + 1)) + min}}

  • Replace min and max with your desired range in milliseconds (e.g., 2000 for 2 seconds and 5000 for 5 seconds).
  • Map the delay duration variable to the Delay module.

CAPTCHA Handling:

  • If websites trigger CAPTCHA challenges, consider integrating a CAPTCHA-solving service like 2Captcha or Anti-Captcha.

What’s your thought on this? Do you have a better recommendation to mimic human-like behavior, when making many website requests (not to be detected as a bot)?

Hi @Harry2,
yup, my website, moonira.com, was just a placeholder for me to check if it works.

1 - Yes, you can totally do that! Especially the request headers that tell the site what browser and OS you use are good to include.

2 - Using a proxy will cost you money and especially a lot of time to set up correctly. I would recommend not to do that.

One quick thing: You aren’t actually sending a ton of requests to reach page, right? You send one to the sitemap.xml and then 5 to 10 to the pages that chatGPT thinks are important. Adding the Headers and a random delay between requests to the different pages of one website is more than enough.
Here’s how you’d add a random time delay between 0 - 60 seconds:

If you want to go a step further (with proxies and real scraping), i recommend you check out apify.com.

That is the most used platform for scraping all kinds of stuff. They have a whole marketplace of scrapers and all of them use proxies by default. I think you get like $5 for free with which you can already do a lot of scraping :smiley:

P.S. If a website requires captcha, it gets a lot more complex. It is possible that some of the website scrapers on apify have that built in, but if you want to set something like this up yourself, you need some good coding knowledge :face_exhaling:

@Juliusforster

2- My second question is, in your blueprint flow, I wanted to add a Google Sheet module at the end, where the output from the ChatGPT that generates the one-line icebreaker is added/updated in a cell in a particular column in the Google sheet. Let’s say this is column ‘K’ and will be called ‘Ice-breaker’ in the Google Sheet.

Which Google Sheet module should you use to add the ice-breaker from ChatGPT module into the entries for each company under the ‘Ice-breaker’ column? I couldn’t work out whether it was ‘Update Row’ or ‘Update Cell’!

‘Update a Row’ module, asks for the ‘Row Number’. I can’t specify a row number, because I need to add the values (ice-breaker text) dynamically for each company entry row, under the ‘Ice-breaker’ column.

‘Update a Cell’ module also asks for a specific cell number, which is not what I require:

Which module should I use?

When you get the data from google sheets, doesn’t it also give you the row number as output?

@Juliusforster

3- My third question is, what happens in the scenario - if some websites are no longer active (i.e. the HTTP GET Request times out)? Will the HTTP GET Request module throw an error, and the entire flow is interrupted? Or will the flow continue to the next entries in the Google Sheet? If the flow gets interrupted, by broken websites/URLs, should we put in place a Conditional module that ensures any HTTP Request that is not a 200 (or 200-299) response status code, we should update it in a separate column, in the Google Sheet, (say call it ‘HTTP Response Status Code’) to know which websites are broken in the Google Sheet. At the same time ensure the flow continues to processing the remaining entries in the Google Sheet, without interruption.

The whole flow will be interrupted. You can right click the http module and add an “error handler”. I recommend you search on youtube for “make.com error handler guide”. Several good ones will come up regarding the different ways to handle errors like this :smiley:

1 Like