Scraping Data from multiple pages based on main webpage

SamuelLion · July 23, 2024, 6:55am

In my work, I have successfully connected Google Sheets containing URLs of various homepages or AI tools to CloudConvert, converting these URLs into JPG images of the website homepages. I then use these images to train OpenAI to answer questions based on the homepage content. However, websites typically have multiple important pages, such as pricing or about us.

I want to create a scenario where, based on one URL, all relevant pages of a website are retrieved and converted into JPG images. This way, OpenAI can answer questions using information from all pages within the website. How can I automate the retrieval of these additional pages and their conversion into JPG images for comprehensive training?

samliew · July 23, 2024, 8:59am

Welcome to the Make community!

You’ll have to extract all the links on the current page, and then use an Iterator to then fetch the subsequent linked pages.

samliew – request private consultation

Join the Make Fans Discord server to chat with other makers!

SamuelLion · July 23, 2024, 7:25pm

How do I extract all the links on the current page seamlessly without having to manually go to the home page and click on all the links to get the specific URL of those subsequent pages?

samliew · July 24, 2024, 1:32am

Something like this perhaps?

You can use the Text Parser “Match Elements” module to get the URLs.

Screenshot_2024-07-24_090710

samliew – request private consultation

Join the Make Fans Discord server to chat with other makers!

samliew · July 24, 2024, 1:34am

Module Export

You can copy and paste this module export into your scenario. This will paste the modules shown in my screenshots above.

Copy the JSON code below by clicking the copy button when you mouseover the top-right of the code block
Enter your scenario editor. Press ESC to close any dialogs. Press CTRLV (paste keyboard shortcut for Windows) to paste directly in the canvas.
Click on each imported module and save it for validation. You may be prompted to remap some variables and connections.

Click to Expand Module Export Code

JSON - Copy and Paste this directly in the scenario editor

{"subflows":[{"flow":[{"id":39,"module":"http:ActionSendData","version":3,"parameters":{"handleErrors":true,"useNewZLibDeCompress":true},"mapper":{"url":"https://www.make.com/en","serializeUrl":false,"method":"get","headers":[],"qs":[],"bodyType":"","parseResponse":false,"authUser":"","authPass":"","timeout":"","shareCookies":false,"ca":"","rejectUnauthorized":true,"followRedirect":true,"useQuerystring":false,"gzip":true,"useMtls":false,"followAllRedirects":false},"metadata":{"designer":{"x":-8,"y":-1522,"name":"Get page content"},"restore":{"expect":{"method":{"mode":"chose","label":"GET"},"headers":{"mode":"chose","collapsed":true},"qs":{"mode":"chose","collapsed":true},"bodyType":{"collapsed":true,"label":"Empty"},"parseResponse":{"collapsed":true}}},"parameters":[{"name":"handleErrors","type":"boolean","label":"Evaluate all states as errors (except for 2xx and 3xx )","required":true},{"name":"useNewZLibDeCompress","type":"hidden"}]}},{"id":40,"module":"regexp:GetElementsFromText","version":1,"parameters":{"continueWhenNoRes":false},"mapper":{"pattern":"##http_urls","text":"{{toString(39.data)}}","requireProtocol":false,"specialCharsPattern":""},"metadata":{"designer":{"x":238,"y":-1519,"name":"Get links on page"},"restore":{"expect":{"pattern":{"label":"HTTP address"}}},"parameters":[{"name":"continueWhenNoRes","type":"boolean","label":"Continue the execution of the route even if the module finds no matches","required":true}],"interface":[{"name":"match","label":"Match","type":"any"}]}},{"id":42,"module":"http:ActionSendData","version":3,"parameters":{"handleErrors":true,"useNewZLibDeCompress":true},"filter":{"name":"not self link","conditions":[[{"a":"{{40.match}}","o":"text:notequal:ci","b":"https://www.make.com/en"}]]},"mapper":{"url":"{{40.match}}","serializeUrl":false,"method":"get","headers":[],"qs":[],"bodyType":"","parseResponse":false,"authUser":"","authPass":"","timeout":"","shareCookies":false,"ca":"","rejectUnauthorized":true,"followRedirect":true,"useQuerystring":false,"gzip":true,"useMtls":false,"followAllRedirects":false},"metadata":{"designer":{"x":537,"y":-1520,"name":"Get linked page content"},"restore":{"expect":{"method":{"mode":"chose","label":"GET"},"headers":{"mode":"chose"},"qs":{"mode":"chose"},"bodyType":{"label":"Empty"}}},"parameters":[{"name":"handleErrors","type":"boolean","label":"Evaluate all states as errors (except for 2xx and 3xx )","required":true},{"name":"useNewZLibDeCompress","type":"hidden"}]}},{"id":46,"module":"util:ComposeTransformer","version":1,"parameters":{},"mapper":{"value":"{{toString(42.data)}}"},"metadata":{"designer":{"x":784,"y":-1521,"name":"Get page source code"},"restore":{},"expect":[{"name":"value","type":"text","label":"Text"}]}},{"id":47,"module":"util:TextAggregator","version":1,"parameters":{"rowSeparator":"\n","feeder":40},"mapper":{"value":"{{46.value}}{{newline}}"},"metadata":{"designer":{"x":1037,"y":-1520,"name":"Combine into single variable"},"restore":{"parameters":{"rowSeparator":{"label":"New row"}},"extra":{"feeder":{"label":"Get links on page - Match elements"}}},"parameters":[{"name":"rowSeparator","type":"select","label":"Row separator","validate":{"enum":["\n","\t","other"]}}],"advanced":true}},{"id":48,"module":"openai-gpt-3:CreateCompletion","version":1,"parameters":{"__IMTCONN__":107818},"mapper":{"select":"chat","max_tokens":"128000","temperature":"1","top_p":"1","n_completions":"1","response_format":"text","model":"gpt-4-turbo","messages":[{"role":"user","content":"Analyze the following content and provide a summary:\n\n{{47.text}}"}]},"metadata":{"designer":{"x":1283,"y":-1519,"name":"OpenAI"},"restore":{"parameters":{"__IMTCONN__":{"label":"OpenAI","data":{"scoped":"true","connection":"openai-gpt-3"}}},"expect":{"select":{"label":"Create a Chat Completion (GPT Models)"},"logit_bias":{"mode":"chose"},"response_format":{"mode":"chose","label":"Text"},"stop":{"mode":"chose"},"additionalParameters":{"mode":"chose"},"model":{"mode":"chose","label":"gpt-4-turbo (system)"},"messages":{"mode":"chose","items":[{"role":{"mode":"chose","label":"User"}}]}}},"parameters":[{"name":"__IMTCONN__","type":"account:openai-gpt-3","label":"Connection","required":true}]}}]}],"metadata":{"version":1}}

samliew – request private consultation

Join the Make Fans Discord server to chat with other makers!

Sam_Van_Leeuwen · July 24, 2024, 7:16am

Thank you for your reply and I like the idea however running into some problems. A website is filled with hundreds of linked pages for all sorts of different things. When I only actually only need a few like pricing page, security page. Ones that are actual heygen pages, Is it possible get the links and then specifically filter out the ones that are the ones we need? Then I dont know why but at ‘get linked page’ it fails because of an invalid URL eventhough the URL is data block from the module before www.heygen.com so dont know why thats not working

samliew · July 24, 2024, 7:19am

Yes, modify the existing filter in the middle.

One of the URLs in one of the bundles are probably a “relative” URL.

You can try adding the base domain http://www.heygen.com/ to the start of URLs that does not begin with http://www.heygen.com/.

samliew – request private consultation

Join the Make Fans Discord server to chat with other makers!

Sam_Van_Leeuwen · July 25, 2024, 7:03am

Okay thank you, you are a big help. Now I boss wants me to do things differently and have changed things up abit but I have a different question now. I want to let OpenAI fetch the data from the image I have in google drive. Then use the OpenAI ‘Text to Structured Data" to structure the data to be able to put it in GoogleSheets. So for instance I have got these categories; Billing Cycle, Pricing, Description, Type of AI. I want the info that OpenAI image vision fetched form these images and then “Text to structured data” to put them nicely in a strcuture for me to put them in to rows within sheets with the right header. But it does not seems to work for me for some reason. Overall do what do you think of OpenAI “Text to data structure” is it a good option or do you recommend to stick to a "parse JSON’ module and ask OpenAI within the image vision to create an output/answer in JSON format by mentioning that in the prompt?

Sam_Van_Leeuwen · July 25, 2024, 7:24am

Ooh and would you know why with the same prompt. If i provide a screenshot of for instance the pricing page of the website of an AI tool the OpenAI image vision module it will perfectly capture the data within that image with perfect accuracy. However, when I use CloudConvert to download and PNG or JPG image from a url and then give to Image Vision the data is not perfectly accurately taken from the image?

samliew · July 25, 2024, 7:38am

No problem, glad I could help!

Please create a separate topic for each question.

While it’s tempting to continue an existing thread, a more effective approach would be to start a new topic. It helps other community users to respond to your query, and keeps our space organised for everyone. If you start a new conversation you are also more likely to get help from other users. You can refer others back to a related topic by including that link in your question. Thank you for understanding and keeping our community neat and tidy.

The “New Topic” link can be found in the top-right of the header:

1. If anyone has a new question in the future, please start a new thread. This makes it easier for others with the same problem to search for the answers to specific questions, and you are more likely to receive help since newer questions are monitored closely.

2. The Make Community guidelines encourages users to try to mark helpful replies as solutions to help keep the Community organized.

This marks the topic as solved, so that:

others can save time when catching up with the latest activity here, and
allows others to quickly jump to the solution if they come across the same problem

To do this, simply click the checkbox at the bottom of the post that answers your question:
Screenshot_2023-10-04_161049

3. Don’t forget to like and bookmark this topic so you can get back to it easily in future!

4. Do join the unofficial Make Discord server for live chat and video assistance

Join the Make Fans Discord server to chat with other makers!

Topic		Replies	Views
A web scraping method to scrape many websites of different structure and content Questions notion	26	4204	January 7, 2025
OpenAI Transform Text to Structured Data Questions error	2	1470	July 26, 2024
Extract Website Data via URL for SaaS Website Directory Questions functions , mapping , connections , google-sheets	3	142	October 28, 2024
Web links from semi structured data Questions error	2	183	March 13, 2024
Extracting information from Webpage Questions error	1	237	July 26, 2024

Scraping Data from multiple pages based on main webpage

Module Export

Related topics