How to scrape only specific parts of a website?

Jayman · March 15, 2024, 10:09am

I want to send the website headlines fot ChatGPT to evaluate.

Currently, when I use HTTP request modul, it scrapes the entire website.

Can I somehow define to scrape only H1, H2, H3 titles? If I send the entire HTML-code from a single website, the amount of data is too much for chatGTP and/or it costs too much.

Or what might be a good module to manage the scraped HTML code before sending it to ChatGPT? So I could somehow cut away unneeded HTML elements

Screenshot 2024-03-15 at 12.06.55

Msquare_Automation · March 15, 2024, 11:17am

Hi @Jayman
You can use the text parse module to match the head tag out of the entire html file

to match all the head tags from h1 to h6 you can use the expression:
<h[1-6][^>]*>.*?</h[1-6]>

Output:

If you require additional assistance, please don’t hesitate to reach out to us.
MSquare Support
Visit us here
Youtube Channel

Jayman · March 15, 2024, 12:29pm

Yep thanks, for some reason the Text parser is not delivering any OUTPUT. Any thoughts?

samliew · March 15, 2024, 1:35pm

Welcome to the Make community!

What is the website and what is the output of the HTTP module?

When reaching out for assistance with your regex pattern for a Text Parser module, it would be super helpful if you could share the actual content you’re trying to match. Screenshots of text can be a bit tricky, so if you could copy and paste the text directly here, that would be awesome! It ensures we can run it against test patterns effectively. If there’s any sensitive info, feel free to change it to something fictional yet still valid by keeping the format intact.

Providing clear text examples saves time on both ends and helps us give you the best possible solution. Without proper examples, we might end up playing a guessing game, and nobody wants that as it is a waste of time! You are more likely to get a correct answer faster. So, help us help you by sharing those text snippets. Thanks a bunch!

Please provide the input and output bundles of the modules by running the scenario (or get from the scenario History tab), then click the white speech bubble on the top-right of each module and select “Download input/output bundles”.

A.

Save each bundle contents in your text editor as a bundle.txt file, and upload it here into this discussion thread.

Uploading them here will look like this:

module-1-input-bundle.txt (12.3 KB)
module-1-output-bundle.txt (12.3 KB)

B.

If you are unable to upload files on this forum, alternatively you can paste the formatted bundles in this manner:

Either add three backticks ``` before and after the code, like this:

```
^{input/output bundle content goes here}
```
Or use the format code button in the editor:

Providing the input/output bundles will allow others to replicate what is going on in the scenario even if they do not use the external service.

This will allow others to better assist you. Thanks!

Msquare_Automation · March 16, 2024, 6:16am

hi @Jayman

make sure you enter the same pattern
there are some * missing in the pattern you gave

Topic		Replies	Views
Basic webscraping with HTTP and Text Parser Beginner Questions functions , web-scraping	1	623	July 5, 2024
I need to capture certain content from HTML Questions text-parser , html	6	368	October 29, 2024
Extract title from .html webpage Beginner Questions filters	2	1026	November 10, 2023
HTML to Text Parsing ONLY for <Body> text Questions arrays , set-variable	3	2165	May 19, 2024
Data scraping, extract specific text from http module Questions google-sheets , http , web-scraping	3	136	January 6, 2025

How to scrape only specific parts of a website?

A.

B.

Related topics