How to scrape only specific parts of a website?

I want to send the website headlines fot ChatGPT to evaluate.

Currently, when I use HTTP request modul, it scrapes the entire website.

Can I somehow define to scrape only H1, H2, H3 titles? If I send the entire HTML-code from a single website, the amount of data is too much for chatGTP and/or it costs too much.

Or what might be a good module to manage the scraped HTML code before sending it to ChatGPT? So I could somehow cut away unneeded HTML elements

Screenshot 2024-03-15 at 12.06.55

Hi @Jayman
You can use the text parse module to match the head tag out of the entire html file

to match all the head tags from h1 to h6 you can use the expression:
<h[1-6][^>]*>.*?</h[1-6]>

Output:

If you require additional assistance, please don’t hesitate to reach out to us.
MSquare Support
Visit us here
Youtube Channel

2 Likes

Yep thanks, for some reason the Text parser is not delivering any OUTPUT. Any thoughts?



Welcome to the Make community!

What is the website and what is the output of the HTTP module?

When reaching out for assistance with your regex pattern for a Text Parser module, it would be super helpful if you could share the actual content you’re trying to match. Screenshots of text can be a bit tricky, so if you could copy and paste the text directly here, that would be awesome! It ensures we can run it against test patterns effectively. If there’s any sensitive info, feel free to change it to something fictional yet still valid by keeping the format intact.

Providing clear text examples saves time on both ends and helps us give you the best possible solution. Without proper examples, we might end up playing a guessing game, and nobody wants that as it is a waste of time! You are more likely to get a correct answer faster. So, help us help you by sharing those text snippets. Thanks a bunch!

Please provide the input and output bundles of the modules by running the scenario (or get from the scenario History tab), then click the white speech bubble on the top-right of each module and select “Download input/output bundles”.
Screenshot_2023-10-06_141025

A.

Save each bundle contents in your text editor as a bundle.txt file, and upload it here into this discussion thread.

Uploading them here will look like this:

module-1-input-bundle.txt (12.3 KB)
module-1-output-bundle.txt (12.3 KB)

B.

If you are unable to upload files on this forum, alternatively you can paste the formatted bundles in this manner:

  • Either add three backticks ``` before and after the code, like this:

    ```
    input/output bundle content goes here
    ```

  • Or use the format code button in the editor:
    Screenshot_2023-10-02_191027

Providing the input/output bundles will allow others to replicate what is going on in the scenario even if they do not use the external service.

This will allow others to better assist you. Thanks!

2 Likes

hi @Jayman

image
make sure you enter the same pattern
there are some * missing in the pattern you gave

2 Likes