Get HTML content without the rubbish (buttons,ads...)

HI All

I am reading RSS data from various feeds, and it works OK
But some feeds have an HTML link to the full news story, which I can iterate through, but if I do I also get some junk that I don’t want, such as the site logo, menu, etc.

Here is an example of an RSS feed
https://pitchfork.com/feed/feed-news/rss

Within the RSS feed, you might see a link such as

Is there a way of getting the story content (without the header menu, footer menu, signup button, read more section)

Thanks
Gar

Yes, you can use web scraping apps like ScrapeNinja and ScrapingBee where you can specify which sections of the website you want to return.

3 Likes

Have you tried the Text Parser module called HTML to Text?

2 Likes

Ok thanks
This looks like what I need, but I am getting a 403 error with RapidAPI key
I have opened a new topic on that

Thanks
Gar

Thanks, but I think that I want to keep the HTML content (images etc)

1 Like

Hi @samliew

This works well for me, thanks
But just wondering, if i want to search for a class with a name in the space (eg “article main-content”)
Is this possible
I have tried
extra single quotes “‘.article main-content’”
i have also tried “.article.main-content”

Any idea
Thanks

.article.main-content should be correct. No single quotes.

If you need further assistance, please provide the following:

1. Extractor function

Please provide the contents of the extractor function here. Paste the text formatted in this manner:

  • Either add three backticks ``` before and after the code, like this:

    ```
    input/output bundle content goes here
    ```

  • Or use the format code button in the editor:
    Screenshot_2023-10-02_191027

2 Likes
function (input, cheerio){
let $ = cheerio.load(input);
return {
   title: $("h1").text().trim(),
   excerpt: $(".body__inner-container").text().trim(),
   body: $(".article.main-content").text().trim()
  }
}

The above code is what I tried
The end result is, the excerpt and body values give the same result

Thanks for any help

Trying your above function with no modifications, in the sandbox ScrapeNinja Live Sandbox,

Gives the correct result.

Screenshot_2024-02-10_190242

2 Likes