Get HTML content without the rubbish (buttons,ads...)

MrStack · February 5, 2024, 11:00pm

HI All

I am reading RSS data from various feeds, and it works OK
But some feeds have an HTML link to the full news story, which I can iterate through, but if I do I also get some junk that I don’t want, such as the site logo, menu, etc.

Here is an example of an RSS feed
https://pitchfork.com/feed/feed-news/rss

Within the RSS feed, you might see a link such as

Is there a way of getting the story content (without the header menu, footer menu, signup button, read more section)

Thanks
Gar

samliew · February 5, 2024, 11:57pm

Yes, you can use web scraping apps like ScrapeNinja and ScrapingBee where you can specify which sections of the website you want to return.

alex.newpath · February 6, 2024, 3:45pm

Have you tried the Text Parser module called HTML to Text?

MrStack · February 6, 2024, 10:37pm

Ok thanks
This looks like what I need, but I am getting a 403 error with RapidAPI key
I have opened a new topic on that

Thanks
Gar

MrStack · February 6, 2024, 10:38pm

Thanks, but I think that I want to keep the HTML content (images etc)

MrStack · February 7, 2024, 8:55pm

Hi @samliew

This works well for me, thanks
But just wondering, if i want to search for a class with a name in the space (eg “article main-content”)
Is this possible
I have tried
extra single quotes “‘.article main-content’”
i have also tried “.article.main-content”

Any idea
Thanks

samliew · February 7, 2024, 10:35pm

.article.main-content should be correct. No single quotes.

If you need further assistance, please provide the following:

1. Extractor function

Please provide the contents of the extractor function here. Paste the text formatted in this manner:

Either add three backticks ``` before and after the code, like this:

```
^{input/output bundle content goes here}
```
Or use the format code button in the editor:

MrStack · February 10, 2024, 1:49am

function (input, cheerio){
let $ = cheerio.load(input);
return {
   title: $("h1").text().trim(),
   excerpt: $(".body__inner-container").text().trim(),
   body: $(".article.main-content").text().trim()
  }
}

The above code is what I tried
The end result is, the excerpt and body values give the same result

Thanks for any help

samliew · February 10, 2024, 11:24am

Trying your above function with no modifications, in the sandbox ScrapeNinja Live Sandbox,

Gives the correct result.

Screenshot_2024-02-10_190242

Topic		Replies	Views
Get only article text from a remote page Questions text-parser , http	2	1661	April 3, 2024
How to scrape only specific parts of a website? Questions http , web-scraping	5	1227	June 14, 2024
Readability / article extraction? Questions api	5	1244	January 5, 2024
Issue with Scrapeninja Extraction Questions filters , webhooks	9	576	December 14, 2023
News Automation (RSS -> Scraptio -> OpenAI --> Google Sheet): almost there, please help! Questions error	6	498	September 11, 2024

Get HTML content without the rubbish (buttons,ads...)

1. Extractor function

Related topics