MrStack
February 5, 2024, 11:00pm
1
HI All
I am reading RSS data from various feeds, and it works OK
But some feeds have an HTML link to the full news story, which I can iterate through, but if I do I also get some junk that I don’t want, such as the site logo, menu, etc.
Here is an example of an RSS feed
https://pitchfork.com/feed/feed-news/rss
Within the RSS feed, you might see a link such as
Is there a way of getting the story content (without the header menu, footer menu, signup button, read more section)
Thanks
Gar
samliew
February 5, 2024, 11:57pm
2
Yes, you can use web scraping apps like ScrapeNinja and ScrapingBee where you can specify which sections of the website you want to return.
Have you tried the Text Parser module called HTML to Text?
MrStack
February 6, 2024, 10:37pm
4
Ok thanks
This looks like what I need, but I am getting a 403 error with RapidAPI key
I have opened a new topic on that
Thanks
Gar
MrStack
February 6, 2024, 10:38pm
5
Thanks, but I think that I want to keep the HTML content (images etc)
Hi @samliew
This works well for me, thanks
But just wondering, if i want to search for a class with a name in the space (eg “article main-content”)
Is this possible
I have tried
extra single quotes “‘.article main-content’”
i have also tried “.article.main-content”
Any idea
Thanks
samliew
February 7, 2024, 10:35pm
7
.article.main-content should be correct. No single quotes.
If you need further assistance, please provide the following:
1. Extractor function
Please provide the contents of the extractor function here. Paste the text formatted in this manner:
Either add three backticks ``` before and after the code, like this:
```
input/output bundle content goes here
```
Or use the format code button in the editor:
MrStack
February 10, 2024, 1:49am
8
function (input, cheerio){
let $ = cheerio.load(input);
return {
title: $("h1").text().trim(),
excerpt: $(".body__inner-container").text().trim(),
body: $(".article.main-content").text().trim()
}
}
The above code is what I tried
The end result is, the excerpt and body values give the same result
Thanks for any help
samliew
February 10, 2024, 11:24am
9
Trying your above function with no modifications, in the sandbox ScrapeNinja Live Sandbox ,
Gives the correct result.