Get only article text from a remote page

Usually RSS feeds contain the body text/content. Try using the data contained within the RSS feed.

You could probably try to configure a module that does web scraping (like ScrapeNinja, ScrapingBee), to extract the selectors you mentioned (article/paragraphs/headers).

Alternatively, you can try to use Feedly to aggregate different RSS feeds for you.

A possible solution would be to use AI (OpenAI GPT) to parse the raw web page data and return you only the article bits you need (structured data).

Screenshot_2024-01-04_100138

2 Likes