Get only article text from a remote page

So, i’m trying to pull information from articles posted on various websites. All the websites are different as they are coming from a news aggregator RSS feed.

I can store the URL of the article, but then I want another module to pull the text and store it into a table. This is then passed into openAI so that it can be summarised.

However, Im struggling to get only the relevant text of the page. At the moment i’m getting everything - sidebar info, CSS, and loads of ads. I don’t want this, i just want the article text, both H2,H3 etc, and paragraph text.

Is there a way i can do this which would likely work on most web pages and just pull the text I want?

Usually RSS feeds contain the body text/content. Try using the data contained within the RSS feed.

You could probably try to configure a module that does web scraping (like ScrapeNinja, ScrapingBee), to extract the selectors you mentioned (article/paragraphs/headers).

Alternatively, you can try to use Feedly to aggregate different RSS feeds for you.

A possible solution would be to use AI (OpenAI GPT) to parse the raw web page data and return you only the article bits you need (structured data).

Screenshot_2024-01-04_100138

2 Likes