So, i’m trying to pull information from articles posted on various websites. All the websites are different as they are coming from a news aggregator RSS feed.
I can store the URL of the article, but then I want another module to pull the text and store it into a table. This is then passed into openAI so that it can be summarised.
However, Im struggling to get only the relevant text of the page. At the moment i’m getting everything - sidebar info, CSS, and loads of ads. I don’t want this, i just want the article text, both H2,H3 etc, and paragraph text.
Is there a way i can do this which would likely work on most web pages and just pull the text I want?
Usually RSS feeds contain the body text/content. Try using the data contained within the RSS feed.
You could probably try to configure a module that does web scraping (like ScrapeNinja, ScrapingBee), to extract the selectors you mentioned (article/paragraphs/headers).
Alternatively, you can try to use Feedly to aggregate different RSS feeds for you.
A possible solution would be to use AI (OpenAI GPT) to parse the raw web page data and return you only the article bits you need (structured data).