Get only article text from a remote page

Dylan_Leighton · January 3, 2024, 5:55pm

So, i’m trying to pull information from articles posted on various websites. All the websites are different as they are coming from a news aggregator RSS feed.

I can store the URL of the article, but then I want another module to pull the text and store it into a table. This is then passed into openAI so that it can be summarised.

However, Im struggling to get only the relevant text of the page. At the moment i’m getting everything - sidebar info, CSS, and loads of ads. I don’t want this, i just want the article text, both H2,H3 etc, and paragraph text.

Is there a way i can do this which would likely work on most web pages and just pull the text I want?

samliew · January 4, 2024, 2:30am

Usually RSS feeds contain the body text/content. Try using the data contained within the RSS feed.

You could probably try to configure a module that does web scraping (like ScrapeNinja, ScrapingBee), to extract the selectors you mentioned (article/paragraphs/headers).

Alternatively, you can try to use Feedly to aggregate different RSS feeds for you.

A possible solution would be to use AI (OpenAI GPT) to parse the raw web page data and return you only the article bits you need (structured data).

Screenshot_2024-01-04_100138

Topic		Replies	Views
Readability / article extraction? Features api	5	1189	January 5, 2024
I am having trouble getting Chatgpt to read a webpage Getting Started connections	4	1090	August 2, 2024
How to summarize a scraped page? How To	2	670	December 7, 2023
Extract text content from HTML and save it to a Google Doc How To functions , connections	7	906	July 5, 2024
How to access and get data of content of a specific news from a website and add it to a google sheet How To google-sheets , http	1	19	April 24, 2025

Get only article text from a remote page

Related topics