Readability / article extraction?

RobertAndrews · October 6, 2023, 10:37am

Is there any way to extract the essential article body of a page?
Safari does this and calls it Reader View.
It is basically the Readability Javascript library.

samliew · October 7, 2023, 3:24am

Every website has their own way of structuring their content, and not all pages contain the “article” or “main” HTML tag, so it’s mostly a hit-or-miss.

If you’re always going to be scraping the same website with the same scenario, it’s better to create a selector to target the HTML tag, or use a Match Pattern module with a pattern that is always going to contain the main content you want to extract.

RobertAndrews · October 7, 2023, 6:02am

“Selector”? What is that capability? I can certainly use a HTTP GET to download the source page, but I don’t think Make can use either jQuery or CSS selectors. This leaves me having to concoct regexs to find content between a given <div class=“including class names”> and the next unique <div class=“unique class names”> (because merely the corresponding closing </div> wouldn’t cut it.

I have previously used specialist text extraction tools’ APIs for this and it has worked well for news articles - but not for the kind of page I am currently working with.

samliew · October 7, 2023, 6:21am

For web scraping, some apps you can use are ScrapingBee and ScrapeNinja to get content from the page.

I’ve used ScrapeNinja, and you can use jQuery-like selectors there in the extractor function.

ScrapeNinja also can run the page in a web-browser so it closely emulates what users see, as opposed to just the page HTML fetched from the HTTP module.

If you want an example, take a look at Grab data from page and url - #5 by samliew

alex.newpath · October 7, 2023, 1:28pm

Have you tried the text parser module called “HTML to text”. It is one of the available simple scrapers available for free inside the native Text Parser app.

Topic		Replies	Views
Get only article text from a remote page How To text-parser , http	2	1465	April 3, 2024
Extract article from web page? Features	6	4807	September 7, 2023
How to extract information in an http page using http request How To api	6	224	July 31, 2024
Overview of Different Web Scraping Techniques in Make :globe_with_meridians: Hire a Pro technical-solution , professional-service	2	1554	September 22, 2024
How can I extract a specific text from an HTML? How To text-parser	4	394	October 26, 2024

Readability / article extraction?

Related topics