Readability / article extraction?

Is there any way to extract the essential article body of a page?
Safari does this and calls it Reader View.
It is basically the Readability Javascript library.

Every website has their own way of structuring their content, and not all pages contain the “article” or “main” HTML tag, so it’s mostly a hit-or-miss.

If you’re always going to be scraping the same website with the same scenario, it’s better to create a selector to target the HTML tag, or use a Match Pattern module with a pattern that is always going to contain the main content you want to extract.

2 Likes

“Selector”? What is that capability? I can certainly use a HTTP GET to download the source page, but I don’t think Make can use either jQuery or CSS selectors. This leaves me having to concoct regexs to find content between a given <div class=“including class names”> and the next unique <div class=“unique class names”> (because merely the corresponding closing </div> wouldn’t cut it.

I have previously used specialist text extraction tools’ APIs for this and it has worked well for news articles - but not for the kind of page I am currently working with.

For web scraping, some apps you can use are ScrapingBee and ScrapeNinja to get content from the page.

I’ve used ScrapeNinja, and you can use jQuery-like selectors there in the extractor function.

ScrapeNinja also can run the page in a web-browser so it closely emulates what users see, as opposed to just the page HTML fetched from the HTTP module.

If you want an example, take a look at Grab data from page and url - #5 by samliew

2 Likes

Have you tried the text parser module called “HTML to text”. It is one of the available simple scrapers available for free inside the native Text Parser app.

2 Likes