Extract article from web page?

RobertAndrews · August 28, 2022, 8:56pm

Do any Make-connected services have the ability to extract fields of an article on a web page?

Many API services out there exist to do this. I’ve even plugged in one or two via a HTTP call. They extract fields like body, title, images, date etc.
And yes, I’m familiar with scrapers and doing that via Apify with Make.

I also know there are built-in ways to do this in iOS Shortcuts, because iOS has a Content Graph, so Shortcuts can surface those fields natively.

But are there any other options with Make?

I think there is nothing Make itself offers. How about connections?

JugaadiTech · August 29, 2022, 3:22am

Howdy! So make is not specifically designed with webscraping in mind but…

Do you have a specific use case in mind?
This is easier to answer, related to specific functionality.

You can use the Make HTTP Module for example to login to a page, share cookies with the next HTTP module, to download simple HTML pages behind that login.

Then the Text Parser module to say "extract all images"

This has many limitations depending on a combination of factors…

What Frontend framework was used to make the website,

Is there dynamic content, and how is it loaded.

What antiscraping measures have been taken by the website owner

Is the content behind a login wall, and if so

How is login handled on the website.

Is a frontend framework used to rehydrate statically rendered content?

etc,etc.

some of these limitations can be overcome if you have a solid understanding of how to read this part of the chrome dev tools

Depending on the site, and situation.

Over Simplified Baseline

APIs
non “cross-origin restricted” XHR
Basic HTML Pages
Statically Rendered Content.
run javascript on page before scraping.

Hope this is helpful.

RobertAndrews · August 29, 2022, 6:28am

Hi, I’m not asking about scraping generally.

That would require advance knowledge of the mark-up construction of every web page.

See the APIs at 10 News Article Extraction APIs & Free Alternatives List - December, 2023 | RapidAPI

The services behind them are able to smartly extract body text, title, images etc, no matter the layout, and return them as structured data.

I’m asking if Make either i) has any such native functionality (no) or ii) it supports any direct connections to any similar services.

The alternative is to use one of those APIs again.

JugaadiTech · August 29, 2022, 7:57am

The short answer is no:

But some random extra details in case it sparks ideas for you

One exception may be images.
it would be pretty strait forward to look at a news feed, get top “x” links
and get all images on each article within the main make interface.

anything past that would be a heavy lift.

All The apis on that list can be used within make via either:

The HTTP Module.
Make Custom Apps https://docs.integromat.com/apps/

different direction
This would be where i would start.
https://www.make.com/en/integrations/inoreader
https://www.make.com/en/integrations/feedly
https://www.make.com/en/integrations/rss
https://www.make.com/integrations/opengraph-io
https://www.make.com/en/integrations/category/data-extraction-collection

I also noticed that some bookmark apps (like raindrop) feed back quite a lot of data about a page/article to make on request, and can create new book mark lists from make.

all pretty round about ways to get to what you are looking for.

RobertAndrews · August 29, 2022, 8:51am

Good suggestions. Yeah, I know Inoreader, Raindrop etc.
It won’t be Inoreader or a feed reader for this particular project, since the source articles won’t be coming in a list (feed).
This nice thing about that, though, is - nevermind Inoreader alone - even RSS represents its own structured, clean article data. But don’t bet me started on the availability of RSS feeds these days - not to mention, you’d have to check for an available feed.

So, the idea of using Raindrop is nice and sneaky… bookmark an article in there and suck out the raw article data if available.
I’ll also check out that data extraction category, that’s the one.

Otherwise, no real problem using an external API, I guess.
''Twas just a thought, to try something built-in.

Thanks.

JugaadiTech · August 29, 2022, 3:46pm

Honestly curious now how much is grabbable from raindrop without using the “make api call module”

let me know if you try it, and how it goes. (I might also beat you to it xD)

Topic		Replies	Views
Readability / article extraction? Features api	5	1116	January 5, 2024
How to grab only the main content and main image URL from webpage? Getting Started filters	2	89	December 4, 2024
Creating an automation using webscrapers services via API How To api	3	46	December 19, 2024
How to extract information in an http page using http request How To api	6	154	July 31, 2024
Overview of Different Web Scraping Techniques in Make :globe_with_meridians: Hire a Pro technical-solution , professional-service	2	659	September 22, 2024

Extract article from web page?

Howdy! So make is not specifically designed with webscraping in mind but…

Depending on the site, and situation.

Over Simplified Baseline

The short answer is no:

Related topics