Extract article from web page?

Do any Make-connected services have the ability to extract fields of an article on a web page?

Many API services out there exist to do this. I’ve even plugged in one or two via a HTTP call. They extract fields like body, title, images, date etc.
And yes, I’m familiar with scrapers and doing that via Apify with Make.

I also know there are built-in ways to do this in iOS Shortcuts, because iOS has a Content Graph, so Shortcuts can surface those fields natively.

But are there any other options with Make?

I think there is nothing Make itself offers. How about connections?

Howdy! So :make: make is not specifically designed with webscraping in mind but…

:question: Do you have a specific use case in mind?
This is easier to answer, related to specific functionality.

You can use the :make: Make HTTP Module for example to login to a page, share cookies with the next HTTP module, to download simple HTML pages behind that login.

Then the Text Parser module to say "extract all images"

This has many limitations depending on a combination of factors…

  • What Frontend framework was used to make the website,
  • Is there dynamic content, and how is it loaded.
  • What antiscraping measures have been taken by the website owner
  • Is the content behind a login wall, and if so
    • How is login handled on the website.
  • Is a frontend framework used to rehydrate statically rendered content?
  • etc,etc.

some of these limitations can be overcome if you have a solid understanding of how to read this part of the chrome dev tools

Depending on the site, and situation.

Over Simplified Baseline

:white_check_mark: APIs
:white_check_mark: non “cross-origin restricted” XHR
:white_check_mark: Basic HTML Pages
:white_check_mark: Statically Rendered Content.
:x: run javascript on page before scraping.

Hope this is helpful.

1 Like

Hi, I’m not asking about scraping generally.

That would require advance knowledge of the mark-up construction of every web page.

See the APIs at https://rapidapi.com/collection/news-article-extractor-api

The services behind them are able to smartly extract body text, title, images etc, no matter the layout, and return them as structured data.

I’m asking if Make either i) has any such native functionality (no) or ii) it supports any direct connections to any similar services.

The alternative is to use one of those APIs again.

The short answer is no:

But some random extra details in case it sparks ideas for you

One exception may be images.
it would be pretty strait forward to look at a news feed, get top “x” links
and get all images on each article within the main make interface.

anything past that would be a heavy lift.

All The apis on that list can be used within make via either:

  1. The HTTP Module.
  2. Make Custom Apps Getting started - Make Apps

different direction
This would be where i would start.

I also noticed that some bookmark apps (like raindrop) feed back quite a lot of data about a page/article to make on request, and can create new book mark lists from make.

:person_shrugging: all pretty round about ways to get to what you are looking for.

1 Like

Good suggestions. Yeah, I know Inoreader, Raindrop etc.
It won’t be Inoreader or a feed reader for this particular project, since the source articles won’t be coming in a list (feed).
This nice thing about that, though, is - nevermind Inoreader alone - even RSS represents its own structured, clean article data. But don’t bet me started on the availability of RSS feeds these days - not to mention, you’d have to check for an available feed.

So, the idea of using Raindrop is nice and sneaky… bookmark an article in there and suck out the raw article data if available.
I’ll also check out that data extraction category, that’s the one.

Otherwise, no real problem using an external API, I guess.
''Twas just a thought, to try something built-in.



Honestly curious now how much is grabbable from raindrop without using the “make api call module”

let me know if you try it, and how it goes. (I might also beat you to it xD)