How to get the main image of any news website

I need to get the url of the main image of different web sites on a Google Sheet cell.

I´m getting the HTML through a HTTP call, but if I use Get images and then parse them, I get so many bundles with images in different positions, and the image changes position depending on the source of the news.

Is it possible to select only the image based on the Meta Tag

?

Welcome to the Make community!

What if there is no meta og:image tag?

2 Likes

Which meta tag do you want to filter by? You can process each image result bundle and search for the presence of a string but I am not sure this is what you’re asking.

2 Likes

Most news sites have it, but if not then I would have to find the main image of the post.

But there is also this HTML:

“image”: {
@type”: “ImageObject”,
“url”: “https://www.abc.com/imagespp/12354.jpg?w=900&h=500”,
“width”: 900,
“height”: 500
}

I´m trying with RapidAPI - Extract News, but I´d like to know if is it possible to configure the HTTP module to do something like it…

I usually get from 15 to 18 bundles with images: the main post image is buried between logos and other kind of images.

Most news sites have this tag:

So I´d like to find just that image to get it in a Google Sheet…

If there is no social media tag, there is also this HTML that I´d like to get out of the rest of the HTML:

“image”: {
@type”: “ImageObject”,
“url”: “https://www.abc.com/imagespp/12354.jpg?w=900&h=500”,
“width”: 900,
“height”: 500
}

Instead of using Get HTML elements in the Text Parser, use the Match Pattern module in the Text Parser. You can just get the HTML by calling the page with HTTP app/Make a request module, and the full HTML will be returned as a text string in the data object.

This doesn’t seem to be HTML but rather a JSON collection. You can use the Text Parser and create a regular expression to look for the URL of the image with this regular expression. The named group will be called URL.

\"url\"\s*:\s*\"(?<URL>http.+)\"

You can also look for the meta property using this regular expression

<meta\s*property\s*\=\s*\"twitter:image\"\s*content\s*=\s*\"(?<URL>http.+)\"\s*>

3 Likes

Thanks for your suggestion. I tried but could not configure it in a fashion that I could get the main picture only. So I resolved it by adding a Pocket module that can do the trick. Thanks very much for the suggestions, I´ll keep on working on the mach pattern module later.

2 Likes

Heya @Gabriel_Rodriguez :wave:

Great to hear that you got the ball rolling with assistance from @samliew and @alex.newpath! :soccer:

Thank you very much for sharing the solution that worked for you! This way we keep our community organized and neat for other users.

Keep up the good job!

1 Like