Scrape Article Content from Redirected Google News Links

I’m trying to scrape the content of a specific article, but the starting point is a Google News link that redirects to the actual article. When I use the http module and set headers, I’m not able to follow the redirect and access the article content directly.
For example, the Google News link is: https://news.google.com/rss/articles/CBMieWh0dHBzOi8vd3d3Lm51bWVyYW1hLmNvbS90ZWNoLzE2NjkzODgteWFubi1sZS1jdW4tbGlhLWdlbmVyYXRpdmUtZXN0LTUwLWZvaXMtbW9pbnMtaW50ZWxsaWdlbnRlLXF1dW4tZW5mYW50LWRlLTQtYW5zLmh0bWzSAQA?oc=5
This link redirects to the actual article URL: https://www.numerama.com/tech/1669388-yann-le-cun-lia-generative-est-50-fois-moins-intelligente-quun-enfant-de-4-ans.html
I’ve tried using the following headers, but I’m still not able to access the article content:

headers = {
‘User-Agent’: ‘Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36’
‘CONSENT’: ‘YES+cb.20220419-08-p0.cs+FX+111’
}

Do you guys have any suggestions on how I can efficiently scrape the article content from the redirected Google News link?

Welcome to the Make community!

If you view the source code of the HTML file, it looks like a client-side JavaScript redirect. This means that the “Follow redirects” option in the HTTP module won’t work as that option is for server-side 301/2 redirect codes.

From the source code, there are three obvious URLs to the article, so you could probably use a Text Parser “Match Elements” module to extract the URLs, and then filter by the domain name.

Or, you can just aggregate the results and use the built-in function last to get the last URL on the page.

Screenshot_2024-04-23_210433

Give it a go and let us know if you have any issues!

samliewrequest private consultation

4 Likes