I’m trying to scrape the content of a specific article, but the starting point is a Google News link that redirects to the actual article. When I use the http
module and set headers, I’m not able to follow the redirect and access the article content directly.
For example, the Google News link is: https://news.google.com/rss/articles/CBMieWh0dHBzOi8vd3d3Lm51bWVyYW1hLmNvbS90ZWNoLzE2NjkzODgteWFubi1sZS1jdW4tbGlhLWdlbmVyYXRpdmUtZXN0LTUwLWZvaXMtbW9pbnMtaW50ZWxsaWdlbnRlLXF1dW4tZW5mYW50LWRlLTQtYW5zLmh0bWzSAQA?oc=5
This link redirects to the actual article URL: https://www.numerama.com/tech/1669388-yann-le-cun-lia-generative-est-50-fois-moins-intelligente-quun-enfant-de-4-ans.html
I’ve tried using the following headers, but I’m still not able to access the article content:
headers = {
‘User-Agent’: ‘Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36’
‘CONSENT’: ‘YES+cb.20220419-08-p0.cs+FX+111’
}
Do you guys have any suggestions on how I can efficiently scrape the article content from the redirected Google News link?