Scrape Content from Redirected Google News Links

Galina_Plavunova · August 20, 2024, 1:24pm

I read an advise about getting HTML and then extract a link from there with a text-parser, but in html that i get from RSS-module linked to google news i do not have a direct link. Can this problem be solved? How can i retrieve a direct link from RSS-module linked to google news to scrape the website further? Thanks in advance

samliew · August 21, 2024, 12:55am

Welcome to the Make community!

To allow others to assist you with your scenario, please provide the following:

1. Relevant Screenshots

Please share screenshots of your scenario, any error messages, relevant module fields, and filters in question? It would really help other community members to see what you’re looking at.

You can upload images here using the Upload icon in the text editor:

2. Scenario Blueprint

Please export the scenario blueprint file to allow others to view the mapped variables in the module fields. At the bottom of the scenario editor, you can click on the three dots to find the Export Blueprint menu item.

3. Output Bundles of Modules

Please provide the output bundles of the modules by running the scenario (or get from the scenario History tab), then click the white speech bubble on the top-right of each module and select “Download input/output bundles”.

A. Upload as Text File

Save each bundle contents in your text editor as a bundle.txt file, and upload it here into this discussion thread.

B. Insert as Formatted Code Block

If you are unable to upload files on this forum, alternatively you can paste the formatted bundles.
These are the two ways to format text so that it won’t be modified by the forum:

Method 1: Type code block manually

Add three backticks ``` before and after the content/bundle, like this:

```
^{content goes here}
```
Method 2. Highlight and click the format button in the editor

Providing the input/output bundles will allow others to replicate what is going on in the scenario even if they do not use the external service.

Following these steps will allow others to assist you here. Thanks!

Galina_Plavunova · August 21, 2024, 2:54pm

Here are output bundles, first module is RSS, than I retrieve HTML with HTTP module and then try to extract origin links with text-parser. But the problem is, that origin links are not in HTML and “follow redirected links” also does not work with Google news links.
3_text_parser_output.txt (5.5 KB)
2_http_module_output.txt (261.2 KB)
1_rss_module_output.txt (4.7 KB)

Galina_Plavunova · August 29, 2024, 9:03am

Does anyone have the idea how to get an original link from Google News page?

samliew · August 30, 2024, 7:50am

I’ve found a method, you have to use a hosted scraper like DumplingAI or ScrapeNinja.

Here is an example:

Screenshot_2024-08-30_150854

Example Output

Others

I’ve used ScrapeNinja, and you can use jQuery-like selectors in the extractor function. But this is more advanced. ScrapeNinja also can run the page in a real web-browser, loading all the content and running the page load scripts so it closely simulates what you see, as opposed to just the raw page HTML fetched from the HTTP module. If you want a ScrapeNinja example, take a look at Grab data from page and url - #5 by samliew

For more information on other different methods of web scraping, see Overview of Different Web Scraping Techniques in Make 🌐

Hope this helps! Let me know if there are any further questions or issues.

— @samliew

P.S.: Did you know, the concepts of about 70% of questions asked on this forum are already covered in the Make Academy. Investing some effort into it will save you lots of time and frustration using Make later!

Galina_Plavunova · September 2, 2024, 12:32pm

super, thank you so much!!!

Topic		Replies	Views
Scrape Article Content from Redirected Google News Links How To http	2	778	July 22, 2024
Web links from semi structured data How To error	3	158	June 11, 2024
Extract text content from HTML and save it to a Google Doc How To functions , connections	7	670	July 5, 2024
Rss content & Raw HTML How To filters , text-parser	1	22	November 2, 2024
Email article link to actual article link How To error	9	38	September 9, 2024