How to prevent duplicate data in Google Sheets when scraping multiple times daily?

EJS12 · December 23, 2024, 10:55am

Hi, I’m building an AI automation in Make.com and need some help with preventing duplicate data. Here’s the current setup of my scenario, which has 4 modules:

HTTP (Make a request): Sends a request to the sitemap of nos.nl to retrieve all posts available at that moment.
XML (Parse XML): Parses the XML data from the sitemap.
Iterator: Creates bundles for each post from nos.nl.
Google Sheets (Add a Row): Adds the URL from each bundle to a Google Sheet.

I run this scenario 3 times a day to ensure I catch all updates, but this often results in duplicate URLs being added to my Google Sheet.

What I’ve tried:

I was thinking about adding a Google Sheets (Search row) module between the Iterator and Google Sheets (Add a row) modules. I then planned to use a filter between Google Sheets (Search row) and Google Sheets (Add a row) with the condition “URL does not exist.” However, this doesn’t seem to work.

My question:

What’s the best way to configure this scenario to prevent duplicate data from being added to the Google Sheet? Is there a more reliable way to check if a URL already exists in the sheet before adding it?

Donald_Mitchell · December 23, 2024, 3:02pm

Hello @EJS12,

It seems like what you had would have been a good way to go, but using a different filter.
You can try for your filter operator, “Text Does Not Contain (case-insensitive)” or “Text Not Equal To (case-insensitive)”.

Also, the way your Scenario is built, it will Search Rows for every item encountered, which potentially uses up a lot of Ops depending on how much data you’re retrieving. Once you get it working, maybe consider an alternative approach to save on Ops.

Alternative approach:
Read the entire sheet once, followed by an Array Aggregator that aggregates the URL column into an array.
Then, for each new URL, just add a row if that URL doesn’t exist in the array.
Better yet, for each URL that would be added, add those to a new array instead, then push that array to Google Sheets in a Bulk Add Rows operation.

With this approach, whether you need to add 1 new URL or 20 new URLs, the scenario still uses maybe around 7 Ops.

EJS12 · December 23, 2024, 8:27pm

Hi Donald,

Thank you so much for your detailed response and the alternative approach! I really appreciate the thought you put into helping me save ops. I’ve been trying to implement your suggestions and have watched several videos on iterators and aggregators to better understand the process, but I’m still struggling to get it to work.

Here’s my updated setup based on your advice:

HTTP (Make a request): Sends a request to the sitemap of nos.nl to retrieve all posts available at that moment.
XML (Parse XML): Parses the XML data from the sitemap.
Google Sheets (Search Rows): Reads the entire sheet.
Array Aggregator: Aggregates the existing URLs into an array.
Filter: Checks if the aggregated array does not contain the current URLs from the parsed data ({{22.array}} does not contain {{20.urlset.url.loc}}).
Iterator: Creates bundles for the new URLs to add.
Google Sheets (Bulk Add Rows): Adds all new URLs in bulk to the sheet.

Here’s the challenge I’m facing:

• I can’t seem to properly configure the filter between the Array Aggregator and the Iterator to ensure only new URLs are passed through.

• I’m unsure if the Iterator is needed here or if I’m overcomplicating the process by including it before the Bulk Add Rows.

I feel like I’m close but missing something critical. Could you clarify:

The correct configuration for the filter to compare the URLs from the Array Aggregator against the new ones.
Whether the Iterator is necessary in this setup, or if I can go straight from the Array Aggregator to the Bulk Add Rows module. If I’m wrong about the whole sequence please let me know too.

Thanks again for your time and help—this is all new to me, and your guidance is greatly appreciated!

Best regards,

EJS12

samliew · December 24, 2024, 1:04pm

Welcome to the Make community!

You are using a Text Operator to compare an array.

Try using an Array Operator to compare an array instead.

Yes, the type of filter operator IS extremely important.

Hope this helps! Let me know if there are any further questions or issues.

— @samliew

P.S.: Investing some effort into the Make Academy will save you lots of time and frustration using Make.

EJS12 · January 6, 2025, 4:34pm

Hey all,

Thanks to @samliew and @Donald_Mitchell for your help so far!

Unfortunately I still can’t figure out how to make it work, despite applying your feedback, doing the corresponding Make.com Academy courses and watching videos about it on YouTube.

I’m doing something wrong but I just can’t figure out what or where…

So this is what I want:

My goal is to scrape this sitemap: https://nos.nl/sitemap/index.xml.

What I want to extract from this sitemap is: URL (loc) and Date (publication_date).

I’m only interested in URLs that start with “https://nos.nl/artikel/”

I want to scrape it 3 times per day and put it in a Sheet.

To prevent duplicates from happening (because I’ll scrape it 3 times per day) I want to filter out URLs that I already have in my Sheet.

Could someone tell me what the right order of modules is? And which specific filters, functions etc. I should use and where?

I have tried many things and made a lot of mistakes (which cost me thousands of operations by mistake… oops…) but still I can’t figure it out. I’m still thinking that the order in the image under here would be best, but I would love to hear your thoughts.

Jeroom · January 10, 2025, 3:27pm

I’m facing a similar issue. I recommend subscribing to my post to get updates… Hopefully we get the help we need

Topic		Replies	Views
Duplicate Imports in Google Sheets How To filters , google-sheets	11	173	October 6, 2024
RSS Feed Duplicating Data in Google Sheets - Need Help with Filters Getting Started filters	5	86	December 12, 2024
Google sheet duplicated How To google-sheets	1	22	January 12, 2025
How to avoid duplicate or similar output results (news in my case) when sending it to Google Sheets Getting Started filters	2	23	February 27, 2025
Remove duplicated rows from google sheet How To google-sheets	11	1021	December 13, 2023

How to prevent duplicate data in Google Sheets when scraping multiple times daily?

Related topics