How should I remove duplicated output bundles?

Kaz_Suzuki · December 17, 2024, 11:33pm

Hi all,

My scenario extracts texts from images. I use an OpenAI module for the text extraction. After the text extraction is done, I use the Text Aggregator module to get the following output:

As you see, the text extracted from 0002.jpg, 0003.jpg, and 0004, jpg is the same as the one extracted from 0001.jpg. Likewise, the text extracted from 0006.jpg and 000 7.jpg is the same as the one extracted from 0005.jpg.

So, I would like to remove the portions enclosed with red boxes. I use the Text aggregator module to create the above result as shown below:

I tried things after reading related topics here, but I can’t find a solution to remove the duplicated texts from the Text aggregator output.

It would be great if anyone here can help. Thank you!

kudracha · December 18, 2024, 3:34pm

@Kaz_Suzuki , Have you tried asking ChatGPT filter those duplicate texts ? Can you also look for potential regex solution with ChatGPT? Just copy the output at ChatGPT and explain what u’re trying to do, then use ‘match pattern’ module (text = output text, pattern = pattern given by gpt).

Kaz_Suzuki · December 18, 2024, 6:43pm

Thank you very much for the reply. I tried, but I was not able to solve my issue by using Match Pattern + regex.

Kaz_Suzuki · December 18, 2024, 6:47pm

Thank you all who read my question and tried to come up with a solution.

I ended up using an Array Aggregator right after the Text Aggregator, and then used the distinct function to deduplicate. The Text Aggregator might not be needed, but it’s a part of somewhat complex scenario and I didn’t want to break something. So, I haven’t tried to remove the Text Aggregator yet (probably, I will try later).

Thank you!

Juliusforster · December 18, 2024, 6:49pm

Would the function deduplicate() work here?

Never used it, but it might be worth a shot. Probably only works with simple arrays only

CleanShot 2024-12-18 at 7 .48.27

Kaz_Suzuki · December 18, 2024, 7:01pm

Thank you for the reply! I think deduplicate() is for a simple array that each item doesn’t have any other attributes. In this case, each array item includes text and its associated image filename. So, I used distinct() like “distinct(; text)”.

Juliusforster · December 18, 2024, 7:23pm

Just wanted to comment that as well. Distinct is the correct one ^^

Topic		Replies	Views
Bundle deduplication How To aggregators , arrays , google-sheets , collections	6	166	September 29, 2024
Remove duplicates from parser How To arrays	3	114	October 26, 2024
How to remove duplicates from array aggregator before posting in bulk to Google Sheet How To functions , arrays , google-sheets	2	165	October 16, 2024
Issues trying to isolate email addresses and remove duplicates to pass to other modules How To filters , mapping	4	82	January 10, 2025
Text Parser a lot of Output Bundle, want each separate How To text-parser	5	529	April 5, 2024

How should I remove duplicated output bundles?

Related topics