How should I remove duplicated output bundles?

Hi all,

My scenario extracts texts from images. I use an OpenAI module for the text extraction. After the text extraction is done, I use the Text Aggregator module to get the following output:

As you see, the text extracted from 0002.jpg, 0003.jpg, and 0004, jpg is the same as the one extracted from 0001.jpg. Likewise, the text extracted from 0006.jpg and 000 7.jpg is the same as the one extracted from 0005.jpg.

So, I would like to remove the portions enclosed with red boxes. I use the Text aggregator module to create the above result as shown below:

I tried things after reading related topics here, but I can’t find a solution to remove the duplicated texts from the Text aggregator output.

It would be great if anyone here can help. Thank you!

@Kaz_Suzuki , Have you tried asking ChatGPT filter those duplicate texts ? Can you also look for potential regex solution with ChatGPT? Just copy the output at ChatGPT and explain what u’re trying to do, then use ‘match pattern’ module (text = output text, pattern = pattern given by gpt).

Thank you very much for the reply. I tried, but I was not able to solve my issue by using Match Pattern + regex.

Thank you all who read my question and tried to come up with a solution.

I ended up using an Array Aggregator right after the Text Aggregator, and then used the distinct function to deduplicate. The Text Aggregator might not be needed, but it’s a part of somewhat complex scenario and I didn’t want to break something. So, I haven’t tried to remove the Text Aggregator yet (probably, I will try later).

Thank you!

1 Like

Would the function deduplicate() work here?

Never used it, but it might be worth a shot. Probably only works with simple arrays only

CleanShot 2024-12-18 at 7 .48.27

Thank you for the reply! I think deduplicate() is for a simple array that each item doesn’t have any other attributes. In this case, each array item includes text and its associated image filename. So, I used distinct() like “distinct(; text)”.

Just wanted to comment that as well. Distinct is the correct one ^^