How to filter out similar articles from RSS feeds

Hi,

I’m trying to find a solution for filtering of “duplicated” posts. I would like to ask you for help. Here is an example:

Feed 1. /the-bachelor-joey-graziadei-interview-season-28-spoilers/
Feed 2. /bachelor-season-28-joey-graziadei-boring-fan-reactions/
Feed 3. /reality-tv/the-bachelor-joey-graziadei-in-tears-as-he-reveals-biggest-fear-about-season-28/

I would like to put some filter before posting to check for example if previous X article titles contains 5 or more same words as last title.

I am uploading images to google drive and rename them to article title names so I have images with same titles too. Example the-bachelor-joey-graziadei-interview-season-28-spoilers.jpg

I was thinking that I can somehow fetch last 10 image names from drive and put there some function that new image title can not contains more than 5 same words as each of last 10 uploaded images.
This would prevent post similar articles… Can somebody suggest me a solution for this or different idea how to prevent this please?

Tomas

Welcome to the Make community!

Yes, that is possible. You’ll need a minimum of two modules:

1. Normalize both URLs

First, you need to replace slashes with hyphens, and make all letters lowercase for case-insensitive comparison of words

2. Next, calculate difference in removed duplicate words

Calculate the difference in the original number of words compared to the number of unique words

3. Lastly, add filter to allow non-duplicate items through

Screenshot_2024-02-02_100220

Give it a go and let us know if you have any issues!

Note:

Steps 2 calculation could be combined into the filter in step 3 if you want to save an operation:

Screenshot_2024-02-02_100210

Step 1 could also be combined directly into the filter if you can wrap your head around it, but I wouldn’t recommend it.

Download Blueprint

blueprint.json (28.3 KB)

3 Likes

Thanks looks like this could work, but I need somehow firstly get the dynamic list of latest titles. If I want to try Watch Files in a Folder from Google Drive it has Acid mark so I can not use it in the middle of scenario, same for Data Store. I try now “Search for files” from Google Drive but don’t know how to fill “Query” to get list of latest uploaded files.

Idea is to compare last uploaded file title to next X uploaded file titles. So if I get the list of latest uploaded files lets say:

Image 1 - Title 1
Image 2 - Title 2
Image 3 - Title 3…

I already use {{replace(30.img title; “/[”“:_/=´]/g”; emptystring)}} to rename images, I could include lower case here for each image. Then I would like to compare Title 1 to Title 2,3,4,…and if there will not be match of 5 same words it will continue. It is a bot so I need to make it dynamic. Everytime there will be a new task to upload image I need to fetch latest files again. :upside_down_face:

Hi, I tried Search for Files module and it is fetching latest images, titles. I would like to ask you for help with two things:
I want to rename titles, put all these [“”:|_/=?!"] to emptystring keep only apostrophe and all blank spaces replace with “-”. I tried this but it only make all lowercase.
make rename title

If I keep there only ;"; it will remove only first " not the second ".

Second thing. I can fetch “names” of images. I tried Table Aggregator and it put 10 bundles into one. Iterator makes each bundle as No. 1. I need to make each of 10 bundles attachable as variable. What module should I use for this please? Then I will be able to use Set Multiple Variables module and pick each bundle as value and use your method.


You need to use a regular expression pattern with the global flag.

/[“”":|_/=?!]/g

3 Likes

I don’t understand this question. Perhaps table aggregator isn’t the right aggregator for this?

Every result (item/record) from a search/match module will output a bundle. To “combine” them into a single structure, you’ll need to use an aggregator of some sort.

Aggregators are modules that accumulate multiple bundles into one single bundle. An example of a commonly-used aggregator module is the Array aggregator module. The next popular aggregator is the Text Aggregator which is very flexible and has applies to many use-cases.

There are other types of aggregator modules, click the below links to find out more:

1 Like

Hi, thank you, regular expression works. I know aggregator is wrong I’m looking for a module to keep all bundles. In the picture are bundles from Search module. In the second picture I need to have them instead of “text”, So I can pick each of them as value for functions.


In your diffLenth calculation there are “6. item1” “6. item2”. Instead I would like to put there “Bundle 1 (title 1)” “Bundle 2 (title 2)”. Variable value in your first picture will not be /the-bachelor-joey-graziadei-interview-season-28-spoilers/ but “Bundle 1 (title)”…