Get the 'Scraptio' module to pull back a list of links on a website that contain the word "about"

Hi there,

Hopefully an easy one - I’m trying out Scraptio and, after passing it the URL of a website, I want it to return all (full) links on that site that contain the word ‘about’ (as an example).

I’m using the ‘Scrape website links’ part of it, so I assume this will do the trick, but I’ve absolutely no idea what I"m supposed to add into it to get it to give me back the links I want. Here’s a screenshot. Can anyone help?

Welcome to the Make community!

Yes, that is possible. You’ll need a minimum of two modules:

Screenshot_2024-06-10_090634

You can use a Text Parser “Match Pattern” module with this Pattern (regular expression):

<a[^<">]*?href="(?<url>[^<">]+?)"[^<>]*>(?<text>[^<">]*?about[^<">]*?)<\/a>

Proof

https://regex101.com/r/vZXzRO

Important Info

  • :warning: Global match must be set to YES!

Screenshot

Output

Screenshot_2024-06-10_090625


For more information, see Text Parser in the Make Help Center:

Match Pattern
The Match pattern module enables you to find and extract string elements matching a search pattern from a given text. The search pattern is a regular expression (aka regex or regexp), which is a sequence of characters in which each character is either a metacharacter, having a special meaning, or a regular character that has a literal meaning.

Hope this helps!

Give it a go and let us know if you have any issues!

Module Export

You can copy and paste this module export into your scenario. This will paste the modules shown in my screenshots above.

  1. Copy the JSON code below by clicking the copy button when you mouseover the top-right of the code block
    Screenshot_2024-01-17_200117

  2. Enter your scenario editor. Press ESC to close any dialogs. Press CTRLV (paste keyboard shortcut for Windows) to paste directly in the canvas.

  3. Click on each imported module and save it for validation. You may be prompted to remap some variables and connections.

View Module Export Code

JSON

{
    "subflows": [
        {
            "flow": [
                {
                    "id": 82,
                    "module": "http:ActionSendData",
                    "version": 3,
                    "parameters": {
                        "handleErrors": false,
                        "useNewZLibDeCompress": true
                    },
                    "mapper": {
                        "url": "https://samliew.com/",
                        "serializeUrl": false,
                        "method": "get",
                        "headers": [],
                        "qs": [],
                        "bodyType": "raw",
                        "parseResponse": false,
                        "authUser": "",
                        "authPass": "",
                        "timeout": "",
                        "shareCookies": false,
                        "ca": "",
                        "rejectUnauthorized": true,
                        "followRedirect": true,
                        "useQuerystring": false,
                        "gzip": true,
                        "useMtls": false,
                        "contentType": "",
                        "data": "",
                        "followAllRedirects": false
                    },
                    "metadata": {
                        "designer": {
                            "x": -725,
                            "y": -2086
                        },
                        "restore": {
                            "expect": {
                                "method": {
                                    "mode": "chose",
                                    "label": "GET"
                                },
                                "headers": {
                                    "mode": "chose",
                                    "collapsed": true
                                },
                                "qs": {
                                    "mode": "chose",
                                    "collapsed": true
                                },
                                "bodyType": {
                                    "collapsed": true,
                                    "label": "Raw"
                                },
                                "parseResponse": {
                                    "collapsed": true
                                },
                                "contentType": {
                                    "collapsed": true,
                                    "label": "Empty"
                                },
                                "data": {
                                    "collapsed": true
                                }
                            }
                        },
                        "parameters": [
                            {
                                "name": "handleErrors",
                                "type": "boolean",
                                "label": "Evaluate all states as errors (except for 2xx and 3xx )",
                                "required": true
                            },
                            {
                                "name": "useNewZLibDeCompress",
                                "type": "hidden"
                            }
                        ],
                        "expect": [
                            {
                                "name": "url",
                                "type": "url",
                                "label": "URL",
                                "required": true
                            },
                            {
                                "name": "serializeUrl",
                                "type": "boolean",
                                "label": "Serialize URL",
                                "required": true
                            },
                            {
                                "name": "method",
                                "type": "select",
                                "label": "Method",
                                "required": true,
                                "validate": {
                                    "enum": [
                                        "get",
                                        "head",
                                        "post",
                                        "put",
                                        "patch",
                                        "delete",
                                        "options"
                                    ]
                                }
                            },
                            {
                                "name": "headers",
                                "type": "array",
                                "label": "Headers",
                                "spec": [
                                    {
                                        "name": "name",
                                        "label": "Name",
                                        "type": "text",
                                        "required": true
                                    },
                                    {
                                        "name": "value",
                                        "label": "Value",
                                        "type": "text"
                                    }
                                ]
                            },
                            {
                                "name": "qs",
                                "type": "array",
                                "label": "Query String",
                                "spec": [
                                    {
                                        "name": "name",
                                        "label": "Name",
                                        "type": "text",
                                        "required": true
                                    },
                                    {
                                        "name": "value",
                                        "label": "Value",
                                        "type": "text"
                                    }
                                ]
                            },
                            {
                                "name": "bodyType",
                                "type": "select",
                                "label": "Body type",
                                "validate": {
                                    "enum": [
                                        "raw",
                                        "x_www_form_urlencoded",
                                        "multipart_form_data"
                                    ]
                                }
                            },
                            {
                                "name": "parseResponse",
                                "type": "boolean",
                                "label": "Parse response",
                                "required": true
                            },
                            {
                                "name": "authUser",
                                "type": "text",
                                "label": "User name"
                            },
                            {
                                "name": "authPass",
                                "type": "password",
                                "label": "Password"
                            },
                            {
                                "name": "timeout",
                                "type": "uinteger",
                                "label": "Timeout",
                                "validate": {
                                    "max": 300,
                                    "min": 1
                                }
                            },
                            {
                                "name": "shareCookies",
                                "type": "boolean",
                                "label": "Share cookies with other HTTP modules",
                                "required": true
                            },
                            {
                                "name": "ca",
                                "type": "cert",
                                "label": "Self-signed certificate"
                            },
                            {
                                "name": "rejectUnauthorized",
                                "type": "boolean",
                                "label": "Reject connections that are using unverified (self-signed) certificates",
                                "required": true
                            },
                            {
                                "name": "followRedirect",
                                "type": "boolean",
                                "label": "Follow redirect",
                                "required": true
                            },
                            {
                                "name": "useQuerystring",
                                "type": "boolean",
                                "label": "Disable serialization of multiple same query string keys as arrays",
                                "required": true
                            },
                            {
                                "name": "gzip",
                                "type": "boolean",
                                "label": "Request compressed content",
                                "required": true
                            },
                            {
                                "name": "useMtls",
                                "type": "boolean",
                                "label": "Use Mutual TLS",
                                "required": true
                            },
                            {
                                "name": "contentType",
                                "type": "select",
                                "label": "Content type",
                                "validate": {
                                    "enum": [
                                        "text/plain",
                                        "application/json",
                                        "application/xml",
                                        "text/xml",
                                        "text/html",
                                        "custom"
                                    ]
                                }
                            },
                            {
                                "name": "data",
                                "type": "buffer",
                                "label": "Request content"
                            },
                            {
                                "name": "followAllRedirects",
                                "type": "boolean",
                                "label": "Follow all redirect",
                                "required": true
                            }
                        ]
                    }
                },
                {
                    "id": 83,
                    "module": "regexp:Parser",
                    "version": 1,
                    "parameters": {
                        "pattern": "<a[^<\">]*?href=\"(?<url>[^<\">]+?)\"[^<>]*>(?<text>[^<\">]*?about[^<\">]*?)<\\/a>",
                        "global": true,
                        "sensitive": false,
                        "multiline": false,
                        "singleline": false,
                        "continueWhenNoRes": false,
                        "ignoreInfiniteLoopsWhenGlobal": false
                    },
                    "mapper": {
                        "text": "{{toString(82.data)}}"
                    },
                    "metadata": {
                        "designer": {
                            "x": -480,
                            "y": -2081
                        },
                        "restore": {
                            "parameters": {
                                "multiline": {
                                    "collapsed": true
                                },
                                "singleline": {
                                    "collapsed": true
                                },
                                "continueWhenNoRes": {
                                    "collapsed": true
                                }
                            }
                        },
                        "parameters": [
                            {
                                "name": "pattern",
                                "type": "text",
                                "label": "Pattern",
                                "required": true
                            },
                            {
                                "name": "global",
                                "type": "boolean",
                                "label": "Global match",
                                "required": true
                            },
                            {
                                "name": "sensitive",
                                "type": "boolean",
                                "label": "Case sensitive",
                                "required": true
                            },
                            {
                                "name": "multiline",
                                "type": "boolean",
                                "label": "Multiline",
                                "required": true
                            },
                            {
                                "name": "singleline",
                                "type": "boolean",
                                "label": "Singleline",
                                "required": true
                            },
                            {
                                "name": "continueWhenNoRes",
                                "type": "boolean",
                                "label": "Continue the execution of the route even if the module finds no matches",
                                "required": true
                            },
                            {
                                "name": "ignoreInfiniteLoopsWhenGlobal",
                                "type": "boolean",
                                "label": "Ignore errors when there is an infinite search loop",
                                "required": true
                            }
                        ],
                        "expect": [
                            {
                                "name": "text",
                                "type": "text",
                                "label": "Text"
                            }
                        ],
                        "interface": [
                            {
                                "type": "text",
                                "name": "url",
                                "label": "url"
                            },
                            {
                                "type": "text",
                                "name": "text",
                                "label": "text"
                            },
                            {
                                "type": "uinteger",
                                "name": "i",
                                "label": "i"
                            },
                            {
                                "type": "any",
                                "name": "__IMTMATCH__",
                                "label": "Fallback Match"
                            }
                        ]
                    }
                }
            ]
        }
    ],
    "metadata": {
        "version": 1
    }
}

samliewrequest private consultation

Join the Make Fans Discord server to chat with other makers!

1 Like

Cheers. As I’m using scraptio module (scrape website links) how does that apply to that? As in, what do I put into that module to achieve the result I’m looking for? Any help really appreciated.

In the “Tag 1” field, you need to put an a to get links.

Then after this module you can perform the text filter.

samliewrequest private consultation

Join the Make Fans Discord server to chat with other makers!

1 Like

Thanks Sam - could you give me an example of the exact ‘code’ I’d put in the scraptio get website links module? (Sorry, completely new to this whole filtering stuff!), I’m looking to pull back a list of links from the URL that was passed into the module that contain the word ‘about’ (as an example). Really appreciated!

Sorry, I’m not very familiar with Scraptio.

For more information, see the Scraptio documentation in the Help Center.

Alternatively, you can use either the solution I provided above, or use ScrapeNinja, or chat in the Discord server.

samliewrequest private consultation

Join the Make Fans Discord server to chat with other makers!

1 Like

Thanks Sam, so in the solution you suggested you provided this code:

<a[^<“>]?href=“(?[^<”>]+?)"[^<>]>(?[^<”>]?about[^<">]?)</a>

If I was to use that to try and find (and bring back) any links on the homepage of the url (say, www.acme.com) that contained the word “about” - what would that code look like? I think if I see an example I’ll be able to get that to work…

I don’t think Scraptio can do this because I think it only returns links, and not the text like “about”, where you can apply a filter to it afterwards. (I’m not sure because I haven’t used Scraptio before and don’t know what the output looks like).

If you want me to give Scraptio a try, you can request a private consultation.

samliewrequest private consultation

Join the Make Fans Discord server to chat with other makers!

1 Like

Oh, sorry - I meant in the ‘text parser’ suggestion you made earlier, ignore scraptio for now. What would be the code to put in the text parser?

I’ve already provided the “Module Export Code”, so you can conveniently paste the modules into your scenario and run it. Please follow the instructions above.

samliewrequest private consultation

Join the Make Fans Discord server to chat with other makers!

1 Like

Thanks Sam, I’ll give that a go. Thanks for your help on this.

Oh, just to clarify - I only wanted the actual URL links that contained the word ‘about’ - nothing else. Would your suggestion do this?

Can you provide an example of a real web page and which link you want to extract?

samliewrequest private consultation

Join the Make Fans Discord server to chat with other makers!

1 Like

Sure! Let’s say https://www.eurogamer.net/ and I just want the output be the actual URL of any link that has the word “about” in it (so it would just bring back something like “www.eurogamer.net/about-us”.

I’m just after the URL’s - nothing else.

It should work, see proof here https://regex101.com/r/iwzSxJ

samliewrequest private consultation

Join the Make Fans Discord server to chat with other makers!

1 Like

Ah, so the expression to put into the text parser would be:

http:///<a[^<“>]?href=“(?<www.eurogamer.net>[^<”>]+?)"[^<>]>\s*(?[^<”>]?about[^<">]?)\s*</a>/gi

… and the output would be the url of the about us page?

(Sorry, completely new to this!)

Nope.

  1. Paste my module export into your scenario.

  2. Change the URL of the HTTP module to https://www.eurogamer.net

  3. Run the scenario.

samliewrequest private consultation

Join the Make Fans Discord server to chat with other makers!

1 Like

Ok, will do. Just need to pop out so will look around lunch time. Thanks so much for this!

Screenshot for your convenience

2 Likes

Hi Sam, this is great. I’ve set it up and managed to get it to output a URL into a google sheet.

However, the ‘output’ (in this instance) is just “/about-us”, where as I’d like the output to be the ‘full’ URL of this link (i.e. “www.eurogamer.net/info/about-us”). Is there a way to tweak it to do that?