I have downloaded a page’s html using HTTP get request and now I’m trying to extract the page’s category tags which are shown in the text.
Here’s an example source url: Example page - see Categories towards the bottom above the map
If the page categories include ‘Accommodation’ ‘Guest House-B&B’ or ‘Restaurant’ etc then I need to extract this info and create a list which I can then populate into a google sheet.
So far, I’ve got the html tag (using chatgpt to create the regex) which contains the category information but I cannot seem to extract just the category text.
The expression I’m using to extract the phrase so far is:
Categories:</strong>\s*((?:<[^>]+>[^<]+</[^>]+>\s*,?\s*)+)
And that’s giving me this output:
<a href="https://www.discovercarlisle.co.uk/eat-drink/category/accomodation" class="Accomodation EDNcategorycolor-default">Accomodation</a>, <a href="https://www.discovercarlisle.co.uk/eat-drink/category/guest-house-bb" class="Guest_House-B_B EDNcategorycolor-default">Guest House-B&B</a>
Note: The categories will change for each URL I do a HTTP request for.
In my scenario I need to do further text parsing to extract the category text I need. I’m therefore trying to further reduce the output using a second text parser (I imagine there’s a more succinct way to do this but I’m new to this) This expression is:
>([^<]+)<
…and that’s not working either despite what ChatGPT says. The result of the second text parser is empty.
blueprint (1).json (47.9 KB)
Any ideas how I can get just the data back that I need?