Goal:
Read an HTML string, find a particular node using some kind of query like I would do in javascript with document.queryElement or with XPATH, or any other language. I need to get the exact node.
Tried:
ChatGPT: Too non deterministic. I get different results with same prompt and input
Convert to Text, then use regular expressions: I get results from wrong places, cannot add more logic to refine.
ideal solution: Convert HTML to XML, then use XPATH. Or execute javascript code.
Non acceptable solutions: Invoke a paid service to convert HTML to XML. I expect built in modules to resolve this.
You can either use the Text Parser “Match Pattern” module, or the XML “Perform XPath Query” module. If you can’t do it then it’s probably a misconfiguration, or too advanced for you.
Alternatively,
For web scraping, some apps you can use are ScrapingBee and ScrapeNinja to get content from the page.
I’ve used ScrapeNinja, and you can use jQuery-like selectors there in the extractor function.
ScrapeNinja also can run the page in a web-browser so it closely emulates what users see, as opposed to just the raw page HTML fetched from the HTTP module.
If you want an example, take a look at Grab data from page and url - #5 by samliew
samliew – request private consultation
Join the Make Fans Discord server to chat with other makers!
1 Like
I’ve tried the Match Pattern and I’m obtaining results, but unreliable.
For instance, I need to match a date, but sometimes the HTML contains several dates in different places, and It’s impossible to identify which match result is the one I need (regex is not enough to get the context).
If I could convert the HTML to XML, I would definitely use xpath. But I can’t find a module to convert HTML to XML (I would use XLST if I had to do it myself, but there is no XSLT module either)
If you need further assistance, please provide the following:
1. Screenshots of module fields and filters
Please share screenshots of relevant module fields and filters in question? It would really help other community members to see what you’re looking at.
You can upload images here using the Upload icon in the text editor:
2. Scenario blueprint
Please export the scenario blueprint file to allow others to view the mappings and settings. At the bottom of the scenario editor, you can click on the three dots to find the Export Blueprint menu item.
(Note: Exporting your scenario will not include private information or keys to your connections)
Uploading it here will look like this:
blueprint.json (12.3 KB)
3. And most importantly, Input/Output bundles
Please provide the input and output bundles of the trigger/iterator/aggregator modules by running the scenario (or get from the scenario History tab), then click the white speech bubble on the top-right of each module and select “Download input/output bundles”.
A.
Save each bundle contents in your text editor as a bundle.txt
file, and upload it here into this discussion thread.
Uploading them here will look like this:
module-1-output-bundle.txt (12.3 KB)
B.
If you are unable to upload files on this forum, alternatively you can paste the formatted bundles in this manner:
-
Either add three backticks ```
before and after the code, like this:
```
input/output bundle content goes here
```
-
Or use the format code button in the editor:
Providing the input/output bundles will allow others to replicate what is going on in the scenario even if they do not use the external service.
Following these steps will allow others to assist you here. Thanks!
samliew – request private consultation
Join the Make Fans Discord server to chat with other makers!
1 Like
I’ve signed up to this [ScrapeNinja]. Do you know if I can provide the raw HTML as input instead of a url? my content comes form an email not public website.
Hmm, no then. Those are web scrapers only. You didn’t mention the HTML was from an email earlier.
A regex should be reliable enough provided you are skilled to build one.
samliew – request private consultation
Join the Make Fans Discord server to chat with other makers!
1 Like