How to extract JSON from HTML Script ID Tag

Hello Make Community!

My scenario is using the HTTP GET request on a URL and I’m attempting to parse JSON from a webpage that’s underneath a Script ID tag. I tried to play around with RegEx using the Text Parser but I’m not getting the desired output. The goal is to have the scenario run and find the new records daily and send to a Google Sheet and Slack channel.

The Script ID tag looks like this -

<script id="__NEW_DATA__" type="application/json">

And this is an example of what the JSON looks like but has many more records than this -

{"props":
  "pageProps": {
    "layoffs": [
      {
        "layoff_date": "2023-05-12T00:00:00.000Z",
        "layoff_ppl_impacted": 340,
        "company_code": "nuro",
        "company_name_if_no_company_code": "",
        "company_url_if_no_company_code": "",
        "layoff_url": "https://techcrunch.com/2023/05/12/autonomous-delivery-startup-nuro-to-lay-off-30-of-workforce/",
        "layoff_title": "Autonomous delivery startup Nuro to lay off 30% of workforce",
        "layoff_pct_impacted": "30.00",
        "website_clean": "nuro.ai",
        "company_short": "Nuro",
        "business_description_short": "Autonomous robotics",
        "related_list_id": "",
        "related_list_title": ""
      },
      {
        "layoff_date": "2023-05-12T00:00:00.000Z",
        "layoff_ppl_impacted": 200,
        "company_code": "veeam",
        "company_name_if_no_company_code": "",
        "company_url_if_no_company_code": "",
        "layoff_url": "https://www.bizjournals.com/columbus/inno/stories/news/2023/05/12/veeam-columbus-job-cuts.html",
        "layoff_title": "Columbus data backup company cut 200 jobs after CTO called tech layoffs a bad strategy",
        "layoff_pct_impacted": null,
        "website_clean": "veeam.com",
        "company_short": "Veeam",
        "business_description_short": "Backup, disaster recovery, and data management",
        "related_list_id": "",
        "related_list_title": ""
      },

Hi @Cameron_Cooper,

You can use something like this for the regular expression to get the JSON inside the Script Tag,

<script id="__NEW_DATA__".*?>\s*({.*})\s*<\/script>

So, It is just a standard ECMAScript to get content between Script Tag in HTML and for your use case what you can do is pass the ID if that is known or simply use type="application/json" as part of the regex to get the JSON from the HTML page.

My, only concern however is the JSON that you have added in the example is not valid and you might need to manipulate it for it to work through the Parse JSON module.

blueprint (39).json (11.0 KB)

Hi @Runcorn

Thanks for the reply and helping out! I tested your suggestion but I’m still not getting any output.

This is what I have for the HTTP request and what I’m seeing in the Text Parser. I’ve also attached a snippet of the HTML code from the website -





Can you paste the Get HTTP module response here?

I want to run the Praser through and see if I need to adjust something based on the ouput.

1 Like

Hi @Runcorn

I think I figured out the problem, I was executing the run module only on the text parser but when running the entire scenario, it was able to parse the JSON.

I am running into another issue where I want to set up a filter so that only the new data gets sent to the Slack channel. For example, if the scenario to run at a certain time of day, I want it to pull the new data from 1 day ago or 24 hours ago. I’ve tried setting up a date filter with a condition but nothing is passing through.

This is what the date format and filter condition look like -


This will not work as the date on either of those might contain Hour, Minute and second and thus it will never be equal.

What you can do though is change the filter type or simply perform a string comparison on,

formatDate(layoffDate;YYYY-MM-DD) and formatDate(addDays(now;-1);YYYY-MM-DD)

1 Like

Hi @Runcorn

Does this look right? I changed the addDays value to -4 to capture things from previous but not getting any output -

No, instead of layoff_date in the first input field use the formatDate(layoff;YYYY-MM-DD) and in the condition i.e. second input just use the now one.

Screenshot from 2023-05-15 22-08-30

1 Like

Hi @Runcorn

This worked - thanks again!

I guess one final thing, if I wanted to run this more frequently throughout a period of time during the day, is there any way for the scenario to remember what it sent before if it ran every two hours? I know there’s no time in the date filed but wondering if there’s a workaround.