Extract text, tags & elements from HTML into nested format

Somewhat a follow up from a previous post,

But had to change get query for additional information.
But currently the output from my query is in HTML format

<h1 class="heading-1" align="">
  <div class="heading-anchor" id="august-2024-news-and-updates"></div>
  <div class="heading-text">
    <div id="section-august-2024-news-and-updates" class="heading-anchor_backwards"></div>
      August 2024 News And Updates
  <a aria-label="link to august-2024-news-and-updates" class="heading-anchor-icon" href="#august-2024-news-and-updates"></a>

<h2 class="heading-2" align="">
  <div class="heading-anchor" id="new-business-sector-added"></div>
  <div class="heading-text">
    <div id="section-new-business-sector-added" class="heading-anchor_backwards"></div>
      New Business Sectors Added!
  <a aria-label="link to the-business-sector-added" class="heading-anchor" href="#new-business-sector-added"></a>
<p>Check out the new business sectors added</p>

<h2 class="heading-2" align="">
  <div class="heading-anchor" id="this-week-sports-news"></div>
  <div class="heading-text"><div id="section-sports-news" class="heading-anchor_backwards"></div>
    This Week's Sports News
  <a aria-label="link to sports-news" class="heading-anchor" href="#this-week-sports-news"></a>

<h3 class="heading-3" align="">
  <div class="heading-anchor" id="basketball-scores-matches"></div>
  <div class="heading-text">
    <div id="section-basketball" class="heading-anchor_backwards"></div>
      Basketball Matches & Scores
  <a aria-label="link to basketball" class="heading-anchor-icon" href="#basketball-scores-matches"></a>

<p>In this weeks basketball matches...</p>
<p>In this weeks basketball scores...</p>

<h3 class="heading-3" align="">
  <div class="heading-anchor" id="ping-pong-scores-matches"></div>
  <div class="heading-text">
    <div id="section-ping-pong" class="heading-anchor_backwards"></div>
      Ping Pong Matches & Scores
  <a aria-label="link to ping-pong" class="heading-anchor-icon" href="#ping-pong-scores-matches"></a>

<p>No Ping Pong matches this week</p>

<h2 class="heading-2" align="">
  <div class="heading-anchor" id="this-week-entertainment-news"></div>
  <div class="heading-text"><div id="section-entertainment-news" class="heading-anchor_backwards"></div>
    This Week's Entertainment News
  <a aria-label="link to entertainment-news" class="heading-anchor" href="#this-week-entertainment-news"></a>

<h1 class="heading heading-1 header-scroll" align="">
  <div class="heading-anchor anchor waypoint" id="july-2024-news-and-updates"></div>
  <div class="heading-text">
    <div id="section-july-2024-news-and-updates" class="heading-anchor_backwards"></div>
      July 2024 News & Updates
  <a aria-label="link to july-2024-news-and-updates" class="heading-anchor-icon" href="#july-2024-news-and-updates"></a>

<h2 class="heading-2" align="">
  <div class="heading-anchor" id="this-week-business-news"></div>
  <div class="heading-text"><div id="section-business-news" class="heading-anchor_backwards"></div>
    This Week's Business News
  <a aria-label="link to business-news" class="heading-anchor" href="#this-week-business-news"></a>

My desired output is in a nested format,

h1: August 2024 News And Updates
anchor_id: august-2024-news-and-updates
  h2: New Business Sectors Added!
  anchor_id: new-business-sector-added
  h2: This Week's Sports News
  anchor_id: this-week-sports-news
    h3: Basketball Matches & Scores
    anchor_id: basketball-scores-matches
    h3: Ping Pong Matches & Scores
    anchor_id: ping-pong-scores-matches
  h2: This Week's Entertainment News
  anchor_id: this-week-entertainment-news

Some additional requirements

  1. Only want to return data from this month (everything up until the second h1 tag)
  2. Only want to return header position (h1, h2 and h3 - don’t care about the other html or css) and only care about the related text for those header tags
  3. Want to extract related anchor IDs for related sections

Planned approach,

  1. Text Parser - Match Pattern: Return the data before 2nd h1 tag (currently working on this)
  2. Extract Related anchor id (Beleive i should be able to extract using regex and set as variable), and noticed I could potentially pull this from 2 different places, if one was easier to extract than the other.
  3. Aggregate into nested format

Need help with,

  • Am I correct to think that I can extract anchor tag value in the first Text Parser module using Regex?
  • What is best approach/module to set into nested format? Aggregate to Json? Create Json? or Text Aggregator with Groupings?