Somewhat a follow up from a previous post,
But had to change get query for additional information.
But currently the output from my query is in HTML format
<h1 class="heading-1" align="">
<div class="heading-anchor" id="august-2024-news-and-updates"></div>
<div class="heading-text">
<div id="section-august-2024-news-and-updates" class="heading-anchor_backwards"></div>
August 2024 News And Updates
</div>
<a aria-label="link to august-2024-news-and-updates" class="heading-anchor-icon" href="#august-2024-news-and-updates"></a>
</h1>
<h2 class="heading-2" align="">
<div class="heading-anchor" id="new-business-sector-added"></div>
<div class="heading-text">
<div id="section-new-business-sector-added" class="heading-anchor_backwards"></div>
New Business Sectors Added!
</div>
<a aria-label="link to the-business-sector-added" class="heading-anchor" href="#new-business-sector-added"></a>
</h2>
<p>Check out the new business sectors added</p>
<h2 class="heading-2" align="">
<div class="heading-anchor" id="this-week-sports-news"></div>
<div class="heading-text"><div id="section-sports-news" class="heading-anchor_backwards"></div>
This Week's Sports News
</div>
<a aria-label="link to sports-news" class="heading-anchor" href="#this-week-sports-news"></a>
</h2>
<h3 class="heading-3" align="">
<div class="heading-anchor" id="basketball-scores-matches"></div>
<div class="heading-text">
<div id="section-basketball" class="heading-anchor_backwards"></div>
Basketball Matches & Scores
</div>
<a aria-label="link to basketball" class="heading-anchor-icon" href="#basketball-scores-matches"></a>
</h3>
<p>In this weeks basketball matches...</p>
<p>In this weeks basketball scores...</p>
<h3 class="heading-3" align="">
<div class="heading-anchor" id="ping-pong-scores-matches"></div>
<div class="heading-text">
<div id="section-ping-pong" class="heading-anchor_backwards"></div>
Ping Pong Matches & Scores
</div>
<a aria-label="link to ping-pong" class="heading-anchor-icon" href="#ping-pong-scores-matches"></a>
</h3>
<p>No Ping Pong matches this week</p>
<h2 class="heading-2" align="">
<div class="heading-anchor" id="this-week-entertainment-news"></div>
<div class="heading-text"><div id="section-entertainment-news" class="heading-anchor_backwards"></div>
This Week's Entertainment News
</div>
<a aria-label="link to entertainment-news" class="heading-anchor" href="#this-week-entertainment-news"></a>
</h2>
<h1 class="heading heading-1 header-scroll" align="">
<div class="heading-anchor anchor waypoint" id="july-2024-news-and-updates"></div>
<div class="heading-text">
<div id="section-july-2024-news-and-updates" class="heading-anchor_backwards"></div>
July 2024 News & Updates
</div>
<a aria-label="link to july-2024-news-and-updates" class="heading-anchor-icon" href="#july-2024-news-and-updates"></a>
</h1>
<h2 class="heading-2" align="">
<div class="heading-anchor" id="this-week-business-news"></div>
<div class="heading-text"><div id="section-business-news" class="heading-anchor_backwards"></div>
This Week's Business News
</div>
<a aria-label="link to business-news" class="heading-anchor" href="#this-week-business-news"></a>
</h2>
My desired output is in a nested format,
h1: August 2024 News And Updates
anchor_id: august-2024-news-and-updates
h2: New Business Sectors Added!
anchor_id: new-business-sector-added
h2: This Week's Sports News
anchor_id: this-week-sports-news
h3: Basketball Matches & Scores
anchor_id: basketball-scores-matches
h3: Ping Pong Matches & Scores
anchor_id: ping-pong-scores-matches
h2: This Week's Entertainment News
anchor_id: this-week-entertainment-news
Some additional requirements
- Only want to return data from this month (everything up until the second h1 tag)
- Only want to return header position (h1, h2 and h3 - don’t care about the other html or css) and only care about the related text for those header tags
- Want to extract related anchor IDs for related sections
Planned approach,
- Text Parser - Match Pattern: Return the data before 2nd h1 tag (currently working on this)
- Extract Related anchor id (Beleive i should be able to extract using regex and set as variable), and noticed I could potentially pull this from 2 different places, if one was easier to extract than the other.
- Aggregate into nested format
Need help with,
- Am I correct to think that I can extract anchor tag value in the first Text Parser module using Regex?
- What is best approach/module to set into nested format? Aggregate to Json? Create Json? or Text Aggregator with Groupings?