How to Automate Web Data Extraction Using Dumpling AI and Make.com
In this tutorial, I’ll walk you through setting up an automation in Make.com that extracts clean data from web pages using Dumpling AI’s Scrape URL module. This automation is perfect for users looking to efficiently collect structured content for tasks like content creation, research, or reporting. By leveraging Dumpling AI, you’ll get clear, text-only outputs that are easy to work with and integrate into your workflows. Follow this step-by-step guide to set up the automation seamlessly.
Step 1: Setting Up a Google Sheets Trigger
In this step, we’ll start by using Google Sheets as our data source to list the URLs you want to scrape.
-
Create a New Scenario in Make
Log in to Make and click on Create a New Scenario. -
Add Google Sheets Module
- Search for Google Sheets: Watch Rows and select it.
- Connection: If you haven’t connected your Google account before, click Add to authenticate.
- Spreadsheet: Choose the spreadsheet that contains your list of URLs.
- Sheet Name: Specify the sheet (e.g., “Sheet1”).
- Limit: Set to 1 to only process one URL per run, ensuring your automation doesn’t overload.
Step 2: Extracting Clean Content Using Dumpling AI’s Scrape Module
The key to this automation is using Dumpling AI’s Scrape URL module, which extracts clear data without the clutter of HTML tags.
-
Add the Dumpling AI Scrape Module
After setting up the Google Sheets trigger, search for Dumpling AI: Scrape URL and add it to your scenario. The Dumpling AI Module eliminates the need to manually filter out HTML tags, making the data easier to work with. -
Configure the Scrape Module
- Connection: Select your existing Dumpling AI connection. If you don’t have one, click Add Connection and enter your API key.
- URL: Map the URL from your Google Sheets module ({{1.URL}}).
- Format: Choose Markdown for cleaner output.
- Clean Data: Set true to strip away all HTML elements.
- Render JavaScript: Set true if the webpage relies on JavaScript for loading content dynamically
Here is an example of the Content Scraped by Dumpling AI’s scrape URL module:
Step 3: Transforming the Scraped Content into Blog Posts Using OpenAI
Once the content is scraped, the next step is to use OpenAI’s capabilities to transform the raw data into a cohesive blog post.
-
Add OpenAI Module
Search for and add OpenAI: Create Completion to your scenario. -
Configure the OpenAI Module
- Model: Choose gpt-4o-mini.
- Role Setup:
- System Role: Input prompt “Generate a blog post using the provided content.”
- User Input: Use the scraped content from Dumpling AI ({{2.content}}).
- Max Tokens: Set to 2048 to generate a comprehensive output.
- Temperature: Leave at the default of 1 for balanced creativity.
- Context: OpenAI’s text generation allows you to quickly transform the extracted content into a blog post, saving hours of manual writing.
Step 4: Creating Visual Assets with Dumpling AI Image Generation
To enhance the blog post, we will generate an AI-powered image related to the content.
- Add Dumpling AI Image Generation Module
Search for Dumpling AI: Generate AI Image Recraft v3 module and add it to your scenario. - Configure the Module
- Prompt: Use something like “Generate an image representing the main theme of: {{6.result}}”.
- Size: Set to 1024x1024 for optimal resolution.
- Style: Select Digital Illustration to match your blog aesthetic.
Step 5: Generating Blog Tags Using OpenAI
Tags help categorize your blog content, improving SEO and discoverability.
- Add Another OpenAI Module
- Add a second OpenAI: Create Completion module.
- Prompt: Use “Generate SEO-friendly tags for the content: {{6.result}}”.
- Configure Tags Generation
- Max Tokens: Set to 0 for concise output.
- Temperature: Keep it low (0.7) to focus on relevant keywords.
Step 6: Updating Google Sheets with All Outputs
Now that the content, tags, and images have been generated, we’ll store everything back in Google Sheets.
- Add Google Sheets: Update Row Module
- Connection: Use the same Google Sheets account.
- Spreadsheet ID: Select your existing spreadsheet.
- Row Number: Map it to the row number from the trigger ({{1.ROW_NUMBER}}).
- Fields to Update: Map the following:
- Content: {{6.result}} for the blog post.
- Tags: {{10.result}} for the tags.
- Image URL: {{7.images.url}} for the generated image.
Step 7: Testing and Activating the Automation
- Run a Test Scenario
- Add a sample URL to your Google Sheets.
- Run the scenario and ensure:
- The webpage content is scraped and cleaned.
- OpenAI generates a blog post and relevant tags.
- An image is created, and all outputs are saved in Google Sheets.
- Activate the Scenario
Once testing is complete, activate the scenario. Your automation will now run whenever a new URL is added to your Google Sheets.
Get the Blueprint Featured in This Guide here!