Hey guys, I need help while trying to scrape data from a listing website.
The scenario it works perfectly without errors, however, I’m unable to scrape all the data listings on the site.
My scenario is only able to scrape 3 listing results correctly after which the remaining output bundles contain HTML with “Javascript disabled” information, preventing me from scraping the data on subsequent pages.
I tried simulating a normal user interaction experience by generating random values to be used in sleep modules just before HTTP request modules, however, it doesn’t seem to work.
Can anyone provide me with a solution to this.
Attached below is a copy of my scenario for anyone looking to help.
I thought I could do it using just the HTTP request module, but after going through some of the articles on here, I figured that the HTTP request cannot handle client-side Javascript.
So you basically need to “visit” the site yourself to get the content. This is called Web Scraping.
Web Scraping
For web scraping, a service you can use is ScrapeNinja to get content from the page.
ScrapeNinja allows you to use jQuery-like selectors to extract content from elements by using an extractor function. ScrapeNinja also can run the page in a real web-browser, loading all the content and running the page load scripts so it closely simulates what you see, as opposed to just the raw page HTML fetched from the HTTP module.
You can also use AI-powered web scraping tools like Dumpling AI.
This is probably the easiest and quickest way to set-up, because all you need to do is to describe the content that you want, instead of inspecting the element to create selectors, or having to come up with regular expression patterns.
The plus-side of this is that such services combine BOTH fetching and extracting of the data in a single module (saving operations), and doing away with the lengthy setup from the other methods.
I’m not sure if it will be the same for you, but when I had this issue, it was because I wasn’t passing Header information in the HTTP Get Response module.