I would like to scrap an entire website with python

jeremy3 · August 19, 2023, 9:32am

Hi !
I would like to create a scenario that will scrap an entire site and return all the scrapped urls to me.
I thought of using 0codeKit (1SaaS) with this code: "import requests
from bs4 import BeautifulSoup
from urllib.parse import urlparse, urljoin

visited_links = set() # Set to store previously visited links
def get_internal_links_from_url(url, domain):
global visited_links

if url in visited_links: # If the link has already been visited, skip it
    return []

visited_links.add(url)
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

Find all links and filter those that are internal to the domain

all_links = [a['href'] for a in soup. find_all('a', href=True)]
internal_links = [urljoin(url, link) for link in all_links if urlparse(link).netloc == domain or not urlparse(link).netloc]

# Recursively crawl each internal link found

for link in internal_links:
if link not in visited_links:
get_internal_links_from_url(link, domain)

return list(visited_links)

def crawl_site(url):
domain = urlparse(url).netloc # Get the domain of the site
return get_internal_links_from_url(url, domain)

result = {‘data’: crawl_site(‘https://www.***.fr’)}
"
but I got an error.

Do you know what is the way to do this?

Thx !

samliew · August 19, 2023, 3:06pm

I’ve used the ScrapeNinja module to do all web scraping. 0codekit is more for running short snippets of code instead.

Topic		Replies	Views
Automate Website actions Getting Started http	4	3544	February 14, 2024
Store a list of already scrapped urls and check it within a scenario How To filters , web-scraping	5	398	October 3, 2023
Web Scraping from URL How To	3	479	September 7, 2023
How to scrape all the pages within 1 click? How To web-scraping	3	67	March 20, 2025
Extract Instagram url from website list Getting Started webhooks , google-sheets	2	502	November 25, 2024

I would like to scrap an entire website with python

Find all links and filter those that are internal to the domain

Related topics