Web Scraping

Web Scraping is a technique in which a computer program extracts data from human-readable output coming from websites.

Download full website

wget --mirror --convert-links --adjust-extension --page-requisites --no-parent http://example.org

Website Copier | Download Sites | Website Ripper - Tools Bug

lxml.etree

theXPath - language for XML queries

beautifulsoup

super short learning curve
two function api
- parse
- search (find_all)

from bs4 import BeautifulSoup

soup = BeautifulSoup(response.text, 'html.parser')

mydivs = soup.find_all("div", {"class": "stylelistrow"})
print(i, soup.body.div.div)

Selenium (for javascript)

Headless browser

Headless Chromium
Zombie
slimerjs
puppeteer

Proxies

Best Web Scraping Toolkit - ZenRows
Bright Data - All in One Platform for Proxies and Web Scraping

AI Tools

Scrape and Monitor Data from Any Website with No Code
GitHub - laramies/theHarvester: E-mails, subdomains and names Harvester - OSINT
Overview - Reducto API
GitHub - mendableai/firecrawl: 🔥 Turn entire websites into LLM-ready markdown or structured data. Scrape, crawl and extract with a single API. (37K Stars)
GitHub - ScrapeGraphAI/Scrapegraph-ai: Python scraper based on AI (19.3K Stars)
GitHub - scrapy/scrapy: Scrapy, a fast high-level web crawling & scraping framework for Python. (55K Stars)
Beautiful Soup Documentation — Beautiful Soup 4.13.0 documentation
GitHub - unclecode/crawl4ai: 🚀🤖 Crawl4AI: Open-source LLM Friendly Web Crawler & Scraper. Don't be shy, join here: https://discord.gg/jP8KfhDhyN (40.8K Stars)
GitHub - microsoft/markitdown: Python tool for converting files and office documents to Markdown. (61.4K Stars)
GitHub - supermemoryai/markdowner: A fast tool to convert any website into LLM-ready markdown data. Built by https://supermemory.ai (1.5K Stars)
GitHub - mendableai/firegeo: 🔥 GEO-powered SaaS starter built with Firecrawl for brand monitoring, auth, and billing

Download full website​

lxml.etree​

beautifulsoup​

Selenium (for javascript)​

Headless browser​

Proxies​

Links​

AI Tools​