Web Scraping
Web Scraping is a technique in which a computer program extracts data from human-readable output coming from websites.
Download full website
wget --mirror --convert-links --adjust-extension --page-requisites --no-parent http://example.org
Website Copier | Download Sites | Website Ripper - Tools Bug
lxml.etree
theXPath - language for XML queries
beautifulsoup
- super short learning curve
- two function api
- parse
- search (find_all)
from bs4 import BeautifulSoup
soup = BeautifulSoup(response.text, 'html.parser')
mydivs = soup.find_all("div", {"class": "stylelistrow"})
print(i, soup.body.div.div)
Selenium (for javascript)
Headless browser
- Headless Chromium
- Zombie ⭐ 5.6k
- slimerjs
- puppeteer ⭐ 94k
Proxies
Links
- https://www.toptal.com/python/web-scraping-with-python
- https://www.freecodecamp.org/news/how-to-scrape-websites-with-python
AI Tools
- GitHub - mendableai/firecrawl: 🔥 Turn entire websites into LLM-ready markdown or structured data. Scrape, crawl and extract with a single API. ⭐ 99k
- GitHub - microsoft/markitdown: Python tool for converting files and office documents to Markdown. ⭐ 93k
- GitHub - scrapy/scrapy: Scrapy, a fast high-level web crawling & scraping framework for Python. ⭐ 61k
- GitHub - unclecode/crawl4ai: 🚀🤖 Crawl4AI: Open-source LLM Friendly Web Crawler & Scraper. Don't be shy, join here: https://discord.gg/jP8KfhDhyN ⭐ 63k
- GitHub - ScrapeGraphAI/Scrapegraph-ai: Python scraper based on AI ⭐ 23k
- Scrape and Monitor Data from Any Website with No Code
- GitHub - laramies/theHarvester: E-mails, subdomains and names Harvester - OSINT ⭐ 16k
- Overview - Reducto API
- Beautiful Soup Documentation — Beautiful Soup 4.13.0 documentation
- GitHub - supermemoryai/markdowner: A fast tool to convert any website into LLM-ready markdown data. Built by https://supermemory.ai ⭐ 1.9k (1.5K Stars)
- GitHub - mendableai/firegeo: 🔥 GEO-powered SaaS starter built with Firecrawl for brand monitoring, auth, and billing ⭐ 610
- How to Scrape Data From Any Website Using Deepseek - YouTube