Web Scraping
Web Scraping is a technique in which a computer program extracts data from human-readable output coming from websites.
Download full website
wget --mirror --convert-links --adjust-extension --page-requisites --no-parent http://example.org
Website Copier | Download Sites | Website Ripper - Tools Bug
lxml.etree
theXPath - language for XML queries
beautifulsoup
- super short learning curve
- two function api
- parse
- search (find_all)
from bs4 import BeautifulSoup
soup = BeautifulSoup(response.text, 'html.parser')
mydivs = soup.find_all("div", {"class": "stylelistrow"})
print(i, soup.body.div.div)
Selenium (for javascript)
Headless browser
Proxies
Links
- https://www.toptal.com/python/web-scraping-with-python
- https://www.freecodecamp.org/news/how-to-scrape-websites-with-python
AI Tools
- Scrape and Monitor Data from Any Website with No Code
- GitHub - laramies/theHarvester: E-mails, subdomains and names Harvester - OSINT
- Overview - Reducto API
- GitHub - mendableai/firecrawl: 🔥 Turn entire websites into LLM-ready markdown or structured data. Scrape, crawl and extract with a single API. (37K Stars)
- GitHub - ScrapeGraphAI/Scrapegraph-ai: Python scraper based on AI (19.3K Stars)
- GitHub - scrapy/scrapy: Scrapy, a fast high-level web crawling & scraping framework for Python. (55K Stars)
- Beautiful Soup Documentation — Beautiful Soup 4.13.0 documentation
- GitHub - unclecode/crawl4ai: 🚀🤖 Crawl4AI: Open-source LLM Friendly Web Crawler & Scraper. Don't be shy, join here: https://discord.gg/jP8KfhDhyN (40.8K Stars)
- GitHub - microsoft/markitdown: Python tool for converting files and office documents to Markdown. (61.4K Stars)
- GitHub - supermemoryai/markdowner: A fast tool to convert any website into LLM-ready markdown data. Built by https://supermemory.ai (1.5K Stars)
- GitHub - mendableai/firegeo: 🔥 GEO-powered SaaS starter built with Firecrawl for brand monitoring, auth, and billing