Libraries
Top Python libraries of 2023 | Tryolabs
GitHub - KalyanKS-NLP/llm-engineer-toolkit: A curated list of 120+ LLM libraries category wise.
- LLM Training – Tools for fine-tuning and pretraining LLMs efficiently.
- LLM Application Development – Frameworks for integrating LLMs into real-world applications.
- RAG – Libraries for RAG, enhancing knowledge retrieval.
- LLM Inference – Solutions for running models efficiently across different hardware.
- LLM Serving – Scalable frameworks for model deployment.
- Data Extraction – Tools for extracting structured insights from raw text.
- Data Generation – Resources for creating synthetic data.
- Agents – Autonomous AI agents powered by LLMs.
- LLM Evaluation – Tools for benchmarking and assessing model performance.
- LLM Monitoring – Observability tools for tracking model behavior.
- Prompts – Libraries for managing and optimizing prompt engineering.
- Structured Outputs – Solutions for enforcing structured responses from LLMs.
- LLM Safety and Security – Guardrails for ethical and secure AI usage.
The 10 main picks
- LiteLLM - call any LLM using OpenAI format, and more
- PyApp - deploy self contained Python applications anywhere
- Taipy - build UIs for data apps, even in production
- MLX - machine learning on Apple silicon with NumPy-like API
- Unstructured - the ultimate toolkit for text preprocessing
- ZenML and AutoMLOps - portable, production-ready MLOps pipelines
- WhisperX - speech recognition with word-level timestamps & diarization
- AutoGen - LLM conversational collaborative suite
- Guardrails - babysit LLMs so they behave as intended
- Temporian - the "Pandas" built for preprocessing temporal data
- GitHub - langgenius/dify: Dify is an open-source LLM app development platform. Dify's intuitive interface combines AI workflow, RAG pipeline, agent capabilities, model management, observability features and more, letting you quickly go from prototype to production.
Runner-ups
Causal inference
- CausalTune - a library for automated tuning and selection for causal estimators.
- CausalPy - A Python package for causal inference in quasi-experimental settings.
- PyWhy-LLM - experimental library integrating LLM capabilities to support causal analyses.
CLI LLM Tools
- Chatblade - ChatGPT on the command line, providing utility methods to extract JSON or Markdown from ChatGPT responses.
- Elia - A terminal ChatGPT client built with Textual.
- Gorilla CLI - powers your command-line interactions with a user-centric tool. Simply state your objective, and Gorilla CLI will generate potential commands for execution.
- LLM - A CLI utility and Python library for interacting with Large Language Models, both via remote APIs and models that can be installed and run on your own machine. By the author of Datasette.
Code Tools
- Chainlit - "the Streamlit for ChatGPT", create ChatGPT-like UIs on top of any Python code in minutes!
- pydistcheck - Linter that finds portability issues in Python package distributions (wheels and sdists).
- pyxet - lightweight interface for the XetHub platform, a blob-store with a filesystem like interface and git capabilities.
Code Review
- GitHub - luiyen/llm-code-review: A container GitHub Action to review a pull request by HuggingFace's LLM Model.
- Revolutionizing Code Review with Large Language Models: A Deep Dive into code2prompt and its Peers | by Pınar Ersoy | ANOLYTICS | Jun, 2024 | Medium
Computer vision
- deepdoctection - orchestrates document extraction and document layout analysis tasks using deep learning models.
- FaceFusion - Next generation face swapper and enhancer.
- MetaSeg - packaged version of the Segment Anything Model (SAM).
- VTracer - open source software to convert raster images (like jpg & png) into vector graphics (svg)
Data and Features
- Adala - Adala - Autonomous DAta (Labeling) Agent framework.
- Autolabel - Label, clean and enrich text datasets with LLMs.
- balance - simple workflow and methods for dealing with biased data samples when looking to infer from them to some target population of interest. See launch blog post. By META.
- Bytewax - Python framework that simplifies event and stream processing. Because Bytewax couples the stream and event processing capabilities of Flink, Spark, and Kafka Streams with the friendly and familiar interface of Python, you can re-use the Python libraries you already know and love.
- Featureform - feature store. Turn your existing data infrastructure into a feature store.
- Galactic cleaning and curation tools for massive unstructured text datasets. Ben (48/100) on X
- Great Expectations - helps data teams build a shared understanding of their data through quality testing, documentation, and profiling.
Scapping / Scrapper
- GitHub - mendableai/firecrawl: 🔥 Turn entire websites into LLM-ready markdown or structured data. Scrape, crawl and extract with a single API. (37K Stars)
- GitHub - ScrapeGraphAI/Scrapegraph-ai: Python scraper based on AI (19.3K Stars)
- GitHub - scrapy/scrapy: Scrapy, a fast high-level web crawling & scraping framework for Python. (55K Stars)
- Beautiful Soup Documentation — Beautiful Soup 4.13.0 documentation
- GitHub - unclecode/crawl4ai: 🚀🤖 Crawl4AI: Open-source LLM Friendly Web Crawler & Scraper. Don't be shy, join here: https://discord.gg/jP8KfhDhyN (40.8K Stars)
Data Visualization
- PyGWalker - turn your pandas DataFrame into a Tableau-style User Interface for visual analysis.
- Vizro - a toolkit for creating modular data visualization applications. By McKinsey.
Embeddings and Vector DBs
- Epsilla - a high performance Vector Database Management System, focused on scalability, high performance, and cost-effectiveness of vector search.
- LanceDB - open-source database for vector-search built with persistent storage, which greatly simplifies retrieval, filtering and management of embeddings.
- SeaGOAT - local search tool that leverages vector embeddings to enable to search your codebase semantically.
- Text Embeddings Inference - A blazing fast inference solution for text embeddings models.
Federated learning
- Flower - A Friendly Federated Learning Framework.
- MetisFL - federated learning framework that allows developers to easily federate their machine learning workflows and train their models across distributed data silos without ever collecting the data in a centralized location.