Skip to main content

Building

Architecture

emerging-llm-app-stack

Emerging Architectures for LLM Applications | Andreessen Horowitz

Transformers, explained: Understand the model behind GPT, BERT, and T5 - YouTube

  • Positional encodings
  • Attention
  • Self attention
  • GPT3 - 45tb of text data

chat-gpt-working

Let’s Architect! Discovering Generative AI on AWS | AWS Architecture Blog

Building

LLM Working

Decoding Strategies

How to train your ChatGPT

Stage 1: Pretraining

  1. Download ~10TB of text
  2. Get a cluster of ~6,000 GPUs
  3. Compress the text into a neural network, pay ~$2M, wait ~12 days
  4. Obtain base model

Stage 2: Finetuning

  1. Write labeling instructions
  2. Hire people (or use scale.ai!), collect 100K high quality ideal Q&A responses, and/or comparisons
  3. Finetune base model on this data, wait ~1 day
  4. Obtain assistant model
  5. Run a lot of evaluations
  6. Deploy
  7. Monitor, collect misbehaviors, go to step 1

LLM Security

  • Jailbreaking
  • Prompt injection
  • Backdoors & data poisoning
  • Adversarial inputs
  • Insecure output handling
  • Data extraction & privacy
  • Data reconstruction
  • Denial of service
  • Escalation
  • Watermarking & evasion
  • Model theft

1hr Talk Intro to Large Language Models - YouTube

Awesome ChatGPT Prompts | This repo includes ChatGPT prompt curation to use ChatGPT better.

SynthID - Google DeepMind

Distillation Attack

A distillation attack occurs when someone systematically queries a proprietary AI model and uses its outputs to train a smaller or competing model, effectively ‘stealing’ its capabilities without access to the original weights.

Instead of copying parameters, the attacker copies behaviour by treating the target model as a teacher and learning from its responses at scale. For developers building AI products, this raises serious concerns around API exposure, rate limits, watermarking, and model output monitoring.

Dev Tools

Ollama / LM Studio

The easiest way to get up and running with large language models locally.

docker run -d -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama

docker exec -it ollama ollama run llama2

docker exec -it ollama ollama run llama2-uncensored

docker exec -it ollama ollama run mistral

>>> /? # for help

Options

  • LM Studio – A polished desktop application with a built-in search for finding and downloading models from Hugging Face. It is the easiest way to visually manage models and adjust hardware settings like GPU offloading without touching a terminal.
  • Ollama – A lightweight command-line tool that runs as a background service to serve models via a simple API. It is the best choice for developers who want to integrate local AI into other apps or run models with a single terminal command.
  • MLX-LM – Apple’s official framework optimized specifically for Apple Silicon to achieve the highest possible inference speeds. It is the "performance king" for those comfortable with Python who want to squeeze every drop of power from their Mac's GPU.
  • Jan.ai – An open-source, privacy-focused assistant that provides a clean chat interface similar to ChatGPT but entirely offline. It is ideal for users who want organized chat history, file uploads, and a "set-it-and-forget-it" local workspace.
  • GPT4All – A beginner-friendly app designed to run efficiently on standard CPUs without needing a powerful graphics card. It features a built-in "LocalDocs" tool that lets you chat privately with your own PDF and text collections out of the box.

open-webui / openwebui / open webui (103K stars)

Open WebUI is an extensible, feature-rich, and user-friendly self-hosted AI platform designed to operate entirely offline. It supports various LLM runners like Ollama and OpenAI-compatible APIs, with built-in inference engine for RAG, making it a powerful AI deployment solution.

GitHub - open-webui/open-webui: User-friendly AI Interface (Supports Ollama, OpenAI API, ...)

OpenWebUI, it provides a universal chat like interface for using any models via Ollama. We can add prompt templates and use the chat for our purpose. It is a solution for using LLMs but not allowing the parent companies to collect the data. It is based on Open-AI's api so integrating other models would be a challenge.

Alternative

oobabooga

A Gradio web UI for Large Language Models. Supports transformers, GPTQ, AWQ, EXL2, llama.cpp (GGUF), Llama models.

GitHub - oobabooga/text-generation-webui: A Gradio web UI for Large Language Models. Supports transformers, GPTQ, AWQ, EXL2, llama.cpp (GGUF), Llama models.

GitHub - oobabooga/text-generation-webui-extensions

Ludwig

Ludwig is an open-source, declarative machine learning framework that makes it easy to define deep learning pipelines with a simple and flexible data-driven configuration system. Ludwig is suitable for a wide variety of AI tasks, and is hosted by the Linux Foundation AI & Data.

Ludwig enables you to apply state-of-the-art tabular, natural language processing, and computer vision models to your existing data and put them into production with just a few short commands.

GitHub - ludwig-ai/ludwig: Low-code framework for building custom LLMs, neural networks, and other AI models

Ludwig

What is Ludwig? - Ludwig

Others

SAAS

Resources

LLM Visualization

Development with Large Language Models Tutorial - OpenAI, Langchain, Agents, Chroma - YouTube

document-based-question-answering-system