ClickHouse
ClickHouse is an open source column-oriented database management system capable of realtime generation of analytical data reports using SQL queries.
Key Features
- True column-oriented storage
- Vectorized query execution
- Data compression
- Parallel and distributed query execution
- Real time query processing
- Real time data ingestion
- On-disk locality of reference
- Cross-datacenter replication
- High availability
- SQL support
- Local and distributed joins
- Pluggable external dimension tables
- Arrays and nested data types
- Approximate query processing
- Probabilistic data structures
- Full support of IPv6
- Features for web analytics
- State-of-the-art algorithms
- Detailed documentation - Clean documented code
History
ClickHouse is developed by a Russian company called Yandex. It is designed for multiple projects within Yandex. Yandex needed a DBMS to analyze large amounts of data, thus they began to develop their own column-oriented DBMS. The prototype of ClickHouse appeared in 2009 and it was released to open-source in 2016.
Compression
Dictionary Encoding Delta Encoding Naïve (Page-Level)
In addition to general-purpose encoding with LZ4 (default) or Zstd, ClickHouse supports dictionary encoding via LowCardinality data type, as well as delta, double-delta and Gorilla encodings via column codecs.
Concurrency Control
ClickHouse does not support multi-statement transactions.
Data Model
ClickHouse uses the relational database model.
Foreign Keys
ClickHouse does not support foreign keys.
Indexes
ClickHouse supports primary key indexes. The indexing mechanism is called a sparse index. In the MergeTree, data are sorted by primary key lexicographically in each part. Then ClickHouse selects some marks for every Nth row, where N is chosen adaptively by default. Together these marks serve as a sparse index, which allows efficient range queries.
Joins
ClickHouse uses hash join by default, which is done by placing the right part of data in a hash table in memory. If there's not enough memory for hash join it falls back to merge join.
Logging
ClickHouse replicates its data on multiple nodes and monitors data synchronicity on replicas. It recovers after failures by syncing data from other replica nodes.
Parallel Execution
Intra-Operator (Horizontal) Inter-Operator (Vertical)
ClickHouse utilizes half cores for single-node queries and one replica of each shard for distributed queries by default. It could be tuned to utilize only one core, all cores of the whole cluster or anything in between.
Query Compilation
ClickHouse supports runtime code generation. The code is generated for every kind of query on the fly, removing all indirection and dynamic dispatch. Runtime code generation can be better when it fuses many operations together and fully utilizes CPU execution units.
Query Execution
Query Interface
Custom API SQL HTTP / REST Command-line / Shell
ClickHouses provides two types of parsers: a full SQL parser and a data format parser. It uses SQL parser for all types of queries and the data format parser only for INSERT queries. Beyond the query language, it provides multiple user interfaces, including HTTP interface, JDBC driver, TCP interface, command-line client, etc.
Storage Architecture
Disk-oriented In-Memory Hybrid
ClickHouse has multiple types of table engines. The type of the table engine determines where the data is stored, concurrent level, whether indexes are supported and some other properties. Key table engine family for production use is a MergeTree that allows for resilient storage of large volumes of data and supports replication. There's also a Log family for lightweight storage of temporary data and Distributed engine for querying a cluster.
Storage Model
Decomposition Storage Model (Columnar)
ClickHouse is a column-oriented DBMS and it stores data by columns.
Storage Organization
Indexed Sequential Access Method (ISAM) Sorted Files
Stored Procedures
Currently, stored procedures and UDF are listed as open issues in ClickHouse.
System Architecture
ClickHouse system in a distributed setup is a cluster of shards. It uses asynchronous multimaster replication and there is no single point of contention across the system.
Views
Virtual Views Materialized Views
ClickHouse supports both virtual views and materialized views. The materialized views store data transformed by corresponding SELECT query. The SELECT query can contain DISTINCT, GROUP BY, ORDER BY, LIMIT, etc.
Internals
Modern SQL in 2023 - ClickHouse - YouTube
"Building for Fast" by Alexey Milovidov, Amsterdam, June 2022 - YouTube
Why is ClickHouse so fast? | ClickHouse Docs
Commands
docker pull clickhouse/clickhouse-server
docker run -d -p 8123:8123 -p 9000:9000 --network=test-net --name some-clickhouse-server --ulimit nofile=262144:262144 clickhouse/clickhouse-server
docker run -d --name some-clickhouse-server -p 8123:8123 -p 9000:9000 --network=test-net -e CLICKHOUSE_DB=my_database -e CLICKHOUSE_USER=username -e CLICKHOUSE_DEFAULT_ACCESS_MANAGEMENT=1 -e CLICKHOUSE_PASSWORD=password --ulimit nofile=262144:262144 clickhouse/clickhouse-server
# https://hub.docker.com/r/clickhouse/clickhouse-server/
echo 'SELECT version()' | curl 'http://localhost:8123/' --data-binary @-
Visual Interfaces from Third-party Developers | ClickHouse Docs
-- https://clickhouse.com/docs/en/getting-started/quick-start
clickhouse client
CREATE TABLE my_first_table
(
user_id UInt32,
message String,
timestamp DateTime,
metric Float32
)
ENGINE = MergeTree
PRIMARY KEY (user_id, timestamp)
INSERT INTO my_first_table (user_id, message, timestamp, metric) VALUES
(101, 'Hello, ClickHouse!', now(), -1.0 ),
(102, 'Insert a lot of rows per batch', yesterday(), 1.41421 ),
(102, 'Sort your data based on your commonly-used queries', today(), 2.718 ),
(101, 'Granules are the smallest chunks of data read', now() + 5, 3.14159 )
SELECT *
FROM my_first_table
ORDER BY timestamp
SELECT
passenger_count,
avg(toFloat32(total_amount))
FROM s3(
'https://datasets-documentation.s3.eu-west-3.amazonaws.com/nyc-taxi/trips_0.gz',
'TabSeparatedWithNames'
)
GROUP BY passenger_count
ORDER BY passenger_count;
New York Taxi Data | ClickHouse Docs
Links
- Fast Open-Source OLAP DBMS - ClickHouse
- GitHub - ClickHouse/ClickHouse: ClickHouse is a free analytics DBMS for big data
- ClickHouse - YouTube
- What Is ClickHouse? | ClickHouse Docs
- Getting Started with ClickHouse
- Used by - Zerodha, Cloudflare, Uber, Ebay (moved from postgres, elk, druid to clickhouse)
- Learn ClickHouse with Mark - YouTube
- ClickHouse Academy - "How to" Sessions - YouTube
- ClickHouse and The One Trillion Row Challenge
- DataOps Barcelona 2018 - Clickhouse Use Cases - YouTube
- YouTube
- The Secrets of ClickHouse Performance Optimizations at BDTC 2019 - YouTube
- Didi Migrates from Elasticsearch to ClickHouse for a new Generation Log Storage System
- Building a Robust Data Pipeline with Kafka and ClickHouse | The Write Ahead Log
- First ClickHouse research paper: How do you make a modern data analytics database lightning-fast?