Working

Writes

Need to be lock-free and fast (no reads or disk seeks)
Client sends write to one coordinator node in Cassandra cluster
- Coordinator may be per-key, or per-client or per-query
- Per-key coordinator ensures writes for the key are serialized
Coordinator uses Partitioner to send query to all replica nodes responsible for key
When X replicas respond, coordinator returns an acknowledgement to the client

If any replica is down, the coordinator writes to all other replicas, and keeps the write locally until down replica comes back up
When all replicas are down, the Coordinator (front end) buffers write (for up to a few hours)

Memtable = In-memory representation of multiple key-value pairs
Typically append-only datastructure (fast)
Cache that can be searched by key
Write-back as opposed to write-through
Later When memtable is full or old, flush to disk
- Data File: An SSTable (Sorted String Table) - list of key-value pairs, sorted by key
- SSTables are immutable (once created, they don't change)
- Index file: An SSTable of (key, position in data sstable) pairs
- And a Bloom Filter (for efficient search)

Any server may be queried, it acts as the coordinator
Contacts nodes with the requested key
On each node, data is pulled from SSTables and merged
Consistency < All performs read repair in background (read_repair_chance -default 10% of reads)

Data updates accumulate over time and SSTables and logs need to be compacted
- The process of compaction merges SSTables, i.e., by merging updates for a key
- Run periodically and locally at each server- TimeWindowCompactionStrategy

Coordinator can contact X replicas (e.g., in same rack)
- Coordinator sends read to replicas that have responded quickest in past
- When X replicas respond, coordinator returns the latest timestamped value from among those X
Coordinator also fetches value from other replicas
- Checks consistency in the background, initiating a read repair if any two values are different
- This mechanism seeks to eventually bring all replicas up to date
At a replica
- A row may be split across SSTables => reads need to touch multiple SSTables => reads slower than writes (but still fast)