Design

Partitioner

Any server in cluster could be the coordinator
So every server needs to maintain a list of all the other servers that are currently in the serveran
List needs to be updated automatically as servers join, leave, and fail
Cassandra uses gossip-based cluster membership
- Nodes periodically gossip their membership list
- On receipt, the local membership list is updated
- If any heartbeat older than Tfail, node is marked as failed

Nodes join the cluster by communicating with any node
Cassandra finds these nodes list of possible nodes in cassandra.yaml
Seed nodes communicate cluster topology to the joining node
Once the new node joins the cluster, all nodes are peers
States of nodes
- Joining
- Leaving
- Up
- Down- Cassandra uses a Ring-based DHT but without finger tables or routing

Uses the partitioner, of which there are two kinds

RandomPartitioner: Chord-like hash partitioning
ByteOrderedPartitioner: Assigns ranges of keys to servers
- Easier for range queries (e.g., Get me all twitter users starting with [a-b])

NetworkTopologyStrategy: for multi-DC deployments
- Two replicas per DC
- Three replicas per DC
- Per DC
  - First replica placed according to Partitioner
  - Then go clockwise around ring until you hit a different rack

Suspicion mechanisms to adaptively set the timeout based on underlying network and failure behavior
Accural detector: Failure Detector outputs a value (PHI) representing suspicion
Applications set an appropriate threshold
PHI calculation for a member
- Inter-arrival times for gossip messages
- PHI(t) = -log(CDF or Probability(t_now - t_last))/log 10
- PHI basically determines the detection timeout, but takes into account historical inter-arrival time variations for gossiped heatbeats
In practice, PHI = 5 => 10-15 sec detection time