Design of Zookeeper
Zookeeper - Service for coordinating processes of distributed applications
- Fundamentals
- Design goals
- Architecture
- Applications
Classic Distributed System
- Most of the system like HDFS have one Master and couple of slave nodes and these slave nodes report to the master.
Fault Tolerant Distributed System
- Real distributed fault tolerant system have Coordination service, Master and backup master
- If primary failed then backup works for it
What is Coordination?
- Group membership: Set of datanodes (tasks) belong to same group
- Leader election: Electing a leader between primary and backup
- Dynamic Configuration: Multiple services are joining, communicating and leaving (Service lookup registry)
- Status monitoring: Monitoring various processes and services in a cluster
- Queuing: One process is embedding and other is using
- Barriers: All the processes showing the barrier and leaving the barrier
- Critical sections: Which process will go to the critical section and when?
What is ZooKeeper?
- ZooKeeper is a highly reliable distributed coordination kernel, which can be used for distributed locking, configuration management, leadership election, work queues
- Zookeeper is a replicated service that holds the metadata of distributed applications
- Key attributed of such data
- Small size
- Performance sensitive
- Dynamic
- Critical
- In very simple words, it is a central store of key-value using which distributed systems can coordinate. Since it needs to be able to handle the load, ZooKeeper itself runs on many machines
- Exposes a simple set of primitives
- Very easy to program
- Uses a data model like directory tree
- Used for
- Synchronisation
- Locking
- Maintaining Configuration
- Coordination service that does not suffer from
- Race Conditions
- Dead locks
Design Goals
-
Simple
- A shared hierarchal namespace looks like standard file system
- The namespace has data nodes - znodes (similar to files/dirs)
- Data is kept in-memory
- Achieve high throughput and low latency numbers
- High performance
- Used in large, distributed systems
- Highly available
- No single point of failure
- Strictly ordered access
- Synchronisation
-
Replicated
- All servers have a copy of the state in memory
- A leader is elected at startup
- Followers service clients, all updates go through leader
- Update responses are sent when a majority of servers have persisted the change
- We need 2f+1 machines to tolerate f failures
-
Ordered
- Zookeeper stamps each update with a number
- The number:
- Reflects the order of transactions
- Used implement higher-level abstractions, such as synchronization primitives
-
Fast
- Performs best where reads are more common than writes, at ratios of around 10:1
Data Model
- The way you store data in any store is called data model
- Think of it as highly available file system
- Znode: we store data in an entity called znode
- JSON data: The data that we store should be in JSON format
- No append operation: The znode can only be updated. It does not support append operations
- Data access (read/write) is atomic: The read or write is atomic operation meaning either it will be full or would throw an error if failed. There is no intermediate state like half written
- Znode: Can have children
- So znodes inside znodes make a tree like hierarchy
- The top level znode is "/"
- The znode "/zoo" is child of "/" which top level znode
- duck is child znode of zoo. It is denoted as /zoo/duck
- Though "." or ".." are invalid characters as opposed to the file system
Data model - Znode - Types
- Persistent
Such kind of znodes remain in zookeeper until deleted. This is the default type of znode. To create such node you can use the command: create /name_of_myznode "mydata"
- Ephemeral
- Ephermal node gets deleted if the session in which the node was created has disconnected. Though it is tied to client's session but it is visible to the other users.
- An ephermal node can not have children not even ephermal children
Architecture
- Zookeeper can run in two nodes
- Standalone
- In standalone mode, it is just running on one machine and for practical purposes we do not use standalone mode
- This is only for testing purposes
- It doesn't have high availability
- Replicated
- Run on cluster of machines called an ensemble
- High availability
- Tolerates as long as majority
- Standalone
Architecture: Phase 1
Phase 1: Leader election (Paxos Algorithm)
- The machines elect a distinguished member - leader
- The others are termed followers
- The phase is finished when majority sync their state with leader
- If leader fails, the remaining machines hold election takes 200ms
- If the majority of the machines aren't available at any point of time the leader automatically steps down
Architecture: Phase 2
Phase 2: Atomic broadcast
- All write requests are forwarded to the leader
- Leader broadcasts the update to the followers
- When a majority have persisted the change
- The leader commits the up-date
- The client gets success response
- The protocol for achieving consensus is atomic like two-phase commit
- Machines write to disk before in-memory
Election in Zookeeper
- Centralized service for maintaining configuration information
- Uses a variant of Paxos called Zab (Zookeeper Atomic Broadcast)
- Needs to keep a leader elected at all times
- Each server creates a new sequence number for itself
- Let's say the sequence number are ids
- Gets highers id so far (from ZK file system), creates next-higher id, writes it into ZK file system
- Elect the highest-id server as leader
- Failures:
- One option: everyone monitors current master (directly or via a failure detector)
- On failure, initiate election
- Leads to a flood of elections
- Too many messages
- One option: everyone monitors current master (directly or via a failure detector)
- Second option: (implemented in Zookeeper)
- Each process monitors its next-higher id process
- if that successor was the leader and it has failed
- Become the new leader
- else
- wait for a timeout, and check your successor again
- What about id conflicts? What if leader fails during election?
- To address this, Zookeeper uses a two-phase commit (run after the sequence/id) protocol to commit the leader
- Leader sends NEW_LEADER message to all
- Each process responds with ACK to at most one leader, i.e., one with highest process id
- Leader waits for a majority of ACKs, and then sends COMMIT to all
- On receiving COMMIT, process updates its leader variable
- Ensures that safety is still maintained
Election Demo
- If you have three nodes A, B, C with A as Leader. And A dies. Will someone become leader?
- If you have three nodes A, B, C and A and B die. Will C become Leader?
Why do we need majority?
- Imagine: We have an ensemble spread over two data centres.
- Imagine: The network between data centres got disconnected. If we did not need majority for electing Leader
- What will happen?
- Each data centre will have their own Leader. No Consistency and utter Chaos. That is why it requires majority.
Sessions
- Let's try to understand how do the zookeeper decides to delete ephermals nodes and takes care of session management.
- A client has list of servers in the ensemble
- It tries each until successful
- Server creates a new session for the client
- A session has a timeout period - decided by caller
- If the server hasn't received a request within the timeout period, it may expire the session
- On session expire, ephermal nodes are lost
- To keep sessions alive client sends pings (heartbeats)
- Client library takes care of heartbeats
- Sessions are still valid on switching to another server
- Failover is handled automatically by the client
- Application can't remain agnostic of server reconnections because the operations will fail during disconnection
Multi Update
- Batches together multiple operations together
- Either all fail or succeed in entirely
- Possible to implement transactions
- Others never observe any inconsistent state
Watches
- Clients to get notifications when a znode changes in some way
- Watchers are triggered only once
- For multiple notifications, re-register
ACLs - Access Control Lists
ACL determines who can perform certain operations on it
- ACL is the combination
- authentication scheme
- an identity for that scheme
- and a set of permissions
- Authentication Scheme
- digest - The client is authenticated by a username & password
- sasl - The client is authenticated using Kerberos
- ip - The client is authenticated by its IP address
Use Cases
Building a reliable configuration service
- A Distributed lock service
Only single process may hold the lock
When not to use?
-
To store big data because:
- The number of copies == number of nodes
- All data is loaded in RAM too
- Network load of transferring all data to all Nodes
-
Extermely strong consistency
Katta - Lucene & more in the cloud
Katta is a scalable, failure tolerant, distributed, data storage for real time access.
Katta serves large, replicated, indices as shards to serve high loads and very large data sets. These indices can be of different type. Currently implementations are available for Lucene and Hadoop mapfiles.
- Makes serving large or high load indices easy
- Serves very large Lucene or Hadoop Mapfile indices as index shards on many servers
- Replicate shards on different servers for performance and fault-tolerance
- Supports pluggable network topologies
- Master fail-over
- Fast, lightweight, easy to integrate
- Plays well with Hadoop clusters
- Apache Version 2 License