Design of Zookeeper

Zookeeper - Service for coordinating processes of distributed applications

Fundamentals
Design goals
Architecture
Applications

Classic Distributed System

Most of the system like HDFS have one Master and couple of slave nodes and these slave nodes report to the master.

Fault Tolerant Distributed System

Real distributed fault tolerant system have Coordination service, Master and backup master
If primary failed then backup works for it

What is Coordination?

Group membership: Set of datanodes (tasks) belong to same group
Leader election: Electing a leader between primary and backup
Dynamic Configuration: Multiple services are joining, communicating and leaving (Service lookup registry)
Status monitoring: Monitoring various processes and services in a cluster
Queuing: One process is embedding and other is using
Barriers: All the processes showing the barrier and leaving the barrier
Critical sections: Which process will go to the critical section and when?

What is ZooKeeper?

ZooKeeper is a highly reliable distributed coordination kernel, which can be used for distributed locking, configuration management, leadership election, work queues
Zookeeper is a replicated service that holds the metadata of distributed applications
Key attributed of such data
- Small size
- Performance sensitive
- Dynamic
- Critical
In very simple words, it is a central store of key-value using which distributed systems can coordinate. Since it needs to be able to handle the load, ZooKeeper itself runs on many machines
Exposes a simple set of primitives
Very easy to program
Uses a data model like directory tree
Used for
- Synchronisation
- Locking
- Maintaining Configuration
Coordination service that does not suffer from
- Race Conditions
- Dead locks

Design Goals

Simple
- A shared hierarchal namespace looks like standard file system
- The namespace has data nodes - znodes (similar to files/dirs)
- Data is kept in-memory
- Achieve high throughput and low latency numbers
- High performance
  - Used in large, distributed systems
- Highly available
  - No single point of failure
- Strictly ordered access
  - Synchronisation
Replicated
- All servers have a copy of the state in memory
- A leader is elected at startup
- Followers service clients, all updates go through leader
- Update responses are sent when a majority of servers have persisted the change
- We need 2f+1 machines to tolerate f failures

Ordered
- Zookeeper stamps each update with a number
- The number:
  - Reflects the order of transactions
  - Used implement higher-level abstractions, such as synchronization primitives
Fast
- Performs best where reads are more common than writes, at ratios of around 10:1

Data Model

The way you store data in any store is called data model
Think of it as highly available file system
Znode: we store data in an entity called znode
JSON data: The data that we store should be in JSON format
No append operation: The znode can only be updated. It does not support append operations
Data access (read/write) is atomic: The read or write is atomic operation meaning either it will be full or would throw an error if failed. There is no intermediate state like half written
Znode: Can have children
So znodes inside znodes make a tree like hierarchy
The top level znode is "/"
The znode "/zoo" is child of "/" which top level znode
duck is child znode of zoo. It is denoted as /zoo/duck
Though "." or ".." are invalid characters as opposed to the file system

Data model - Znode - Types

Persistent

Such kind of znodes remain in zookeeper until deleted. This is the default type of znode. To create such node you can use the command: create /name_of_myznode "mydata"

Ephemeral
- Ephermal node gets deleted if the session in which the node was created has disconnected. Though it is tied to client's session but it is visible to the other users.
- An ephermal node can not have children not even ephermal children

Architecture

Zookeeper can run in two nodes
- Standalone
  - In standalone mode, it is just running on one machine and for practical purposes we do not use standalone mode
  - This is only for testing purposes
  - It doesn't have high availability
- Replicated
  - Run on cluster of machines called an ensemble
  - High availability
  - Tolerates as long as majority

Architecture: Phase 1

Phase 1: Leader election (Paxos Algorithm)

The machines elect a distinguished member - leader
The others are termed followers
The phase is finished when majority sync their state with leader
If leader fails, the remaining machines hold election takes 200ms
If the majority of the machines aren't available at any point of time the leader automatically steps down

Architecture: Phase 2

Phase 2: Atomic broadcast

All write requests are forwarded to the leader
Leader broadcasts the update to the followers
When a majority have persisted the change
- The leader commits the up-date
- The client gets success response
The protocol for achieving consensus is atomic like two-phase commit
Machines write to disk before in-memory

Election in Zookeeper

Centralized service for maintaining configuration information
Uses a variant of Paxos called Zab (Zookeeper Atomic Broadcast)
Needs to keep a leader elected at all times
Each server creates a new sequence number for itself
- Let's say the sequence number are ids
- Gets highers id so far (from ZK file system), creates next-higher id, writes it into ZK file system
Elect the highest-id server as leader

Failures:
- One option: everyone monitors current master (directly or via a failure detector)
  - On failure, initiate election
  - Leads to a flood of elections
  - Too many messages

Second option: (implemented in Zookeeper)
- Each process monitors its next-higher id process
- if that successor was the leader and it has failed
  - Become the new leader
- else
  - wait for a timeout, and check your successor again

What about id conflicts? What if leader fails during election?
To address this, Zookeeper uses a two-phase commit (run after the sequence/id) protocol to commit the leader
- Leader sends NEW_LEADER message to all
- Each process responds with ACK to at most one leader, i.e., one with highest process id
- Leader waits for a majority of ACKs, and then sends COMMIT to all
- On receiving COMMIT, process updates its leader variable
Ensures that safety is still maintained

Election Demo

If you have three nodes A, B, C with A as Leader. And A dies. Will someone become leader?

If you have three nodes A, B, C and A and B die. Will C become Leader?

Why do we need majority?

Imagine: We have an ensemble spread over two data centres.

Imagine: The network between data centres got disconnected. If we did not need majority for electing Leader
What will happen?

Each data centre will have their own Leader. No Consistency and utter Chaos. That is why it requires majority.

Sessions

Let's try to understand how do the zookeeper decides to delete ephermals nodes and takes care of session management.
- A client has list of servers in the ensemble
- It tries each until successful
- Server creates a new session for the client
- A session has a timeout period - decided by caller
- If the server hasn't received a request within the timeout period, it may expire the session
- On session expire, ephermal nodes are lost
- To keep sessions alive client sends pings (heartbeats)
- Client library takes care of heartbeats
- Sessions are still valid on switching to another server
- Failover is handled automatically by the client
- Application can't remain agnostic of server reconnections because the operations will fail during disconnection

Multi Update

Batches together multiple operations together
Either all fail or succeed in entirely
Possible to implement transactions
Others never observe any inconsistent state

Watches

Clients to get notifications when a znode changes in some way
Watchers are triggered only once
For multiple notifications, re-register

ACLs - Access Control Lists

ACL determines who can perform certain operations on it

ACL is the combination
- authentication scheme
- an identity for that scheme
- and a set of permissions
Authentication Scheme
- digest - The client is authenticated by a username & password
- sasl - The client is authenticated using Kerberos
- ip - The client is authenticated by its IP address

Use Cases

Building a reliable configuration service

A Distributed lock service

Only single process may hold the lock

When not to use?

To store big data because:
- The number of copies == number of nodes
- All data is loaded in RAM too
- Network load of transferring all data to all Nodes
Extermely strong consistency

Katta - Lucene & more in the cloud

Katta is a scalable, failure tolerant, distributed, data storage for real time access.

Katta serves large, replicated, indices as shards to serve high loads and very large data sets. These indices can be of different type. Currently implementations are available for Lucene and Hadoop mapfiles.

Makes serving large or high load indices easy
Serves very large Lucene or Hadoop Mapfile indices as index shards on many servers
Replicate shards on different servers for performance and fault-tolerance
Supports pluggable network topologies
Master fail-over
Fast, lightweight, easy to integrate
Plays well with Hadoop clusters
Apache Version 2 License

http://katta.sourceforge.net

Classic Distributed System​

Fault Tolerant Distributed System​

What is Coordination?​

What is ZooKeeper?​

Design Goals​

Data Model​

Data model - Znode - Types​

Architecture​

Architecture: Phase 1​

Architecture: Phase 2​

Phase 2: Atomic broadcast​

Election in Zookeeper​

Election Demo​

Why do we need majority?​

Sessions​

Multi Update​

Watches​

ACLs - Access Control Lists​

Use Cases​

Building a reliable configuration service​

When not to use?​

Katta - Lucene & more in the cloud​