Cluster and Native Service Management
Materials — open to everyone, no sign-in
Topic: Cluster and Native Service Management
Presenter: 马可
Interviewer: system
Additional Resources:
System Design Event Sign Up Form:
Job Referral
Candidates: https://commitway.com/job-refer
Search for “Databricks” for job opening on Michael’s team
Hiring managers/team members: https://commitway.com/job-open
QRCode
Cloud computing
Commodity hardware
Low cost, and reliable
failure is NORM
Heterogeneous hardware
Supports different workload
Computing and storage decouple
Massive scale
Resiliency
Cost efficiency
Multi geo-location
Resiliency - can failover across different regions
Locality - can deploy service close to customers
Data center topology
Rack
Switches (top-of-rack router)
There are usually multiple level of routers
Ethernet is the most popular protocol (more popular than ATM)
Why is data center management different?
Because we need to manage machines
What is a cluster?
A collection of machines
Can be bare metal machine or VM
Cluster management system
Control plane
Data plane
Functionalities
Service discovery
Inventory management
Allocation
Failure detection
Healing
Deployment
Policy management
Node lifecycle management
Workload lifecycle management
How is cluster management different?
Why not use a typical architecture? FE + BE + database?
Highly available?
Scalable?
Fault tolerance
Distributed
Entities to management - some part of the control plane needs to co-locate with the services
Control plane may be a cache of the data plane
What is native service?
This can be compared to 3-tier or single machine traditional software
Fault tolerance - machines can fail
Multi-tenancy - need to share resource with other processes
Usually container based
Declare resource demand
Cluster and native service management system
Borg
Google’s home grown
Job and service management
Autopilot
Microsoft home grown
Service management
Kubernetes
Invented by google
OSS
Concepts
| | Borg | Autopilot | Kubernetes |
| Management scope | cell | cluster | cluster |
| Control plane | borgmaster | autopilot | kubemaster |
| Data plane | borglet | shared | kubelet |
| Workload | job | Machine type | replicaset |
| Workload instance | Tasks (single container) | Machine | Pod (multi-container) |
| Node | Node | Physical machine | Node |
| Tenant | N/A | Environment | Namespace |
Microsoft Autopilot vs service fabric
Abstraction of services
Service fabric: control and data plane are co-hosted. Helps scaling out
Service fabric supports multiple-tenancy from the beginning (using job object/container). Autopilot - all instances map to a physical machine.
Service fabric supports service store and partition (helps scaling out)
Interface
They all use declarative interface.
Job description
Borg - BCL
Autopilot *.init file with workload information
K8S - YAML
Workload lifecycle
Autopilot
Has complex handling for failures. Introduced probation state to avoid cascading failures.
It tailors to long-running service
Availability constraints
Borg
Number of task disruptions
Autopilot
Failing limit
Kubernetes
Pod disruption budget
Autopilot needs to deal with potential data corruption. Borg jobs don’t have such need due to other storage infrastructures such as GFS.
Priority and quota
Priority
Borg
A positive integer
Bands: monitoring/production/batch/best-effort
No in-band preemption
Autopiot
K8S
Quota
Resource reservation
Part of admission control
Service discovery
Borg
Borg name service - a default name is created
Cell name/job name/task number
Host name/port in chubby
Autopilot
DNS
Host file
Local discovery file
KBS
Service type
clusterIP
nodePort
Loadbalancer
ExternalName
Discovering
Environment variable
DNS
Headless service
DNS
Workload failure detection
Can deadlock, or fail in other ways
Borg
HTTP based health check URL
Autopilot
Watchdog
OK/Error/Warning
KV based health store
K8S
Liveness probe
Demand may be predicted using ML, but it’s not always predicted correctly.
Node failure detection
borg
Borglet poll
Autopilot
Watchdog
K8S
Node problem detector
Run as daemonset
Monitor
System logs
System stats
Custom plugin
Health checker
Node controller
Probe node
Healing
Autopilot
History and error based
Actions
Reboot - may solve memory leak
Reimage - may solve disk out of space
Log rotation
Declare space needs
Replace - disk bisector, memory corruption
Migration
- can solve problem similar to reboot, reimage, replace
Can solve problems of noisy neighbor
Monitoring
Borg
Infrastore
Autopilot
Collection service
Cockpit https endpoint on public IP. can use access internal information
K8S
Borg
Borgmaster
5 replicas
Paxos based repliacation/persistence
Leader election
Communication with borglet
Scheduler
Feasibility checking
Scoring
Best fit vs worst fit
How to avoid single point of failure?
Try to minimize / maximize sqr root of node number square
Minimum 5 replicas
1 failure for system
1 failure for system upgrade
Where are 5 replicas?
Usually at the same data center or same site (physical location)
Kubernetes in theory we can run at different data centers
Latency impacts the performance of paxos algorithm
Usually on different racks
Software may also be failure
Borgmaster software may contain bugs
Blast radius
Try to limit lower tier’s blast radius
High-level can have larger blast radius
Autopilot
Device manager
Strong consistency
Replicated using paxos
Pull vs push
Satellite service
Poll state from device manager
Reliable design - will eventually get the latest update
Simple - push model still requires pulling as backup
Deployment service
Watchdog service
Repair service
Provision service
Collection service
cockpit
K8S
Api service
Cluster management service
Replication controller
Node controller
CIDR assignment
Inventory reconciliation
Node health monitoring
ETCD - uses raft protocol. Distributed key-value store
Key - object path
Value - binary representation of object
Architecture - control plane scalability
Omega
Basic idea: sharding
optimistic locking
Borg
Decouple functionalities
Separate scheduler based on snapshot state
Workload specific scheduling
Score caching
Equivalent class
Random selection
Borglet probe
Read-only API
Sharded across replicas
Autopilot
Complex scale out design
Sharding
How to avoid conflict
Lock global resource (may make it bottleneck). Assuming infrequent access
Asynchronous protocol for update/state modification. Assumes it may fail. Optimistic locking. 2-phase commit.
Kubernetes
Claim resources
Complexity is high
Borg
Borglet
Borgmaster polls borglet
Link shard for partition/aggregation
Autopilot
File sync
Application manager (local)
Local watchdog
K8S
Kubelet
Job opening
Resource provisioning across multiple clouds
Challenges: unified interface
How to scale up service stack on different clouds