Distributed Key Value Store

System DesignDatabases & StorageDistributed Systems

Topic: Distributed Key Value Store

Interviewer: 好雨

Interviewee: 仪丽

Level: L4 (Experienced Individual Contributor)


Mock System Design Interview Summary

Interview Overview

Date: 2/13/2022

Target level: 4

Duration: 45 minutes

Topic covered: Key Value Store

Drawing tool used: miro.com

Requirements

Functional requirements

Create, delete, get, no partial update, batch update

Key: string: value: string

Non requirements:

No need for transaction (ACID)

No sorted result needed

Non-functional requirements

Update triggers

10TB / year

Value = 1kb

QPS: 100k QPS

Latency 50ms

Durability

Eventual consistency -> optional strong consistency

Availability? Fault tolerant, no single point of failure, 99.99%

Scalability

Write intensive rather than read intensive

Non functional requirements

System Design

External APIs

System design

Config server:

Metadata, log

JSON serialization format

Request Manager -> config server: who owns the data

Request Manager -> communicate with actual store

JSON vs other?

JSON vs redis(?)

JSON is easier to use for frontend

What protocol to use?

A: HTTP: widely accepted

TCP is for internal use

Request manager -> config server call, is it for every request?

There can be a cache before the config server

Client vs server side cache?

Server side

Every request needs to hit config cache

Keep-alive:

The config server knows primary server is up or not, automatically switch to secondary if primary is down

Single point of failure

Data migration can use instruction from config server

How to handle config server failures?

Can make a backup for config server

How do you make sure there is not 2 active config server

Use sentinel to understand which one is the active config server

How do you support high concurrent writes?

Storage engine levelDB

Some data engine supports write intensive

Why they can support high write volume?

LSM tree as data structure. Supports write-heavy workloads

Why is it fast?

LSM tree: sorted string tables, divide data as segment

Can store key-value in each segment

(audience:

immutable memtable

Then sstable

Then compact

)

How to handle growing data?

Consistent hash

When client request data

Mapping table: node -> information of node

If one node fails, it will put the key value store to other nodes

Config server:

Contains mapping table

2 columns:

Data server info (ip address)

ID of node (e.g. A)

Where to store backup servers?

Also store in config server

Config server (mapping table, user ID, version number)

How is config cache updated?

If new node is added or removed, the config service will regenerate the mapping table

Config service will sync the data to the new cache

Eventual consistency -> strong consistency How to support it?

3 nodes for key-value store. 2 of them changed then we can send success to the client

Primary has the latest data, and secondary has some stale data

RAFT can help

Interviewer: can just route request to primary

When requests update the same key, how do we handle concurrency?

Can store a version number on key-value store

Version 10

Party A: Version 10->version 11

Party B: Version 10->version 11 (rejected)

Optimistic locking

Q: How do we use pessimistic lock?

A: I prefer version based (optimistic lock)

LevelDB:

Write data to memory, flush to disk periodically

Server goes down after writing to memory? How to ensure durability?

Durability is not an issue

Interviewer and Audience Feedback

Interviewer L4:

P2p - more complex

Choose master-slave is the right choice

The solution is workable

Things to improve:

Rushed the technical decision

But we can discuss the tradeoff ahead of the decision

Sometimes interviewee can dive deeper

LSM tree - is the right direction, can go deeper. No very clear during deeper dive

High level is good

Write heavy system?

Storage engine decides high write vs high read

Key value store: B+ tree (favoring read), LSM tree (favoring write)

I asked but deep dive

Audience:

Interviewee should drive requirement gathering

Candidate: encryption. We can say “we don’t need encryption”

If interviewee has not dived deep enough, then the interviewer can drive harder

Audience:

Distributed key value store, what is the emphasis

For example, single node CRUD, read,write intensive, then extend to distributed

The interviewee did not cover the single node

Depends the requirement of the interviewer

Audience

QPS: real data are added every year

1k, write QPS 300 or so. Not write heavy

17 TB

Write and update data

10 TB: how this is defined

100k QPS

On average 1k Byte

100k + key value size

100k write or read QPS?

Interviewer

Read write ratio, QPS will be more clear

Audience

100k QPS 10TB

Interviewee can calculate

Audience

Who should provide the numbers?

Audience

Write heavy, write intensive, distributed system

What database to use?

LevelDB LSM tree

Can system level - write heavy

Audience

Can add a queue, then put data from queue to database

The tradeoff we will get dirty data

Audience

Throughput of levelDB?

Read: B+ tree

LevelDB is a library. LSM tree

RothDB also uses levelDB

innoDB

Audience

Key value store

Default very high scale

First discuss a single node

Surprised that we directly levelDB? Should we design the deeper level

Interviewer

How to cover scalability, durability etc?

Audience

The interviewer asked about single machine design first, then move to scale the system

Key value store, why is it better than traditional database?

Because in a single machine writing is much faster

Audience

Level DB, how to design

Normal interview may not dive so deep

It seems online videos have similar flow (single -> multiple devices)

Audience

Existing solution name -> then internal dive

Audience

First list the requirement

Audience

Overlaps quite a bit with existing solutions

Audience

The key point is if the interviewer knows existing solutions

Audience

Bigtable - much more complex, tear (taobao),

We can choose a simple one, and easy to explain.

Audience

L4: lower layer is better, or application layer is better?

Such as appointment application

Which one is more suitable for L4

Audience

Which points do we want to deep dive into?

Want to ask about non-functional?

Key value p2p vs master slave. How to prevent master single point of failure

Request - client can directly talk to data server. I was hoping config server is moved to the client.

Need a sentinel (哨兵) for monitoring and switch between master and slave

(a) Request does not need to go through master

(b) how to do master failover

I think the config server is the master

Config server is a single point of failure?

Client can cache the configuration, reduces load on config server

Switch over

Config server can be a cluster

Primary DB - etcd . lots of components and assumptions

ADG is sharding. B, E, H is backup

Primary is already sharding

Write volume is big: how to scale: sharding

Metrics, master down, and slave is up, need to alert and monitor

Configuration cache can go to the client side

When we need to add machine, we can predict the growth - can put it to the client side

Config/data server heartbeat

mapping/hashing

Outdated mapping. Go back to config service to get the up-to-date mapping

Audience:

Request manager can be merged with config server?

Request manager is fine. You may not be able to add logic to load balancer

====