Distributed Key-value Store

System DesignDatabases & StorageDistributed Systems

Topic: Distributed Key-value Store

Interviewer: shihao (沉)

Interviewee: Tom (慢)

Level: L5 (Senior)


Group QR code

Topic

Mock System Design Interview Summary

Interview Overview

Date: 3/6/2022

Target level: L5

Duration: 45 minutes

Topic covered: Key-value store

Requirements

Functional requirements

Design the next generation of distributed KV store that will support core service like payment, authentication, authorization, etc

KV store need to scale linearly

We do not need to consider multi-data center deployment for now, but may need it after 5 years

Range query, batch inserts

Non functional requirements

Functional requirements

Support payment/authentication

Scale linearly

Multi data center after 5 years

Range query and batch query/update

Non functional requirements

Latency: Users should have real time experience

Availability: 99.9%

Robustness: user data should not get lost

Capacity planning

read/write ratio = 99/1

10k write QPS

=> 990k read QPS

Key <= 1kb

Value <= 8kb

Average k+v size 2kb

=> 2kb * 10k / s

System Design

Options:

Centralized LB => db shard (Google spanner) (recommended)

Pro: quick, easy to manage

Cons: the metadata table might be the bottleneck

Distributed hash table => Dynamodb

Pros: easy to scale

Latency - need some extra jumps - add more data in a single machine

Global LB try to find the best

External APIs

getDataRequest(user_token, key, version) - res, error return a string as result

Return a string as result, version_id

Error: error_code, error message

System design

Index: type of index for one cluster

How to implement batch read / batch write

Index:

B+ tree, log structured merge tree

B+ Look up tree is O(log(N))

Better for

LSM tree very good write performance

LSM is close to append

Write performance is good

Which one will you pick?

Read is heavy.

B+ tree can perform better

LSM read is not worse than B+ tree

LSM has good write performance

How do you perform range query?

LSM

2 batch writes, batch1, batch2

Batch 1 is successful

Transactions

Most strict level of isolation

Lock solution, optimistic, pessimistic lock.

Depends: Strong consistent read, or not

How to handle: High throughput?

Every write goes to master

Every read goes to master or slave, without consistency guarantee

Every read goes to master for consistency guarantee

Master - slave

Write comes

Write to master

If want robustness - master sync data to slaves

Master return succeed

Master + slave = N

Assume we read b copies

a + b > n => strong consistency read

Master can go down

Master election algorithm, e.g. paxo, raft

Does raft require read+write > N?

Read from master => strong consistency read

Paxo is only for election of master. This is the node to accept write.

If a cluster is down, can promote from another cluster

Zookeeper key1/cluster_1 => cluster_2

If read query needs to read across 2 different clusters?

Shard based on what?

Based on key range, or based on hash value

Read from 1-1000, 1-100 in shard 1, 101-1000 in shard 2

Need to query different shards

How to enforce ACID?

There can be an abstract layer on top of the shards

We can use multi-version control

Read timestamp <= my_timestamp

Time is not accurate

Spanner - true time

Computer network time protocol

Try to get pretty close clock

Same data center - there can be an atom clock

There can be a bottleneck

Interviewer and Audience Feedback

====

===

Interviewer

Hire

Describe tradeoff

Want to see more accurate technical terms

Range query

Forces B+ tree

Consistent model

Payment, authentication ACID

K/V size

Spanner uses B+ tree

Prefers B+ tree but interviewee is more familiar with LSM tree

Self implement B+ tree

Value in memory. Cluster index vs non cluster

Read-write conflict

Write-write conflict - ran out of time

Want to have snapshot

Distributed:

Consistent hashing

Range based - need separate meta data

Master may become a bottleneck

partition

WDR vs election based?

Distributed transaction

Interviewer’s design

Within shard we can put more meta data

Range

Timestamp 2PL

Get all locks

SAGA pattern?

How to reduce pressure for single master?

Only talk to single master: Need a new tablet / partition

Meta data is only for

CRUD for master?

Only for split and merge

Distributed B+ tree

QPS 10k

At the beginning when new client is created, will need to visit the root tablet

Then client can cache the root tablet content

Out of date metadata: then it will requery

For scale up: we can move the data gradually

How to handle hot tablet?

With range query.

Config table may create hot tablet

Change master slave into quorum based

Can use quorum read write. Read more

Interviewee:

Distributed system - how to implement consistency

Range query: If it’s a list then we may be issue queries in parallel

==

Soft skill:

Let the interviewer finish asking

===

Clarifying question:

Distributed key value store

How is it a distributed system?

Designing an infrastructure, will be a bit more detail

I gave the business requirement,

2 machines or more are distributed system

Key value store - is index the same as key?

Hashtable, want to use range query

Read heavy and range query heavy

Range query

If query contains range across multiple tablets

Reads: retrieve timestamp and then read from multiple tablets

Gossip:

Leaderless: no reliable timestamp

Similar to multi-range system

Design distributed key-value store

What does “next generation” mean?

Don’t want to use dynamo design.

2PL, 2 Phase Lock, SAGA

2 phase lock

Write-write conflict

Write 1-2-3

Write 2-3-4

We can lock 1-2-3

Then write

Then unlock, then commit

2 phase commit

Coordinator

Try pre-execute

Make all component ready for commit

Then coordinator can commit everything

Saga: try-catch confirm

Can I make a spanner system?

Interviewer: we don’t need people to memorize any particular paper.

Do keys need to be sorted?

I will probably use an order

WR - can guarantee strong consistency

Why is range query a transaction?

Single record is not a transaction

Split-merge

==

Distributed

Cache

Consistent

How to scale

Requirement gathering

API not important

Single node - indexing, business logic

Distributed - sharding, replication

==

Distributed transaction

==

2 phase commit is a basic protocol

Strong consistency

==

Can we use an existing system

Related to workflow

Everyone has different question