Design Google Drive

System DesignDatabases & Storage

Materials — open to everyone, no sign-in

Topic: Design Google Drive

Interviewer: ken

Interviewee: 艾宇杰

Level: L4 (Experienced Individual Contributor)

Additional Resources:


QR Code to join us:

Interview Notes:

Functional requirements

Support directories

Upload files

Download files

File sync - multiple clients

Out of scope

Permission

File sharing

Notification

Scale

File < 1 GB

50M signed up users. 10M DAU

10 GB free space

Upload 2 files per day. Average size 500 KB

1:1 read to write ratio

Average 1000 files per user

Estimates:

QPS

50M * 10GB = 500 PB

2 files uploads

QPS for upload: 10 million * 2 uploads / 10^5 = 10 * 10 * 2 = 200 QPS

Peak QPS = 200 QPS * 5 = 1000 QPS

Metadata DB storage

10GB / 500KB = 20,000 files

1000 files * 10M users = 10 billion files entries

File path, s3 path, user, date,

10 billion files * 200 bytes of metadata per file = 2 TB

===

Bandwidth

200 uploads per second * .5MB files per upload = 100 MB per second

Non functional

Durable

Sync quickly

Minimize bandwidth

Scalable

available

====

API

UploadFile

DownloadFile

GetFileDirectory

[31:25]

Pull / push new changes

[Reversed arrow?]

[25:58]

User

Device

File

File_id

Block_id

[19:37]

10M * 2 / 3600 /24 = 231 write QPS

Support of transaction

Relation / tables

[10:24]

How does the blob storage connect back to client?

Both way can work

API gateway can return the URL for upload. Client connects to blob store

Blob store can connect back to the client

6:28

=====

=====

Notification back to client

Client starts and connects

Long pull to notification service

Websocket, bidirectional

File has changed

Initiate a download to the API gateway

[1:20]

Sharding

=====

Highly

Points to cover:

API:

list, upload, download, uploadChunk, downloadChunk

Client notification when version updated on server. Tradeoffs of poll vs push

Database Schema

Architecture choices:

If using S3/Google cloud storage: does the traffic go through the application server or not?

Push vs poll for propagating changes

storage choices:

cache

database: SQL (mySql) vs NoSQL (cassandra, eventual consistency)

File storage: Amazon S3, Google cloud storage, HDFS

Bar raiser:

Familiar with Amazon S3, Google cloud storage or HDFS workflow

Tiered storage to save storage cost

Experienced IC

Soft Skills

Requirement gathering

Discuss tradeoffs

Clear presentation

Driving interview

Hard skills

Design quality

Knowledge of existing solutions/tradeoffs

Fit into larger context of project and product lifecycle

====

GFS

Soft skill

How to communicate and show my knowledge

Hard skill

API flow

Schema

===