TopK · commitway

Topic: TopK

Interviewer: ken

Interviewee: jumplus

Level: L4 (Experienced Individual Contributor)

Topic: TopK for YouTube

Mock System Design Interview Summary

Key Feedback

Soft skills

Clarifying requirements was the best soft skill demonstrated by the interviewee. The interviewee paused at each key junction to clarify requirements.

Hard skills

TopK for YouTube can be designed similar to a time-series database with multiple layers of database:

Redis (fast distributed cache)

periodically aggregate redis data into a relational database, e.g. at minute frequency

periodically aggregated higher frequency data into lower frequency data, e.g. hourly or daily.

The lower granularity database tables can be cleared out once the aggregation has been computed.

Interview Overview

Date:

Target level: L4 (experienced IC)

Duration: 45 minutes

Topic covered: Top K for YouTube

Drawing tool used: Whimsical

Requirements

Functional requirements

Trending systems

TopK for the last 24 hours for 300 videos

Non functional requirements

High availability,

low latency

high throughput/scalability,

Fault tolerant

Traffic:

100M users (Youtube 300M on iOS. https://www.statista.com/statistics/1252638/youtube-app-dau-worldwide/)

1 hour of video watching on average

Each video on average 5 minutes

Peak time estimate: youtube, bilibili

Read heavy

System Design

External APIs

getTopVideos(K, startTime, null)

TODO: record videos, or ingestion of data

System design diagram

Distributed design? Or single host design?

Landing page request

Call to calculate the most trending on the fly

Schema: noSQL, cassandra

Time, ListOfVideos(top K)

Date, one day

5 minutes, ListOfVideos (top K videos)

Top 200 video for 5 minutes

Database schema

Watching video:

Daemon to calculate 80% of video

Distributed message queue: schema

User Watch Count, for last few seconds

A = 3

B = 1

UserID, count (how many videos the user watched per second)

Consistent hashing by userID

CountMin

Fixed memory

Storage service -> aggregation

Aggregate count data with a single process

Interviewer and Audience Feedback

Message queue

Map reduce

Why do we need sharding for queue?

Precompute topK and save in cache

100-200

Add score for video

Drawback: the video may not be even

Estimate: we should know the number of videos

Distinguished videos

Queue 1M video => 1000 partitions => 1000 video id per

One video ID 2G cache per machine

The data needs to be kept. They need to be saved in database

Database - 1B video

Every video you need to count 5 minutes

5 minutes - aggregate 10 days

Only need to keep the 5 minute window for the past few days

DB are aggregated periodically

What database can we use?

Append.

Use the time series database

Video ID, key

Can use many different database, shard by video ID

Which process is writing into the database?

Redis cache, map reduce, spark

Put the data for every 5 minutes. Only for batch processing later.

15,000 QPS

Will not write to database 15,000 QPS

Will have a pre-aggregate to 5 minute window

15,000 QPS -> can save it in Redis

5 minutes, calculation

Redis (plus 1 plus 1)

Using Redis to do accumulation

Let’s say there are 1000 servers to serve 15,000 QPS

Redis in-memory database.

Key of the solution:

How do we shrink step-by-step

Why don’t we just have a storage in the database

If it crashes, it’s lost from memory

Redis - is fast due to C language and replica

It’s more reliable

Redis: High throughput writing, but mainly for short-term pre-processing

It still saves into a database

Do we need to go back to see every 5 minutes?

Specific time window: is a bonus

https://www.youtube.com/watch?v=kx-XDoPjoHw

Lossy count, count-min sketch

The key of this question is batch processing and stream processing

Can go deep on an algorithm

Spark streaming, I know the source code, or I design a similar system

List a few key problems

Highest click for historical

Eventually - no data loss

Fast data process

Want a slow path

Keep every click

Every click needs

Heap - new count

Throw away

Database - topK service, every 5 minutes

Asynchronous job

Can be provided by the backend

Do we have to scan the database?

N log (k)

Flink - every 10 seconds? Spark streaming?

Aggregation 5 minutes granularity

We may save data in file, and not database

Priority queue, can purge out

Add some questions for Level 5, Level 6

Materials — open to everyone, no sign-in