Newsfeed Real Time Comments

Topic: Newsfeed Real Time Comments

Interviewer: Shawn

Interviewee: Anna

Level: L5 (Senior)

Additional Resources:

Topic

Mock System Design Interview Summary

Interview Overview

Date: 1/30/2022

Target level: L5

Duration: 45 minutes

Topic covered: Realtime news feed comments

Drawing tool used: whimsical

Vote for future interview topics:

Thursday Feb 2/3/2022 职场提升俱乐部活动：

“有不少朋友给我们反馈，希望职场提升俱乐部能帮助完善自己的简历，俱乐部决定在2月3号周四晚举办一次关于简历准备的讲座，欢迎大家点击下面链接填写报名，上传简历。为了保护隐私，上传的简历可以隐藏自己的姓名/电话/地址等个人信息。”

https://forms.gle/jYZykQCQksjEj2P97

Requirements

Functional requirements

Design live comment, real time

User can see comments/posts in real time

Real time: low latency, as low as possible. 100 ms

Any time of post, can be text or video, but let’s focus on comments

Only friends can give you comment?

Focus on the post you can see. When people

Focus on real time delivery

Deprioritize permission of comments

Scale estimate:

How many posts do we have?

100M posts

10s of millions of posts every second

Any posts can receive comments

100k/s comments can be delivered to potential readers

Every user can post x comments per second, but it’s out of scope

1M daily active users

Latency is the top priority

Don’t worry about calculation of scale too much

Non functional requirements

System Design

External APIs

Push model

Pull model

Push - can keep sending the messages

Pull model - will cause delays in message delivery

Experienced with Kafka as message queue

Q from interviewer: Why do you need to persist comments?

Q from interviewee: some scenarios are not realtime. E.g. if you click on an old post, you should still see the comments

Each comment will be produced into one topic

What counts as a topic? Each post can count as one topic.

This may require 10M message queues

Audience chat: reference LinkedIn design https://www.infoq.com/presentations/linkedin-play-akka-distributed-systems/?useSponsorshipSuggestions=true

Consumer: subscribe to his post’s topic, not every topic

Q from interviewer: why only the consumer’s own topic?

A user may be looking at one comment

Scroll down

The comments may grow

Subscribe to his news’ window -topics.

Size 5-10 posts / topics

Not every topic

Once scrolled up, the consumer can unsubscribe to that topic

Q: What are the tradeoffs when consumer subscribe to many topics vs fewer topics?

Q: message queue - cannot directly push the message to the mobile client

A: Add frontend service for comment receiver. It consumes the message queue, and translate the message to the right format

1M users: how do we fan out, for celebrity posts?

Let’s say there are millions of users looking at the same post.

How do we fan out

How does frontend service and comment receiver communicate?

A: CDN service can help deliver the content to the user

Q: CDN is not the push model

How do we let frontend service push the message to the user

Websocket can be a solution

Q: there are many frontend service, and many comment receivers, how do we know which frontend service serves which comment receiver?

A: sharding, partition

Each comment receiver sends the frontend service the list of postIDs

Q: how does the frontend service which post ID sends to which receiver?

A: it’s a dynamic. User keeps on scrolling. Receiver may have a scrolling window. Let the frontend service know “here are the 10 comments I am going to read”

Q: The list of 5-10 postIDs keep on changing, how does the frontend know the up to date list?

A: you can add a cache to the frontend service.

Cache / fast storage

1M posts, 1M users

How does the frontend know which post to send to which comment receiver?

A: comment receiver connects to the frontend service, within that request there is a list of posts interested in (5-10 posts). If there is no update, then the user will keep on looking at these posts. If there is update, it means the user moves to a different post.

Q:how does

A: each comment sender send to frontend service. How do you communicate?

Websocket, or HTTP requests. For faster communication, can use websocket

Interviewer and Audience Feedback

Audience Scores

Soft skills

Hard Skills

Interviewer Feedback

Newsfeed, livefeed, requirement gathering is difficult

Design low latency system is difficult

Should meet L4 bar

L5 has higher bar

Push vs poll

How to push which comment to which users

Push which posts to which comment receiver. Need more clarification

Use of Kafka/pubsub model for push. It may work, but there are a lot of topics. Every subscriber may subscribe to all posts. In the frontend service, each user may subscribe many topics.

Kafka is acceptable.

Requirement gathering. It is satisfactory.

Asked the interviewee not to do calculation

Need more practice

Followed the interviewer’s suggestion

Hard Skill

Interviewee:

Based on research, not very workable solution

Facebook: Push vs pull

Push is fast enough

Write locally, read globally -> don’t quite understand

Write to local region. Read is global

Facebook post

Audience: world wide readable. Write to local

Facebook did not explain clearly

Write speed is high

Every comment is read, therefore

Read heavy: should “read”

Write - write relationship, not write comment

Continue to scroll the screen. Post vs user is continuously updated. Each frontend service knows which post the user is a reader. It is updated continuously.

Currently there is a relationship between user and posts

Cache stores this relationship

Write intensive

Every friend sends a comment (like). There are a lot of writes

Many people like, then there are a lot of broadcast

Websocket - bidirectional client-server communication

Payload can be pushed from server (mobile app, client app)

Server send event (similar to HTTP) initiated by server, and send to client

Client gets the message then renders on mobile client

YouTube, linkedin did a real time system for comment (comments/like)

They have a in-memory local cache to manage client and frontend relationship

Fanout is large. 1M clients. Many frontend servers.

Each frontend server hosts 1000 clients. A large set of machines to handle connections

There are dispatcher to manage frontend node and post mapping

“Streaming a million likes/second: Real-time interactions on Live video” Linkedin

https://www.youtube.com/watch?v=yqc3PPmHvrA

Do they send the updates to frontend nodes at different data centers?

If there are multiple masters, it’s hard to

Write locally: write to a local data center

East User comment on a post

West user comment on a post

Write locally

Somewhere they need to merge the comments

There is probably an aggregator to merge the comments

Who manages the sequence?

Writing the relation: receiver is writing locally

Push center is reading globally (who is reading this post)

Write heavy - is writing the relation

Comment writing is a different issue.

LinkedIn youtube video addresses the write of the relations

The sequence may not be the same

Comment sequence is a different question. I can first see the answer and not the question.

Facebook: you can reply to previous comment. Then there is “previous ID”

If there are no causal relations, just use timestamp

Distributed system, data center write, synchronization problem

In distributed system, there is an expensive solution to ensure strict ordering

But in facebook, it is probably a tradeoff to choose low latency and sacrifice consistency

Local: data center local. Not client

Frontend service, local cache. Local to the user (same area)

Global: same post’s replicated in different

A post is created

Globally the post can see the post. They can be seen globally

There is a new comment. The server needs to find the frontend servers that maintains the relations of user->post mapping

Every time a new comment -> collect mapping from different data centers

Cassandra -> quorum to decide

Tolerate the error

Write transaction is fast

Read globally - not every data center gets the latest copy

Sync between global is expensive

Comments’ server will pull from different data centers’ aggregators

Global sync is expensive

A new post -> Will map the post to related users -> then push the post to the users’ database

They don’t need to sync the post globally

If there is a writer to write a comment

Is there a global service that knows the comment has been written?

Dispatch the comment to related readers?

Writer data center -> reader data center point to point connection

Writer -> reader is 1-to-N. There is not a global service in charge of all communications

Frontend server: after crashing the user can reconnect to a different frontend server

If there is no local data center disk copy

How does a new server knows which post a user has finished reading?

Interviewer: I don’t think this needs to be sticky

Data center 1: frontend server crashes

User can connect to data center 2

Which user is reading which post

Other people needs to fan out comments to me

The system needs to maintain user -> post mapping

Poll vs push: need to use push to justify low latency

Reduce network cost.

If no websocket, server send event.

Long poll? Client connect through HTTP. Server holds on to the connection with keep-live. If server gets the answer, then the response can be sent back. Client reconnect to the server right away. Similar to websocket.

More connection each server, the better.

Long poll is less efficient than websocket

Long poll is different from push vs pull. Long poll is only a connection mechanism.

Websocket is much more efficient

Traffic is too big for continuous pull

Push vs pull may still depends on situation

E.g. highly popular topic. Need to push to many clients. Say if you need to 100M people

You can use multiple queue to push the message

Hot topic: pulling is better

Push: if only between colleagues and friends, no

Pull: depends on how faster

Push: fire and forget

Kafka:

consumer group

User can be partitioned. Different queues

Each server may go to kafka to get the message they need to subscribe to

Kafka:

Too much fan-out, it may crash kafka

Newsfeed - only for online users

No need to support offline users

Justin Biber: need to fanout to millions/10s of million

We can add more machines to handle the fanout

Offline user: can pull

Online user: live chat, requires push

Realtime: we don’t need to use queue

Just keep a map in memory

Materials — open to everyone, no sign-in