News and subscription

Topic: News and subscription

Interviewer: @朱彦彬

Interviewee: @旅行者(traveller2050)

Level: L5 (Senior)

Mock System Design Interview Summary

Interview Overview

Date: 7/17/2022

Target level: L5

Duration: 45 minutes

Topic covered: News and Subscription

Drawing tool used: excalidraw

Starting 6:18 - ending 7:03

Requirements

Functional requirements

News aggregator like Google news

Can subscribe to different category of topics

Get news feed

There are 5 news sources

Crawling 5 news sources

Frontend: serve newsfeed

Backend: crawler

Discussion:

We just need to put the link to our database, and not the content

Scaling requirements

200 bytes for each news info (URL, etc) - no need to cache content

Get latest news within 5 seconds

300M daily active users

System Design

External APIs

getNewsFeed(userId)

subscribe(userId, topicList)

System design

We will pre-generate feed for all users

Now add the crawler

Now add ranking service

Q: Hidden source is low priority

A: let me remove

Add subscription service, users can subscribe to different categories of news

The above completes high level

Next let me focus on the ranking service.

Interviewee: what’s the scale of the service?

Interviewer: 300M daily active users

Interviewee:

300M DAU

QPS 300M * 10 time a day / seconds per day, peak = 3*

Storage 300M * 10

Cache (500 news, userId, list of 500 news IDs)

We need to put a lot cache because QPS is very high

First look up the relevant news ID, then look up separately the title and link

Cache for hot news

Tradeoff: we can

Return a list of news IDs, then users separately retrieve each news, but this generates too much latency

Return news IDs + news title and link in one request. The request may be slow but overall experience is better.

Feed section

300M DAU
QPS 300M * 10 time a day / secs per day, peak = 3 *
cache (500 news, userid, list of newID 500)
cache for hot news
DB, Non-SQL Casandra

Ranking

core
data: input:source, news, sub, user, cate

take the top news
consider the time + user preference
like, comment, sher, activity

We can use a machine learning model to generate 500 news items in the feed for each user.

Q: what do you consider as the top news?

A: we can based it on the crawling information

Interviewer and Audience Feedback

Audience score

Interviewer

There are lots of possible questions. I wanted to focus on the category.

Interviewee didn’t cover sufficiently.

Showed good knowledge of system design.

Suggestion: try to communicate with the interviewer to find which aspects are important

Requirement gathering:

Non functional, we can cover scaling factor, system quality (high availability, low latency) early on.

I wanted to cover urgent news. How do we optimize for this?

Can go in more depth in the database. It’s quite important.

Interviewee: I wasn’t sure which part of category you like to listen to.

Interviewer: the first time you use the news, the app usually asks for which category you are interested in, such as financial, politics, etc.

We can do both pull and push. We can have a hybrid.

They are different because we need to support categories.

The categories are shared by many people.

Interviewee: personalization

Interviewer: I didn’t want to cover it.

Interviewee self feedback

I realized pushing wastes too much time.

My design was good for personalized news.

However, if the categories are shared, then my design is over-engineered.

I am not familiar with ranking.

Crawler: similar to crawler, but if we dive deep we probably will exceed the time.

Should Dedupe. Blacklist

Audience Feedback

Interviewer: API is good

Will be good to use data flow for each API.

Can adjust the order based on the API design

Interviewee: feels too much time pressure.

There are many subsystem.

Probably can expand the design ASAP instead of multi-stage drilling down

This one is different from facebook/twitter because the crawler constructs the feed

Hard skill

Audience: Crawling is time consuming

Interviewer: Google is an aggregator, so crawling is the right way

Audience: Is there partnership between aggregator and news provider

Interviewee: Google 20,000 news source. Majority are crawled; small portions are pushed.

Interviewer: we can cover both types of sources, pushed and pulled

Interviewee: I should clarify with the interviewer

Interviewee: do we have a requirement for a news alert?

Interviewer: You should ask me that question.

Audience: crawling usually have long latency.

Interviewer: not core requirement

Interviewer: common types of feeds: facebook/twitter user submitted content. Vs news aggregator.

You don’t need to store old news. We can purge old news.

Front-page optimization

Pull and push hybrid

Only 20 categories

Important topics: use media authority as the weight. It can be pre-defined

Clickable/likable: more difficult to implement

Manually predefine weights for ranking within category

Optimize the frontpage to load within xxx ms

Pull: regular news

Push service: urgent news

Long pull, websocket

You can monitor top 100 news

Simplified solution is whenever there is a new topic within the category, then we push

2 parts: crawler, feed

Interviewer: I wanted to listen to ranking

Too important: Cold start, and front page

Audience: Are there blending between top news service, news push service or news first page service

Interviewer: no blending. To simplify, we can still push news even if the user has read it.

Interviewee: fan out to cache?

Interviewer: fan out to in-memory DB

Interviewee: What happens when the in-memory DB crashes?

Interviewer: you can still build HA for in-memory DB

Audience: how deep should we go for personalization?

Interviewer: I initially tried to simplify, but if we want to go deep we can use clickstream and bigdata.

Interviewer: soft skill: use the right graph library. Plugin for excalidraw.