News and subscription
Topic: News and subscription
Interviewer: @朱彦彬
Interviewee: @旅行者(traveller2050)
Level: L5 (Senior)
Sign up for future events
Mock System Design Interview Summary
Interview Overview
Date: 7/17/2022
Target level: L5
Duration: 45 minutes
Topic covered: News and Subscription
Drawing tool used: excalidraw
Starting 6:18 - ending 7:03
Requirements
Functional requirements
News aggregator like Google news
Can subscribe to different category of topics
Get news feed
There are 5 news sources
Crawling 5 news sources
Frontend: serve newsfeed
Backend: crawler
Discussion:
We just need to put the link to our database, and not the content
Scaling requirements
200 bytes for each news info (URL, etc) - no need to cache content
Get latest news within 5 seconds
300M daily active users
System Design
External APIs
getNewsFeed(userId)
subscribe(userId, topicList)
System design
We will pre-generate feed for all users
Now add the crawler
Now add ranking service
Q: Hidden source is low priority
A: let me remove
Add subscription service, users can subscribe to different categories of news
The above completes high level
Next let me focus on the ranking service.
Interviewee: what’s the scale of the service?
Interviewer: 300M daily active users
Interviewee:
300M DAU
QPS 300M * 10 time a day / seconds per day, peak = 3*
Storage 300M * 10
Cache (500 news, userId, list of 500 news IDs)
We need to put a lot cache because QPS is very high
First look up the relevant news ID, then look up separately the title and link
Cache for hot news
Tradeoff: we can
Return a list of news IDs, then users separately retrieve each news, but this generates too much latency
Return news IDs + news title and link in one request. The request may be slow but overall experience is better.
Feed section
300M DAU
QPS 300M * 10 time a day / secs per day, peak = 3 *
cache (500 news, userid, list of newID 500)
cache for hot news
DB, Non-SQL Casandra
Ranking
core
data: input:source, news, sub, user, cate
take the top news
consider the time + user preference
like, comment, sher, activity
We can use a machine learning model to generate 500 news items in the feed for each user.
Q: what do you consider as the top news?
A: we can based it on the crawling information
Interviewer and Audience Feedback
Audience score
Interviewer
There are lots of possible questions. I wanted to focus on the category.
Interviewee didn’t cover sufficiently.
Showed good knowledge of system design.
Suggestion: try to communicate with the interviewer to find which aspects are important
Requirement gathering:
Non functional, we can cover scaling factor, system quality (high availability, low latency) early on.
I wanted to cover urgent news. How do we optimize for this?
Can go in more depth in the database. It’s quite important.
Interviewee: I wasn’t sure which part of category you like to listen to.
Interviewer: the first time you use the news, the app usually asks for which category you are interested in, such as financial, politics, etc.
We can do both pull and push. We can have a hybrid.
They are different because we need to support categories.
The categories are shared by many people.
Interviewee: personalization
Interviewer: I didn’t want to cover it.
Interviewee self feedback
I realized pushing wastes too much time.
My design was good for personalized news.
However, if the categories are shared, then my design is over-engineered.
I am not familiar with ranking.
Crawler: similar to crawler, but if we dive deep we probably will exceed the time.
Should Dedupe. Blacklist
Audience Feedback
Interviewer: API is good
Will be good to use data flow for each API.
Can adjust the order based on the API design
Interviewee: feels too much time pressure.
There are many subsystem.
Probably can expand the design ASAP instead of multi-stage drilling down
This one is different from facebook/twitter because the crawler constructs the feed
Hard skill
Audience: Crawling is time consuming
Interviewer: Google is an aggregator, so crawling is the right way
Audience: Is there partnership between aggregator and news provider
Interviewee: Google 20,000 news source. Majority are crawled; small portions are pushed.
Interviewer: we can cover both types of sources, pushed and pulled
Interviewee: I should clarify with the interviewer
Interviewee: do we have a requirement for a news alert?
Interviewer: You should ask me that question.
Audience: crawling usually have long latency.
Interviewer: not core requirement
Interviewer: common types of feeds: facebook/twitter user submitted content. Vs news aggregator.
You don’t need to store old news. We can purge old news.
Front-page optimization
Pull and push hybrid
Only 20 categories
Important topics: use media authority as the weight. It can be pre-defined
Clickable/likable: more difficult to implement
Manually predefine weights for ranking within category
Optimize the frontpage to load within xxx ms
Pull: regular news
Push service: urgent news
Long pull, websocket
You can monitor top 100 news
Simplified solution is whenever there is a new topic within the category, then we push
2 parts: crawler, feed
Interviewer: I wanted to listen to ranking
Too important: Cold start, and front page
Audience: Are there blending between top news service, news push service or news first page service
Interviewer: no blending. To simplify, we can still push news even if the user has read it.
Interviewee: fan out to cache?
Interviewer: fan out to in-memory DB
Interviewee: What happens when the in-memory DB crashes?
Interviewer: you can still build HA for in-memory DB
Audience: how deep should we go for personalization?
Interviewer: I initially tried to simplify, but if we want to go deep we can use clickstream and bigdata.
Interviewer: soft skill: use the right graph library. Plugin for excalidraw.