Crawler
Topic: Crawler
Interviewer: Pinglu
Interviewee: Harry
Level: L5 (Senior)
Topic
Mock System Design Interview Summary
Interview Overview
Date: 1/2/2022
Target level: L5
Duration: 1 hour
Topic covered: Crawler
Drawing tool used: Whimsical
Requirements
Functional requirements
Distributed crawler
Non functional requirements
Latency requirements?
How many websites? When do we need to crawl?
Latency is flexible. Depends on how often the content changes
1 trillion URLs
Approach: start from a workable solution. Then make it scalable.
System Design
System design diagram
URL retriever
URL downloader
Content parser
Indexer
Interviewee: How to improve:
Control the speed of crawling -
Kafka to avoid loss
Interviewer: What’s metadata:
Metadata of the website.
It says “how often your client should access my website”
How to handle duplicate URL?
In metadata database, we have URL, and update time. Crawler can look at the URL.
To avoid the duplicate content, we can use hash to see if the content has changed or not.
We can adjust the scheduler if the content doesn’t change very often.
Database schema
ID,link, status, updatetime(i.e. Last crawl time)
Interviewer: How do you shard?
Interviewee: Calculate:
1 trillion websites
11k-12k per second if we crawl once a week
Interviewer:
Update frequency:
Most frequent: 5 minutes
Least frequent 1 month
For retriever or downloader
Need 2000 machines
1k websites/second
1 / s
1k
100 process for each machine
Need at least 10 machines
1000 machines
We can shard the URL based on the URL ID
Shard by ID, spread it to 10 machines
Revised ID:
We can shard by location
then shard by URL ID
Based on that hand to different Crawlers and Downloaders
How do you handle failures?
Failure 1: When website is down, we can add retry. Downloader can add the entry back to the queue
Another failure: message queue is down
Interviewee: clarify the failures
Interviewer: retriever and downloader
Interviewee:
We can assume the process is stateless
We can create a write-ahead log
Interviewer: How to detect failure of retriever and scheduler?
Interviewee:
Master can use heart beat, and restart if the process is down
We can also use zoo keeper
Interviewer: How to avoid infinite loop? Circular URLs
Interviewee:
We can use BFS to process the files
Mark all the URL that has been crawled
Avoid duplicate URL if the URL has been crawled
Interviewer: How do we extend the system?
Interviewee: Save in DFS. We can read the content differently.
Interviewer and Audience Feedback after the Interview
Clear design
Explain each detail components
Did well to ask for feedback
Answered the questions well
Strong technical knowledge
Scalability
URL size can be divided into multiple database
Biggest components
Retriever
Downloader
Content parser
Starts from retriever
Retrieve URL from database
Retriever put URL in queue for downloader
Downloader will get the URL from queue
Get IP from DNS server/cache
Downloaded URL and content goes to a new queue
Content parser processes content
HTML and raw data is saved in DFS
Content parser can get some new URLs, and put it into the database
Recurring job: for optimizing retriever and downloader
Robot.txt decides highest frequency you can retrieve
Recurring job saves it into DB
Question from the audience
Why retriever and downloader are different?
To handle different speed between retriever and downloader
Can also handle the speed for download
Do we need to have a full table scan?
We may use SQL
Look at the status and update time
Why not push the URL directly from parser directly to queue?
If there is no database, then you may not do dedupe
URL may have priority
Can set different priority in queue
Update time, schedule time
Schedule time can be computed
Exponential backoff
If there is no change
Interviewee:
Scheduler: is for speed control
Why do we use kafka?
How do you know the message can fit into queue?
You can first write into file system. Then you can send file location into the queue
Reason to split crawler and parser?
Audience
Ask for next steps
Should estimate the machines
Latency doesn’t apply to this case
How to partition, what’s partition key
Location and ID partition
Audience
Component is not clearly described?
Sharding
Don’t wait for the interviewer?
What needs sharding?
DFS and Database sharding
Sharding is usually based on visit patterns
Related website
Related time
Audience:
Please write a SQL statement for database retrieval
How to analyze?
NLP processing, keyword, elestatic search
Need to have different queue using different priorities
Details about fast changing URL vs slow changing URL
Basic
Bonus requirement: tiered access
Feature is not well defined
Usually uses Bloomfilter
Put seed into the database
Seed in database
Check bloomfilter
How to update bloomfilter
===
What happens if we are stuck?
Come back to the high level requirements
Latency
Scalability
Durability
Reliability - fault tolerant
There may be 3 points for reliability
User journey: MVP
Scale from small number of users -> high level of users
===
From simple to complex
Increase data size
Optimize: queue
Security
Monitoring
Fast and slow changing websites?
There are 2 types of interviewers
Some have strong idea
Some don’t have strong idea
Calibration:
What are the key points?
Completeness?
Or deep dive?
Fundamental
What are the product requirements
API, queue, database, cache
Should become familiar with basic components
Clarify requirement is the most important
==
Interviewer:
Requirement: all URLs, duplicate
Audience:
Should we ask interviewer
Audience
Missing the iterations
Step-by-step narrow down
Difference between L4 and L5?
L5:
requires trim down
Duplicate
Functional, non functional, user journey
Database, message queue
L5 - how to identify key feature, bottleneck, big problems
Infrastructure: machine, round trip, milliseconds
L5:
Multiple solutions, tradeoffs
Try to find the points
Different databases (sql/nosql)
Interviewers are emotional. Try to find one impressive point.
Deep dive, how to drill down?
Usually by the need of real work of the interviewer.
In real life, there is no well defined problems
Need a workable framework
Main bottleneck is the sites does not allow you to parse very quickly
The same domain may need to distribute to different crawler machines
==
Interviewer may deep dive based on the flow of the answer, and based on your expertise.
Should write down the requirements clearly before doing the design.