Metadata Delivery System
Topic: Metadata Delivery System
Interviewer: jyuan
Interviewee: Eric
Level: L5 (Senior)
System design: Public speaking / mock interview:
Topic
Mock System Design Interview Summary
Interview Overview
Date: 6/12/2022
Target level: L5
Duration: 45 minutes
Topic covered: Metadata delivery system
Drawing tool used: Excalidraw
Requirements
Design a metadata delivery system
Statements:
Customers can self-onboard their service(s) running within their infrastructure nodes to the metadata delivery system
Metadata delivery system is responsible for
delivering custom metadata to all the nodes associated to the service (custom metadata can expire based on customer configuration)
updating custom dynamic metadata periodically
Customer can force metadata delivery to a single node by making the API call to the metadata delivery system
Basic assumptions:
there is an existing API to get the metadata each time
there is a S3 object to provide mapping between service name and all the infrastructure nodes associated with the service name
Start time 7:23. End time 8:08
Functional requirements
Customers have nodes. Servers in their own environment
Customers need to deliver metadata to their servers. Example: tags, certificates, server related information. We can focus on certificate delivery.
Once a cert renews, the system should deliver to the customer’s node
Every node have a different certificate
The system should deliver the certificate before it expires
You can assume the content certificate is stored in a system like S3. We can retrieve the content from that system.
The product is an internal product. Customers can onboard their service.
They can create a new service (service name and expiration time).
The system can retrieve the nodes from S3 service.
Assumption:
Information about the mapping is done
Data size: 10KB
100k services, > 10M nodes
Certificate expiry: customizable(assume mostly for about 1 hour)
Non functional requirements
Reliable
High availability
Have to deliver at least once before the certificate expires
Traffic estimate:
10M / 24 / .5 = 10 ^ 6 / 50 = 50 * 10 ^ 5 a day
5 * 10 ^5 / 86400 = 20 QPS
System Design
External APIs
API Example:
GetMetaData(node_id, serviceName_id)
Register(servicename_id, s3_file_link)
Refresh (serviceName_id, new_node_id)
Architecture design
Add delivery service
Add alarm for system outages
How to handle hardware failures?
The customers should fix the hardware failures
Q: What if the customer node has network issues, and cannot be reached for 10 seconds. Meanwhile the delivery service also fails.
A:
we should provide some acknowledgment for successful delivery.
API server can detect the failure and retry finding another delivery service and retry delivery of certificate
Q: what if the API server also fails?
A: there should be multiple API servers
Q: if the whole system is down e.g. all nodes for metadata delivery system can fail
A: we can deploy to multiple regions
Q: how do you get the work to retry if we don’t store any data?
A: add a database storage to store necessary work
Q: What failure cases can you think about?
A: the clock may be wrong on the node.
Q: what if all nodes in the system failed
A: need to file an alarm to oncall; requires manual recovery. We will gradually reboot the system to handle the peak traffic.
Interviewer and Audience Feedback
Interviewer:
Soft skill
Good soft skill. Expected 15-20 minutes to clarify
Covered 4 important points
In the API design, we forgot the details from the requirements. We did not cover the onboarding of the API.
Overall I think soft skill is good
Hard skill
We may have missed some aspects of the design. I hinted at we should have a database to store customer information. We don’t have store the node id, but we need to store service information.
In a big system failure, delivery service uses synchronize call. The API server thread and delivery service call may be fully occupied.
Interviewee:
Was not a strong interview
The use case is very specific. I was guessing what are the key points of the interview.
I forgot about some requirements during the design.
Regarding synchronous call, I didn’t think we needed to use a queue.
I think the cost of delivery is low.
Interviewer design
API server can get requests from customer:
Register request to register service name. We can save in the database.
Force get metadata request.
Metadata: there is a 3rd party API to retrieve the metadata.
We assume the certificate is the meta data. It’s a simplified assumption from the real system.
The most important point is periodically delivery the metadata/certificates. The registration contains service ID and frequency of delivery.
Audience: The certificate should contain the expiration time.
Interviewer: the expiry is submitted via API
Audience: there are many types of metadata. Should we clarify?
Interviewer: yes.
Audience: synchronous vs async?
Interviewer: leaning toward asynchronous call, because synchronous call may block the threads in the worker
We can use non-blocking IO in the worker.
Audience: (1) how to quickly get requirements? (2) What did the interviewer expect from system failure? (3) is it similar to job scheduler?
Ask more questions. I tried to scope down the question, e.g. limit to certificate
Wanted to provide a hint that we need a database.
Yes it’s similar.
There are other follow-ups, e.g. exhausting job thread pool.
Audience: What does S3 store?
Interviewer: Service name to node mapping. Any storage is fine; you can use database.
Audience: how to assign work to workers?
Interviewer: one worker for each service name. This may lead to hot partition.
Or we can assign different node to different workers.
Audience: why message queue?
Interviewer: to avoid API server being blocked.
Audience: how do we know the expiration time?
Interviewer: during registration.
Audience: how to implement expiry?
Interviewer: cron, or priority queue.
Audience: what’s the most important hard skill?
Interviewer: want to see a working design + failure case + peak hour.
Audience: asynchronous call
Interviewer: previously synchronous call for API service -> Worker -> Node. Then I needed to test about node crashes.
Audience: what is the database design?
Interviewer: most simple design is service name + expiration time.
Audience: why does the worker talk to DB?
Interviewer: most cases the worker talks to DB. The message queue is for force refreshing metadata.
Audience: what if a worker fails to call a node?
Interviewer: we can scale up workers, or we can use non-blocking IO to make calls.
A worker can keep retrying if the worker fails to call a node.
Audience: what if it fails many times?
Interviewer: we may throw the