Metadata Delivery System

System DesignDatabases & Storage

Topic: Metadata Delivery System

Interviewer: jyuan

Interviewee: Eric

Level: L5 (Senior)


System design: Public speaking / mock interview:

Topic

Mock System Design Interview Summary

Interview Overview

Date: 6/12/2022

Target level: L5

Duration: 45 minutes

Topic covered: Metadata delivery system

Drawing tool used: Excalidraw

Requirements

Design a metadata delivery system

Statements:

  1. Customers can self-onboard their service(s) running within their infrastructure nodes to the metadata delivery system

  2. Metadata delivery system is responsible for

  3. delivering custom metadata to all the nodes associated to the service (custom metadata can expire based on customer configuration)

  4. updating custom dynamic metadata periodically

  5. Customer can force metadata delivery to a single node by making the API call to the metadata delivery system

Basic assumptions:

  1. there is an existing API to get the metadata each time

  2. there is a S3 object to provide mapping between service name and all the infrastructure nodes associated with the service name

Start time 7:23. End time 8:08

Functional requirements

Customers have nodes. Servers in their own environment

Customers need to deliver metadata to their servers. Example: tags, certificates, server related information. We can focus on certificate delivery.

Once a cert renews, the system should deliver to the customer’s node

Every node have a different certificate

The system should deliver the certificate before it expires

You can assume the content certificate is stored in a system like S3. We can retrieve the content from that system.

The product is an internal product. Customers can onboard their service.

They can create a new service (service name and expiration time).

The system can retrieve the nodes from S3 service.

Assumption:

Information about the mapping is done

Data size: 10KB

100k services, > 10M nodes

Certificate expiry: customizable(assume mostly for about 1 hour)

Non functional requirements

Reliable

High availability

Have to deliver at least once before the certificate expires

Traffic estimate:

10M / 24 / .5 = 10 ^ 6 / 50 = 50 * 10 ^ 5 a day

5 * 10 ^5 / 86400 = 20 QPS

System Design

External APIs

API Example:

GetMetaData(node_id, serviceName_id)

Register(servicename_id, s3_file_link)

Refresh (serviceName_id, new_node_id)

Architecture design

Add delivery service

Add alarm for system outages

How to handle hardware failures?

The customers should fix the hardware failures

Q: What if the customer node has network issues, and cannot be reached for 10 seconds. Meanwhile the delivery service also fails.

A:

we should provide some acknowledgment for successful delivery.

API server can detect the failure and retry finding another delivery service and retry delivery of certificate

Q: what if the API server also fails?

A: there should be multiple API servers

Q: if the whole system is down e.g. all nodes for metadata delivery system can fail

A: we can deploy to multiple regions

Q: how do you get the work to retry if we don’t store any data?

A: add a database storage to store necessary work

Q: What failure cases can you think about?

A: the clock may be wrong on the node.

Q: what if all nodes in the system failed

A: need to file an alarm to oncall; requires manual recovery. We will gradually reboot the system to handle the peak traffic.

Interviewer and Audience Feedback

Interviewer:

Soft skill

Good soft skill. Expected 15-20 minutes to clarify

Covered 4 important points

In the API design, we forgot the details from the requirements. We did not cover the onboarding of the API.

Overall I think soft skill is good

Hard skill

We may have missed some aspects of the design. I hinted at we should have a database to store customer information. We don’t have store the node id, but we need to store service information.

In a big system failure, delivery service uses synchronize call. The API server thread and delivery service call may be fully occupied.

Interviewee:

Was not a strong interview

The use case is very specific. I was guessing what are the key points of the interview.

I forgot about some requirements during the design.

Regarding synchronous call, I didn’t think we needed to use a queue.

I think the cost of delivery is low.

Interviewer design

API server can get requests from customer:

Register request to register service name. We can save in the database.

Force get metadata request.

Metadata: there is a 3rd party API to retrieve the metadata.

We assume the certificate is the meta data. It’s a simplified assumption from the real system.

The most important point is periodically delivery the metadata/certificates. The registration contains service ID and frequency of delivery.

Audience: The certificate should contain the expiration time.

Interviewer: the expiry is submitted via API

Audience: there are many types of metadata. Should we clarify?

Interviewer: yes.

Audience: synchronous vs async?

Interviewer: leaning toward asynchronous call, because synchronous call may block the threads in the worker

We can use non-blocking IO in the worker.

Audience: (1) how to quickly get requirements? (2) What did the interviewer expect from system failure? (3) is it similar to job scheduler?

Ask more questions. I tried to scope down the question, e.g. limit to certificate

Wanted to provide a hint that we need a database.

Yes it’s similar.

There are other follow-ups, e.g. exhausting job thread pool.

Audience: What does S3 store?

Interviewer: Service name to node mapping. Any storage is fine; you can use database.

Audience: how to assign work to workers?

Interviewer: one worker for each service name. This may lead to hot partition.

Or we can assign different node to different workers.

Audience: why message queue?

Interviewer: to avoid API server being blocked.

Audience: how do we know the expiration time?

Interviewer: during registration.

Audience: how to implement expiry?

Interviewer: cron, or priority queue.

Audience: what’s the most important hard skill?

Interviewer: want to see a working design + failure case + peak hour.

Audience: asynchronous call

Interviewer: previously synchronous call for API service -> Worker -> Node. Then I needed to test about node crashes.

Audience: what is the database design?

Interviewer: most simple design is service name + expiration time.

Audience: why does the worker talk to DB?

Interviewer: most cases the worker talks to DB. The message queue is for force refreshing metadata.

Audience: what if a worker fails to call a node?

Interviewer: we can scale up workers, or we can use non-blocking IO to make calls.

A worker can keep retrying if the worker fails to call a node.

Audience: what if it fails many times?

Interviewer: we may throw the