Escalation and Notification System

System DesignReliability & ScalingMessaging & Streaming

Topic: Escalation and Notification System

Interviewer: Eric

Interviewee: jyuan

Level: L5 (Senior)


Design interview:

Public speaking / Workplace communication & behavior interview

Topic

Mock System Design Interview Summary

Interview Overview

Date: 5/29/2022

Target level: L5 senior engineer

Duration: 45 minutes

Topic covered: Escalation and notification system

Drawing tool used: excalidraw

Requirements

Functional requirements

Library for different companies or a single company? Pager system for multiple companies

Companies should onboard with SSO service

User can define the rules

Group escalation. Teams are customizable

Tickets of multiple severity. Customize.

Different company/user/group can generate different rules

Users or services (generated by other monitoring systems) can trigger the notification

Non functional requirements

One company; 100,000 teams, 1 million employees, 10 tickets

100 companies 1000 TPS

High scalability

High availability

Latency -> a few seconds of delay is fine

Accuracy -> escalate at least once

System Design

External APIs

/api/v1/createTickets(ticketId, group)

Response: 200

createTime, group

/api/v1/updateTickets(ticketId, status)

NonRead, under investigation, pending deployment/fix, resolved

/aip/v1/createRules(ruleName, group, ruleInformation)

/api/v1/updateRules

System design

Choosing no-sql database for scalability

Now add message queue / worker / different types of notifications

Q: what do you store in the metadata?

A: the alert rules, and group information

Q: how do you differentiate data from different companies?

A: user has signed in with SSO; frontend can resolve user’s SSO identity to company’s identity

Q: work through workflow

A: create ticket -> fetch meta data service -> task, resolve to next escalation point

E.g. next 5 minutes

Ticket ID, next escalation time

Send task to queue, which will delay handled by the worker

Worker reads the escalation task, double check with metadata service, may work on task or drop the task (depending on if ticket has been resolved)

For different tickets, we can hash based on ticketIDs, then hand tasks to different workers

If worker gets a task that the worker does not own:

It can put the task back to the queue

Or push the task to the right worker

We can use zookeeper to handle sharding

What happens if the metadata changed?

Worker can pull the metadata

What priority queue is used?

Store object, and next escalation time

Q: If the user received the email, why is it necessary for the worker add the message back to the queue?

A: if the user does not respond, then it may be necessary for the message to be sent to the next escalation point

Q: What happens if the worker goes down?

A: create disaster recovery service, task has not been processed by the worker. Disater recovery service will call the frontend service.

For each worker, we can have a replication

Disaster recovery to read the metadata service; find messages that are not resolved; compare the escalation time vs next escalation time; resend it back to the queue

Status of the messages:

Succeed: add back to priority queue

Failed, e.g. invalid email, or 3rd party tool failed

If it’s not retriable, then drop the user

If it’s retriable, then put back to the queue

If the user press “acknowledge” in the phone, then we can mark the message as “in progress”. We can drop future tasks.

Metrics: scaling up the workers

Q: How do we need know the system is working?

A: depends on the metrics

Q: What if the frontend is down?

A: tracks the service. Data announce. Canary to verify end to end flow.

What if the system is comp

Interviewer and Audience Feedback

Interviewer:

Good candidate for L4

L5: borderline

Requirement gathering

Design was a bit confusing

Whether the worker is stateful or stateless (e.g. queue)

Discovery system. Worker goes down. Can be recovered better

Metadata service does too much

We may split into two services.

====

Interviewee:

May have not gathered requirement

Spoke too quickly

Disaster recovery - can have more improvements. Scan database

Worker: data is in priority queue, but source of truth is in database. So worker is still stateless

====

Audience

Will it be different if we share this across many companies vs shared by different companies

A: much smaller scale for internal systems

A few minutes of drawing. He was silent. Does the interviewer needs explanation?

Interviewee: Drop architecture design. I can confirm with the interviewer after each stage

Interviewer: I usually don’t interrupt the interviewee

==

Audience

Interviewer said “Do you have something more to add?” What’s expected?

Interviewer:

No huge expectations. Monitoring by another party

My main concern was the design itself. There were small issues, so I was on borderline for L5

I felt meta service was too monolithic. I was hoping some refactoring into different services

API response should not be 200

===

Interviewee: what did you mean that I was missing a component?

Interviewer:

the metadata service is taking on too many responsibilities. Hoping to have more services

I was confused about the frontend.

Interviewee: I should improve names of the services

Interviewer: you can correct the name in the middle of the interview.

===

Interviewee: why not return 200?

Interviewer: get should return 200. Create should return 201, 202

Not a big deal

Not big issue overall, but we may not go to L5

===

Audience: suggestion: what happens if the work fails?

You can proactively work on fault tolerance. (similar to running test case)

===

Audience: how to acknowledge

A: We can use the meta data service to update the database

Escalation. It’s part of the business logic. We may not need to go into a lot of details.

===

Audience: there is a state machine. Continuously evaluate. We may need to dedupe.

Worker should be similar to a cronjob. Continuously look at unhandled task

A: Priority queue. At the point of escalation, we can drop the service if it’s already handled. It’s more lightweight

Audience: Noise neighbor. One service may flood other services

Need to guarantee everybody gets served

===

Audience: why was there no database design?

Interviewee: we should consider it. It feels the time was too tight

===

Audience: Schedule state machine. 2 scenarios.

Email not succeed. Should we put the message back to the queue?

A: put it to the priority queue. Invalid email, just drop the email. Retriable: put it back to the priority queue (not message queue)

If email succeeds. Should we wait for the acknowledgement?

Email cannot return read or no read.

User should update the ticket

Worker may be down or may not be able to handle so request

Didn’t consider (200 threads, 3rd party may be down. Threads may be all blocked) we probably need another component.

====

Audience: why do we use priority queue?

1M paging. 99% can be put into the queue. Some may not be handled, it may be a waste of resource

Why do we use priority queue but not a scheduling service?

A: not familiar with scheduling service. In order based on time. Cronjob may require lots of resource to scan the database

Q: different ticket may have different time to handle. Some items may take 7 days. When we add a new item log(N) for 15 minutes. We may fill up the system.

A: can consider data management service.

Any suggestions:

Split metadata service into meta data and runtime data.

Priority queue: we may be able to throw message back to message queue. No need for disaster recovery

Message queue: is in-order

Confused about priority queue; priority queue adds status to the worker

Every time worker is down, we need recovery

Audience: we use dynamodb, eventual consistency. Time sensitive. if the message sending failed, we rescan the table. There may be some delay.

Q: easy to extend. No relations.

Should everything go through message queue? Should we scan again?

A: Create and update should both go through message queue. There may be duplicate task. Use metadata service to dedupe

Q: ticket update is async. It may not be friendly to user.

A: database update is synchronous. Notification should be asynchronous.

Q: if nobody takes care of the ticket, should we have a timedb or re-pickup?

A: we can throw into priority, but it becomes