Metrics System (Time series DB)

Topic: Metrics System (Time series DB)

Interviewer: Bryan

Interviewee: Alina

Level: L4 (Experienced Individual Contributor)

Additional Resources:

Google monach time series database

Metrics System

Mock System Design Interview Summary

Interview Overview

Date: 10/31/2021

Target level: 4

Duration: 1 hour

Topic covered:

Drawing tool used: diagrams.net

Requirements

1M users, 10 writes, 1 read per second

Functional requirements

data collection, CRUD, aggregate, cal, store, query, data visual

Non functional requirements

high available » high consistency

high scalability

high performance

data consistency

Math constrains:

QPS:

write heavy

1M * 20 / 10 ^ 5 = 10 ^ 4

peek:

3 * 10 ^ 4

read:

10 write, 1 read

10 ^ 3 /s

storage:

3 years * 365 * 10 ^ 5 * 10 ^ 4 * 1KB / (1024 * 1024)

bandwidth:

write: 1KB * 3 * 10 ^ 4 = 0.5 M/s

read: 0.05 M/s

memory:

1M * 0.2 * 1KB = 5 * 10 ^ 3 GB

System Design

System design diagram

API

metric.send(userId, eventName, status, timestamp): statusCode

metric.get(eventName, timestamp, range): List / integrate

Why is the result? Trying to get a count per day - unit can be 1 day

Customer can change range to 1 hour, 1 day, 1 week, 1 month

Push/Pull: choose push

Data flow:

Writes:

Client -> load balancer -> aggregator service -> message queue (kafka) ->

Log service

No SQL database Elastic search

Visualization service (logstash / self )

Elastic search (index, range query)

Notification service

SQL database (notification, sender/receiver for queries)

PostSQL with replica

Reads:

Client -> LB -> Redis -> read service -> elastic search, sliding window (within the window, what the counts are)

Database schema:

nosql:

{

“index”: eventName,

“timestamp”: time,

“status”: running,

“tenant”: {

“tenant1”, “tenant2”

}

(why is there “tenant”?)

SQL:

Discussion:

Interviewer: Go through API and schema

Interviewee:

metric.send(userId, eventName, status, timestamp): statusCode

Interviewer: What is the example?

Interviewee:

metrics.send({

User: userId

eventName: add item,

status: isAdded,

Timestamp: 2021,

})

Interviewer:

Let’s skip log and notification service

Can we see the schema for visualization service.

Interviewee:

Schema is …

Read service will read from elastic search

Interviewer:

How can user see the visualization

Interviewee:

Added Grafana (added between elastic search and read service)

Removed grafana -> the client can get a count based on range, can display its own UI

Interviewer:

Why do we have Redis?

Interviewee:

The client can keep on clicking refresh to read from database

Redis:

{

“Input parameter”: {

“eventName”, “time interval”

}

Count:

}

Interviewer:

Computation in the database is really slow. How do we do real time?

Interviewee:

Can be real time. It’s ok to be slow.

Can I get some hint? For performance improvement.

Interviewer:

If you need to monitor

1M user, all traffic at the beginning. Say within 1 minute, lots of writes

Every 5 seconds: display all traffic in the monitor

Multiple million in each query

Interviewee:

Can use some sampling to estimate to render quickly

Then render a more accurate

Interviewer:

Can we aggregate before we write to the time series database

Can we add time interval as part of the key

Interviewee:

Add a temporary storage for aggregation

Interviewer:

Kafka can aggregate

Other services can also aggregate

Before you save to elastic search

Interviewer

If our payload is huge, what can we transport the data to the database

Interviewee

Can split the load into small files

One worker is 1kb

Every second 1M records

Interviewee:

Can hash and send to different instances of aggregator

Interviewer:

Example

Interviewee

Missing data. Monitor -> wait for 2 days and there are no events. Can ask the client to resent.

Validation of data

Interviewer:

Walk through the whole design

Send request

Aggregator: missing data, validation

Kafka: push to different consumer

Log service: append to log entry

Visualization: can aggregate and write to elastic search

Notification: may or may not need

Read:

Request

Redis check cache. Return result if cache hit

Redis talk to read service if cache miss. Read service read from elastic search

UI keeps on pulling

Handle scale up:

Database sharding. Replica to make service highly available. Sharding: time interval + event key

Kafka, elastic search, read service, can all have multiple instances

Discussions during the Interview

Interviewer and Audience Feedback after the Interview

Interviewer:

Interviewee is nervous

I waited for interviewee to finish

Chat window: expressed a lot of prepared material

However should pause and ask for requirements

Spend too much time upfront

Interviewer:

Design share screen

Our requests are big

Aggregator service. Event can be aggregated at the aggregator

Partition, event name, time interval.

Regardless time interval, can make it a batch

Can aggregate 5 seconds

Currently individual records are sent to kafka directly

Load in frontend is heavy

Kafka is heavily loaded

Range query will need to visit the whole database

Slow

Metrics monitoring is close to real time

Calculation in database is slow, hard to become real time

Every time UI will need to visit the database

Time series database:

Header - file type

Table split into blocks. Log structure merge stream, append only.

Reading is relatively slow

Recommend to do aggregation ahead of time

For example aggregate within 1 second

Time series database, CRUD, compaction

Small chunks at a time

Then will merge

Cache use

Redis as cache

There is a monitor, so why does it need Redis to cache

ELK: elastic search, kabana - can read and provide close to real time visualization. It’s not through the client side.

It feeds the component directly

Hard skill: