Log Collection System

System DesignMessaging & Streaming

Materials — open to everyone, no sign-in

Topic: Log Collection System

Interviewer: system

Level: L5 (Senior)

Additional Resources:


System Design Interview

Join Us on Wechat

Interview Notes:

[7:07]

Requirements

Functional

Install log agent service

Share internal work, we don’t need to worry about network bandwidth

Start from logs search

Logs should be saved for search for about 30 days

Non functional

Tolerate data loss when too many logs

Tolerate latency

Scalability

Out of scope

No auth

Archiving / Cold storage

[7:16]

Design

Top level alternatives: Push vs pull model

Push:

install an agent. Advantage: Server can handle the load

Pull:

Pros: server side can process on its own speed

Cons: security issue. Complexity from management of which server pulls from where.

Push log to Kafka

Discussion: Quota for different client?

Default: 50qps, 100 kb

Configurable

[7:21]

Interal API:

PUT updateServiceQuota(String service, int quota)

Not important

Add Rate limiter service

Q: how to support search?

A: add spark / real time processing, pull from kafka and write it to search service

[7:24]

Add log search UI and LB

Add archive service, and archive log into S3

Improve performance:

Batch processing, and compress. (run compress mode in kafka)

Q: Why we write a separate log agent, instead of just sending log directly from the applications

A: we can have 1 team to maintain it. We can also use one programming language

We can run agent into a container. We need to know the offset how much log has been sent.

Need an offset file

[7:32]

Q: Are there alternative systems to Kafka

A: We may use RabbitMQ or AmazonSQS, but Kafka is the best choice because it supports sharding. RabbitMQ and AmazonSQS does not support shard

Sometimes there may be more traffic than expected, we can easily add more shards and more consumers

Q: Why use Spark?

A: not familiar with. => replace with a message queue consumer

Q: Choices of log search UI:

A: We can use Splunk.

Interviewer: query interpretation

[7:39]

Q: how to support metrics / structured search? Want to visualize / analyze the metrics

A: add a separate queue for metrics collection

Add real time processing service.

Q: want to see p90 latency of every minute by a curve

Add grafana dashboard

We may need a time series database

Don’t know which one is best for time series database

[audience discussion: use prometheus]

Q: How to delete data? Compliance needs to reduce the duration of storage

A: S3 or elastic search?

S3 supports expiration

Move to glacier by configuration

Interviewee: Search log across different systems. On application side, we can add request ID / request context. We can join logs using UUID

===

Kafka

Default message size 1MB

When the messages are too big, we may have HOL issues

A large message may block other messages from entering

Some service may have a peak

Usually there is a buffer at the agent. When buffer is filled, then agent will write from buffer to message queue

If 10MB, then can divide

===

Push vs poll:

Using push model. Should we add a load balancer in front of push?

Kafka may adjust partition automatically, but this may overload Kafka

We may want to reduce the dependency on 3rd party components

Upgrading of elastic search and kafka may generate new issues

Different OS - there may be compatibility issues between different OS

Push is more preferable

If we increase kafka cluster, we can use DNS / CNAME for clients to look up new Kafka machines.

Can ask if this is an internal service or external service.

Suggestion: Add LB between agent and Kafka

LB cluster

Different data centers

Ordering is not critical for this problem. Log file entries contain timestamp.

The agent may need to aggregate before pushing. Combine smaller requests into a batch

Requirements: do we need to handle peak traffic?

If we add LB, then rate limiter can be supported at LB

Handling peak requests

Rate limit - we may wait till we have quota to send

Load shedding - directly drop request

Back pressure - ask agent to reduce load

Elastic search UI

Kibana is the default UI for elastic search

Elastic search supports time series

Time series database

can support better aggregation

is more optimized to store numbers

Is more compressed to store time series data

Elastic search is more optimized for text search

We may add spark in front of time series database

It can pre-aggregate data, e.g. to minutes, to reduce the data load to downstream systems

In practice, it appears elastic search can support time series of metrics. Response time is about 10 seconds for logs from XX,XXX machines.

Raw data is stored in S3

Elastic search builds an index on top of raw data

Elastic search index can be saved in S3 or EBS

When sending to Kafka, it sends to the HTTP endpoint.

1:24:49:10

1:25:47:09