Log Collection System
Materials — open to everyone, no sign-in
Topic: Log Collection System
Interviewer: system
Level: L5 (Senior)
Additional Resources:
System Design Interview
Join Us on Wechat
Interview Notes:
[7:07]
Requirements
Functional
Install log agent service
Share internal work, we don’t need to worry about network bandwidth
Start from logs search
Logs should be saved for search for about 30 days
Non functional
Tolerate data loss when too many logs
Tolerate latency
Scalability
Out of scope
No auth
Archiving / Cold storage
[7:16]
Design
Top level alternatives: Push vs pull model
Push:
install an agent. Advantage: Server can handle the load
Pull:
Pros: server side can process on its own speed
Cons: security issue. Complexity from management of which server pulls from where.
Push log to Kafka
Discussion: Quota for different client?
Default: 50qps, 100 kb
Configurable
[7:21]
Interal API:
PUT updateServiceQuota(String service, int quota)
Not important
Add Rate limiter service
Q: how to support search?
A: add spark / real time processing, pull from kafka and write it to search service
[7:24]
Add log search UI and LB
Add archive service, and archive log into S3
Improve performance:
Batch processing, and compress. (run compress mode in kafka)
Q: Why we write a separate log agent, instead of just sending log directly from the applications
A: we can have 1 team to maintain it. We can also use one programming language
We can run agent into a container. We need to know the offset how much log has been sent.
Need an offset file
[7:32]
Q: Are there alternative systems to Kafka
A: We may use RabbitMQ or AmazonSQS, but Kafka is the best choice because it supports sharding. RabbitMQ and AmazonSQS does not support shard
Sometimes there may be more traffic than expected, we can easily add more shards and more consumers
Q: Why use Spark?
A: not familiar with. => replace with a message queue consumer
Q: Choices of log search UI:
A: We can use Splunk.
Interviewer: query interpretation
[7:39]
Q: how to support metrics / structured search? Want to visualize / analyze the metrics
A: add a separate queue for metrics collection
Add real time processing service.
Q: want to see p90 latency of every minute by a curve
Add grafana dashboard
We may need a time series database
Don’t know which one is best for time series database
[audience discussion: use prometheus]
Q: How to delete data? Compliance needs to reduce the duration of storage
A: S3 or elastic search?
S3 supports expiration
Move to glacier by configuration
Interviewee: Search log across different systems. On application side, we can add request ID / request context. We can join logs using UUID
===
Kafka
Default message size 1MB
When the messages are too big, we may have HOL issues
A large message may block other messages from entering
Some service may have a peak
Usually there is a buffer at the agent. When buffer is filled, then agent will write from buffer to message queue
If 10MB, then can divide
===
Push vs poll:
Using push model. Should we add a load balancer in front of push?
Kafka may adjust partition automatically, but this may overload Kafka
We may want to reduce the dependency on 3rd party components
Upgrading of elastic search and kafka may generate new issues
Different OS - there may be compatibility issues between different OS
Push is more preferable
If we increase kafka cluster, we can use DNS / CNAME for clients to look up new Kafka machines.
Can ask if this is an internal service or external service.
Suggestion: Add LB between agent and Kafka
LB cluster
Different data centers
Ordering is not critical for this problem. Log file entries contain timestamp.
The agent may need to aggregate before pushing. Combine smaller requests into a batch
Requirements: do we need to handle peak traffic?
If we add LB, then rate limiter can be supported at LB
Handling peak requests
Rate limit - we may wait till we have quota to send
Load shedding - directly drop request
Back pressure - ask agent to reduce load
Elastic search UI
Kibana is the default UI for elastic search
Elastic search supports time series
Time series database
can support better aggregation
is more optimized to store numbers
Is more compressed to store time series data
Elastic search is more optimized for text search
We may add spark in front of time series database
It can pre-aggregate data, e.g. to minutes, to reduce the data load to downstream systems
In practice, it appears elastic search can support time series of metrics. Response time is about 10 seconds for logs from XX,XXX machines.
Raw data is stored in S3
Elastic search builds an index on top of raw data
Elastic search index can be saved in S3 or EBS
When sending to Kafka, it sends to the HTTP endpoint.
1:24:49:10
1:25:47:09