Distributed Log Collection (used)

Topic: Distributed Log Collection (used)

Interviewer: Zhi Liu

Interviewee: Jiayan

Level: L5 (Senior)

Additional Resources:

Distributed Log Processing

Target level: L5 (Senior Engineer)

Duration: 45 minutes

Topic covered: Distributed log collection

Functional requirements

Interviewer:

We run many servers that produce different types of log files on different machines. When there is an issue, the engineer needs to troubleshoot using the log files. We need a central system to collect and search logs.

The main focus today will be how to collect log into a central system.

We can install a log agent on the same machines as the services

Log generation and log collection in the same data center with the same internal network. No need to worry about network bandwidth

The core feature is to search the logs, and can extend to other functions

Log is saved for about 30 days.

Archive the data beyond 30 days in somewhere

Out of scope:

Authentication/authorization of the agents

We don’t need to worry about the bandwidth

Scaling requirements

We can tolerate some data loss. One server generates too much log files, so in some extreme cases we need to drop some log files.

We can tolerate some latency

The system needs to be scalable. It should handle log from all microservices in one company

System Design

The primary tradeoff to consider is using pull model or push model.

Pull model:

Pros:

central server can work at its own speed

Cons:

We need to open a port on each machine, which may create some security concerns.

We need to maintain the metadata which machines we need to pull log from

Push model:

Pros:

Easy to manage

Cons:

need consensus protocol to decide where to push the log files to.

We will assume a push model for the rest of our design.

We install a log agent on each node

The log agent sends the log to the kafka cluster

Interviewee: should we set the quota for each service?

Interviewer: yes, this should be configurable.

Interviewee: let’s assume we will build a configuration service to set the quota for each client. The configuration service is maintained by the logging team. We can provide an API to configure the client

Interviewer: let’s simplify and write the quota in a file

Interviewee:

We can enforce the quota in 2 ways:

leverage the quota system in kafka

Or we build a rate limiter service for the log agent to limit the amount of log it can send

1a. Log agent checks with rate limiter if it’s allowed to send log file

1b. Log agent compresses log and sends it to Kafka cluster

1c. Spark real time processing service reads the log from Kafka

1d. Spark real time processing service sends the log to elastic search

2a. User initiates a request to search the log file. The request is sent from search UI to the load balancer

2b. Load balancer forwards the search request to the search service

2c. Search service sends the request to elastic search service

3a. Archive service periodically reads from elastic search to retrieve the logs that are too old

3b. Archive service saves the old logs to an archiving storage, such as a bucket in Amazon S3

Interviewer: Why do we use log agents, instead of letting the application directly send the log files?

Interviewee:

We can use one log agent for 100 teams. The log agent can be maintained by one team.

We can use one language. If we let every team push the log directly, then we need to support the logging logic in multiple languages.

Centrally maintained log agent logic is easier to upgrade, e.g. add new functions such as rate limiter service. If we use Kubernetes we can add a log agent as a side cart to the container.

Interviewer:

Another benefit is to decouple the failure of the original service with the logging component. We can send log messages even if the original service fails. We can run the original service even if the log system fails.

Interviewee:

If the virtual machine running the original service and log agent fails, there may be a small amount of log that has not been sent to Kafka. To resolve this, we can write the offset of how much of the log file has been sent to Kafka. During recovery, we can start from the offset.

Interviewer: why do we need to use kafka? Are there other alternatives?

Interviewee: we can use other alternatives, such as RabbitMQ or amazon SQS. However,

Kafka is the best choice, because it can support sharding natively. Redis queue and amazon SQS doesn’t support shard. Here is why it’s important. We may need more shards to the consumer that generates a high volume of logs.

Interviewers: We can support sharding through RabbitMQ or Kinesis.

Interviewee:

Kinesis is similar to Kafka in terms of supporting sharding. However, RabbitMQ doesn’t support sharding out of the box, so the user put in extra effort to configure it.

Therefore, Kafka and Kinesis are good choices, and work similarly.

Interviewer: Why do we use spark for consuming output of Kafka?

Interviewee: Spark used widely for real-time processing, but I am not familiar with this part.

Interviewer: We should be able to write a Java component ourselves instead of Kafka.

Interviewee: Yes.

Interviewer: We save the data to elastic search. What does the search service box do?

Interviewee: Borrowing splunk as an example. We can make a complex query. The service needs to parse the query, then execute the query.

Interviewer: sounds like it’s for query interpretation.

Interviewee: Yes.

Interviewer:

application can emit different types of logs. One type of the log can contain metrics. For example, from the metrics logs may measure latency, and emit it as a number. It may contain additional information, for example service name, api name, number, and unit of measurement.

How will we support summarizing the statistics of the log files? For example, find out the average latency and 90 percentile latency?

Interviewee:

Because it’s a different type of logs. We can set up a separate Kafka for metrics collection so the log agent sends the log to a different kafka cluster for metrics collection.

We will need to do many time-based queries

Interviewer:

I may want to see the curve of latency distribution for last week.

Interviewee:

Let’s go to the UI first. Let’s say we have a Grafana metrics UI dashboard

The Grafana sends the request to a time series service. It can answer questions such as last month what is the p99 latency.

We can find a time series database.

As highlighted in the gray box above

4a. Log agent sends metrics data to a new kafka cluster

4b. Kafka sends the data to a time series database

4c. Users use Grafana to visualize the metrics. The request is sent to a load balancer

4d. Load balancer forwards the request to time series service

4e. Time series service queries the time series database for an answer.

Interviewer: if we have some logs stored for 30 days. How do we handle customer deleting data due to compliance?

Interviewee: usually we do not delete logs.

Interviewer: Let’s say initially we stored data in elastic search for 30 days, and in S3 for 90 days. However, the compliance team says we can only store for 50 days. How do we handle this situation?

Interviewee: S3 supports automatic expiration. We can change the expiration date from 90 days to 50 days. There are automatic triggers in S3, which we can also leverage to archive the data from S3 into Glacier.

Interviewer: anything to add?

Interviewee: We should generate UUID for a request context. We can write the UUID as part of the log. We aggregate log into elastic search. We can search across different services using UUID.

Distributed Log Processing: Audience Feedback

Soft skills

Hard skills

Interviewer Feedback

Strength:

It’s a workable solution

Did well in requirement gathering, gathered requirements thoroughly.

The interviewee was fluent and smooth

Area for Improvement:

The interviewee asked whether we can install an agent during the requirement gathering phase. This felt like an implementation detail.

The interviewee could gather more requirements, such as log size. For example, if we handle application log, the size limit can be a few hundred to 1 thousand characters per line.

The interviewee jumped too quickly into a specific solution from a particular vendor, for example, spark real time processing, kafka and elastic search. The interviewee could describe the solution in general first, then drill down to specific solutions.

If I encountered answers with specific choices of technology, I usually ask more specifically why the interviewee chooses the specific solution. Today, I found out the interviewee had experience with most of the solutions proposed, other than the spark stream processing. As such, the interviewee did well as an individual contributor candidate.

Interviewee Self Feedback

I should have asked about log size. I assumed too much based on my own experience.

I actually was not very familiar with ElasticSearch. I could add some explanation for the part that I am knowledgeable about, such as indexing.

Audience Feedback

Audience: the interviewee should clarify if it is application log, machine log that contains metrics, or audit log.

Interviewer: At the beginning, I brought up an application log. The interview could have confirmed it.

Audience: we should clarify the latency requirements

Interviewer: As we discussed at the beginning of the interview, we could tolerate some delays

Audience: We have not calculated the size of the storage

Interviewer: I am not too concerned about it.

Interviewee: I mentioned scalability is part of the requirement. Kafka can scale out of the box so calculation does not help the design as much.

Audience: What is the design bottleneck?

Interviewer: S3 and kafka provide scalability out of the box. The bottleneck is probably from the need to automatically adjust the sharding to scale up Kafka. We can use a configuration server to automatically adjust the sharding of Kafka. We may hit the bottleneck of Kafka at some point.

Audience: How do we handle traffic spikes?

Interviewer: Because we have an agent, the agent can smooth out the traffic spike, as it sends logs every minute.

Interviewee: we also have a rate limiter to throttle the agent from sending too many requests.

Interviewer: I don’t quite agree with the need of the rate limiter because the agent can send the log every minute.

Interviewee: We still need the rate limiter, because there is a situation where the agent cannot catch up with the output of the log files.

Interviewer: right.

Audience: Do we need to save the log to a database, and send the index to Kafka? Kafka may be too expensive to use.

Interviewee: Capacity of Kafka is usually bigger than normal databases. There is no problem writing to Kafka directly.

Interviewer: Kafka may be less expensive; databases can be more expensive, because it’s usually designed for long term storage.

Audience: We should estimate the storage. We cannot assume Kafka takes care of the scalability out of the box.

Audience: Kafka scalability does not scale up automatically. Kinesis can scale up dynamically better.

Audience: we probably don’t need to dig in deeply in the issue of scalability, unless the interviewer drills down on it. The interviewee should spend time on overall design.

Audience: Do we need to worry about data delivery to deliver at least once vs exactly once? Where do we dedupe data if we set up kafka to deliver at least once?

Interviewer: Dedupe is an important point, just I didn’t ask about it. Dedupe can happen at the processor.

Interviewee: because we can tolerate data loss, when we set up kafka, we will set it up using the “ack=0” configuration. When we send a message to kafka, we don’t worry if kafka has completed processing the data. In some extreme cases, such as Kafka master goes down. We probably don’t need dedupe.

Audience: How do we truncate S3? Interviewee mentioned S3 has TTL.

Interviewer: I am fine with using TTL in S3. I am not sure what happens if we set it to 90 days, then we set it to 60 days. It’s not a big concern. The problem can come up upon customer requests to delete.

Audience: What is an audit log?

Interviewer: We capture information such as during authentication, at some point in time, a user does something. It is used by the information security team.

Audience: It feels we can use elastic search or a time series database

Interviewer: Elastic search can create a reverse index, and provide free text search. The time series database is better at metrics aggregation, such as 90% latency. They solve different problems.

Audience: do we need to save all information into elastic search?

Audience: Real time processing can filter out some part that is not needed

Interviewee: usually we don’t filter, based on splunk convention.

Audience: do we save data into the database in addition to elastic search?

Interviewer: we may be able to save to an additional database during realtime processing.

Audience: How do we handle application log versus machine log differently?

Interviewee:

Machine log is usually for metrics. The data will be sent to the time series database.

Application log is usually for unstructured data. The data will be sent to elastic search.

See the two gray boxes in the picture, one labeled with “machine log route”, and another with “application log route”.

Audience: What if we want to ensure that no log can be lost?

Interviewee:

If it cannot be lost, then we can save the log in s3, which can guarantee no data loss.

If we use kafka, we should pay attention to acknowledgement mode setup: ack = 0, ack = 1, ack = all. We will have a tradeoff between providing low latency versus providing durability.

Ack = 0. With such configuration, master crashes lead to data loss.

Ack = 1: we write to write ahead log. If the node crashes then we can recover from the write-ahead log

Ack = all: requires a quorum of multiple nodes to acknowledge they received the data.

Audience: we can consider using Prometheus to support the pull model. We can open the Prometheus server port on the node. Grafana can read from prometheus as a data source