Job scheduler
Topic: Job scheduler
Interviewer: Jokerfly
Interviewee: Lu
Level: L5 (Senior)
Job Scheduler
Mock System Design Interview Summary
Interview Overview
Date: 10/17/2021
Target level: senior engineers (L5/SDEIII)
Duration: 1 hour
Topic covered: Build a job scheduler system to execute ML jobs, in the range of millions of jobs per day
Diagram tool used: draw.io
Requirements
Functional requirements
10M per day
Schedule pattern
Run at least once
Cron jobs: hourly/daily/weekly/monthly
Non functional requirements
Highly available: can schedule at any time
Highly scalable
High durability
Requirement clarification during the interview:
Interviewer: Client submit jobs
Interviewer: Output to disk or cloud storage.
Interviewee: Client can see the result? Interviewer: not high priority feature.
Interviewee: run once or run multiple times? (Idempotency) Interviewer: can accept situation where a job can run multiple times
System Design
System design diagram
Single responsibility principle for defining the services
Job Planner: accepts new job definition from client, and fan out the job into tasks. Writes records for jobs and tasks.
10/20 9am (epoch time) = scheduleStartTime
1 wk = jobInterval
6 times = job recurring time
6 records for “task table” (6 different tasks)
Job Scanner: periodically picks up the tasks to run
API:
Initial design:
schedule(taskDescription, scheduledStartTime
After some discussion with the interviewer, changed to:
schedule(taskDescription, scheduledStartTime, jobIntervals, jobRecurringTime)
Database schema:
Reasons why choosing DynamoDB: no-sql is more scalable.
Job states:
“scheduled”->”Enqueue”->”Claimed”->”Processing”->”Successful/Failed”
Discussions During the Interview
Interviewer: how to submit the definition of the job
Interviewee:
“scheduled”->”Enqueue”->”Claimed”->”Processing”->”Successful/Failed”
Interviewer: how often do we scan?
Interviewee: Task scanner will actively query which tasks are ready to run. Is 5 minute interval fine?
Interviewer: 5 minutes is acceptable given AI tasks take hours to run.
Interviewer: Task scanner may fail, single point of failure
Interviewee:
Partition based on scheduled start time
GSI (global secondary index) ScheduledStartTime.
GSI (status). Filter out completed tasks
Main idea: 9am, 9:10am, should belong to the same shard
Shard based on hour level.
Interviewer:
GSI is to make scanning more efficiently
However, my concerns is failure of task scanner. Should we have 1 or 10 hosts of task scanners?
Interviewee:
Yes we can create multiple
Interviewer:
Will the same task be picked up by 10 task scanners. Then the same task will be executed 10 times.
Interviewer and Audience Feedback After the Interview
Interviewee style:
Interviewer can consider driving the interview more, because target level is senior
Can consider swapping API design closer to the beginning. This reduces discussion back and forth.
Scaling the job scanner:
There can be multiple scanners running.
Sharding between scanners can be based on a hash function, such that two scanners will not pick up the same job.
ZooKeeper can keep scanners alive with heartbeat checking.
Can use consistent hashing to scale up job scanner count
How to handle tasks that repeat infinite times:
Can schedule the next task each time when the current task is scheduled.
Separation of Job and Tasks: Each job can be executed multiple times (tasks). Therefore
Job and tasks can be modeled separately
Job environment preparation can be done once, while task can execute multiple times in the same environment
System scaling is predictable, and can be planned ahead of time
SQL vs NoSQL. SQL can be considered as a valid alternative
If in the future complex query is required
MySQL database can scale well with sharding/partitioning
Scaling system in general:
All components in the design should support multiple instances