AI Automation Framework
Materials — open to everyone, no sign-in
Topic: AI Automation Framework
Interviewer: Li
Level: L6 (Staff)
Additional Resources:
System Design Interview - Distributed System
8/21/2024
YouTube for the event:
https://www.youtube.com/watch?v=-7hkkRGsCa0
Coach Ken LinkedIn: https://commitway.com/linkedin
| | | 职场提升俱乐部 |
Requirement
[41:39]
[connect to external services, auth]
[38]
[36]
[34]
[32]
High level design
[30]
[28:45]
API
[27:29]
DB schema
of workflows
[26]
May not fit into one database instance
NoSQL database
Partition key: workflow ID
Sort key: creation_timestamp
[timeout?]
[workflow run in DB?]
[24]
20
[20]
Workflow service to schedule first run
Worker to schedule new runs
[crash recovery of workers?]
[19]
Workflow scheduler maintain heartbeat with workers
[how to scale worker pool?]
Pull model or push model?
[how to scale worker pool?]
Pull model because push is messy to keep status
[16]
Add message queue
Pull model: easier for maintenance
Heartbeat fails 3 times, then assume worker is dead
[pull model?]
[13:35]
Select pull
Each worker contains multiple docker container, some level of security
How do we know the status?
Worker will update status
[11]
Relational or non-relational?
Non-relational: scale is very large
Still need strong consistency
A workflow run is scheduled twice in two separate machines
WorkerID acts as a lock
Conditional update
[9:23]
DynamoDB or sharded MySQL
To ensure strong consistency
Partition key?
Workflow run vs scheduled time
Workflow run: easy to find the workflow, full table scan
Scheduled time: easy to run. Hard to query.
Scanning is more frequent: optimize for scan, scheduled time
Secondary index. Sharded SQL or DynamoDB, strong consistency
Transaction: cannot support commit > 100 records
[5]
How can worker retry the job?
If worker has failed, then need to reschedule. Management service should update the workflow run DB
[ missing intermediate result ]
[2:48]
Worker mgmt service is single point of failure
Good monitoring system. We may need manual investigation to reload the job
Or automatic fail over
A little risky. May prefer engineer to investigate.
[ time is up ]
Worker mgmt service.
317 jobs /second
7:09-
100k
Concurrent runs
7:09-7:11
Not too sure about today’s performance
The system is complex
Availability
Data model
设计不太有把握
==
重试可不可以作为workflow definition的一部分?