Machine Learning System Architecture and Development Cycle
Topic: Machine Learning System Architecture and Development Cycle
Presenter: Junzhi He
Additional Resources:
Sign Up Form:
Job Referral
Candidates: https://commitway.com/job-refer
Hiring managers/team members: https://commitway.com/job-open
QRCode
Meeting Summary
Machine Learning Architecture
and Development Cycle
Why a system for machine learning is developed
Tools, workflow, hardware, architecture
We will explain key concepts
We will help friends to understand across software engineering and machine learning engineering
We will cover recruiting talents, motivating teams, managing up as a machine learning manager
Outline:
What is machine learning
ML basic component and architect
Model development cycle vs SWE development cycle
The future of machine learning and AI
How do you learn a new topic? Distributed computing and machine learning
Misconceptions
Common issues:
Initially I didn’t have an overall knowledge of architecture.
I knew individual pieces, but it’s hard to know the overall picture
Machine learning is the same
When we learn through books, it’s dry.
Key issue is the knowledge is distant from day-to-day work
Then we doubt ourselves
愚公移山
Goal is good
But the process is not advisable.
What problem did it solve?
What challenges does it bring?
Key reasons
What does distributed system solve?
Cost vs product quality tradeoff
Data size increases, requiring systems to process lots of data
2015: 6ZB
2020: 44ZB
2030: 2500ZB
Value of Individual records decreases
7*24 hour service required due to competition
Need to support a company of 1000s - 10000s people
Single machine system
Frontend
Java web app
mySQL
Now the system is a lot more complex.
All nodes are replicated
Application server to web layer and application layer
Lots of problems from increasing scale
Where to find other servers?
How to distribute load?
Need to manage system architecture using software and not human
Different servers serve the same user
What if some server crashes. Will we have an avalanche
How to monitor and maintain servers?
Lots of problems with lots of data
Partition of data
Data replication
Before you study, write down problems and classify
Division of work
Find service
Find instance
Data partition
Avalanche
Consensus
Replication
Distributed application logic
Distributed lock
Operation
Configuration center
Monitor and observability
We need to have an overall system to learn
Machine learning:
Most important point: Complex patterns: there are patterns to learn, they are complex, there are historical
Learn: system has capacity to learn
Existing data: data is available
etc
Try to discover patterns
Which one is easiest for an ML system to solve?
A released prisoner, will they have another offense again
Buying pattern, will the buyer break the contract in the future?
Watching behavior, what video will they watch in the future
Using a questionnaire, decide the support rate of candidates
Answer: using past pattern to predict future pattern
difficult:
Support rate: not easy as ML system
Ethics: second offense
Easier:
Buying behavior to predict the risk of the buyer
Watching behavior to predict what they
5 more examples
Chess
麻将
Chinese chess
Texas poker
Go
The current state of the game + previous behavior: can decide your next best action.
Q: Previous behavior?
A: The psychology of the opponent
There can be infinite number of repeats of experiments
It is harder than the behavior prediction
Machine learning applications
Recommender system
Acquiring new customers
Increasing customer satisfaction
Increasing long term customer engagement
Generating customer intelligence
Predicting number of visitors
Reducing cost
Increasing customer satisfaction
Predicting demand fluctuations
Fraud detection
NLP and CV: interaction with customers
Basic steps for ML system development
Model: pattern for prediction
Deploy the model
Use the model
ML system deployment cycle
Explore and process:
collect historical data
clean and explore data
very important
Discover some relevance between signal to result
prepare/transform.
Modeling:
Develop and train model
Validate/ evaluate model
Deployment
Deploy to production
Monitor and update model & data
Challenges from ML systems:
Part1 challenges
Different data sources are easy to access
Easily queryable - infrastructure requirement such as 3 papers from Google
We must have data to train
Some requirements are vague:
Competitor can customize prices without losing satisfaction
You need to make some assumptions.
Lots of failures are due to assumptions
Complexity of computation. The speed is too slow
What is a model?
a*2 + b = 5
a*1 + b = 4
Linear model
min(abs(5 - 2a - b) + abs(4 - a - b)) => solution determines the best a and b
If I have too much data, how to do I do distributed optimization of the model?
Part2 challenges
After you deploy a model, performance is worse than experiment
After you deploy a model, performance worsened
Are you 100% sure the logic is correctly implemented?
What if engineers implemented different code than data scientists?
What if data scientists make a mistake?
How to version the model?
If the model has an error or there is a bias in data, how do we troubleshoot?
Part 3 challenges
What if the team’s value doesn’t align with others?
ML engineers
Sales teams - they want engagement. They want to recommend the thing that generate highest revenue
Product team - they want good user experience
ML platform team: biggest ask is not to change the platform
Manager - want people to use new ML model
Needs lots of negotiation with other teams
Solution to ML system challenges
“Technical debt in machine
ML challenges
Slides link
ML System challenges by root cause
Correctness
Data access
Cannot train practically
Prediction speed is low
Change in underlying data
Workflow for development
Balance of team needs
Infrastructure problems: 2, 3
Architecture problem: 2, 3, 4
Operation and tool problems: 1, 5, 6
Management and collaboration: how to make tradeoffs, how to set up a culture of data-driven decision culture. Data can help. Involvement of domain experts: 1, 7
ML’s own problem: 3, 4
Biggest problem is business impact
Revenue
Cost
Customer satisfaction
Computation speed and QPS
Usually:
restAPI
parse HTTP request
prepare feature
load model
pass feature to model
Calculate result
Send back HTTP response
What if the feature/calculate the result step is too slow
Things to improve:
Network protocol
Data format
Model I/O
Model inference speed
Hardware improvement
Easy way to improve
Cache the model in GPU
Model compression:
lower-rank factorization. Reduce/eliminate some layers
Knowledge distillation
Pruning
Quantization - 32 bit double, 16 bit float, 8 bit int If they don’t degrade the result, then it works
Reference Uber ML system
Pre-compute result
Replication and partition. Scalable
Hardware: TPU (tensor)
GPU is better for 1 dimensional calculation
Model optimization
The most expensive. Need to optimize based on hardware and ML cod
Matrix calculation optimization
Parallel computing
For loop vectorization
Assembly language optimization
https://lmax-exchange.github.io/disruptor/
Compiler optimization?
Usually optimize based on single thread (Single thread vs multi-thread)
Model serving
Batch inference or online inference?
Batch cannot handle latest behaviors
Improving speed vs latency (model freshness) is a tradeoff
Data infrastructure
Datalake, data warehouse
Key questions to ask:
More raw the better? Or binary smaller data
Do you read all data or part of data
Do you need to read often or write often?
Can loss be fine?
Row or column based?
What format? Readable or machine optimized
Datalake: row based, raw
Datawarehouse: column based, formated
Data governance: complex
Data management like code management
Review
Versioning
Refer to Esensoft
What if a column is deleted?
We need to know the baseline
ML system architect 3
Consistent environment leads to consistent output
Development env vs product env
Feature engineering
Workflow management: airflow, prefect (they are scheduler
Resource management: K8S, Kubeflow
Development lifecycle, CD for ML
Other topics
Model training
Model monitor
Continual learning
ML architect 5.1
Most important is how to put together ML system with other systems
What can be slow?
What must be fast? E.g. serving.
Delay tolerance to return prediction:
Online: 80ms
Nearline: 800ms
Offline: One week
Chip: design machine learning system
How do you unify online and offline
We should understand why there needs to be a workflow?
The main reason
CI/CD: standard issue for development environments
Software: code and dependencies
Hardware: environment
Accessory: environment
CI/CD: reduce error for test and verification
Developer
Code
How to test ML model? Offline test is easy. But how to test on prod? Answer: test on production
How to test research? Experiment tracking.
How to control the version of ML? How do we rollback? Answer: versioning. tradeoff of model serving.
Code splits into data, model
Build:
Experiment, model, code together
Automation test
Release
Model, entire ML pipeline, image, code as artifact
CD for ML is more complex
Scheduler and lower layer orchestrators
ML jobs
Research, analysis, model and engineers
Org:
Collaboration
Culture: data/result driven decision.
Make ML result a reference
广结善缘,雷厉风行
Is it a good chance to join ML?
Now ML is mature; growth is slower compared to initial stage
Longer term there will be more growth
MLE:
Closer to model
Closer to engineering
Choose based on your background
Should ML change direction?
Depends on your business impact
ML vs SDE? ML is more specific
Is it time to enter high tech or ML?
Specialization - needs people with lots of experience
Get ready
Technical trend - easier to guess based on experience
Business trend - hard to capture
ML is not a bubble
Picking company: business can change. the core is technical barrier
Books
Full Stack Deep Learning
Chip: Design Machine Learning System
Udacity: MLE - nanodegree - pretty good as training class
李宏毅: Youtube - all concepts well explained
王喆: 书和极客时间专栏
王树森:Youtube
李沐:b站和书 - paper
StatQuest - statistical book
ritvikmath
数学之美
白板推导
Google的老论文: Hidden Technical Debt in Machine Learning Systems
https://netflixtechblog.com/system-architectures-for-personalization-and-recommendation-e081aa94b5d8