InstaCart Payment
Topic: InstaCart Payment
Interviewer: Ellie
Interviewee: 南杨
Level: L5 (Senior)
Additional Resources:
Shopping Service Payment System
Mock System Design Interview Summary
Overview
Date: 10/24/2021
Target level: L5
Duration: 1 hour
Topic covered: Payment system
Diagram tool used: whimsical.com
Requirements
Functional requirements
Shopping service
Role:customer, shopper
Order:service as blackbox
Customer -> withholding/validate
3-5 days
customer -> instacart (actual charge)
instacart->shopper (delivery,tip)
instacart->customer(for refund)
Check instacart ->grocery payment
auditibility
User Journey Walk-through
Non functional requirements
Accuracy
Consistency »availability
Latency: low latency for sync <500ms
Moderate latency for async < 10 seconds
Fault tolerant
System Design
System design diagram
API:
StatusCode =witholdFund(identity, amount, paymentDetail)
Status Code=makePayment(payer,payee,amount,paymentDetails,idompotencyKey)
Steps for system module calls:
Placing order:
Customer place order through webapp
Webapp -> payment service to withhold fund
Payment service -> payment profile db to store credit information
Payment service -> payment DB to save
Webapp calls orderservice to place order
Order processor service:
Order processor to scan completed order
Order processor -> message queue to insert payment for completed order
Async payment processor -> retrieve payment requests from queue
Asnc payment processor calls payment service to call bank to send payment
Database schema:
Schema for payment DB:
Order db
Additional design
Discussions during the Interview
Functional requirement discussion:
Interviewer: initial requirement doesn’t include grocery/merchant, why?
Interviewee: added
API:
Interviewer: Do we need merchant information for API? withholdFund()
Interviewee: merchant not needed yet until shopper buys from merchant
Audience: is withholdFund() call idempotent?
SQL vs noSQL
consistency
SQL database (MySQL)
noSQL app server, 2-phase-commit
Scaling:
Transaction throughput: 10M MAU => 3 TPS. conservatively 300TPS
Storage: 10M x 10 x 5KB x 12 x 5(years)=30TB
Archive data to lower grade database
Shard DB. We may be able to
shard by payment ID
Shard by order ID (more preferred due to other tables)
Payment processor flow:
Can lock multiple records
Or can lock a single record
Steps to make payment:
Lock record
Call bank for idempotency ID?
Call bank to transact
Payment DB: MySQL
What if we have 10x customers? Grows to 10M customers
MySQL, shard by database ID
Fault tolerance
Scaling:
Order processor scaling: no need to scan all status
We can have a separate DB
Idempotency:
Payment service requests bankPaymentID
Payment service then will pay based on bankPaymentID
Interviewer and Audience Feedback after the Interview
Interviewer:
Soft skill
Good pacing
22-25 requirement
27-37 high level diagram
37-42 database, db selection
6:45 capacity estimate - can waste time. It is necessary but approximate is fine. Can be used for read-heavy, write-heavy
TPS
Storage
Upload and download bandwidth
TPS is good for selecting DB. Can ask about storage when TPS is estimated.
Message system: must estimate upload/download bandwidth
Suggestion:
Requirement -> Scale estimate (TPS, storage size, read heavy or write heavy, upload and download bandwidth).
Hard Skill
Refund, 2 phase commit
Selection of mySQL is good.
All payment related system, can just go to mySQL. ACID requirement is very high for banking/payment, much higher than sharding, so choice to use RDBMS is easy
Audience:
Audience: Is the asynchronous case over complicated ? Will synchronized transactions be easier to manage?
Interviewee: The company sends notification to customers and marks transactions as failure, stop, step to law step. After customers pay or get refilled, let the service center charge.(automation if large scale)
Asynchronous: before update it failed. Instacart prices are different from grocery stores.
Order processor is retrying. Jingdong has a strong synchronization requirement. It does not send out goods before payment succeeds.
Audience: Withdraw database transaction?
Interviewee: Shoppers already report how much to pay after 5 days. Try bank if it succeeds. No withdrawal, only log. If we try one day and still fail, save to a database to let a team solve. Coupon table and user balance table should be consistent.
Audience: Database set estimation? Unique activity. A user has multiple purchase times (modify in dau).
Postgresql the same machine database, multiple images. Only multiple images, isolation read and write.
Scale: second sales or data large hive query? Double side
Large data: data sharding.
Visiting at large scale: get categorized, get rid of old data , three month ago to put to second layer
Audience: Order processor is down, how do we record, when it comes back, how do we handle the failure?
Interviewee: If it is completely down, we need to wait. If there are only a few, idempotency guarantees the process can continue. Kafka . We have a backup service and timestamp. We need a timestamp as a key. Only scan index . It does not scan hard path.
Audience: How much bandwidth can one device handle?
Audience:
QPS limit depends on layer:
buffer: 100k QPS
MySQL: low
Tomcat: 100 QPS per second
Network card: usually no need to consider. Likely backend will be threshold.
Interviewee: webcrawler network may find network card to be a bottleneck.
Audience: Async vs sync?
Sync is better than async
Jingdong: 发货快。没法回滚
Usually there are more tables than what’s designed, e.g. order table, discount table, remaining balance table
Audience: how to handle large loads?
Audience: high QPS can be handled by data shard. Usually through Master (write) - slave (reads). Rarely use sharding because sharding is fragile
Can classify
Order: you can archive, cold data.
Hot data is for the recent.
Audience:
Data sharding: 跨机器。same machine can also shard
Audience:
Visit QPS is very large. Data size not big
Or data size big, but QPS is low - for example, hive query
You should ask QPS or data size
Normally: may be high on both QPS and Size.
Data size:
data sharding
Tiered access: hot and cold access (e.g. recent 1 month)
Large QPS
Throttle (DDoS)
Break transaction into small steps
Audience: If user does not know about withhold? What if we directly pay?
Interviewer:
InstaCart already provides credit card.
Price may still be different from estimated at the time of the user
Key point is price may change
Very likely the interviewer will tell you
Example: instagram - if you have not used it before, then interviewer should let you know the key workflows. Interviewee’s job is to translate from requirements to design.
In this design, keypoints are all written down in functional requirement. Interviewee should think from the owner’s perspective how to solve the problem
Audience: Can I ask the interviewer for workflow?
Interviewer: Design Instagram and you have not used it, yes. Payment: should think from owner’s perspective
Audience: What level of detail is expected?
Kaisi: Design should cover all user stories. E.g. price changes.
Audience: What if I have no bank experience, and is not aware of the step to withhold fund vs charging?
Audience: interviewer should guide the interviewee. Should define 1-2 classic scenarios. After the clarification, then interviewer and interviewee has a combined job to cover the flow clearly
Audience: Synchronize, asynchronous. Why use asynchronous?
Interviewee:
Directly went to asynchronous.. Payment will happen 3-5 days later than initial ordering.
Apart from withholding (synchronous), I went to async directly.
Synchronous: order processor/payment service may need to retry.
Audience: How to handle order processor down?
Interviewee:
If partial down, requires idempotency.
If it’s already in message queue: no problem
If it’s not in the message queue: a backup service will pick up the order
Audience: How do we store whether the order has been picked up or not
Interviewee: need a status field: one status is already queued
Audience: Why do we need order processor? Order service can just insert the database, then enqueue.
Interviewee: Order service can be better for separating logic for waiting 5 days to a different service (order processor) from order service.
Audience: is High availability, the same as fault tolerance
Audience:
High availability is for single point of failure
Fault tolerance includes high availability, but may include more business-related failure
Audience:
May shard by user
Audience:
Should use the most frequent query to decide how to shard
Massive scale up: may use existing solutions. 分库分表
Global 2ndary index is supported by DynamoDB
NoSQL: If you need high QPS, no need to join.
Interviewer:
NoSQL, SQL. Difference is more blurry
If there are strong pros and cons, then choosing SQL and NoSQL
if there is no strong pros and cons, then
SQL believers: Scott Shi: Prioritize SQL.
NoSQL believers: use NoSQL. Try to reduce requirement on relations
Audience: How to choose NoSQL, SQL if there is only one primary key.
Audience: depends on business requirement. If there are no consistent columns between different products, can use NoSQL such as MongoDB.
Audience: What if it’s ambiguous choose between SQL and NoSQL
Interviewer: The key is to discuss tradeoffs
Audience: Discuss 2 phase commit?
Interviewer: yes, can be a bonus feature to discuss
Audience:
Big data, no join: NoSQL
ACID: SQL
Audience:
One thing of system design interview is to look at your weakest point
One way is to go deep. One is to go wide (and not deep)
Audience:
Perhaps not focus on solution
Audience:
The tone. And getting the points of the interviewer.
However, the interviewee may be very high level, but does not follow the interviewer’s hints. That’s a red flag.
Example, interviewer may like the system design, interviewee may go for ML design.
Audience:
Use of tool:
Google Jamboard - hard to move and edit
Google drawing - high latency
Try to use the most convenient tool.