InstaCart Payment

System DesignE-commerce & Payments

Topic: InstaCart Payment

Interviewer: Ellie

Interviewee: 南杨

Level: L5 (Senior)

Additional Resources:


Shopping Service Payment System

Mock System Design Interview Summary

Overview

Date: 10/24/2021

Target level: L5

Duration: 1 hour

Topic covered: Payment system

Diagram tool used: whimsical.com

Requirements

Functional requirements

Shopping service

Role:customer, shopper

Order:service as blackbox

Customer -> withholding/validate

3-5 days

customer -> instacart (actual charge)

instacart->shopper (delivery,tip)

instacart->customer(for refund)

Check instacart ->grocery payment

auditibility

User Journey Walk-through

Non functional requirements

Accuracy

Consistency »availability

Latency: low latency for sync <500ms

Moderate latency for async < 10 seconds

Fault tolerant

System Design

System design diagram

API:

StatusCode =witholdFund(identity, amount, paymentDetail)

Status Code=makePayment(payer,payee,amount,paymentDetails,idompotencyKey)

Steps for system module calls:

Placing order:

Customer place order through webapp

Webapp -> payment service to withhold fund

Payment service -> payment profile db to store credit information

Payment service -> payment DB to save

Webapp calls orderservice to place order

Order processor service:

Order processor to scan completed order

Order processor -> message queue to insert payment for completed order

Async payment processor -> retrieve payment requests from queue

Asnc payment processor calls payment service to call bank to send payment

Database schema:

Schema for payment DB:

Order db

Additional design

Discussions during the Interview

Functional requirement discussion:

Interviewer: initial requirement doesn’t include grocery/merchant, why?

Interviewee: added

API:

Interviewer: Do we need merchant information for API? withholdFund()

Interviewee: merchant not needed yet until shopper buys from merchant

Audience: is withholdFund() call idempotent?

SQL vs noSQL

consistency

SQL database (MySQL)

noSQL app server, 2-phase-commit

Scaling:

Transaction throughput: 10M MAU => 3 TPS. conservatively 300TPS

Storage: 10M x 10 x 5KB x 12 x 5(years)=30TB

Archive data to lower grade database

Shard DB. We may be able to

shard by payment ID

Shard by order ID (more preferred due to other tables)

Payment processor flow:

Can lock multiple records

Or can lock a single record

Steps to make payment:

Lock record

Call bank for idempotency ID?

Call bank to transact

Payment DB: MySQL

What if we have 10x customers? Grows to 10M customers

MySQL, shard by database ID

Fault tolerance

Scaling:

Order processor scaling: no need to scan all status

We can have a separate DB

Idempotency:

Payment service requests bankPaymentID

Payment service then will pay based on bankPaymentID

Interviewer and Audience Feedback after the Interview

Interviewer:

Soft skill

Good pacing

22-25 requirement

27-37 high level diagram

37-42 database, db selection

6:45 capacity estimate - can waste time. It is necessary but approximate is fine. Can be used for read-heavy, write-heavy

TPS

Storage

Upload and download bandwidth

TPS is good for selecting DB. Can ask about storage when TPS is estimated.

Message system: must estimate upload/download bandwidth

Suggestion:

Requirement -> Scale estimate (TPS, storage size, read heavy or write heavy, upload and download bandwidth).

Hard Skill

Refund, 2 phase commit

Selection of mySQL is good.

All payment related system, can just go to mySQL. ACID requirement is very high for banking/payment, much higher than sharding, so choice to use RDBMS is easy

Audience:

Audience: Is the asynchronous case over complicated ? Will synchronized transactions be easier to manage?

Interviewee: The company sends notification to customers and marks transactions as failure, stop, step to law step. After customers pay or get refilled, let the service center charge.(automation if large scale)

Asynchronous: before update it failed. Instacart prices are different from grocery stores.

Order processor is retrying. Jingdong has a strong synchronization requirement. It does not send out goods before payment succeeds.

Audience: Withdraw database transaction?

Interviewee: Shoppers already report how much to pay after 5 days. Try bank if it succeeds. No withdrawal, only log. If we try one day and still fail, save to a database to let a team solve. Coupon table and user balance table should be consistent.

Audience: Database set estimation? Unique activity. A user has multiple purchase times (modify in dau).

Postgresql the same machine database, multiple images. Only multiple images, isolation read and write.

Scale: second sales or data large hive query? Double side

Large data: data sharding.

Visiting at large scale: get categorized, get rid of old data , three month ago to put to second layer

Audience: Order processor is down, how do we record, when it comes back, how do we handle the failure?

Interviewee: If it is completely down, we need to wait. If there are only a few, idempotency guarantees the process can continue. Kafka . We have a backup service and timestamp. We need a timestamp as a key. Only scan index . It does not scan hard path.

Audience: How much bandwidth can one device handle?

Audience:

QPS limit depends on layer:

buffer: 100k QPS

MySQL: low

Tomcat: 100 QPS per second

Network card: usually no need to consider. Likely backend will be threshold.

Interviewee: webcrawler network may find network card to be a bottleneck.

Audience: Async vs sync?

Sync is better than async

Jingdong: 发货快。没法回滚

Usually there are more tables than what’s designed, e.g. order table, discount table, remaining balance table

Audience: how to handle large loads?

Audience: high QPS can be handled by data shard. Usually through Master (write) - slave (reads). Rarely use sharding because sharding is fragile

Can classify

Order: you can archive, cold data.

Hot data is for the recent.

Audience:

Data sharding: 跨机器。same machine can also shard

Audience:

Visit QPS is very large. Data size not big

Or data size big, but QPS is low - for example, hive query

You should ask QPS or data size

Normally: may be high on both QPS and Size.

Data size:

data sharding

Tiered access: hot and cold access (e.g. recent 1 month)

Large QPS

Throttle (DDoS)

Break transaction into small steps

Audience: If user does not know about withhold? What if we directly pay?

Interviewer:

InstaCart already provides credit card.

Price may still be different from estimated at the time of the user

Key point is price may change

Very likely the interviewer will tell you

Example: instagram - if you have not used it before, then interviewer should let you know the key workflows. Interviewee’s job is to translate from requirements to design.

In this design, keypoints are all written down in functional requirement. Interviewee should think from the owner’s perspective how to solve the problem

Audience: Can I ask the interviewer for workflow?

Interviewer: Design Instagram and you have not used it, yes. Payment: should think from owner’s perspective

Audience: What level of detail is expected?

Kaisi: Design should cover all user stories. E.g. price changes.

Audience: What if I have no bank experience, and is not aware of the step to withhold fund vs charging?

Audience: interviewer should guide the interviewee. Should define 1-2 classic scenarios. After the clarification, then interviewer and interviewee has a combined job to cover the flow clearly

Audience: Synchronize, asynchronous. Why use asynchronous?

Interviewee:

Directly went to asynchronous.. Payment will happen 3-5 days later than initial ordering.

Apart from withholding (synchronous), I went to async directly.

Synchronous: order processor/payment service may need to retry.

Audience: How to handle order processor down?

Interviewee:

If partial down, requires idempotency.

If it’s already in message queue: no problem

If it’s not in the message queue: a backup service will pick up the order

Audience: How do we store whether the order has been picked up or not

Interviewee: need a status field: one status is already queued

Audience: Why do we need order processor? Order service can just insert the database, then enqueue.

Interviewee: Order service can be better for separating logic for waiting 5 days to a different service (order processor) from order service.

Audience: is High availability, the same as fault tolerance

Audience:

High availability is for single point of failure

Fault tolerance includes high availability, but may include more business-related failure

Audience:

May shard by user

Audience:

Should use the most frequent query to decide how to shard

Massive scale up: may use existing solutions. 分库分表

Global 2ndary index is supported by DynamoDB

NoSQL: If you need high QPS, no need to join.

Interviewer:

NoSQL, SQL. Difference is more blurry

If there are strong pros and cons, then choosing SQL and NoSQL

if there is no strong pros and cons, then

SQL believers: Scott Shi: Prioritize SQL.

NoSQL believers: use NoSQL. Try to reduce requirement on relations

Audience: How to choose NoSQL, SQL if there is only one primary key.

Audience: depends on business requirement. If there are no consistent columns between different products, can use NoSQL such as MongoDB.

Audience: What if it’s ambiguous choose between SQL and NoSQL

Interviewer: The key is to discuss tradeoffs

Audience: Discuss 2 phase commit?

Interviewer: yes, can be a bonus feature to discuss

Audience:

Big data, no join: NoSQL

ACID: SQL

Audience:

One thing of system design interview is to look at your weakest point

One way is to go deep. One is to go wide (and not deep)

Audience:

Perhaps not focus on solution

Audience:

The tone. And getting the points of the interviewer.

However, the interviewee may be very high level, but does not follow the interviewer’s hints. That’s a red flag.

Example, interviewer may like the system design, interviewee may go for ML design.

Audience:

Use of tool:

Google Jamboard - hard to move and edit

Google drawing - high latency

Try to use the most convenient tool.