flowchart TB
C(["Client / SDK"])
C -->|"submit (small) · 201"| API
C -->|"presign + PUT large list"| S3S[("S3 — staged task lists")]
API["API · Fargate · stateless
(decides sync vs async)"]
API -->|"large/open: 202 + ingest msg"| CQ[["SQS — control queue"]]
API -->|"small closed: sync write"| DDB
API -->|"leases · counters · hot-read cache"| RDS[("Redis / MemoryDB
coordination")]
CQ --> IW["Ingest worker
(stream · write · enqueue · latch)"]
IW -->|"read chunks"| S3S
IW -->|"write tasks
(shard only if large)"| DDB[("DynamoDB
jobs · runs · tasks")]
IW -->|"enqueue refs"| RQ[["SQS — per-run queues"]]
CQ -. "permanent fail" .-> DLQ[["DLQ + reconcile sweeper"]]
DLQ -.-> DDB
RQ --> TW["Task workers · Fargate"]
TW -->|"scrape"| GW(["gateway"])
TW -->|"result bodies"| S3R[("S3 — results")]
TW -->|"stats INCR"| RDS
TW -->|"terminal status"| DDB
TW -->|"on run complete"| NOT["Step Functions / webhooks"]
EXP["Results export (offline)"] -->|"full-set reads → zip"| S3R
Each arrow is a job for the engine best at it: S3 absorbs the giant body, SQS is the durable buffer, Redis owns hot coordination, DynamoDB stays the point-read/TTL system of record, the gateway/result flow is unchanged. The request thread returns in milliseconds.
flowchart TD
S["POST /v1/jobs"] --> Q{"size & type?"}
Q -->|"small closed
(tasks < threshold)"| SY["sync: write + enqueue
→ 201, single-partition"]
Q -->|"large closed / open / async"| AS["stage list → 202
+ ingest_job message"]
SY --> RS["tasks queryable now
cheap single-Query reads"]
AS --> RA["tasks appear after ingest
sharded writes · poll job status"]
Today both ceilings are hit because submit is always synchronous and every job is stamped
shard_count=16. Here, sharding and async are switched on only when the job is actually large or
open — so the common small job keeps the old, fast, single-partition path (the regression we found in #117).
flowchart LR A["ingest_job (SQS)"] --> B["stream staged
source, chunk N"] B --> C["write tasks
deterministic ids"] C --> D["enqueue refs"] D --> E{"more
chunks?"} E -->|yes| B E -->|no| F["SetRunTotal
(absolute, idempotent)"] F --> G["latch last_batch"] G --> H["run completes via
normal lifecycle"]
At-least-once safe (this is #120's core, already tested): deterministic task ids
(sha256(run_id:index)) make re-writes overwrite not duplicate; absolute SetRunTotal (not ADD)
can't double-count and is set before the latch so completion never fires early; re-enqueued refs are
deduped by the task worker's status guard. A crash at any step resumes on redelivery and converges; permanent
failures land in a DLQ with the run left recoverable via /rerun or DELETE.
| State | Engine | Why |
|---|---|---|
| Jobs · runs · idempotency · file-inputs | DynamoDB | low-volume point reads, native TTL, serverless, optimistic concurrency |
| Tasks + events (the bulk collection) | DynamoDB, conditionally sharded → Aurora Postgres if bulk ever dominates | shard writes only when large; small jobs single-partition. PG erases both write-cap & read-fan-out if it becomes the defining workload |
| Concurrency semaphore (leases) · rate limits · counters · hot-read cache | Redis / MemoryDB | atomic INCR/SET NX PX, native expiry — coordination is what it's best at |
| Large task lists (ingest source) | S3 (presigned) | keeps 1M URLs off the request body; streamed by the ingest worker |
| Result bodies | S3 | unchanged; presigned URLs, 24 h TTL |
task_id. Cheap, unchanged./rerun "load 1M rows into memory
under the 60 s wall" failure mode.| Concern | Today | Proposed |
|---|---|---|
| Large submit | sync, hits 60 s wall | 202 async, wall gone |
| Sharding | N=16 on every job | only large / open jobs |
| Small-job reads | 16-way fan-out | single Query |
| 1M-URL body | inline / capped | S3-staged stream |
| Full-set reads | hot scan, OOM risk | offline export |
| Semaphore / counters | DDB hot partition | Redis |
| Ingest failure | n/a (sync 503) | DLQ + reconcile sweeper |
ingest_job op + the S3-streamed carrier
(extend file_input) + DLQ & reconcile sweeper. Optionally move the semaphore/counters to Redis. This
is where the 60 s wall actually disappears for 1M.| Considered | Verdict |
|---|---|
| GSI to fix read fan-out | re-creates the write hot-partition on the index — moves the bottleneck |
| DDB on-demand / adaptive capacity | trap — per-key cap is physical; one job = one key |
| Redis as task store | not durable enough for paid 14-day results; RAM-priced retention |
| Wholesale DB swap to Postgres | best engine for the bulk problem, but rewrites storage + worker + loses serverless/TTL — too big for one spike |
| Fat/chunked task rows | breaks per-task status/results/pagination model |
| Raise ALB timeout | holds HTTP open for minutes, still WCU-bound — anti-pattern |
| Client chunking (AddTasks) | free baseline, works today — but pushes batching + partial-failure to the client |