Conveyor — proposed ingestion & storage architecture

A target design that resolves the bulk-ingest / read-fan-out tension surfaced reviewing #117 (sharding) & #120 (async submit).

stage → accept → ingest → process → serve async-by-default for large jobs conditional sharding polyglot storage

The problem in one line. One job can carry up to ~1M tasks. Writing them synchronously blows the 60 s ALB wall and DynamoDB's ~1000 WCU/s per-partition cap. Sharding fixes writes but taxes reads; doing it on every job taxes the 99% that are small. The fix is to stop writing on the request path, shard only when it pays, and put each kind of state on the engine built for it.

1Target architecture

flowchart TB
  C(["Client / SDK"])
  C -->|"submit (small) · 201"| API
  C -->|"presign + PUT large list"| S3S[("S3 — staged task lists")]
  API["API · Fargate · stateless
(decides sync vs async)"]
  API -->|"large/open: 202 + ingest msg"| CQ[["SQS — control queue"]]
  API -->|"small closed: sync write"| DDB
  API -->|"leases · counters · hot-read cache"| RDS[("Redis / MemoryDB
coordination")]
  CQ --> IW["Ingest worker
(stream · write · enqueue · latch)"]
  IW -->|"read chunks"| S3S
  IW -->|"write tasks
(shard only if large)"| DDB[("DynamoDB
jobs · runs · tasks")]
  IW -->|"enqueue refs"| RQ[["SQS — per-run queues"]]
  CQ -. "permanent fail" .-> DLQ[["DLQ + reconcile sweeper"]]
  DLQ -.-> DDB
  RQ --> TW["Task workers · Fargate"]
  TW -->|"scrape"| GW(["gateway"])
  TW -->|"result bodies"| S3R[("S3 — results")]
  TW -->|"stats INCR"| RDS
  TW -->|"terminal status"| DDB
  TW -->|"on run complete"| NOT["Step Functions / webhooks"]
  EXP["Results export (offline)"] -->|"full-set reads → zip"| S3R

Each arrow is a job for the engine best at it: S3 absorbs the giant body, SQS is the durable buffer, Redis owns hot coordination, DynamoDB stays the point-read/TTL system of record, the gateway/result flow is unchanged. The request thread returns in milliseconds.

2Submit decides sync vs async

flowchart TD
  S["POST /v1/jobs"] --> Q{"size & type?"}
  Q -->|"small closed
(tasks < threshold)"| SY["sync: write + enqueue
→ 201, single-partition"]
  Q -->|"large closed / open / async"| AS["stage list → 202
+ ingest_job message"]
  SY --> RS["tasks queryable now
cheap single-Query reads"]
  AS --> RA["tasks appear after ingest
sharded writes · poll job status"]

Today both ceilings are hit because submit is always synchronous and every job is stamped shard_count=16. Here, sharding and async are switched on only when the job is actually large or open — so the common small job keeps the old, fast, single-partition path (the regression we found in #117).

3Ingestion pipeline — durable, resumable

flowchart LR
  A["ingest_job (SQS)"] --> B["stream staged
source, chunk N"]
  B --> C["write tasks
deterministic ids"]
  C --> D["enqueue refs"]
  D --> E{"more
chunks?"}
  E -->|yes| B
  E -->|no| F["SetRunTotal
(absolute, idempotent)"]
  F --> G["latch last_batch"]
  G --> H["run completes via
normal lifecycle"]

At-least-once safe (this is #120's core, already tested): deterministic task ids (sha256(run_id:index)) make re-writes overwrite not duplicate; absolute SetRunTotal (not ADD) can't double-count and is set before the latch so completion never fires early; re-enqueued refs are deduped by the task worker's status guard. A crash at any step resumes on redelivery and converges; permanent failures land in a DLQ with the run left recoverable via /rerun or DELETE.

4State on the right engine (polyglot)

State	Engine	Why
Jobs · runs · idempotency · file-inputs	DynamoDB	low-volume point reads, native TTL, serverless, optimistic concurrency
Tasks + events (the bulk collection)	DynamoDB, conditionally sharded → Aurora Postgres if bulk ever dominates	shard writes only when large; small jobs single-partition. PG erases both write-cap & read-fan-out if it becomes the defining workload
Concurrency semaphore (leases) · rate limits · counters · hot-read cache	Redis / MemoryDB	atomic `INCR`/`SET NX PX`, native expiry — coordination is what it's best at
Large task lists (ingest source)	S3 (presigned)	keeps 1M URLs off the request body; streamed by the ingest worker
Result bodies	S3	unchanged; presigned URLs, 24 h TTL

5Read paths — bounded by design

Point reads (GetTask, history) — one partition, derived from task_id. Cheap, unchanged.
Run-scan (results listing) — single Query for small jobs; bounded N-way fan-out only for the large jobs that opted into sharding. Reads are the cheap side (3× the per-partition headroom of writes, eventually-consistent, parallel).
Full-set reads (export, "give me every result") — go through the offline export op to S3, never a synchronous hot-path scan. This is what kills the scary /rerun "load 1M rows into memory under the 60 s wall" failure mode.

6What changes vs today

Concern	Today	Proposed
Large submit	sync, hits 60 s wall	202 async, wall gone
Sharding	N=16 on every job	only large / open jobs
Small-job reads	16-way fan-out	single Query
1M-URL body	inline / capped	S3-staged stream
Full-set reads	hot scan, OOM risk	offline export
Semaphore / counters	DDB hot partition	Redis
Ingest failure	n/a (sync 503)	DLQ + reconcile sweeper

7Evolution path — ship in slices

Now (in-stack, low risk). Land #117 with conditional sharding (small closed jobs stay single-partition) and a lower default N (4–8). Add the offline-export route for full-set reads. No new infra.

Beta (the 1M path). Land #120's ingest_job op + the S3-streamed carrier (extend file_input) + DLQ & reconcile sweeper. Optionally move the semaphore/counters to Redis. This is where the 60 s wall actually disappears for 1M.

If/when bulk dominates. Move only the task + event store to Aurora Postgres (polyglot) — erases write-cap, read-fan-out, and the GSI dilemma in one move, while DDB keeps the metadata layer. A full DB swap is reserved for if Dynamo fights you in more than this one place.

8Why not the other options

Considered	Verdict
GSI to fix read fan-out	re-creates the write hot-partition on the index — moves the bottleneck
DDB on-demand / adaptive capacity	trap — per-key cap is physical; one job = one key
Redis as task store	not durable enough for paid 14-day results; RAM-priced retention
Wholesale DB swap to Postgres	best engine for the bulk problem, but rewrites storage + worker + loses serverless/TTL — too big for one spike
Fat/chunked task rows	breaks per-task status/results/pagination model
Raise ALB timeout	holds HTTP open for minutes, still WCU-bound — anti-pattern
Client chunking (AddTasks)	free baseline, works today — but pushes batching + partial-failure to the client