Layered approach к LLM provider rate limits, failures, capacity constraints в BloodGPT. Descriptive: что каждый слой умеет (capabilities), и что мы реально используем.
Scope — transport layer: throttling, retries, fallbacks, circuit breakers, observability throughput’а. Schema и semantic concerns — отдельные плоскости (см. llm-call-failure-classes для three-tier separation: transport / schema / semantic).
Layers (by role)
Application
Бизнес-логика, делает LLM-вызов. Не делает: retries, fallbacks, rate-limiters — это задача нижних слоёв (см. no-self-rolled-queues).
У нас: blood-gpt-dotnet (.NET API) + Node.js services (analysis-worker, b2c-dashboard, patient-portal, recommendations-portal, algo-hub). Полный routing — llm-routing.
Event queue / orchestration
Capabilities:
- Concurrency limits per function
- Rate-limit на event-уровне
- Built-in queue с retry/backoff/scheduled retries
У нас: inngest. Pattern в no-self-rolled-queues — queue + retries только здесь, не в application.
LLM gateway
Capabilities:
- Cross-provider cascade fallback (на error/429)
- Built-in circuit breaker (tracks failure rates per provider, auto-opens на threshold)
- Endpoint rotation внутри provider (global ↔ regional keys)
- Per-VK rate limiting (token + request limits, configurable reset period)
- Metrics export (Prometheus counters: requests, tokens, latency, error rate per provider/model)
- Native multi-SDK support (OpenAI / Anthropic / Google GenAI drop-in)
- Semantic caching (Weaviate-based, ~60ms cache lookup vs ~2000ms LLM call) — capability есть, у нас НЕ активирована
У нас: bifrost (target — Go-based, ~11 µs overhead, stage сейчас) + litellm (legacy, prod EU, phasing out — Python ~100+ ms overhead под нагрузкой, memory leak на streaming). Сравнение и rationale — llm-proxy-choice.
Observability (transport-aspect)
Capabilities в scope этой страницы — только transport:
- Trace + token counting per LLM call
- Cost aggregation per provider / model
- Metrics export для alerting (request / token / error counters, latency)
У нас: langfuse (traces + cost per request, через callbacks) + Bifrost Prometheus metrics (counters real-time).
Provider
Provider’s own rate limits даны нам как constraint, не наша ответственность управлять (мы reactiv’но к ним адаптируемся).
У нас: Vertex AI Gemini (детали тиров — vertex-gemini-quotas) + OpenAI (page TBD).
Capabilities matrix — что possible vs реально used
| Capability | Layer | Used |
|---|---|---|
| Concurrency limit per function | Event queue | TBD audit |
| Rate-limit per event | Event queue | TBD audit |
| Cross-provider cascade fallback | Gateway | partial (provider configs есть, prod cascade sequence TBD verify) |
| Endpoint rotation (global ↔ regional) | Gateway | provider keys set (vertex-eu + vertex-global), cascade between — TBD verify |
| Built-in circuit breaker | Gateway | thresholds TBD verify (Bifrost core feature, активен ли в нашем конфиге?) |
| Per-VK rate limit | Gateway | ❌ |
| Metrics counters export | Gateway | exists (Bifrost /metrics), scrape → Grafana TBD |
| Trace + token counting | Observability | ✅ |
| Cost tracking | Observability | ✅ |
| Threshold alerts on capacity (80%/90%) | Observability + Grafana | ❌ |
| Semantic caching | Gateway | ❌ (capability available, не активирована) |
Concept distinctions
Active vs reactive throttling
Reactive — «лети, столкнись с 429, exponential backoff, retry» (current default behavior почти всех LLM apps).
Active — «сам себя сдерживай proactively ниже known threshold, никогда не доходи до 429» (industry best practice per TianPan).
Trade-off: active = predictable throughput + 0 fails, но request может ждать в queue минуты на bursts. Reactive = быстрее в average если capacity available, но tail latency unbounded и periodically degraded UX.
Rate limiter vs circuit breaker
Rate limiter — ограничивает throughput proactively (token bucket / fixed window). Цель: не дать spike долететь до provider.
Circuit breaker — отрубает provider reactively когда failure rate exceeds threshold за окно (e.g. >50% errors за 30 sec → break 60 sec). Цель: gracefully degrade при outage.
Complementary, не альтернативы: rate limiter предотвращает accumulate of failures; circuit breaker bails out если provider всё-таки сломался.
Multi-dimensional limits
Provider enforces несколько dimensions параллельно:
- RPM (binding для high-frequency small-token workloads)
- TPM (binding для long-context / large-output workloads)
- Concurrent requests (binding для long-running streaming)
Hit любого = 429. Observability должна показывать each independently, не aggregate.
Failover на разное model family (shared TPM gotcha)
Некоторые providers share TPM across model size variants одного семейства. OpenAI пример: «Some model families have shared rate limits. All calls to any model in the given shared limit list will count towards that 3.5M.»
→ Failover gpt-5 → gpt-5-mini не обходит лимит. Failover должен быть на разное model family (e.g. OpenAI → Anthropic), не same family different size. Применимо к нашей cascade-стратегии.
Endpoint rotation
Provider exposes несколько endpoints с independent capacity pools (e.g. Vertex global / europe-west4 / us-central1). На 429 на одном → try другой. Vertex specifically supports rotation; AI Studio только global, no rotation possible.
Token bucket vs leaky bucket vs fixed window
- Token bucket — burst-friendly (spend all tokens at once, refill at rate R)
- Leaky bucket — outflow constant (queue + drain at rate R)
- Fixed window — counter resets each period (most providers использует это)
Trade-off: bursts vs smoothness vs implementation simplicity. Bifrost VK governance работает ≈ fixed window (per «Reset Duration» config option).
Adjacent topics
- Schema concerns — JSON validity, cleanSchema, perturbation для structured output — частично overlaps с gateway capabilities, но отдельный concern. См. llm-call-failure-classes.
- Semantic concerns — content quality, factual accuracy, hallucination scoring → отдельная плоскость через datasets / scoring / judges. См. llm-judges, recognize-benchmark.
- Semantic caching — gateway capability которая sits ровно на границе transport/semantic (matches by embedding similarity, not exact). У нас НЕ активирована, medical context добавляет risk false-positive hits. Decision-кандидат TBD.
Открытые вопросы
- Inngest concurrency / rateLimit audit — какие из наших functions имеют эти fields заполнено, какие unset. Нужен grep по
apps/**/inngest/functions/. - Bifrost cascade config на prod — какие fallback sequences реально активны (через INFRA configs prod branch).
- Bifrost built-in circuit breaker — активирован ли в нашем конфиге, какие thresholds? Feature exists в core gateway, но конфиг-параметры verify.
- Prometheus → Grafana → alerts — metrics exposed Bifrost’ом, но scrape / dashboard / alert rules — настроены или просто endpoint висит?
- Alert threshold values — 80%/90% от каких Vertex tier baseline? Зависит от unknown current tier (open question в открытый-вопрос—grant-кредиты-vs-tier-qualification).
- Capacity baseline calibration — теоретический потолок 100 tests/min, фактический TPM-bound ~40 (см. capacity-baseline). Перемерить как изменилось.
Источники
Сноски
-
[TianPan: LLM API Resilience in Production](, accessed 2026-05-17, https://tianpan.co/blog/2026-03-11-llm-api-resilience-production — почему backoff alone не масштабируется, recommended resilience stack pattern. ↩
-
[getmaxim.ai: How AI Gateways Tackle Rate Limiting](, accessed 2026-05-17, https://www.getmaxim.ai/articles/how-ai-gateways-tackle-rate-limiting-for-llm-apps/. ↩
-
[getmaxim.ai: Retries, Fallbacks, and Circuit Breakers in LLM Apps — Production Guide](, accessed 2026-05-17, https://www.getmaxim.ai/articles/retries-fallbacks-and-circuit-breakers-in-llm-apps-a-production-guide/. ↩
-
[TensorZero retries + fallbacks docs](, accessed 2026-05-17, https://www.tensorzero.com/docs/gateway/guides/retries-fallbacks — model routing / variant fallbacks / function fallbacks в Rust-based gateway. ↩
-
[Google Cloud Blog: Reduce 429 errors on Vertex AI](, accessed 2026-05-17, https://cloud.google.com/blog/products/ai-machine-learning/reduce-429-errors-on-vertex-ai. ↩
-
Application-level circuit-breaker libraries: [Polly (.NET)](, accessed 2026-05-17, https://github.com/App-vNext/Polly — Resilience4j (Java/JVM), opossum (Node.js). ↩