Layered approach к LLM provider rate limits, failures, capacity constraints в BloodGPT. Descriptive: что каждый слой умеет (capabilities), и что мы реально используем.

Scope — transport layer: throttling, retries, fallbacks, circuit breakers, observability throughput’а. Schema и semantic concerns — отдельные плоскости (см. llm-call-failure-classes для three-tier separation: transport / schema / semantic).

Layers (by role)

Application

Бизнес-логика, делает LLM-вызов. Не делает: retries, fallbacks, rate-limiters — это задача нижних слоёв (см. no-self-rolled-queues).

У нас: blood-gpt-dotnet (.NET API) + Node.js services (analysis-worker, b2c-dashboard, patient-portal, recommendations-portal, algo-hub). Полный routing — llm-routing.

Event queue / orchestration

Capabilities:

  • Concurrency limits per function
  • Rate-limit на event-уровне
  • Built-in queue с retry/backoff/scheduled retries

У нас: inngest. Pattern в no-self-rolled-queues — queue + retries только здесь, не в application.

LLM gateway

Capabilities:

  • Cross-provider cascade fallback (на error/429)
  • Built-in circuit breaker (tracks failure rates per provider, auto-opens на threshold)
  • Endpoint rotation внутри provider (global ↔ regional keys)
  • Per-VK rate limiting (token + request limits, configurable reset period)
  • Metrics export (Prometheus counters: requests, tokens, latency, error rate per provider/model)
  • Native multi-SDK support (OpenAI / Anthropic / Google GenAI drop-in)
  • Semantic caching (Weaviate-based, ~60ms cache lookup vs ~2000ms LLM call) — capability есть, у нас НЕ активирована

У нас: bifrost (target — Go-based, ~11 µs overhead, stage сейчас) + litellm (legacy, prod EU, phasing out — Python ~100+ ms overhead под нагрузкой, memory leak на streaming). Сравнение и rationale — llm-proxy-choice.

Observability (transport-aspect)

Capabilities в scope этой страницы — только transport:

  • Trace + token counting per LLM call
  • Cost aggregation per provider / model
  • Metrics export для alerting (request / token / error counters, latency)

У нас: langfuse (traces + cost per request, через callbacks) + Bifrost Prometheus metrics (counters real-time).

Provider

Provider’s own rate limits даны нам как constraint, не наша ответственность управлять (мы reactiv’но к ним адаптируемся).

У нас: Vertex AI Gemini (детали тиров — vertex-gemini-quotas) + OpenAI (page TBD).

Capabilities matrix — что possible vs реально used

CapabilityLayerUsed
Concurrency limit per functionEvent queueTBD audit
Rate-limit per eventEvent queueTBD audit
Cross-provider cascade fallbackGatewaypartial (provider configs есть, prod cascade sequence TBD verify)
Endpoint rotation (global ↔ regional)Gatewayprovider keys set (vertex-eu + vertex-global), cascade between — TBD verify
Built-in circuit breakerGatewaythresholds TBD verify (Bifrost core feature, активен ли в нашем конфиге?)
Per-VK rate limitGateway
Metrics counters exportGatewayexists (Bifrost /metrics), scrape → Grafana TBD
Trace + token countingObservability
Cost trackingObservability
Threshold alerts on capacity (80%/90%)Observability + Grafana
Semantic cachingGateway❌ (capability available, не активирована)

Concept distinctions

Active vs reactive throttling

Reactive — «лети, столкнись с 429, exponential backoff, retry» (current default behavior почти всех LLM apps).

Active — «сам себя сдерживай proactively ниже known threshold, никогда не доходи до 429» (industry best practice per TianPan).

Trade-off: active = predictable throughput + 0 fails, но request может ждать в queue минуты на bursts. Reactive = быстрее в average если capacity available, но tail latency unbounded и periodically degraded UX.

Rate limiter vs circuit breaker

Rate limiter — ограничивает throughput proactively (token bucket / fixed window). Цель: не дать spike долететь до provider.

Circuit breaker — отрубает provider reactively когда failure rate exceeds threshold за окно (e.g. >50% errors за 30 sec → break 60 sec). Цель: gracefully degrade при outage.

Complementary, не альтернативы: rate limiter предотвращает accumulate of failures; circuit breaker bails out если provider всё-таки сломался.

Multi-dimensional limits

Provider enforces несколько dimensions параллельно:

  • RPM (binding для high-frequency small-token workloads)
  • TPM (binding для long-context / large-output workloads)
  • Concurrent requests (binding для long-running streaming)

Hit любого = 429. Observability должна показывать each independently, не aggregate.

Failover на разное model family (shared TPM gotcha)

Некоторые providers share TPM across model size variants одного семейства. OpenAI пример: «Some model families have shared rate limits. All calls to any model in the given shared limit list will count towards that 3.5M.»

→ Failover gpt-5gpt-5-mini не обходит лимит. Failover должен быть на разное model family (e.g. OpenAI → Anthropic), не same family different size. Применимо к нашей cascade-стратегии.

Endpoint rotation

Provider exposes несколько endpoints с independent capacity pools (e.g. Vertex global / europe-west4 / us-central1). На 429 на одном → try другой. Vertex specifically supports rotation; AI Studio только global, no rotation possible.

Token bucket vs leaky bucket vs fixed window

  • Token bucket — burst-friendly (spend all tokens at once, refill at rate R)
  • Leaky bucket — outflow constant (queue + drain at rate R)
  • Fixed window — counter resets each period (most providers использует это)

Trade-off: bursts vs smoothness vs implementation simplicity. Bifrost VK governance работает ≈ fixed window (per «Reset Duration» config option).

Adjacent topics

  • Schema concerns — JSON validity, cleanSchema, perturbation для structured output — частично overlaps с gateway capabilities, но отдельный concern. См. llm-call-failure-classes.
  • Semantic concerns — content quality, factual accuracy, hallucination scoring → отдельная плоскость через datasets / scoring / judges. См. llm-judges, recognize-benchmark.
  • Semantic caching — gateway capability которая sits ровно на границе transport/semantic (matches by embedding similarity, not exact). У нас НЕ активирована, medical context добавляет risk false-positive hits. Decision-кандидат TBD.

Открытые вопросы

  • Inngest concurrency / rateLimit audit — какие из наших functions имеют эти fields заполнено, какие unset. Нужен grep по apps/**/inngest/functions/.
  • Bifrost cascade config на prod — какие fallback sequences реально активны (через INFRA configs prod branch).
  • Bifrost built-in circuit breaker — активирован ли в нашем конфиге, какие thresholds? Feature exists в core gateway, но конфиг-параметры verify.
  • Prometheus → Grafana → alerts — metrics exposed Bifrost’ом, но scrape / dashboard / alert rules — настроены или просто endpoint висит?
  • Alert threshold values — 80%/90% от каких Vertex tier baseline? Зависит от unknown current tier (open question в открытый-вопрос—grant-кредиты-vs-tier-qualification).
  • Capacity baseline calibration — теоретический потолок 100 tests/min, фактический TPM-bound ~40 (см. capacity-baseline). Перемерить как изменилось.

Источники

Источники: 1 2 3 4 5 6.

Сноски

  1. [TianPan: LLM API Resilience in Production](, accessed 2026-05-17, https://tianpan.co/blog/2026-03-11-llm-api-resilience-production — почему backoff alone не масштабируется, recommended resilience stack pattern.

  2. [getmaxim.ai: How AI Gateways Tackle Rate Limiting](, accessed 2026-05-17, https://www.getmaxim.ai/articles/how-ai-gateways-tackle-rate-limiting-for-llm-apps/.

  3. [getmaxim.ai: Retries, Fallbacks, and Circuit Breakers in LLM Apps — Production Guide](, accessed 2026-05-17, https://www.getmaxim.ai/articles/retries-fallbacks-and-circuit-breakers-in-llm-apps-a-production-guide/.

  4. [TensorZero retries + fallbacks docs](, accessed 2026-05-17, https://www.tensorzero.com/docs/gateway/guides/retries-fallbacks — model routing / variant fallbacks / function fallbacks в Rust-based gateway.

  5. [Google Cloud Blog: Reduce 429 errors on Vertex AI](, accessed 2026-05-17, https://cloud.google.com/blog/products/ai-machine-learning/reduce-429-errors-on-vertex-ai.

  6. Application-level circuit-breaker libraries: [Polly (.NET)](, accessed 2026-05-17, https://github.com/App-vNext/PollyResilience4j (Java/JVM), opossum (Node.js).