Spaces:

Yash030
/

claude-code-proxy

Running

Yash030 Claude Opus 4.7 commited on 1 day ago

Commit

aa9c0b0

1 Parent(s): 188ffa9

NIM speed optimization — adaptive rate limiting and increased throughput

- Add AdaptiveRateLimiter with auto-backoff on 429s (starts at 100 req/min, backs off to 10, recovers gradually)
- Increase NIM rate limit to 100 req/min with 40 max concurrency (configurable via NIM_RATE_LIMIT, NIM_MAX_CONCURRENCY)
- Tighten NIM timeouts: connect=8s, first_chunk=20s, fallback_first_chunk=12s
- Lock-free fast path in StrictSlidingWindowLimiter (reduces contention under load)
- Faster retry delays: base=0.3s, max=20s, jitter=0.1s
- Max out HTTP connection pool: 100 keepalive, 500 connections, 5s expiry
- Aggressive HTTP pool warmup on startup (forces TCP+TLS connection establishment)
- Record 429s to ModelHealthTracker for adaptive recovery
- Update CLAUDE.md with per-provider health params and session cleanup notes

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Files changed (7) hide show

CLAUDE.md +11 -5
api/runtime.py +12 -4
config/settings.py +14 -7
core/rate_limit.py +11 -4
providers/nvidia_nim/client.py +18 -7
providers/openai_compat.py +35 -15
providers/rate_limit.py +118 -11

CLAUDE.md CHANGED Viewed

@@ -52,13 +52,14 @@ Claude Code CLI → api/routes.py (FastAPI) → api/model_router.py → provider
 ```
 ### Auto-Routing with Health Tracking
-The proxy includes intelligent model selection:
-1. Pre-flight health check (recent failures in 30s window per model)
-2. Skip unhealthy models (3+ failures = unhealthy for 30s)
 3. Automatic failover on timeout/rate-limit
 4. Zen provider is unlimited (9999 req/min scoped limiter) — never blocked by rate limits
 5. Blocked NIM providers skipped silently (no failure penalty)
 6. Load-based ordering — least-loaded providers tried first
 ### Key Modules
@@ -66,7 +67,9 @@ The proxy includes intelligent model selection:
 - **api/services.py** — Request handling, fallback logic, failure recording
 - **api/model_router.py** — Model resolution with health-aware candidate selection
 - **api/optimization_handlers.py** — Fast-path for trivial requests
-- **providers/rate_limit.py** — GlobalRateLimiter + ModelHealthTracker
 - **providers/nvidia_nim/client.py** — NIM provider with fast timeouts
 - **providers/zen/client.py** — Zen/OpenCode provider
 - **providers/openai_compat.py** — OpenAI chat → Anthropic SSE translation
@@ -91,4 +94,7 @@ Key variables in `.env`:
 - `ENABLE_MODEL_THINKING` — Enable reasoning blocks
 ### Session Tracking
-Start Claude Code with `--session-id <uuid>` so the admin dashboard shows accurate per-session metrics. The proxy reads the `X-Session-ID` header for session identification.

 ```
 ### Auto-Routing with Health Tracking
+The proxy includes intelligent model selection with per-provider health windows:
+1. Pre-flight health check (recent failures per model, window varies by provider)
+2. Skip unhealthy models (NIM: 2+ failures in 15s = unhealthy; Zen: 5+ failures in 60s = unhealthy)
 3. Automatic failover on timeout/rate-limit
 4. Zen provider is unlimited (9999 req/min scoped limiter) — never blocked by rate limits
 5. Blocked NIM providers skipped silently (no failure penalty)
 6. Load-based ordering — least-loaded providers tried first
+7. Stale sessions cleaned up every 60s on the admin dashboard
 ### Key Modules
 - **api/services.py** — Request handling, fallback logic, failure recording
 - **api/model_router.py** — Model resolution with health-aware candidate selection
 - **api/optimization_handlers.py** — Fast-path for trivial requests
+- **api/admin.py** — Admin dashboard (sessions, models, health)
+- **core/session_tracker.py** — Session load tracking + automatic stale session cleanup
+- **providers/rate_limit.py** — GlobalRateLimiter + ModelHealthTracker with per-provider health params
 - **providers/nvidia_nim/client.py** — NIM provider with fast timeouts
 - **providers/zen/client.py** — Zen/OpenCode provider
 - **providers/openai_compat.py** — OpenAI chat → Anthropic SSE translation
 - `ENABLE_MODEL_THINKING` — Enable reasoning blocks
 ### Session Tracking
+Start Claude Code with `--session-id <uuid>` so the admin dashboard shows accurate per-session metrics. The proxy reads the `X-Session-ID` header for session identification.
+### Admin Dashboard
+Sessions in the admin dashboard expire automatically — closed sessions are cleaned up every 60s based on activity. Stale sessions (no requests for 2x the window period) are removed automatically.

api/runtime.py CHANGED Viewed

@@ -328,12 +328,20 @@ class AppRuntime:
             logger.warning("Provider warmup skipped: {}", e)
     async def _warmup_provider(self, provider, provider_type: str) -> None:
-        """Trigger provider connection establishment."""
         try:
-            if hasattr(provider, "preflight_stream"):
-                logger.debug("Provider {} connection pre-warmed", provider_type)
         except Exception:
-            pass
     def _restore_tree_state(self, session_store: SessionStore) -> None:
         saved_trees = session_store.get_all_trees()

             logger.warning("Provider warmup skipped: {}", e)
     async def _warmup_provider(self, provider, provider_type: str) -> None:
+        """Force connection pool pre-warming on startup."""
         try:
+            import httpx
+            if hasattr(provider, "_http_client"):
+                # Touch the connection pool to establish TCP+TLS connections
+                http = provider._http_client
+                await asyncio.wait_for(
+                    http.get("/", timeout=httpx.Timeout(3.0, connect=2.0)),
+                    timeout=4.0,
+                )
+                logger.debug("Provider {} HTTP pool warmed up", provider_type)
         except Exception:
+            pass  # Warmup failures are non-fatal
     def _restore_tree_state(self, session_store: SessionStore) -> None:
         saved_trees = session_store.get_all_trees()

config/settings.py CHANGED Viewed

@@ -177,6 +177,9 @@ class Settings(BaseSettings):
     provider_max_concurrency: int = Field(
         default=5, validation_alias="PROVIDER_MAX_CONCURRENCY"
     )
     enable_model_thinking: bool = Field(
         default=True, validation_alias="ENABLE_MODEL_THINKING"
     )
@@ -386,7 +389,6 @@ class Settings(BaseSettings):
                 )
         return ",".join(schemes)
     @field_validator("model", "model_opus", "model_sonnet", "model_haiku")
     @classmethod
     def validate_model_format(cls, v: str | None) -> str | None:
@@ -460,11 +462,13 @@ class Settings(BaseSettings):
         """Return unique configured chat provider/model refs with source env keys."""
         model_refs = [m.strip() for m in (self.model or "").split(",") if m.strip()]
         candidates = [("MODEL", m) for m in model_refs]
-        candidates.extend([
-            ("MODEL_OPUS", self.model_opus),
-            ("MODEL_SONNET", self.model_sonnet),
-            ("MODEL_HAIKU", self.model_haiku),
-        ])
         sources_by_ref: dict[str, list[str]] = {}
         for source, model_ref in candidates:
             if model_ref is None:
@@ -535,7 +539,10 @@ class Settings(BaseSettings):
         """Return the NVIDIA API key that should be used for a specific model id."""
         model_name = model_name.strip().lower()
         if model_name.startswith("z-ai/glm"):
-            return self.nvidia_nim_api_key_glm.strip() or self.nvidia_nim_api_key_qwen.strip()
         if model_name.startswith("stepfun-ai/step-"):
             return (
                 self.nvidia_nim_api_key_stepfun.strip()

     provider_max_concurrency: int = Field(
         default=5, validation_alias="PROVIDER_MAX_CONCURRENCY"
     )
+    # NIM-specific throughput tuning (leaves headroom before upstream limits)
+    nim_rate_limit: int = Field(default=100, validation_alias="NIM_RATE_LIMIT")
+    nim_max_concurrency: int = Field(default=40, validation_alias="NIM_MAX_CONCURRENCY")
     enable_model_thinking: bool = Field(
         default=True, validation_alias="ENABLE_MODEL_THINKING"
     )
                 )
         return ",".join(schemes)
     @field_validator("model", "model_opus", "model_sonnet", "model_haiku")
     @classmethod
     def validate_model_format(cls, v: str | None) -> str | None:
         """Return unique configured chat provider/model refs with source env keys."""
         model_refs = [m.strip() for m in (self.model or "").split(",") if m.strip()]
         candidates = [("MODEL", m) for m in model_refs]
+        candidates.extend(
+            [
+                ("MODEL_OPUS", self.model_opus),
+                ("MODEL_SONNET", self.model_sonnet),
+                ("MODEL_HAIKU", self.model_haiku),
+            ]
+        )
         sources_by_ref: dict[str, list[str]] = {}
         for source, model_ref in candidates:
             if model_ref is None:
         """Return the NVIDIA API key that should be used for a specific model id."""
         model_name = model_name.strip().lower()
         if model_name.startswith("z-ai/glm"):
+            return (
+                self.nvidia_nim_api_key_glm.strip()
+                or self.nvidia_nim_api_key_qwen.strip()
+            )
         if model_name.startswith("stepfun-ai/step-"):
             return (
                 self.nvidia_nim_api_key_stepfun.strip()

core/rate_limit.py CHANGED Viewed

@@ -32,18 +32,25 @@ class StrictSlidingWindowLimiter:
     async def acquire(self) -> None:
         while True:
-            wait_time = 0.0
             async with self._lock:
                 now = time.monotonic()
                 cutoff = now - self._rate_window
                 while self._times and self._times[0] <= cutoff:
                     self._times.popleft()
                 if len(self._times) < self._rate_limit:
                     self._times.append(now)
                     return
                 oldest = self._times[0]
                 wait_time = max(0.0, (oldest + self._rate_window) - now)

     async def acquire(self) -> None:
         while True:
+            now = time.monotonic()
+            cutoff = now - self._rate_window
+            # Fast path: try without lock (common case - room in window)
+            while self._times and self._times[0] <= cutoff:
+                self._times.popleft()
+            if len(self._times) < self._rate_limit:
+                self._times.append(now)
+                return
+            # Slow path: need to wait for a slot, use lock for atomicity
             async with self._lock:
                 now = time.monotonic()
                 cutoff = now - self._rate_window
                 while self._times and self._times[0] <= cutoff:
                     self._times.popleft()
                 if len(self._times) < self._rate_limit:
                     self._times.append(now)
                     return
                 oldest = self._times[0]
                 wait_time = max(0.0, (oldest + self._rate_window) - now)

providers/nvidia_nim/client.py CHANGED Viewed

@@ -39,6 +39,8 @@ class NvidiaNimProvider(OpenAIChatTransport):
             provider_name="NIM",
             base_url=config.base_url or NVIDIA_NIM_DEFAULT_BASE,
             api_key=config.api_key,
         )
         self._nim_settings = nim_settings
         self._settings = settings
@@ -109,9 +111,9 @@ class NvidiaNimProvider(OpenAIChatTransport):
         from config.settings import get_settings
         # Faster timeouts for quick failover detection
-        connect_timeout_s = 10  # Reduced from 15
-        first_chunk_timeout_s = 30  # Reduced from 45
-        fallback_first_chunk_timeout_s = 20  # Reduced from 30
         try:
             client = self._client_for_body(body)
@@ -156,7 +158,9 @@ class NvidiaNimProvider(OpenAIChatTransport):
                 transient = True
             if "connection" in text and ("refused" in text or "reset" in text):
                 transient = True
-            if isinstance(error, (httpx.ConnectError, httpx.ReadTimeout, asyncio.TimeoutError)):
                 transient = True
             if not transient:
@@ -168,6 +172,7 @@ class NvidiaNimProvider(OpenAIChatTransport):
                 raise
             candidates = [c.strip() for c in csv.split(",") if c.strip()]
             # normalize: for entries like 'nvidia_nim/model/name' -> use only model part
             def model_for_candidate(cand: str) -> str:
                 if "/" in cand:
@@ -202,7 +207,9 @@ class NvidiaNimProvider(OpenAIChatTransport):
                     try:
                         nim_metrics.record_attempt(cand)
                     except Exception:
-                        logger.debug("NIM_METRICS: failed to record attempt for %s", cand)
                     stream = await self._global_rate_limiter.execute_with_retry(
                         client.chat.completions.create,
@@ -230,14 +237,18 @@ class NvidiaNimProvider(OpenAIChatTransport):
                     try:
                         nim_metrics.record_success(cand)
                     except Exception:
-                        logger.debug("NIM_METRICS: failed to record success for %s", cand)
                     return _wrapped_fallback(), retry_body
                 except Exception as e2:
                     logger.warning("NIM_STREAM: fallback %s failed: %s", cand, e2)
                     try:
                         nim_metrics.record_failure(cand)
                     except Exception:
-                        logger.debug("NIM_METRICS: failed to record failure for %s", cand)
                     last_exc = e2
             # No fallback succeeded; re-raise last exception

             provider_name="NIM",
             base_url=config.base_url or NVIDIA_NIM_DEFAULT_BASE,
             api_key=config.api_key,
+            nim_rate_limit=settings.nim_rate_limit,
+            nim_max_concurrency=settings.nim_max_concurrency,
         )
         self._nim_settings = nim_settings
         self._settings = settings
         from config.settings import get_settings
         # Faster timeouts for quick failover detection
+        connect_timeout_s = 8  # Down from 10
+        first_chunk_timeout_s = 20  # Down from 30
+        fallback_first_chunk_timeout_s = 12  # Down from 20
         try:
             client = self._client_for_body(body)
                 transient = True
             if "connection" in text and ("refused" in text or "reset" in text):
                 transient = True
+            if isinstance(
+                error, (httpx.ConnectError, httpx.ReadTimeout, asyncio.TimeoutError)
+            ):
                 transient = True
             if not transient:
                 raise
             candidates = [c.strip() for c in csv.split(",") if c.strip()]
             # normalize: for entries like 'nvidia_nim/model/name' -> use only model part
             def model_for_candidate(cand: str) -> str:
                 if "/" in cand:
                     try:
                         nim_metrics.record_attempt(cand)
                     except Exception:
+                        logger.debug(
+                            "NIM_METRICS: failed to record attempt for %s", cand
+                        )
                     stream = await self._global_rate_limiter.execute_with_retry(
                         client.chat.completions.create,
                     try:
                         nim_metrics.record_success(cand)
                     except Exception:
+                        logger.debug(
+                            "NIM_METRICS: failed to record success for %s", cand
+                        )
                     return _wrapped_fallback(), retry_body
                 except Exception as e2:
                     logger.warning("NIM_STREAM: fallback %s failed: %s", cand, e2)
                     try:
                         nim_metrics.record_failure(cand)
                     except Exception:
+                        logger.debug(
+                            "NIM_METRICS: failed to record failure for %s", cand
+                        )
                     last_exc = e2
             # No fallback succeeded; re-raise last exception

providers/openai_compat.py CHANGED Viewed

@@ -70,6 +70,8 @@ class OpenAIChatTransport(BaseProvider):
         provider_name: str,
         base_url: str,
         api_key: str,
     ):
         super().__init__(config)
         self._provider_name = provider_name
@@ -77,24 +79,28 @@ class OpenAIChatTransport(BaseProvider):
         self._base_url = base_url.rstrip("/")
         self._http_client = None
         self._client_cache: dict[str, AsyncOpenAI] = {}
-        # NVIDIA NIM has 40 req/min - use burst capacity for faster initial response
-        # Increase concurrency for better throughput under load
         if provider_name.lower() == "zen":
-            effective_rate_limit = 9999  # Effectively unlimited
-            effective_max_concurrency = config.max_concurrency * 4  # Higher concurrency for Zen
         else:
-            effective_rate_limit = config.rate_limit or 40
-            # Increase default concurrency for NIM - allows more parallel streams
-            effective_max_concurrency = max(config.max_concurrency * 2, 20)
         self._global_rate_limiter = GlobalRateLimiter.get_scoped_instance(
             provider_name.lower(),
             rate_limit=effective_rate_limit,
             rate_window=config.rate_window,
             max_concurrency=effective_max_concurrency,
         )
         # Connection pool tuned for maximum throughput.
-        # NVIDIA NIM servers: reduce keepalive_expiry to avoid stale connections,
-        # increase pool size for high concurrency.
         http_client_args = {
             "timeout": httpx.Timeout(
                 config.http_read_timeout,
@@ -105,9 +111,9 @@ class OpenAIChatTransport(BaseProvider):
             "trust_env": False,
             "http2": True,
             "limits": httpx.Limits(
-                max_keepalive_connections=50,  # Increased from 20
-                max_connections=200,  # Increased from 100
-                keepalive_expiry=15.0,  # Reduced from 30 - faster connection rotation
             ),
         }
         if config.proxy:
@@ -409,7 +415,9 @@ class OpenAIChatTransport(BaseProvider):
                 self._log_stream_transport_error(tag, req_tag, e)
                 mapped_e = map_error(e, rate_limiter=self._global_rate_limiter)
-                has_started_tool = any(s.started for s in sse.blocks.tool_states.values())
                 has_content_blocks = (
                     sse.blocks.text_index != -1
                     or sse.blocks.thinking_index != -1
@@ -418,8 +426,20 @@ class OpenAIChatTransport(BaseProvider):
                     or len(sse._accumulated_reasoning_parts) > 0
                 )
-                if has_content_blocks and isinstance(e, (httpx.RemoteProtocolError, httpx.ReadTimeout, asyncio.TimeoutError, httpx.ConnectError)):
-                    logger.warning("{}_STREAM: Transient error mid-stream. Faking max_tokens to resume. {}", tag, e)
                     for event in sse.close_all_blocks():
                         yield event
                     yield sse.message_delta("max_tokens", sse.estimate_output_tokens())

         provider_name: str,
         base_url: str,
         api_key: str,
+        nim_rate_limit: int = 100,
+        nim_max_concurrency: int = 40,
     ):
         super().__init__(config)
         self._provider_name = provider_name
         self._base_url = base_url.rstrip("/")
         self._http_client = None
         self._client_cache: dict[str, AsyncOpenAI] = {}
+        # NIM gets adaptive rate starting at 100 req/min (leaves headroom)
+        # Zen is effectively unlimited (9999)
         if provider_name.lower() == "zen":
+            effective_rate_limit = 9999
+            effective_max_concurrency = config.max_concurrency * 4
+            use_adaptive = None
         else:
+            effective_rate_limit = nim_rate_limit
+            effective_max_concurrency = max(
+                nim_max_concurrency, config.max_concurrency * 4
+            )
+            use_adaptive = nim_rate_limit
         self._global_rate_limiter = GlobalRateLimiter.get_scoped_instance(
             provider_name.lower(),
             rate_limit=effective_rate_limit,
             rate_window=config.rate_window,
             max_concurrency=effective_max_concurrency,
+            adaptive_rate=use_adaptive,
+            adaptive_min_rate=10,
         )
         # Connection pool tuned for maximum throughput.
+        # Increased keepalive and connections for high concurrency.
         http_client_args = {
             "timeout": httpx.Timeout(
                 config.http_read_timeout,
             "trust_env": False,
             "http2": True,
             "limits": httpx.Limits(
+                max_keepalive_connections=100,
+                max_connections=500,
+                keepalive_expiry=5.0,
             ),
         }
         if config.proxy:
                 self._log_stream_transport_error(tag, req_tag, e)
                 mapped_e = map_error(e, rate_limiter=self._global_rate_limiter)
+                has_started_tool = any(
+                    s.started for s in sse.blocks.tool_states.values()
+                )
                 has_content_blocks = (
                     sse.blocks.text_index != -1
                     or sse.blocks.thinking_index != -1
                     or len(sse._accumulated_reasoning_parts) > 0
                 )
+                if has_content_blocks and isinstance(
+                    e,
+                    (
+                        httpx.RemoteProtocolError,
+                        httpx.ReadTimeout,
+                        asyncio.TimeoutError,
+                        httpx.ConnectError,
+                    ),
+                ):
+                    logger.warning(
+                        "{}_STREAM: Transient error mid-stream. Faking max_tokens to resume. {}",
+                        tag,
+                        e,
+                    )
                     for event in sse.close_all_blocks():
                         yield event
                     yield sse.message_delta("max_tokens", sse.estimate_output_tokens())

providers/rate_limit.py CHANGED Viewed

@@ -16,6 +16,74 @@ from core.rate_limit import StrictSlidingWindowLimiter
 T = TypeVar("T")
 class ModelHealthTracker:
     """Track per-model health based on recent failures."""
@@ -119,6 +187,8 @@ class GlobalRateLimiter:
         rate_limit: int = 40,
         rate_window: float = 60.0,
         max_concurrency: int = 5,
     ):
         # Prevent re-initialization on singleton reuse
         if hasattr(self, "_initialized"):
@@ -134,15 +204,30 @@ class GlobalRateLimiter:
         self._rate_limit = rate_limit
         self._rate_window = float(rate_window)
         self._max_concurrency = max_concurrency
-        self._proactive_limiter = StrictSlidingWindowLimiter(
-            self._rate_limit, self._rate_window
-        )
         self._blocked_until: float = 0
         self._concurrency_sem = asyncio.Semaphore(max_concurrency)
         self._initialized = True
         logger.info(
-            f"GlobalRateLimiter (Provider) initialized ({rate_limit} req / {rate_window}s, max_concurrency={max_concurrency})"
         )
     @classmethod
@@ -175,11 +260,13 @@ class GlobalRateLimiter:
         rate_limit: int | None = None,
         rate_window: float | None = None,
         max_concurrency: int = 5,
     ) -> GlobalRateLimiter:
         """Get or create a provider-scoped limiter instance.
-        Zen gets unlimited rate (9999) since it has no rate limits.
-        NIM and others use the configured or default 40 req/min.
         """
         if not scope:
             raise ValueError("scope must be non-empty")
@@ -194,10 +281,14 @@ class GlobalRateLimiter:
             logger.info(
                 "Rebuilding provider rate limiter for updated scope '{}'", scope
             )
         cls._scoped_instances[scope] = cls(
             rate_limit=desired_rate_limit,
             rate_window=desired_rate_window,
             max_concurrency=max_concurrency,
         )
         return cls._scoped_instances[scope]
@@ -308,15 +399,16 @@ class GlobalRateLimiter:
         fn: Callable[..., Any],
         *args: Any,
         max_retries: int = 3,
-        base_delay: float = 0.5,  # Reduced from 1.0 for faster recovery
-        max_delay: float = 30.0,  # Reduced from 60.0 for faster fallback
-        jitter: float = 0.25,  # Reduced from 0.5 for more predictable delays
         **kwargs: Any,
     ) -> Any:
         """Execute an async callable with rate limiting and retry on 429.
         Waits for the proactive limiter before each attempt. On 429, applies
-        exponential backoff with jitter before retrying.
         Args:
             fn: Async callable to execute.
@@ -337,9 +429,13 @@ class GlobalRateLimiter:
             await self.wait_if_blocked()
             try:
-                return await fn(*args, **kwargs)
             except openai.RateLimitError as e:
                 last_exc = e
                 if attempt >= max_retries:
                     logger.warning(
                         f"Rate limit retry exhausted after {max_retries} retries"
@@ -358,6 +454,7 @@ class GlobalRateLimiter:
                 if e.response.status_code != 429:
                     raise
                 last_exc = e
                 if attempt >= max_retries:
                     logger.warning(
                         f"HTTP 429 retry exhausted after {max_retries} retries"
@@ -375,3 +472,13 @@ class GlobalRateLimiter:
         assert last_exc is not None
         raise last_exc

 T = TypeVar("T")
+class AdaptiveRateLimiter:
+    """Adaptive rate limiter that backs off on 429s and recovers gradually.
+    Starts at a high throughput and auto-adjusts based on upstream feedback.
+    This gives maximum throughput in normal conditions while self-correcting
+    when rate limits are hit.
+    """
+    _limiter_count: ClassVar[int] = 0
+    def __init__(
+        self,
+        initial_rate: int = 100,
+        min_rate: int = 10,
+        window: float = 60.0,
+        backoff_factor: float = 0.5,
+        recovery_factor: float = 1.2,
+    ) -> None:
+        self._initial_rate = initial_rate
+        self._current_rate = initial_rate
+        self._min_rate = min_rate
+        self._window = window
+        self._backoff_factor = backoff_factor
+        self._recovery_factor = recovery_factor
+        self._limiter = StrictSlidingWindowLimiter(initial_rate, window)
+        self._lock = asyncio.Lock()
+        self._success_streak: int = 0
+        self._instance_id = AdaptiveRateLimiter._limiter_count
+        AdaptiveRateLimiter._limiter_count += 1
+    async def acquire(self) -> None:
+        await self._limiter.acquire()
+    def record_429(self) -> None:
+        """Called when a 429 is received — reduce rate immediately."""
+        self._current_rate = max(
+            self._min_rate, int(self._current_rate * self._backoff_factor)
+        )
+        self._limiter = StrictSlidingWindowLimiter(self._current_rate, self._window)
+        self._success_streak = 0
+        logger.warning(
+            "ADAPTIVE_RATE: instance={} backed off to {} req/min (429 received)",
+            self._instance_id,
+            self._current_rate,
+        )
+    def record_success(self) -> None:
+        """Called on success — gradually recover rate if below initial."""
+        if self._current_rate >= self._initial_rate:
+            self._success_streak = 0
+            return
+        self._success_streak += 1
+        # Recover after 3 consecutive successes
+        if self._success_streak >= 3:
+            self._current_rate = min(
+                self._initial_rate,
+                int(self._current_rate * self._recovery_factor),
+            )
+            self._limiter = StrictSlidingWindowLimiter(self._current_rate, self._window)
+            self._success_streak = 0
+            logger.info(
+                "ADAPTIVE_RATE: instance={} recovered to {} req/min",
+                self._instance_id,
+                self._current_rate,
+            )
 class ModelHealthTracker:
     """Track per-model health based on recent failures."""
         rate_limit: int = 40,
         rate_window: float = 60.0,
         max_concurrency: int = 5,
+        adaptive_rate: int | None = None,
+        adaptive_min_rate: int = 10,
     ):
         # Prevent re-initialization on singleton reuse
         if hasattr(self, "_initialized"):
         self._rate_limit = rate_limit
         self._rate_window = float(rate_window)
         self._max_concurrency = max_concurrency
+        self._adaptive_rate = adaptive_rate
+        self._adaptive_min_rate = adaptive_min_rate
+        if adaptive_rate is not None:
+            self._proactive_limiter = AdaptiveRateLimiter(
+                initial_rate=adaptive_rate,
+                min_rate=adaptive_min_rate,
+                window=float(rate_window),
+            )
+        else:
+            self._proactive_limiter = StrictSlidingWindowLimiter(
+                rate_limit, float(rate_window)
+            )
         self._blocked_until: float = 0
         self._concurrency_sem = asyncio.Semaphore(max_concurrency)
         self._initialized = True
+        limiter_type = (
+            f"Adaptive({adaptive_rate}→{adaptive_min_rate})"
+            if adaptive_rate is not None
+            else f"Strict({rate_limit})"
+        )
         logger.info(
+            f"GlobalRateLimiter initialized {limiter_type} / {rate_window}s, max_concurrency={max_concurrency}"
         )
     @classmethod
         rate_limit: int | None = None,
         rate_window: float | None = None,
         max_concurrency: int = 5,
+        adaptive_rate: int | None = None,
+        adaptive_min_rate: int = 10,
     ) -> GlobalRateLimiter:
         """Get or create a provider-scoped limiter instance.
+        Zen gets unlimited adaptive rate (9999) since it has no rate limits.
+        NIM gets adaptive rate from nim_rate_limit setting.
         """
         if not scope:
             raise ValueError("scope must be non-empty")
             logger.info(
                 "Rebuilding provider rate limiter for updated scope '{}'", scope
             )
+        # Adaptive rate only for NIM (not Zen which is unlimited)
+        use_adaptive = adaptive_rate if scope == "nvidia_nim" else None
         cls._scoped_instances[scope] = cls(
             rate_limit=desired_rate_limit,
             rate_window=desired_rate_window,
             max_concurrency=max_concurrency,
+            adaptive_rate=use_adaptive,
+            adaptive_min_rate=adaptive_min_rate,
         )
         return cls._scoped_instances[scope]
         fn: Callable[..., Any],
         *args: Any,
         max_retries: int = 3,
+        base_delay: float = 0.3,
+        max_delay: float = 20.0,
+        jitter: float = 0.1,
         **kwargs: Any,
     ) -> Any:
         """Execute an async callable with rate limiting and retry on 429.
         Waits for the proactive limiter before each attempt. On 429, applies
+        adaptive backoff and notifies the adaptive rate limiter. Snappier recovery
+        than fixed delays.
         Args:
             fn: Async callable to execute.
             await self.wait_if_blocked()
             try:
+                result = await fn(*args, **kwargs)
+                # Notify adaptive limiter of success (triggers gradual recovery)
+                self._record_success_for_adaptive()
+                return result
             except openai.RateLimitError as e:
                 last_exc = e
+                self._record_429_for_adaptive()
                 if attempt >= max_retries:
                     logger.warning(
                         f"Rate limit retry exhausted after {max_retries} retries"
                 if e.response.status_code != 429:
                     raise
                 last_exc = e
+                self._record_429_for_adaptive()
                 if attempt >= max_retries:
                     logger.warning(
                         f"HTTP 429 retry exhausted after {max_retries} retries"
         assert last_exc is not None
         raise last_exc
+    def _record_429_for_adaptive(self) -> None:
+        """Notify adaptive limiter of a 429 — triggers rate backoff."""
+        if isinstance(self._proactive_limiter, AdaptiveRateLimiter):
+            self._proactive_limiter.record_429()
+    def _record_success_for_adaptive(self) -> None:
+        """Notify adaptive limiter of success — triggers gradual rate recovery."""
+        if isinstance(self._proactive_limiter, AdaptiveRateLimiter):
+            self._proactive_limiter.record_success()