Claude Code Rate Limits Explained: Why You Hit 'API Error' at 6% Usage

Architecture·March 9, 2026·8 min read·

Claude Coderate limitstroubleshootingtutorial

You're on the Max plan. You check your usage dashboard: 6%. You fire up Claude Code, start a refactor, and thirty seconds later:

Rate limited. Please try again later.

You check usage again. Still 6%. You try a different session. Same error. You open the Anthropic status page — all green. You start questioning reality.

You're not alone. GitHub issues #29579, #29882, #28832, and #27336 are full of developers reporting the same thing. Hundreds of thumbs-up reactions. The common thread: the error message tells you nothing about which limit you hit, and the usage dashboard seems to contradict it.

Here's what's actually happening.

There Are Three Different Rate Limits

The confusion exists because Anthropic enforces three separate limit systems, but the error message for all of them is functionally identical. Your usage percentage only reflects one of the three.

1. Usage Caps (What Your Dashboard Shows)

This is the number you see in Settings. It measures total tokens consumed over two rolling windows:

5-hour rolling window: the primary cap. Tokens in, tokens out, across all sessions, summed over the last 5 hours. When you hit 100%, you're locked out until older tokens age out of the window.
7-day weekly cap: a secondary ceiling that prevents sustained heavy use from compounding. Most developers never hit this unless they're running agents around the clock.

Your plan tier determines the size of these windows. Max $100 gets a bigger bucket than Max $20. The utilization percentage on your dashboard is your current consumption relative to these caps.

When this triggers: you've been doing a lot of work across many sessions, and the cumulative token volume has crossed your plan's threshold.

2. Throughput / Burst Limits

This is the one that catches developers off guard. Even with 6% utilization, you can be rate limited if you're sending too many requests concurrently or consuming too many tokens per minute.

Think of usage caps as a gas tank, and throughput limits as how fast you can pour. You might have a full tank (low utilization), but if you try to pump all the gas at once, the nozzle maxes out.

Throughput limits include:

Requests per minute (RPM): how many API calls you can make in a 60-second window
Tokens per minute (TPM): how many input + output tokens you can process per minute
Concurrent requests: how many in-flight requests can be active simultaneously

These limits are per-model, and they vary significantly by tier. Opus has much stricter throughput constraints than Sonnet or Haiku. If you're running multiple Claude Code sessions — or multiple agents across sessions — all hitting Opus simultaneously, you'll blow through throughput limits long before you dent your usage cap.

When this triggers: you have multiple sessions or agents running in parallel, you're using a high-demand model (Opus), or you're sending rapid-fire requests from automated workflows.

3. Server-Side Rate Limiting (429s During Peak Load)

Sometimes Anthropic's infrastructure itself is the bottleneck. During peak usage periods, the API returns HTTP 429 responses to manage overall system load, regardless of your individual account limits.

The telltale sign is a retry-after: 0 header in the response. This means the server is shedding load in the moment and expects capacity to free up almost immediately. It's not your usage, not your throughput — it's global demand exceeding supply.

When this triggers: high-traffic periods (weekday afternoons US time are common), new model launches, or during any period of unusually high platform demand.

How to Diagnose Which Limit You Hit

When you get a rate limit error, work through this checklist:

Check Your Utilization Percentage

Open your Anthropic usage dashboard. If it shows low utilization (under 50%) and you're being rate limited, you're almost certainly hitting a throughput limit, not a usage cap. The dashboard only reflects cumulative token consumption — it tells you nothing about your request velocity.

Count Your Active Sessions

How many Claude Code sessions do you have open right now? Each terminal tab, each VS Code instance, each agent running in a CI pipeline — they all share the same throughput pool. Three agents working in parallel on Opus can easily exceed concurrent request limits.

Look at the Response Headers

If you can inspect the raw API response (enable verbose logging in your client), look for:

retry-after: 0 — server-side load shedding. Wait a beat and retry.
retry-after: N (where N > 0) — you've hit a defined limit. N seconds is how long you should wait.
x-ratelimit-limit-requests and x-ratelimit-remaining-requests — your RPM budget and remaining balance.

Check Which Model You're Using

Opus has significantly lower throughput limits than Sonnet or Haiku. If you're running everything through Opus — including boilerplate tasks, file reads, and simple edits — you're burning scarce throughput on work that doesn't require frontier-level reasoning.

Practical Fixes

For Throughput Limits

Stagger your agent runs. If you have multiple agents, don't start them all simultaneously. Space launches by 30-60 seconds to avoid burst spikes. Use cron scheduling if your workflow supports it — a 5-minute offset between agents smooths out the request curve dramatically.

Avoid parallel sessions on Opus. Opus throughput is the tightest bottleneck in the system. If you need multiple agents running concurrently, route the simpler ones through Sonnet or Haiku and reserve Opus for the agent doing complex architectural reasoning.

Use sequential, not parallel, tool calls where possible. Some Claude Code workflows fire multiple tool calls in parallel. If you're close to RPM limits, sequential execution trades speed for reliability.

For Usage Caps

Implement model tiering. This is the single highest-leverage fix for usage cap pressure. Not every task requires the same model:

| Task | Model | Why | |------|-------|-----| | Code generation, refactoring | Sonnet | Best cost/quality ratio for structured output | | Complex architecture decisions | Opus | Worth the token cost for reasoning depth | | Linting, formatting, simple edits | Haiku | Fast, cheap, good enough | | File search, context gathering | Haiku | No reasoning needed |

Developers who switch from "Opus for everything" to tiered routing typically see 40-60% reduction in token consumption with no meaningful quality drop.

Trim your context windows. Large context windows burn tokens fast. If your agents are loading entire codebases into context, scope them tighter. Use .clawport/memory/ files and CLAUDE.md contracts to pre-load only the context an agent needs, rather than letting it discover everything from scratch each session.

Enable prompt caching. Anthropic's prompt caching reduces token costs for repeated prefixes. If your agents share a common system prompt or project context, caching means that shared prefix only counts once toward your usage cap.

For Server-Side 429s

Implement retry with exponential backoff. A simple retry loop with 1s, 2s, 4s delays handles transient 429s gracefully. Most server-side rate limiting clears in seconds.

Cache last-known-good responses. If your workflow involves repeated queries for relatively stable data (project structure, dependency trees, config files), cache the results locally. Don't re-query the API for information that hasn't changed.

Avoid peak hours for batch work. If you're running large batch operations (codebase-wide refactors, mass test generation), schedule them for off-peak hours. Early morning or late evening US time typically has the lightest load.

How ClawPort Helps

Rate limits become manageable when you can actually see what's happening. Flying blind — with a single utilization percentage and opaque error messages — is what makes this so frustrating.

ClawPort's cost dashboard shows per-model token breakdown across all your agents. You can see at a glance which agents are burning the most tokens and on which model. That's where you find your tiering opportunities — the agent spending 80% of your Opus budget on file reads is an obvious candidate for model demotion.

The cron scheduler lets you spread agent work across time windows instead of running everything in parallel. Define pipelines that stagger agent execution, and the throughput spikes that trigger burst limits flatten out.

The activity console gives you real-time visibility into what every agent is doing right now — how many are active, what they're working on, and whether they're blocked on rate limits. When one agent hits a limit, you can see it immediately instead of discovering the failure 20 minutes later.

And the sidebar status widget surfaces utilization at a glance so you don't have to context-switch to a browser tab to check whether you're approaching a cap.

The Real Fix Is Visibility

Rate limit errors in Claude Code are not bugs. They're the predictable result of a system with three different constraint types and one undifferentiated error message. Once you understand the distinction between usage caps, throughput limits, and server-side load shedding, you can diagnose any rate limit error in under a minute.

The pattern is always the same: check utilization (high = usage cap, low = throughput or server), count active sessions, check the model, and adjust accordingly. Model tiering alone solves most problems. Staggered scheduling solves the rest.

Running Claude Code agent teams? ClawPort gives you the cost dashboard, cron scheduler, and activity console to stay ahead of rate limits. Free and open source.