Design Principles
To ensure all users enjoy stable and reliable inference services, we adopt a tiered resource isolation and dynamic rate management mechanism:
- Fair Use: Monthly quotas are set for "Unlimited Token Plans." After exceeding the quota, accounts enter Basic Assurance Mode to prevent abnormal loads from individual accounts from affecting overall platform stability.
- Predictable Experience: Requests that exceed the threshold will immediately receive a standard HTTP 429 response, enabling clients to perform graceful degradation.
Rate Limits Overview
1. Standard Plans and Pay-as-you-go
The following plans apply to regular development, production, and commercial calling scenarios. If any dimension (RPS / RPM / RPH / RPD) hits the upper limit, a 429 Rate Limit will be returned.
| Plan Type | RPS (Requests/Second) | RPM (Requests/Minute) | RPH (Requests/Hour) | RPD (Requests/Day) |
|---|---|---|---|---|
| Free Trial | 1 | 3 | — | — |
| Paid / Recharged Users | 1 | 6 | — | — |
| Enterprise / API Gateway | As per contract | As per contract | As per contract | As per contract |
Note:
Rate limits, concurrency capabilities, and timeout policies for dedicated Enterprise plans can be customized per account. Please contact your account manager for an exclusive SLA.
2. Unlimited Plans Fair Use Policy
"Unlimited Plans" adopt a "Monthly Token Quota + Excess Basic Assurance" model. Within the monthly Token quota, you can enjoy the high-speed inference service corresponding to that plan; once Token consumption reaches the monthly cap, the account will automatically enter Basic Assurance Mode, with rate limits adjusted as follows:
| Plan Tier | Monthly Token Quota | Excess RPS | Excess RPM | Excess RPH | Excess RPD |
|---|---|---|---|---|---|
| Unlimited 50M | 50,000,000 Tokens | 1 | 2 | 10 | 50 |
| Unlimited 200M | 200,000,000 Tokens | 1 | 8 | 40 | 200 |
| Unlimited 500M | 500,000,000 Tokens | 1 | 20 | 100 | 500 |
Important Notice:
Basic Assurance Mode only ensures basic API availability and is not suitable for production-grade high-concurrency loads. If you need to restore standard or higher performance, you can upgrade your plan at any time or contact your account manager to customize an Enterprise plan.
Rate Limit Dimensions Explained
| Dimension | Full Name | Meaning | Behavior Upon Trigger |
|---|---|---|---|
| RPS | Requests Per Second | Number of requests per second | Instant rejection upon exceeding |
| RPM | Requests Per Minute | Number of requests per minute | Remaining requests in the current minute window are rejected |
| RPH | Requests Per Hour | Number of requests per hour | Limited to the corresponding tier within the current hour window |
| RPD | Requests Per Day | Number of requests per day | Limited to the corresponding tier within the current day window |
Calculation Logic:
The system counts each limit independently. For example, for an Unlimited 50M user after exceeding the quota, even if their RPD still has remaining quota, once RPM reaches 2, subsequent requests within that minute will also be rejected.
Over-limit Response and Best Practices
1. HTTP Response Example
When a request triggers the rate limit, the API will return the following response:
HTTP/2 429 Too Many Requests
Content-Type: application/json
{
"error": {
"code": "rate_limit_exceeded",
"message": "Rate limit exceeded. Please retry later or upgrade your plan for higher throughput.",
"type": "request_limit_exceeded"
}
}2. Client Best Practices
• Exponential Backoff
After receiving a 429, it is recommended to retry with intervals of 1s → 2s → 4s → 8s → ... Avoid high-frequency hard retries, as they may trigger longer throttling.
• Rate Limit Warm-up
For burst traffic (e.g., batch tasks), avoid instantly maxing out concurrency. Adopt gradual speed increases to allow requests to enter the system smoothly.
• Monitor Usage
You can view Token consumption and current plan usage at Model API KEY → Monthly subscription to plan upgrades in advance.