API Rate Limiting & SLA Documentation

Design Principles

To ensure all users enjoy stable and reliable inference services, we adopt a tiered resource isolation and dynamic rate management mechanism:

Fair Use: Monthly quotas are set for "Unlimited Token Plans." After exceeding the quota, accounts enter Basic Assurance Mode to prevent abnormal loads from individual accounts from affecting overall platform stability.
Predictable Experience: Requests that exceed the threshold will immediately receive a standard HTTP 429 response, enabling clients to perform graceful degradation.

Rate Limits Overview

1. Standard Plans and Pay-as-you-go

The following plans apply to regular development, production, and commercial calling scenarios. If any dimension (RPS / RPM / RPH / RPD) hits the upper limit, a 429 Rate Limit will be returned.

Plan Type	RPS (Requests/Second)	RPM (Requests/Minute)	RPH (Requests/Hour)	RPD (Requests/Day)
Free Trial	1	3	—	—
Paid / Recharged Users	1	6	—	—
Enterprise / API Gateway	As per contract	As per contract	As per contract	As per contract

Note:
Rate limits, concurrency capabilities, and timeout policies for dedicated Enterprise plans can be customized per account. Please contact your account manager for an exclusive SLA.

2. Unlimited Plans Fair Use Policy

"Unlimited Plans" adopt a "Monthly Token Quota + Excess Basic Assurance" model. Within the monthly Token quota, you can enjoy the high-speed inference service corresponding to that plan; once Token consumption reaches the monthly cap, the account will automatically enter Basic Assurance Mode, with rate limits adjusted as follows:

Plan Tier	Monthly Token Quota	Excess RPS	Excess RPM	Excess RPH	Excess RPD
Unlimited 50M	50,000,000 Tokens	1	2	10	50
Unlimited 200M	200,000,000 Tokens	1	8	40	200
Unlimited 500M	500,000,000 Tokens	1	20	100	500

Important Notice:
Basic Assurance Mode only ensures basic API availability and is not suitable for production-grade high-concurrency loads. If you need to restore standard or higher performance, you can upgrade your plan at any time or contact your account manager to customize an Enterprise plan.

Rate Limit Dimensions Explained

Dimension	Full Name	Meaning	Behavior Upon Trigger
RPS	Requests Per Second	Number of requests per second	Instant rejection upon exceeding
RPM	Requests Per Minute	Number of requests per minute	Remaining requests in the current minute window are rejected
RPH	Requests Per Hour	Number of requests per hour	Limited to the corresponding tier within the current hour window
RPD	Requests Per Day	Number of requests per day	Limited to the corresponding tier within the current day window

Calculation Logic:
The system counts each limit independently. For example, for an Unlimited 50M user after exceeding the quota, even if their RPD still has remaining quota, once RPM reaches 2, subsequent requests within that minute will also be rejected.

Over-limit Response and Best Practices

1. HTTP Response Example

When a request triggers the rate limit, the API will return the following response:

HTTP/2 429 Too Many Requests
Content-Type: application/json
{
  "error": {
    "code": "rate_limit_exceeded",
    "message": "Rate limit exceeded. Please retry later or upgrade your plan for higher throughput.",
    "type": "request_limit_exceeded"
  }
}

2. Client Best Practices

• Exponential Backoff

After receiving a 429, it is recommended to retry with intervals of 1s → 2s → 4s → 8s → ... Avoid high-frequency hard retries, as they may trigger longer throttling.

• Rate Limit Warm-up

For burst traffic (e.g., batch tasks), avoid instantly maxing out concurrency. Adopt gradual speed increases to allow requests to enter the system smoothly.

• Monitor Usage

You can view Token consumption and current plan usage at Model API KEY → Monthly subscription to plan upgrades in advance.

API Rate Limiting and Service Level Agreement

Design Principles

Rate Limits Overview

1. Standard Plans and Pay-as-you-go

2. Unlimited Plans Fair Use Policy

Rate Limit Dimensions Explained

Over-limit Response and Best Practices

1. HTTP Response Example

2. Client Best Practices

• Exponential Backoff

• Rate Limit Warm-up

• Monitor Usage

Inference

Subscription

AI Cloud

Pricing

Resources

About