Google Cloud uses quotas to help ensure fairness and reduce spikes in resource use and availability. A quota restricts how much of a Google Cloud resource your Google Cloud project can use. Quotas apply to a range of resource types, including hardware, software, and network components. For example, quotas can restrict the number of API calls to a service, the number of load balancers used concurrently by your project, or the number of projects that you can create. Quotas protect the community of Google Cloud users by preventing the overloading of services. Quotas also help you to manage your own Google Cloud resources.
The Cloud Quotas system does the following:
- Monitors your consumption of Google Cloud products and services
- Restricts your consumption of those resources
- Provides a way to request changes to the quota value
In most cases, when you attempt to consume more of a resource than its quota allows, the system blocks access to the resource, and the task that you're trying to perform fails.
Quotas generally apply at the Google Cloud project level. Your use of a resource in one project doesn't affect your available quota in another project. Within a Google Cloud project, quotas are shared across all applications and IP addresses.
Rate limits
This table lists the rate limits that apply to the following models across all regions for the metric,generate_content_input_tokens_per_minute_per_base_model
:
Base model | Tokens per minute |
---|---|
base_model: gemini-1.5-flash |
4M (4,000,000) |
base_model: gemini-1.5-pro |
4M (4,000,000) |
Quotas by region and model
The requests per minute (RPM) quota applies to a base model and all versions, identifiers, and tuned versions of that model. The following examples show how the RPM quota is applied:- A request to the base model,
gemini-1.0-pro
, and a request to its stable version,gemini-1.0-pro-001
, are counted as two requests toward the RPM quota of the base model,gemini-1.0-pro
. - A request to two versions of a base model, `gemini-1.0-pro-001` and `gemini-1.0-pro-002`, are counted as two requests toward the RPM quota of the base model, `gemini-1.0-pro`.
- A request to two versions of a base model, `gemini-1.0-pro-001` and a tuned version named `my-tuned-chat-model`, are counted as two requests toward the base model, `gemini-1.0-pro`.
View the quotas in the Google Cloud console
To view the quotas in the Google Cloud console, do the following:- In the Google Cloud console, go to the IAM & Admin Quotas page.
- Click View Quotas in Console.
- In the Filter field, specify the dimension or metric.
Dimension (model identifier) | Metric (quota identifier for Gemini models) |
---|---|
base_model: gemini-1.5-flash base_model: gemini-1.5-pro |
You can request adjustments in the following:
|
All other models | You can adjust only one quota:
|
View the quotas by region and by model
Choose a region to view the quota limits for each available model:
Increase the quotas
If you want to increase any of your quotas for Generative AI on Vertex AI, you can use the Google Cloud console to request a quota increase. To learn more about quotas, see Work with quotas.
RAG Engine quotas
For each service to perform retrieval-augmented generation (RAG) using RAG Engine, the following quotas apply, with the quota measured as requests per minute (RPM).Service | Quota | Metric |
---|---|---|
RAG Engine data management APIs | 60 RPM | VertexRagDataService requests per minute per region |
RetrievalContexts API |
1,500 RPM | VertexRagService retrieve requests per minute per region |
base_model: textembedding-gecko |
1,500 RPM | Online prediction requests per base model per minute per region per base_model An additional filter for you to specify is base_model: textembedding-gecko |
Service | Limit | Metric |
---|---|---|
Concurrent ImportRagFiles requests |
3 RPM | VertexRagService concurrent import requests per region |
Maximum number of files per ImportRagFiles request |
10,000 | VertexRagService import rag files requests per region |
For more rate limits and quotas, see Generative AI on Vertex AI rate limits.
Batch requests
The quotas and limits for batch requests are the same across all regions.Concurrent batch requests
The following table lists the quotas for the number of concurrent batch requests:Quota | Value |
---|---|
aiplatform.googleapis.com/textembedding_gecko_concurrent_batch_prediction_jobs |
4 |
aiplatform.googleapis.com/gemini_pro_concurrent_batch_prediction_jobs |
4 |
aiplatform.googleapis.com/gemini_flash_concurrent_batch_prediction_jobs |
4 |
Batch request limits
The following table lists the size limit of each batch text generation request.Model | Limit |
---|---|
gemini-1.5-pro |
50k records |
gemini-1.5-flash |
150k records |
gemini-1.0-pro |
150k records |
gemini-1.0-pro-vision |
50k records |
Custom-trained model quotas
The following quotas apply to Generative AI on Vertex AI tuned models for a given project and region:Quota | Value |
---|---|
Restricted image training TPU V3 pod cores per region * supported Region - europe-west4 |
64 |
Restricted image training Nvidia A100 80GB GPUs per region * supported Region - us-central1 * supported Region - us-east4 |
8 2 |
Text embedding limits
Each text embedding model request can have up to 250 input texts (generating 1 embedding per input text) and 20,000 tokens per request. Only the first 2,048 tokens in each input text is used to compute the embeddings.
Gen AI evaluation service service quotas
The Gen AI evaluation service usesgemini-1.5-pro
as a judge model,
and uses mechanisms to ensure consistent and objective evaluation for model-based metrics.
A single evaluation request for a model-based metric might result in multiple underlying requests to
the Gen AI evaluation service. Each model's quota is calculated on a per-project basis, which means
that any requests directed to gemini-1.5-pro
for model inference and
model-based evaluation contribute to the quota. Different model quotas are set differently. The
quota for the Gen AI evaluation service and the quota for the underlying autorater model are shown
in the table.
Request quota | Default quota |
---|---|
Gen AI evaluation service requests per minute | 1,000 requests per project per region |
Online prediction requests per minute for base_model: gemini-1.5-pro |
See Quotas by region and model. |
Limit | Value |
---|---|
Gen AI evaluation service request timeout | 60 seconds |
Pipeline evaluation quotas
If you receive an error related to quotas while using the evaluation pipelines service, you might need to file a quota increase request. See View and Manage Quotas for more information. The evaluation pipelines service uses Vertex AI Pipelines to runPipelineJobs
. See relevant quotas for
Vertex AI Pipelines. The following are general quota recommendations:
Service | Quota | Recommendation |
---|---|---|
Vertex AI API | Concurrent LLM batch prediction jobs per region | Pointwise: 1 * num_concurrent_pipelines Pairwise: 2 * num_concurrent_pipelines |
Vertex AI API | Evaluation requests per minute per region | 1000 * num_concurrent_pipelines |
Tasks | Quota | Base model | Recommendation |
---|---|---|---|
summarization question_answering |
Online prediction requests per base model per minute per region per base_model | text-bison |
60 * num_concurrent_pipelines |
Vertex AI Pipelines
Each tuning job uses Vertex AI Pipelines. For more information, see Vertex AI Pipelines quotas and limits.
Vertex AI Reasoning Engine
The following quotas and limits apply to Vertex AI Reasoning Engine for a given project in each region.Quota | Value |
---|---|
Create/Delete/Update Reasoning Engine per minute | 10 |
Query Reasoning Engine per minute | 60 |
Maximum number of Reasoning Engine resources | 100 |
What's next
- To learn more about dynamic shared quota, see Dynamic shared quota.
- To learn about quotas and limits for Vertex AI, see Vertex AI quotas and limits.
- To learn more about Google Cloud quotas and limits, see
Understand quota values and system limits.