Let’s say that your Gradio demo goes viral on social media — you have lots of users trying it out simultaneously, and you want to provide your users with the best possible experience or, in other words, minimize the amount of time that each user has to wait in the queue to see their prediction.
How can you configure your Gradio demo to handle the most traffic? In this Guide, we dive into some of the parameters of Gradio’s
.queue() method as well as some other related parameters, and discuss how to set these parameters in a way that allows you to serve lots of users simultaneously with minimal latency.
This is an advanced guide, so make sure you know the basics of Gradio already, such as how to create and launch a Gradio Interface. Most of the information in this Guide is relevant whether you are hosting your demo on Hugging Face Spaces or on your own server.
By default, every Gradio demo includes a built-in queuing system that scales to thousands of requests. When a user of your app submits a request (i.e. submits an input to your function), Gradio adds the request to the queue, and requests are processed in order, generally speaking (this is not exactly true, as discussed below). When the user’s request has finished processing, the Gradio server returns the result back to the user using server-side events (SSE). The SSE protocol has several advantages over simply using HTTP POST requests:
(1) They do not time out — most browsers raise a timeout error if they do not get a response to a POST request after a short period of time (e.g. 1 min). This can be a problem if your inference function takes longer than 1 minute to run or if many people are trying out your demo at the same time, resulting in increased latency.
(2) They allow the server to send multiple updates to the frontend. This means, for example, that the server can send a real-time ETA of how long your prediction will take to complete.
To configure the queue, simply call the
.queue() method before launching an
ChatInterface or any
Blocks. Here’s an example:
import gradio as gr
app = gr.Interface(lambda x:x, "image", "image")
app.queue() # <-- Sets up a queue with default parameters
How Requests are Processed from the Queue
When a Gradio server is launched, a pool of threads is used to execute requests from the queue. By default, the maximum size of this thread pool is
40 (which is the default inherited from FastAPI, on which the Gradio server is based). However, this does not mean that 40 requests are always processed in parallel from the queue.
Instead, Gradio uses a single-function-single-worker model by default. This means that each worker thread is only assigned a single function from among all of the functions that could be part of your Gradio app. This ensures that you do not see, for example, out-of-memory errors, due to multiple workers calling a machine learning model at the same time. Suppose you have 3 functions in your Gradio app: A, B, and C. And you see the following sequence of 7 requests come in from users using your app:1 2 3 4 5 6 7 ------------- A B A A C B A
Initially, 3 workers will get dispatched to handle requests 1, 2, and 5 (corresponding to functions: A, B, C). As soon as any of these workers finish, they will start processing the next function in the queue of the same function type, e.g. the worker that finished processing request 1 will start processing request 3, and so on.
If you want to change this behavior, there are several parameters that can be used to configure the queue and help reduce latency. Let’s go through them one-by-one.
The first parameter we will explore is the
default_concurrency_limit parameter in
queue(). This controls how many workers can execute the same event. By default, this is set to
1, but you can set it to a higher integer:
10, or even
None (in the last case, there is no limit besides the total number of available workers).
This is useful, for example, if your Gradio app does not call any resource-intensive functions. If your app only queries external APIs, then you can set the
default_concurrency_limit much higher. Increasing this parameter can linearly multiply the capacity of your server to handle requests.
So why not set this parameter much higher all the time? Keep in mind that since requests are processed in parallel, each request will consume memory to store the data and weights for processing. This means that you might get out-of-memory errors if you increase the
default_concurrency_limit too high. You may also start to get diminishing returns if the
default_concurrency_limit is too high because of costs of switching between different worker threads.
Recommendation: Increase the
default_concurrency_limit parameter as high as you can while you continue to see performance gains or until you hit memory limits on your machine. You can read about Hugging Face Spaces machine specs here.
You can also set the number of requests that can be processed in parallel for each event individually. These take priority over the
default_concurrency_limit parameter described previously.
To do this, set the
concurrency_limit parameter of any event listener, e.g.
btn.click(..., concurrency_limit=20) or in the
ChatInterface classes: e.g.
gr.Interface(..., concurrency_limit=20). By default, this parameter is set to the global
If you have maxed out the
concurrency_count and you’d like to further increase the number of requests that should be processed in parallel, you can increase the number of threads that can process requests from the queue.
You do this by setting the
max_workers parameter in the
launch() method. (The default value is 40.)
A more blunt way to reduce the wait times is simply to prevent too many people from joining the queue in the first place. You can set the maximum number of requests that the queue processes using the
max_size parameter of
queue(). If a request arrives when the queue is already of the maximum size, it will not be allowed to join the queue and instead, the user will receive an error saying that the queue is full and to try again. By default,
max_size=None, meaning that there is no limit to the number of users that can join the queue.
Paradoxically, setting a
max_size can often improve user experience because it prevents users from being dissuaded by very long queue wait times. Users who are more interested and invested in your demo will keep trying to join the queue, and will be able to get their results faster.
Recommendation: For a better user experience, set a
max_size that is reasonable given your expectations of how long users might be willing to wait for a prediction.
Another way to increase the parallelism of your Gradio demo is to write your function so that it can accept batches of inputs. Most deep learning models can process batches of samples more efficiently than processing individual samples.
If you write your function to process a batch of samples, Gradio will automatically batch incoming requests together and pass them into your function as a batch of samples. You need to set
True (by default it is
False) and set a
max_batch_size (by default it is
4) based on the maximum number of samples your function is able to handle. These two parameters can be passed into
gr.Interface() or to an event in Blocks such as
While setting a batch is conceptually similar to having workers process requests in parallel, it is often faster than setting the
concurrency_count for deep learning models. The downside is that you might need to adapt your function a little bit to accept batches of samples instead of individual samples.
Here’s an example of a function that does not accept a batch of inputs — it processes a single input at a time:
def trim_words(word, length):
Here’s the same function rewritten to take in a batch of samples:
def trim_words(words, lengths):
trimmed_words = 
for w, l in zip(words, lengths):
The second function can be used with
batch=True and an appropriate
Recommendation: If possible, write your function to accept batches of samples, and then set
True and the
max_batch_size as high as possible based on your machine’s memory limits.
If you have done everything above, and your demo is still not fast enough, you can upgrade the hardware that your model is running on. Changing the model from running on CPUs to running on GPUs will usually provide a 10x-50x increase in inference time for deep learning models.
It is particularly straightforward to upgrade your Hardware on Hugging Face Spaces. Simply click on the “Settings” tab in your Space and choose the Space Hardware you’d like.
While you might need to adapt portions of your machine learning inference code to run on a GPU (here’s a handy guide if you are using PyTorch), Gradio is completely agnostic to the choice of hardware and will work completely fine if you use it with CPUs, GPUs, TPUs, or any other hardware!
Note: your GPU memory is different than your CPU memory, so if you upgrade your hardware,
you might need to adjust the value of the
default_concurrency_limit parameter described above.
Congratulations! You know how to set up a Gradio demo for maximum performance. Good luck on your next viral demo!