Deploying a Gradio app with Modal

Introduction

Gradio is a great way to test and demo your machine learning apps using a simple and intuitive Python API. When combined with Modal's developer-first cloud infrastructure, you can leverage powerful GPUs to run larger models faster. And you don't need an account with a cloud provider or any config files.

In this tutorial, we will walk you through setting up a Modal account, deploying a simple Gradio app on Modal, and discuss some of the nuance around Gradio's sticky session requirement and handling concurrency.

Let's deploy a Gradio-style "Hello, world" app that lets a user input their name and then responds with a short greeting. We're not going to use this code as-is in our app, but it's useful to see what the initial Gradio version looks like.

import gradio as gr

# A simple Gradio interface for a greeting function
def greet(name):
    return f"Hello {name}!"

demo = gr.Interface(fn=greet, inputs="text", outputs="text")
demo.launch()

To deploy this app on Modal you'll need to

define your container image,
wrap the Gradio app in a Modal Function,
and deploy it using Modal's CLI!

Before you get started, you'll need to create a Modal account if you don't already have one. Then you can set up your environment by authenticating with those account credentials.

Sign up at modal.com.
Install the Modal client in your local development environment.

pip install modal

Authenticate your account.

modal setup

Great, now we can start building our app!

Step 1: Define our `modal.Image`

To start, let's make a new file named gradio_app.py, import modal, and define our image. Modal Images are defined by sequentially calling methods on our Image instance.

For this simple app, we'll

start with the debian_slim image,
choose a Python version (3.12),
and install the dependencies - only fastapi and gradio.

import modal

app = modal.App("gradio-app")
web_image = modal.Image.debian_slim(python_version="3.12").uv_pip_install(
    "fastapi[standard]",
    "gradio",
)

Note, that you don't need to install gradio or fastapi in your local environement - only modal is required locally.

Like many Gradio apps, the example above is run by calling launch() on our demo at the end of the script. However, Modal doesn't run scripts, it runs functions - serverless functions to be exact.

To get Modal to serve our demo, we can leverage Gradio and Modal's support for fastapi apps. We do this with the @modal.asgi_app() function decorator which deploys the web app returned by the function. And we use the mount_gradio_app function to add our Gradio demo as a route in the web app.

with web_image.imports():
	import gradio as gr
    from gradio.routes import mount_gradio_app
    from fastapi import FastAPI
     
@app.function(
    image=web_image,
    max_containers = 1, # we'll come to this later 
)
@modal.concurrent(max_inputs=100) # allow multiple users at one time
@modal.asgi_app()
def ui():
    """A simple Gradio interface for a greeting function."""
    def greet(name):
	    return f"Hello {name}!"
	
	demo = gr.Interface(fn=greet, inputs="text", outputs="text")

    return mount_gradio_app(app=FastAPI(), blocks=demo, path="/")

Let's quickly review what's going on here:

We use the Image.imports context manager to define our imports. These will be available when your function runs in the cloud.
We move our code inside a Python function, ui, and decorate it with @app.function which wraps it as a Modal serverless Function. We provide the image and other parameters (we'll cover this later) as inputs to the decorator.
We add the @modal.concurrent decorator which allows multiple requests per container to be processed at the same time.
We add the @modal.asgi_app decorator which tells Modal that this particular function is serving an ASGI app (here a fastapi app). To use this decorator, your ASGI app needs to be the return value from the function.

To deploy the app, just run the following command:

modal deploy <path-to-file>

The first time you run your app, Modal will build and cache the image which, takes about 30 seconds. As long as you don't change the image, subsequent deployments will only take a few seconds.

After the image builds Modal will print the URL to your webapp and to your Modal dashboard. The webapp URL should look something like https://{workspace}-{environment}--gradio-app-ui.modal.run. Paste it into your web browser a try out your app!

Important Considerations

Sticky Sessions

Modal Functions are serverless which means that each client request is considered independent. While this facilitates autoscaling, it can also mean that extra care should be taken if your application requires any sort of server-side statefulness.

Gradio relies on a REST API, which is itself stateless. But it does require sticky sessions, meaning that every request from a particular client must be routed to the same container. However, Modal does not make any guarantees in this regard.

A simple way to satisfy this constraint is to set max_containers = 1 in the @app.function decorator and setting the max_inputs argument of @modal.concurrent to a fairly large number - as we did above. This means that Modal won't spin up more than one container to serve requests to your app which effectively satisfies the sticky session requirement.

Concurrency and Queues

Both Gradio and Modal have concepts of concurrency and queues, and getting the most of out of your compute resources requires understanding how these interact.

Modal queues client requests to each deployed Function and simultaneously executes requests up to the concurrency limit for that Function. If requests come in and the concurrency limit is already satisfied, Modal will spin up a new container - up to the maximum set for the Function. In our case, our Gradio app is represented by one Modal Function, so all requests share one queue and concurrency limit. Therefore Modal constrains the total number of requests running at one time, regardless of what they are doing.

Gradio on the other hand, allows developers to utilize multiple queues each with its own concurrency limit. One or more event listeners can then be assigned to a queue which is useful to manage GPU resources for computationally expensive requests.

Thinking carefully about how these queues and limits interact can help you optimize your app's performance and resource optimization while avoiding unwanted results like shared or lost state.

Creating a GPU Function

Another option to manage GPU utilization is to deploy your GPU computations in their own Modal Function and calling this remote Function from inside your Gradio app. This allows you to take full advantage of Modal's serverless autoscaling while routing all of the client HTTP requests to a single Gradio CPU container.

Getting Started

Building Interfaces

Building With Blocks

Additional Features

Chatbots

Data Science And Plots

Streaming

Custom Components

Gradio Clients And Lite

Mcp

Other Tutorials

Deploying a Gradio app with Modal

Introduction

Important Considerations

Sticky Sessions

Concurrency and Queues

Creating a GPU Function

Getting Started

Building Interfaces

Building With Blocks

Additional Features

Chatbots

Data Science And Plots

Streaming

Custom Components

Gradio Clients And Lite

Mcp

Other Tutorials

Deploying a Gradio app with Modal

Introduction

Deploying a simple Gradio app on Modal

Prerequisite: Install and set up Modal

Step 1: Define our modal.Image

Step 2: Wrap the Gradio app in a Modal-deployed FastAPI app

Step 3: Deploying on Modal

Important Considerations

Sticky Sessions

Concurrency and Queues

Creating a GPU Function

Step 1: Define our `modal.Image`