Secure Your Data, Unlock AI: Deploy Open WebUI Locally with Remote Ollama GPU

Introduction

Open WebUI (formerly Ollama WebUI) is an extensible, self-hosted user interface designed to operate entirely offline. It supports various LLM runners, including Ollama and OpenAI-compatible APIs, providing a ChatGPT-style experience with built-in RAG (Retrieval-Augmented Generation) capabilities. Its architecture allows for the separation of the interface (frontend) and the inference engine (backend), enabling optimized resource allocation.

In this post, we execute a split-architecture deployment. We will run the Open WebUI frontend inside a Docker container via WSL (Ubuntu), while connecting it to an external personal computer on the local network dedicated to inference, equipped with an NVIDIA GeForce RTX 3060 running Ollama. For this setup, we utilized the mistral-nemo:latest LLM and embeddinggemma:latest embedding model on the Ollama server.

Running Environment

The evaluation utilizes a distributed computing setup to maximize inference throughput while maintaining a lightweight frontend:

Frontend Host: A host running Ubuntu 22.04 via WSL 2. This hosts the Open WebUI Docker container.
Inference Server: External workstation on the local network (LAN) equipped with an NVIDIA GeForce RTX 3060, running the Ollama service.
Network: Gigabit LAN connecting both machines.

Server Configuration: External Ollama Instance

Before deploying the frontend, the external Ollama instance must be configured to accept remote connections. By default, Ollama binds to 127.0.0.1.

Set Environment Variable: On the external machine (Server), configure the OLLAMA_HOST variable to listen on all interfaces.
- Linux: export OLLAMA_HOST=0.0.0.0
Firewall Configuration: Ensure port 11434 (default Ollama port) is open on the Server's firewall to allow inbound TCP traffic from the Frontend Host.
Restart Service: Restart the Ollama application to apply changes.

Installation Instructions: Open WebUI

On the Frontend Host (WSL), follow these steps to deploy Open WebUI:

Verify Docker Installation: Ensure Docker is running within your WSL distribution.
```
docker --version
```
Execute Run Command: Deploy the container using the specific environment variable OLLAMA_BASE_URL to point to your external server. Replace <REMOTE_GPU_IP> with the static IP address of your external machine hosting ollama server.
```
docker run -d -p 3000:8080 \
  -e OLLAMA_BASE_URL=http://<REMOTE_GPU_IP>:11434 \
  -v open-webui:/app/backend/data \
  --name open-webui \
  --restart always \
  ghcr.io/open-webui/open-webui:main
```
- -p 3000:8080: Maps host port 3000 to container port 8080.
- OLLAMA_BASE_URL: Directs API calls to the remote GPU instance.
- -v open-webui:/app/backend/data: Persists user data and chat history.

Accessing Open WebUI from Browser

Once the container is active, access the interface via the browser on your host. Navigate to http://localhost:3000.

The first user to sign up is automatically assigned Admin privileges. Create an account with an email and password to proceed. The interface acts as a gateway, routing all compute-intensive generation tasks to the remote NVIDIA GeForce RTX 3060.

Example Query Execution

We tested the latency and response integration with the following query:

Query: "Summarize in a table in maximum 5 rows the CUDA core architecture differences between Ampere and Hopper."

Process:

Input: User types query in Open WebUI (WSL).
Routing: Docker container forwards request via HTTP to <REMOTE_GPU_IP>:11434.
Inference: Remote NVIDIA GeForce RTX 3060 processes the prompt.
Output: Token stream is returned to the UI.

The response was generated at ~45 tokens/second (dependent on GPU VRAM and model parameter count), with zero load on the local WSL host.

Verifying and Wiring the Ollama Connection

Per the Open WebUI docs, the app will auto-connect to Ollama if it can reach it. To verify and manage the connection:

Navigate to Connection Settings: Go to Admin Settings → Connections → Ollama and click the Manage (wrench) icon.
Confirm the Endpoint: In the Manage screen, ensure the Base URL points to your Ollama host (e.g., http://<REMOTE_GPU_IP>:11434).
Pull and Verify Models:
- From the Manage Panel: You can download models directly from this screen.
- From the Chat: A quicker way is to type a model name (e.g., mistral-nemo:latest) into the chat model selector. If the model isn't available locally, Open WebUI will prompt you to download it via Ollama.
Validate: After a successful connection test, the model selector in the chat sidebar should populate with the models available on your remote Ollama server. If not, re-check the Base URL and firewall settings.

RAG: Adding Documents

Open WebUI provides built-in RAG support. We evaluated this by uploading technical documentation:

Upload Documents:
- Click the Workspace tab (or the + icon in the chat bar).
- Select Documents.
- Upload PDF or text files. The system automatically vectorizes the content.
Collection Management:
- Group documents into a Collection (e.g., "GPU-Manuals").
- In a new chat, enable the collection by typing # and selecting the collection name.

Configuring RAG Embeddings

For effective RAG, Open WebUI allows you to specify the embedding model. When using a remote Ollama server, the embedding process also runs on that server, keeping all intensive computation off the client machine.

Navigate to Admin Settings: Access the admin panel by clicking on your profile in the top left and selecting "Settings".
Select Documents Settings: In the admin panel, go to the "Documents" tab.
Configure Embedding Model:
- Set the Embedding Model Engine to "Ollama".
- Choose your desired embedding model from the Embedding Model dropdown. A recommended model is embeddinggemma:latest. If you don't have it, you can pull it through Ollama.
Adjust RAG Parameters (Optional): You can also fine-tune the RAG process with the following settings:
- Chunk Size: The size of the text chunks that documents are broken into.
- Chunk Overlap: The number of tokens to overlap between chunks.
- Top K: The number of retrieved chunks to be included in the context.

By configuring these settings, you ensure that the remote Ollama server handles the entire RAG workflow, from embedding to generation.

Extending Capabilities with Tools and Functions

Open WebUI offers robust mechanisms to extend its capabilities through Tools and Functions, allowing for highly customized and powerful AI interactions.

Tools: These are Python scripts that empower Large Language Models (LLMs) to perform external actions. This can include web searches, image generation, or fetching real-time data (e.g., weather, stock prices). Tools essentially act as plugins for the LLM, enabling it to go beyond its pre-trained knowledge by interacting with external services. They are managed within the Workspace tabs of the Open WebUI interface.
Functions: These are also Python scripts, but they are designed to extend the Open WebUI platform itself. Functions can add support for new AI model providers, customize message processing, or introduce new UI elements. They operate within the Open WebUI environment, offering modular and fast enhancements to the platform's behavior. Administrators typically configure and manage functions through the Admin Panel.

Both Tools and Functions can be easily installed and enabled within the Open WebUI interface, with a vibrant community contributing a variety of options for import.

Conclusion

This distributed setup offers distinct advantages for efficient AI workflows:

Local Network Isolation: The entire Open WebUI and Ollama server ecosystem operates within your private local network, guaranteeing no external data transmission or leakage, thus ensuring data privacy and security.
Open-Source Solution: Both Open WebUI and Ollama are open-source projects. Ollama is licensed under the MIT License, while Open WebUI uses a custom license with a branding protection clause. This provides a transparent and community-driven alternative to proprietary solutions.
Centralized Inference: Multiple Open WebUI clients can connect to the same central Ollama instance.
Enhanced Capabilities: Open WebUI offers built-in Retrieval-Augmented Generation (RAG) and customizable Tools support, allowing for extended LLM functionalities and dynamic interaction with external data sources and services.

Open WebUI combined with a remote Ollama backend provides a robust, production-grade interface for local LLM deployment.

References

Enjoyed this post? Found it insightful? Feel free to leave a comment below to share your thoughts or ask questions. A GitHub account is required to join the discussion.