Software Choices for a Multi-GPU AI Workstation
Author: Dennis GarciaInference Packages
I am sure that most everyone has heard of Ollama, this package is by far one of the most popular inference and model management packages available and is the basis for most everything I run in the Hardware Asylum Labs.
At its lowest level, Ollama is a nice wrapper around an inference engine. This engine is based on Llama.cpp and was forked into their own managed code sometime in 2025. The engine requires GGUF models that you can import from SafeTensors files or download directly from the Ollama.com website
The command line will allow you to interact with the model and see the library of models that have been downloaded. When you inference using Ollama you will be chatting at the command line and is how you control the different aspects of the application. However, the real power is the native and OpenAI compatible endpoints that you can use to create your own interfaces or integrate Ollama into other projects such as OpenWebUI, N8N and even VSCode. Basically, if the tool understands the OpenAI API you can interact with Ollama.
In the lab I have setup Ollama with my favorite models including Gemma3, Qwen, Llama and GLM. Depending on my desired quantization level I have three basic choices to make.
- Download the model from Ollama, this works well for the 30B models that I will want to run at Q4.
- I can import the model from SafeTensors, this works great for models I want to run at Q8 and will require that I build a modelfile to define some basic parameters for how I want the model to act by default. Parameters such as Context Length and Temperature can be changed after the model loads but, if they are configured in the model file I won’t need to change anything after the fact.
- I can import the model from GGUF files. This option can get a little technical depending on the desired outcome and quantization level. For instance, you can download pre-quantized versions from Huggingface, build a local modelfile and do the import. Or, if you are like me and cherish my bandwidth, I will download the FP16 SafeTensors version and use Llama.cpp to generate the GGUF version at the desired Quantization level. This is really the only option if you want to import a Q6 or Q5 version of your model into Ollama.
Llama.cpp is a very popular inference engine that has an extremely good memory management system allowing users to run AI models on anything from a laptop to a multi-GPU server.
On low powered machines without a GPU the inference engine will run from the CPU and use system DRAM to hold the model weights and KV Cache. Inference is painfully slow in this configuration but, can accomplish the task at hand.
On systems with a GPU it prefers to load the model into VRAM and deliver some really good performance, provided the model fits. If you have more than one GPU and need more VRAM it will spread the model across all available GPUs. If you need even more memory, it will spill the excess into DRAM. Sadly, if DRAM is used, the overall inference will slow down considerably due to it running from the CPU.
For my AI Workstation, I currently use Llama.cpp for GGUF model conversions only though have it setup to use with Augment Toolkit when it comes time to do dataset generation.
By now most everyone has heard of ChatGPT and OpenWebUI is a clone of that interface for hosting the application locally. This is the interface I use and it is really quite powerful.
On the surface, OpenWebUI is just a web application that allows you to send messages to an inference engine and display those back to the browser. There are several applications that do similar things but, I find the additional features of OpenWebUI to be rather convincing.
Within the web UI you will find the ability to select a model to chat with. This can be Ollama, or even cloud service models and is one of the primary reasons for using this app. It also keeps track of your previous discussions allowing you to re-read what was discussed or even continue where you left off.
The system also supports tool usage, has a built in RAG system, user permissions, personalization and much more. If you choose to use this app, I would strongly recommend using the Docker option as it makes it extremely easy to install and keep updated.

