Software Choices for a Multi-GPU AI Workstation
Author: Dennis GarciaPerformance and Datagen
Two of the most popular inference engines out there are vLLM and Llama.cpp. Aphrodite Engine is a fork of vLLM for the PygmalionAI service. It is extremely fast supporting a number of performance enhancements including continuous batching, paged attention and caching through LMCache. While the endpoint is OpenAI compliant, it does have a few additional features specific to their platform.
I use this inference engine when I am doing LLM content generation since it is much faster than Ollama and can handle more concurrent connections without slowing down.
To install Aphrodite Engine I would recommend the Docker approach as it is one of the fastest ways to get started. You’ll search for hours to find the launch command and then a few more hours testing out the list of switches. I have installed Aphrodite Engine as a python project and while it can be done the dependency shuffle can be a real bitch to get sorted.
Due to the OpenAI compatible endpoint you can use Aphrodite Engine OpenWebUI and OpenClaw though due to how the KV Cache is setup I have discovered that the engine will crash when the cache is full making. This makes it a little annoying to use as a daily driver but great for content generation since turn based inference doesn’t have caching issues.
Speaking of content generation, most of us will be familiar with normal chat models where you can say “Generate me a long story about three strawberries in a forest”. This will prompt the LLM to create a fun story until the output token count tells the model to stop. The same is true for article summaries or even code generation, simply save the chat logs and move on.
The difference that Augment Toolkit brings to the table is generating data for model training based on existing documentation. This data can then be used to fine tune train LLMs or even pre-train models if you have very specific needs.
Fine Tune Training of LLMs is a rather involved process but, at the highest level it can accomplish two things. Inject new knowledge into the LLM and change how the LLM responds to certain types of input. A good example is creating a model for a law office where you want the model to respond in a legal lexicon. Within the model you are simply changing the weights and will thus change how the response is generated.
This is also how you inject new information into the model, with a rather large caveat. You are not actually adding parameters to the LLM but by restricting how the output is generated you can make it seem like the model knows new things. With modern LLMs this has become less of a concern since you can make your Llama 3 model talk like a “Gym Bro” by simply creating a system prompt. You can also teach it about Law by combining a prompt with a RAG collection. What fine tune training does is reduces the hallucination rate by reprogramming the weights for your particular segment.
This is where Augment Toolkit is beneficial. Using an example from the project you can create a dataset from Army Service manuals. The system will extract the text and chunk it into smaller segments. These segments are then fed to an LLM as prompts to generate a related response that is finally compiled into a JSON dataset, this process continues until there is no more data left in the end you will have the data needed for LLM training and specific to the source material.
Augment Toolkit also is highly customizable allowing users to create their own workflows. It is also rather complex but, mostly due to it being a project from a single developer. Not a bad thing but, requires that you “think like them” if you want to get something done. There were also some questionable additions to the latest release that obscure the content generation just enough to make you question what is actually happening.

