Fine-Tuning LLMs on a Local Multi-GPU AI Workstation

Author: Dennis Garcia

Published: Wednesday, April 8, 2026

Model Training Software

Axolotl LLM

Axolotl LLM is a training application that is most commonly used in the cloud and only requires a YAML file to configure a training run. Being that this application is open source you can also run it locally and change things if needed.

There are two challenges when it comes to open source LLM training. First are hardware and software demands. Python is a powerful and fragile language allowing programmers to achieve a lot, provided that everything is perfect. Upgrade the wrong package and you may never get the system working again. These packages are also tied to specific driver versions and unlike on Windows, there is often no fallback option, it either works or it doesn’t.

The second issue is that only small subsets of AI professionals do model training and even fewer are building the software. As the laws of demand dictate, “if nobody cares, then nobody will work on it”

Combined this creates an environment of barely working code mixed with workarounds because that is what works. Speaking from experience, installing Axolotl LLM is a real bitch with a long list of requirements and dependencies that often don’t play nice together or are stuck on older versions and impossible to update. I have had really good luck using the Docker version as it removes many of the python dependencies but can still trigger hardware ones so, you’ll need to do some testing.

Augment Toolkit will generate an Axolotl YAML file so if you use that tool you can go directly from dataset generation to training in one step.

Deepspeed

Deepspeed is an interesting project that adds memory management to LLM training. By default, LLM training is dependent on VRAM so, if your training job requires 1.4TB of memory you will need enough GPUs to satisfy that requirement. Axolotl supports a number of optimizers including Deepspeed and allows you to change the memory requirements. Sometimes these will accelerate training but, more often than not, it helps to reduce costs through memory offloading.

Adam ZeRO optimizations change the memory dynamic by enabling training on systems with limited GPU memory. For instance, I can train 4B parameter models on my AI workstation using the 2x RTX A4500 GPUs and the 128GB of system memory. This process will require about 80GB of memory and with Deepspeed I can offload most everything to DRAM allowing the software to actively swap weights and tensor information in and out of the GPU during training.

Adam ZeRO 3 also has an option to offload to a NVMe drives and allow you to train even larger models on limited hardware. For instance an 8B model needs 150GB of memory to train but since my AI workstation only has 128GB that won’t be possible without offloading part of this to NVMe.

Kinda...

Thing is, Deepspeed is primarily being used in the enterprise space for cloud training and in this environment users will leverage the DRAM offload because NVMe offload is not available and DRAM is faster. As such the NVMe offload option has some rather serious bugs that will often crash the training job and still requires an extremely high amount of DRAM. While this works for small models it does severely limit the types of systems that can do model training and what models are supported.

On a positive, Deepspeed and Axolotl are both actively being developed and will often have LLM training support on day one when new models are released along with a support system if bugs are found.