• articles
  • general information
  • Balancing Model Quality and Hardware Demands in AI Workstations
  • Balancing Model Quality and Hardware Demands in AI Workstations

    Author:
    Published:

    Model Weights and Quantization

    Model weights are the learned parameters for the neural network that makes up the AI model. Weight memory scales linearly with parameter count and the precision is the type of number used to store each parameter.  A quantized model uses a lower precision format that will ultimately change how accurate the model is in exchange for size of the AI model.

    The math to determine memory requirements is rather simple.

    Parameter Count x Bytes per Parameter = Total Memory

    Precision FormatBytes per ParameterEx: 8B Model
    FP32 (full precision)4 bytes32 GB
    FP16 / BF16 (half precision)2 bytes16 GB
    INT8 (8-bit quantization)1 byte8 GB
    INT4 (4-bit quantization)0.5 bytes4 GB


    Most Open-Source models are downloaded in FP16 format while FP32 is used for production models, like Gemini-3. The INT4 format is the most common precision for hobby models as it strikes a balance in the consumer space where high precision is often not required.

    By quantizing an AI model, you physically make it smaller allowing it to run on less capable hardware. The DeepSeek R1 example on a Mac Mini was heavily quantized making it not only extremely slow but also not very useful.

    Unlike model weights the Precision Format degrades logarithmically and at precisions smaller than INT4 you begin to notice an increase in Perplexity Divergence. Or, a drift in the expected outcome of the result from the LLM.

    MetricExample chance of tokenLoss
    FP1612.1234567890123456%0%
    Q812.12345678%0.06%
    Q612.123456%0.1
    Q512.12345%0.3
    Q412.123%1
    Q312.12%3.7
    Q212.1%8.2
    Q112%70
    KV Cache

    KV Cache is a hidden memory allocation that serves as the “recall memory” for the LLM. As you feed the LLM with information it is stored in the KV Cache and serves as short term memory for the model.  Most of us simply call this “context” and I call it hidden memory since this metric can be manipulated to make certain models run on systems that may only have enough memory to hold the model weights.

    The math to determine the KV Cache is a little more complex.

    KV Cache per Token: 2 × Layers × KV Heads × Head Dimension × Bytes per Element

    KV Heads is often referred to as the cache length. For instance, Llama 3.1 8B has a context (cache) length of 128k and to use the full context length you will need 16GB of memory at FP16. Like with Model Weights this memory requirement is reduced with quantization and at INT8 it will only need 8GB.

    This is where it gets interesting. Most KV Cache implementations are not quantized so, to save memory you reduce the context length. At only 16k the total memory requirement is only 1GB at FP16.

    The actual KV Cache is highly dependent on the AI Model being used but, the rough estimation is very similar to the model weight calculation.

    Activation Memory

    Total memory required for activations depends on if you are inferencing (using) a LLM or training an LLM. For inference the memory required is 5-10% of the total memory since only the current layer needs to be stored. When training a model, the total value expands due to every layer needing to be loaded for backpropagation.

    Framework and System Overhead

    This is the memory required for the serving framework such as Ollama, Llama.cpp, vLLM, SGLang, etc.. These systems will often need a certain amount of memory just to operate and is the one metric that is not easily accounted for due to the various optimizations and the use of system memory to offload.

    So, why is this important? For an AI model to run it must fit into the available memory. For example, to run the Llama 3.1 8B model at INT8 (Quantized) you will need a GPU with more than 16GB of VRAM to run and use the full 128k context window. Depending on the inference framework this can jump to 24GB if the KV Cache is not quantized to match the model.

    If your GPU only has 8GB of VRAM there is a strong chance that the model will fail to load completely. Or, if you are using a framework such as Llama.cpp or Ollama the overflow model weights will spill over into system RAM and will be forced to use the CPU for inference. CPUs are well-suited for tasks like spreadsheets, but they lack the parallel processing capabilities needed for efficient LLM inference.

    Running from the CPU is SLOWWW, and any system running an AI model without an accelerator will witness this firsthand.