• articles
  • general information
  • Balancing Model Quality and Hardware Demands in AI Workstations
  • Balancing Model Quality and Hardware Demands in AI Workstations

    Author:
    Published:

    Pebble in a Sandbox and Conclusion

    Now that we know what the requirements are, how do they factor into building an AI workstation?

    I would like to start with an analogy. Say you have a sandbox, 10x10 and someone throws in a pebble. It should be rather easy for you to find the pebble and after each retrieval the sandbox grows by a dimension in each direction.  E.g. a 10x10 now becomes 12x12. The pebble is cast again and you need to find it. Do this several times and you will eventually have a sandbox grow to 100x100 or larger.  Thing is, the pebble remains the same size, you start from the same location and over time it will take longer and longer to find the pebble.

    This analogy is a good way to think of the interaction your hardware plays when working with AI Models. The parameter size of the model is the size of the initial sandbox. As you use the model the amount of sand in the sandbox will grow adding context and information.  Ultimately, this requires your system to search through a much larger pool to generate a response. The larger the sandbox the more information you will have available, the better the response will be and the slower the generation will be.

    These all scale together with the edge cases suffering the most.

    To speed things up, we can increase the processing power.  This equates to using a larger shovel or magnifying glass to find the pebble and much like in games, faster GPUs deliver better performance.

    While not directly related to my sandbox analogy, the inference system you use can tweak your user experience considerably.  Systems like Llama.cpp and Ollama can balance AI model usage between CPU and GPU allowing you to use larger models but, you sacrifice speed.  Larger models are already slow due to their size (the sandbox size) and without a GPU, things will negatively multiply.  Systems that use vLLM will be limited to using only GPU memory.  This will require more GPUs to run larger models and the extra processing power can keep up and deliver extremely fast performance.  However, when you run out of context memory everything will crash and thus makes the hardware selection extremely important.

    Conclusion

    While we have just touched the surface on the requirements for running AI models, I think the above is a nice primer for having a detailed and more specific hardware discussion. As with most things you will find exceptions to the rule and it is important to strike a balance between price and performance.  For example, in the gaming space having a GPU with more VRAM will allow you to run higher resolutions and thus larger monitors but, using that same GPU for AI models will get you the speed but, not enough memory to make it useful.  Likewise, GPUs designed for this purpose usually have double the amount of memory and command a higher price. 

    Strangely enough, these GPUs are also slower in an attempt to make them “somewhat” more affordable and is where the tradeoffs begin making the system build more about meeting your goals over raw performance.

    Be sure to check out the next article in this series where these principles are applied to the AI systems used in the Hardware Asylum labs. They are not cutting edge but, they strike a balance and are effectively able to min/max the needs of the lab while also allowing me to accomplish my goals without spending too much.  The trick isn’t throwing money at expensive hardware but, leveraging the available hardware in the best way and matching that hardware with specific software. 

    What might surprise most readers is that the current GPUs are actually three generations old and yet still provide enough value to be useful.