InferenceMax AI benchmark tests software stacks, efficiency, and TCO — vendor-neutral suite runs nightly and tracks performance changes over time

InferenceMax AI benchmark tests software stacks, efficiency, and TCO — vendor-neutral suite runs nightly and tracks performance changes over time

News coverage surrounding artificial intelligence almost invariably focuses on the deals that send hundreds of billions of dollars flying, or the latest hardware developments in the GPU or datacenter world. Benchmarking efforts have almost exclusively focused on the silicon, though, and that’s what SemiAnalysis intends to address with its open-source InferenceMax AI benchmarking suite. It…

Spread the love

News coverage surrounding artificial intelligence almost invariably focuses on the deals that send hundreds of billions of dollars flying, or the latest hardware developments in the GPU or datacenter world. Benchmarking efforts have almost exclusively focused on the silicon, though, and that’s what SemiAnalysis intends to address with its open-source InferenceMax AI benchmarking suite. It measures the efficiency of the many components of AI software stacks in real-world inference scenarios (when AI models are actually “running” rather than being trained), and publishes those at the InferenceMax live dashboard.

InferenceMax is released under the Apache 2.0 license and measures the performance of hundreds of AI accelerator hardware and software combinations, in a rolling-release fashion, getting new results nightly with recent versions of the software. As the project states, existing benchmarks are done at fixed points in time and don’t necessarily show what the current versions are capable of; nor do they highlight the evolution (or regression, even) of software advancements across an entire AI stack with drivers, kernels, frameworks, models, and other components.

A graph highlighting AI throughput vs interactivity

Throughput vs. interactivity (Image credit: SemiAnalysis – InferenceMax)

By the old adage of “fast, big, or cheap — pick two”, a high throughput (measured in tok/s/gpu), meaning optimal GPU usage, is best obtained by serving many clients at once, as LLM inference relies on matrix multiplication, which in turn benefits from batching many requests. However, serving many requests at once lowers how much time the GPU can dedicate to a single one, so getting faster output (say, in a chatbot conversation) means increasing interactivity (measured as tok/s/user) and lowering throughput. For example, if you’ve ever seen ChatGPT responding as if it had a bad stutter, you know what happens when throughput is set too high versus interactivity.

As in any Goldilocks-type scenario, there’s a perfect equilibrium between those two measures for a general-purpose setup. The ideal setup figures belong in the Pareto Frontier Curve, a specific area in a graph plotting throughput versus interactivity, handily illustrated by the diagram below. Since GPUs are purchased based on a dollar-per-hour cost when considering their price and power consumption (or when rented), the best GPU for any given scenario is not necessarily the fastest one — it’ll be the one that’s most efficient.

InferenceMax graph detailing the Pareto Frontier Curve

The Pareto Frontier Curve (Image credit: SemiAnalysis – InferenceMax)

InferenceMax remarks that high-interactivity cases are pricier than high-throughput cases, although potentially more profitable, as they’ll be serving more users simultaneously. The one true measure for service providers, then, is the TCO, measured in dollars per million tokens. InferenceMax attempts to estimate this figure for various scenarios, including purchasing and owning GPUs versus renting them.

Spread the love

Similar Posts