Benchmarking AI Models

Inference Engine Performance Comparison

Test Specification

Prompt: Write 3 sentences about important historical facts in 20th century.
Max Tokens: 500
For each concurrency level, the same test is launched 5 times and average values are calculated. Pause for 10 seconds every iteration.

Hardware Setup

Two NVIDIA RTX 5060 Ti 16GB
AMD Ryzen 9 5950X 16-Core Processor
PCIe gen4
128GB non-ECC RAM (2667 MT/s)

AI Model Setup

llama.cpp

Launching unsloth/gpt-oss-20b-GGUF:Q4_K_M.

docker container run --gpus all --rm \
        -p 9999:9999 \
        -v /opt/HuggingFace/mymodels/unsloth/gpt-oss-20b-gguf:/model \
        ghcr.io/ggml-org/llama.cpp:full-cuda12 --server -m  /model/gpt-oss-20b-Q4_K_M.gguf \
        --n-gpu-layers all \
        --tensor-split 1,1 \
        --flash-attn on \
        --host 0.0.0.0 \
        --port 9999
        -c 65536 \
        -b 2048 \
        -ub 512 \
        -np 100 \
        -cb \
        -kvu \
        --split-mode layer \
        --api-key "llama-myAPIKey" \
        --no-mmap \
        --cache-type-k q8_0 \
        --cache-type-v q8_0

Launching google/gemma-4-26B-A4B-it-qat-q4_0-gguf.

/root/llama.cpp/build/bin/llama-server \
        -hf google/gemma-4-26B-A4B-it-qat-q4_0-gguf \
        --alias "llama-gemma-4-26B-it-qat-q4_0" \
        --no-warmup \
        --host 0.0.0.0 \
        --port 9999 \
        --n-gpu-layers all \
        --tensor-split 1,1 \
        --flash-attn on \
        -c 262144 \
        -b 2048 \
        -ub 512 \
        -np 50 \
        -cb \
        -kvu \
        --split-mode layer \
        --api-key "llama-myAPIKey" \
        --device CUDA0,CUDA1 \
        --no-mmap \
        --cache-type-k q8_0 \
        --cache-type-v q8_0

vLLM

Launching openai/gpt-oss-20b

docker run --rm --gpus all \
        -v /opt/HuggingFace/mymodels/openai/gpt-oss-20b:/model \
        --env "OMP_NUM_THREADS=8" \
        --env "VLLM_CPU_OMP_THREADS_BIND=0-7" \
        --env "HF_HUB_OFFLINE=1" \
        -p 8000:8000 \
        --ipc=host \
        --name vllm-gpt-oss-20b \
        docker.io/vllm/vllm-openai:latest /model \
        --served-model-name gpt-oss-20b \
        --gpu-memory-utilization 0.8 \
        --max-model-len 65536 \
        --tensor-parallel-size 2 \
        --enable-prompt-tokens-details \
        --api-key vllm-myAPIKey \
        --kv_cache_dtype fp8
#       --enforce-eager

Launching gemma-4-26b-a4b-it

docker run --rm  --gpus all \
        -v /root/vLLM/HuggingFace:/root/.cache/huggingface \
        --env "HF_TOKEN=hf_yourAPIKey" \
        -p 8000:8000 \
        --ipc=host \
        docker.io/vllm/vllm-openai:latest \
        cyankiwi/gemma-4-26B-A4B-it-qat-AWQ-INT4 \
        --served-model-name gemma-4-26b-it-qat-awq-int4 \
        --gpu-memory-utilization 0.8 \
        --max-model-len 65536 \
        --enable-auto-tool-choice \
        --tool-call-parser gemma4 \
        --reasoning-parser gemma4  \
        --tensor-parallel-size 2 \
        --enable-prompt-tokens-details \
        --api-key vllm-myAPIKey \
        --kv_cache_dtype fp8 \
        --chat-template examples/tool_chat_template_gemma4.jinja 
        #--enforce-eager

SGLANG

Launching openai/gpt-oss-20b

docker container run \
        --rm \
        -p 30000:30000 \
        --gpus all \
        --shm-size 32g \
        -v /opt/HuggingFace/mymodels/openai/gpt-oss-20b:/model \
        --ipc=host \
        --env HF_HUB_OFFLINE=1 \
        lmsysorg/sglang:v0.5.12-cu130 python3 \
        -m sglang.launch_server \
        --model-path  /model \
        --served-model-name gpt-oss-20b \
        --kv-cache-dtype fp8_e4m3 \
         --context-length 65536 \
        --host 0.0.0.0 \
        --port 30000 \
        --tensor-parallel-size 2 \
        --api-key sglang-myAPIKey \
        --mem-fraction-static 0.8 \
        --reasoning-parser gpt-oss

Ollama

[Unit]
Description=Ollama Service
After=network-online.target

[Service]
ExecStart=/opt/ollama/bin/ollama serve
Environment="OLLAMA_NUM_PARALLEL=10"
Environment="OLLAMA_MAX_LOADED_MODELS=2"
Environment="OLLAMA_HOST=0.0.0.0:11434"
Environment="OLLAMA_FLASH_ATTENTION=1"
Environment="OLLAMA_KV_CACHE_TYPE=q8_0"
Environment="CUDA_VISIBLE_DEVICES=0,1"
Environment="OLLAMA_SCHED_SPREAD=1"
Environment="OLLAMA_KEEP_ALIVE=-1"
Environment="OLLAMA_MODELS=/opt/ollama/Models"
Environment="HOME=/root"
Restart=always
RestartSec=10
Environment="PATH=$PATH"

[Install]
WantedBy=multi-user.target

Benchmark Results (gpt-oss:20b)

gpt-oss:20b Performance Summary

Screenshot From 2026-06-23 10-42-47.png

Summary of findings

The clear winner is vLLM with CUDA Graphs enabled. It delivers the best result across every tested concurrency level for all three measured categories: avg total wall time, avg completion time, and avg response tokens/sec.

Overall ranking by average performance

Engine	Avg Wall Time ↓	Avg Completion Time ↓	Avg Response Tokens/s ↑	50-Client Wall Time	50-Client Tokens/s
vLLM, CG: yes	5.89s	3.22s	58.29	9.03s	24.31
SGLANG	9.27s	5.90s	39.97	14.94s	13.52
vLLM, CG: no	11.04s	4.88s	29.86	13.29s	19.82
llama.cpp	16.31s	8.28s	36.01	43.10s	4.75
OLLAMA	20.77s	11.09s	27.09	47.46s	5.03

Key takeaways

vLLM with CUDA Graphs enabled is the best production choice. It remains consistently fastest from 1 to 50 concurrent clients. At 50 clients, it completes requests in 9.03s wall time, compared with 13.29s for vLLM without CUDA Graphs, 14.94s for SGLANG, 43.10s for llama.cpp, and 47.46s for OLLAMA.

CUDA Graphs make a major difference for vLLM. Compared with vLLM without CUDA Graphs, enabling CG reduces average wall time from 11.04s to 5.89s and nearly doubles average response throughput from 29.86 to 58.29 tokens/s.

SGLANG is the strongest non-vLLM competitor overall. It performs well on wall time, especially at medium and high concurrency, and is generally much better than llama.cpp and OLLAMA under load. However, it still trails vLLM with CUDA Graphs across the board.

llama.cpp is competitive only at very low concurrency. At 1 client, llama.cpp is almost tied with vLLM CG:yes on response throughput: 123.66 vs 124.22 tokens/s. But it degrades sharply as concurrency increases, falling to 4.75 tokens/s at 50 clients.

OLLAMA is the weakest performer in this test. It has the highest average wall time and completion time, and its throughput drops heavily as concurrency increases. At 50 clients, it reaches only 5.03 tokens/s.

Recommendation

Use vLLM with CUDA Graphs enabled as the default inference backend for this workload. It provides the best latency, completion speed, and throughput, and it scales far better under concurrent load. Use SGLANG as the most credible alternative. Avoid llama.cpp and OLLAMA for high-concurrency serving unless there are deployment constraints that matter more than throughput and latency.

Benchmark Results (gemma-4-26b-qat)

Conclusions

gemma-4-26b-qat

vLLM inference engine is able to provide consistent high throughput and low latency output especially with CUDA graph buffers enabled.
vLLM's peak response token per second isn't as high as with gpt-oss:20b; but still respectable avg 75.75 response tokens per second for single concurrent client.
The major limiting factor in my setup is lack of sufficient CUDA cores and limited memory bandwidth of RTX 5060 Ti GPU.
Like in the previous test, llama.cpp is able to provide fairly decent performance for up to 3-5 concurrent clients.
ollama is the worst performer only really suitable for single client deployment; and even with just single client, ollama peaks at avg 58.94 response tokens per second.

Winner: vLLM