Skip to main content

Benchmarking AI Models

Inference Engine Performance Comparison

Test Specification

  • Prompt: Write 3 sentences about important historical facts in 20th century.
  • Max Tokens: 500
  • For each concurrency level, the same test is launched 5 times and average values are calculated. Pause for 10 seconds every iteration.

Hardware Setup

  • Two NVIDIA RTX 5060 Ti 16GB
  • AMD Ryzen 9 5950X 16-Core Processor
  • PCIe gen4
  • 128GB non-ECC RAM (2667 MT/s)

AI Model Setup

llama.cpp

Launching  unsloth/gpt-oss-20b-GGUF:Q4_K_M.

 

docker container run --gpus all --rm \
        -p 9999:9999 \
        -v /opt/HuggingFace/mymodels/unsloth/gpt-oss-20b-gguf:/model \
        ghcr.io/ggml-org/llama.cpp:full-cuda12 --server -m  /model/gpt-oss-20b-Q4_K_M.gguf \
        --n-gpu-layers all \
        --tensor-split 1,1 \
        --flash-attn on \
        --host 0.0.0.0 \
        --port 9999
        -c 65536 \
        -b 2048 \
        -ub 512 \
        -np 100 \
        -cb \
        -kvu \
        --split-mode layer \
        --api-key "llama-myAPIKey" \
        --no-mmap \
        --cache-type-k q8_0 \
        --cache-type-v q8_0

 

 

Launching google/gemma-4-26B-A4B-it-qat-q4_0-gguf.

 

 

/root/llama.cpp/build/bin/llama-server \
        -hf google/gemma-4-26B-A4B-it-qat-q4_0-gguf \
        --alias "llama-gemma-4-26B-it-qat-q4_0" \
        --no-warmup \
        --host 0.0.0.0 \
        --port 9999 \
        --n-gpu-layers all \
        --tensor-split 1,1 \
        --flash-attn on \
        -c 262144 \
        -b 2048 \
        -ub 512 \
        -np 50 \
        -cb \
        -kvu \
        --split-mode layer \
        --api-key "llama-myAPIKey" \
        --device CUDA0,CUDA1 \
        --no-mmap \
        --cache-type-k q8_0 \
        --cache-type-v q8_0

 

 

 

vLLM

 

Launching openai/gpt-oss-20b

 

docker run --rm --gpus all \
        -v /opt/HuggingFace/mymodels/openai/gpt-oss-20b:/model \
        --env "OMP_NUM_THREADS=8" \
        --env "VLLM_CPU_OMP_THREADS_BIND=0-7" \
        --env "HF_HUB_OFFLINE=1" \
        -p 8000:8000 \
        --ipc=host \
        --name vllm-gpt-oss-20b \
        docker.io/vllm/vllm-openai:latest /model \
        --served-model-name gpt-oss-20b \
        --gpu-memory-utilization 0.8 \
        --max-model-len 65536 \
        --tensor-parallel-size 2 \
        --enable-prompt-tokens-details \
        --api-key vllm-myAPIKey \
        --kv_cache_dtype fp8
#       --enforce-eager

 

 Launching gemma-4-26b-a4b-it

 

docker run --rm  --gpus all \
        -v /root/vLLM/HuggingFace:/root/.cache/huggingface \
        --env "HF_TOKEN=hf_yourAPIKey" \
        -p 8000:8000 \
        --ipc=host \
        docker.io/vllm/vllm-openai:latest \
        cyankiwi/gemma-4-26B-A4B-it-qat-AWQ-INT4 \
        --served-model-name gemma-4-26b-it-qat-awq-int4 \
        --gpu-memory-utilization 0.8 \
        --max-model-len 65536 \
        --enable-auto-tool-choice \
        --tool-call-parser gemma4 \
        --reasoning-parser gemma4  \
        --tensor-parallel-size 2 \
        --enable-prompt-tokens-details \
        --api-key vllm-myAPIKey \
        --kv_cache_dtype fp8 \
        --chat-template examples/tool_chat_template_gemma4.jinja 
        #--enforce-eager 

 

 

 

SGLANG

 Launching openai/gpt-oss-20b

 

docker container run \
        --rm \
        -p 30000:30000 \
        --gpus all \
        --shm-size 32g \
        -v /opt/HuggingFace/mymodels/openai/gpt-oss-20b:/model \
        --ipc=host \
        --env HF_HUB_OFFLINE=1 \
        lmsysorg/sglang:v0.5.12-cu130 python3 \
        -m sglang.launch_server \
        --model-path  /model \
        --served-model-name gpt-oss-20b \
        --kv-cache-dtype fp8_e4m3 \
         --context-length 65536 \
        --host 0.0.0.0 \
        --port 30000 \
        --tensor-parallel-size 2 \
        --api-key sglang-myAPIKey \
        --mem-fraction-static 0.8 \
        --reasoning-parser gpt-oss

 

 

 

Ollama

 

[Unit]
Description=Ollama Service
After=network-online.target

[Service]
ExecStart=/opt/ollama/bin/ollama serve
Environment="OLLAMA_NUM_PARALLEL=10"
Environment="OLLAMA_MAX_LOADED_MODELS=2"
Environment="OLLAMA_HOST=0.0.0.0:11434"
Environment="OLLAMA_FLASH_ATTENTION=1"
Environment="OLLAMA_KV_CACHE_TYPE=q8_0"
Environment="CUDA_VISIBLE_DEVICES=0,1"
Environment="OLLAMA_SCHED_SPREAD=1"
Environment="OLLAMA_KEEP_ALIVE=-1"
Environment="OLLAMA_MODELS=/opt/ollama/Models"
Environment="HOME=/root"
Restart=always
RestartSec=10
Environment="PATH=$PATH"

[Install]
WantedBy=multi-user.target

 

 

 

 

Benchmark Results (gpt-oss:20b)

image.png

image.png

image.png

image.png

image.png

gpt-oss:20b Performance Summary


Screenshot From 2026-06-23 10-42-47.png

Summary of findings

The clear winner is vLLM with CUDA Graphs enabled. It delivers the best result across every tested concurrency level for all three measured categories: avg total wall time, avg completion time, and avg response tokens/sec.

Overall ranking by average performance

Engine Avg Wall Time ↓ Avg Completion Time ↓ Avg Response Tokens/s ↑ 50-Client Wall Time 50-Client Tokens/s
vLLM, CG: yes 5.89s 3.22s 58.29 9.03s 24.31
SGLANG 9.27s 5.90s 39.97 14.94s 13.52
vLLM, CG: no 11.04s 4.88s 29.86 13.29s 19.82
llama.cpp 16.31s 8.28s 36.01 43.10s 4.75
OLLAMA 20.77s 11.09s 27.09 47.46s 5.03

Key takeaways

vLLM with CUDA Graphs enabled is the best production choice. It remains consistently fastest from 1 to 50 concurrent clients. At 50 clients, it completes requests in 9.03s wall time, compared with 13.29s for vLLM without CUDA Graphs, 14.94s for SGLANG, 43.10s for llama.cpp, and 47.46s for OLLAMA.

CUDA Graphs make a major difference for vLLM. Compared with vLLM without CUDA Graphs, enabling CG reduces average wall time from 11.04s to 5.89s and nearly doubles average response throughput from 29.86 to 58.29 tokens/s.

SGLANG is the strongest non-vLLM competitor overall. It performs well on wall time, especially at medium and high concurrency, and is generally much better than llama.cpp and OLLAMA under load. However, it still trails vLLM with CUDA Graphs across the board.

llama.cpp is competitive only at very low concurrency. At 1 client, llama.cpp is almost tied with vLLM CG:yes on response throughput: 123.66 vs 124.22 tokens/s. But it degrades sharply as concurrency increases, falling to 4.75 tokens/s at 50 clients.

OLLAMA is the weakest performer in this test. It has the highest average wall time and completion time, and its throughput drops heavily as concurrency increases. At 50 clients, it reaches only 5.03 tokens/s.

Recommendation

Use vLLM with CUDA Graphs enabled as the default inference backend for this workload. It provides the best latency, completion speed, and throughput, and it scales far better under concurrent load. Use SGLANG as the most credible alternative. Avoid llama.cpp and OLLAMA for high-concurrency serving unless there are deployment constraints that matter more than throughput and latency.


Benchmark Results (gemma-4-26b-qat)

image.png

image.png

image.png

image.png

image.png

Conclusions

gemma-4-26b-qat
  • vLLM inference engine is able to provide consistent high throughput and low latency output especially with CUDA graph buffers enabled.
  • vLLM's peak response token per second isn't as high as with gpt-oss:20b; but still respectable avg 75.75 response tokens per second for single concurrent client.
  • The major limiting factor in my setup is lack of sufficient CUDA cores and limited memory bandwidth of RTX 5060 Ti GPU.
  • Like in the previous test, llama.cpp is able to provide fairly decent performance for up to 3-5 concurrent clients.
  • ollama is the worst performer only really suitable for single client deployment; and even with just single client, ollama peaks at avg 58.94 response tokens per second.

Winner: vLLM