e42.uk Circle Device

 

Quick Reference

llama.cpp on Strix Halo

llama.cpp on AMD Ryzen AI Max+ 395 w/Radeon 8060S

Configure the iGPU memory in Advanced to be Auto and iGPU Memory Size to be 0.5GB. This will allow the ROCm software to manage the memory split.

Using Artix Linux with OpenRC with the rocm-hip-sdk installed.

pacman -S rocm-hip-sdk

Get llama.cpp from github:

git clone --depth=1 https://github.com/ggml-org/llama.cpp

Build with cmake:

cmake -S . -B build -DGGML_HIP=ON -DAMDGPU_TARGETS=gfx1151 -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release -- -j$(nproc)

Listing the ROCm devices:

# llama-cli --list-devices
Available devices:
  ROCm0: AMD Radeon 8060S Graphics (64042 MiB, 77959 MiB free)

Run with a model:

GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 build/bin/llama-server --host 0.0.0.0 \
    --port 8080 \
    --flash-attn on \
    --cache-prompt \
    --cache-type-k q8_0 \
    --cache-type-v q8_0 \
    --temp 0.6 \
    --top-p 0.95 \
    --top-k 20 \
    --min-p 0.0 \
    --presence-penalty 0.0 \
    --repeat-penalty 1.0 \
    --gpu-layers 99 \
    --ctx-size 32768 \
    --mmproj ../models/Huihui-Qwen3.6-35B-A3B-abliterated-mmproj-BF16.gguf \
    --model ../models/Huihui-Qwen3.6-35B-A3B-abliterated-Q8_0.gguf

Running Ministral-3-8B-Reasoning-2512-Q8_0 produces:

GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 build/bin/llama-cli --jinja --gpu-layers 99 \
    --ctx-size 32768 --model ../models/Ministral-3-8B-Reasoning-2512-Q8_0.gguf
[ Prompt: 836.3 t/s | Generation: 24.5 t/s ]

Pretty quick!

Running llama.cpp with Radeon RX 9070 XT (gfx1201)

This section will focus on compiling llama.cpp for a system that contains a AMD Ryzen 9 9950X. Which has an on-die GPU (gfx1036) and the discrete GPU, gfx1201.

cmake -S . -B build -DGGML_HIP=ON -DGGML_RPC=ON -DAMDGPU_TARGETS=gfx1201,gfx1036 -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release -- -j$(nproc)

The process works much better on the RX 9070 XT (as expected). To see the list of devices that can run the model:

build/bin/llama-cli --list-devices

This should display something like:

Available devices:
  ROCm0: AMD Radeon RX 9070 (16304 MiB, 15770 MiB free)
  ROCm1: AMD Radeon Graphics (15617 MiB, 30068 MiB free)

The machine here has 32GiB of system RAM... strange numbers but whatever!

Running the Ministral-3-8B-Reasoning-2512-Q8_0.gguf model like this:

llama-cli --device ROCm0 --n-gpu-layers 99 --ctx-size 32768 \
    --model Ministral-3-8B-Reasoning-2512-Q8_0.gguf

Produces a quick response:

[ Prompt: 1416.6 t/s | Generation: 61.4 t/s ]

Running a model with the llama.cpp RPC Server

Add the compile time switch:

-DGGML_RPC=ON

Then see the page in References for more detail ;-)

TODO: Add more detail here

References

Running an MCP Server for File System Access

On Artix Linux and using podman.

pacman -S podman crun

Update /etc/containers/registries.conf so that docker.io is searched when an unqualified image is present in Dockerfile or specified on the command line.

unqualified-search-registries = ["docker.io"]

TODO: I should prefer my own solution than this random implementation. openaiclient

References

Quick Links: Techie Stuff | General | Personal | Quick Reference