LLM, SLM, Machine Learning, Artificial Intelligence and Huggingface for the win.

Hugging Face model acronyms (and what they mean for GPU VRAM) : Last update: 01/01/2026

Musings. AKA: The blog of a person who keeps downloading models and then acting surprised when VRAM evaporates.

Content Warning: You NEED a fiber connection to do model development, as you will be downloading multiple gigabytes of models regularly, often and YOLOs your bandwidth.

Tags: AI, hardware, QuickAndDirty, huggingface

The 30-second takeaway

If you’re deploying on a consumer NVIDIA GPU, “does it fit?” is mostly:

Weights (the model files)
KV cache (Key–Value cache) (memory used while generating tokens)
A bit of overhead (runtime + buffers)

On a 12 GB GPU (e.g., RTX 3080 Ti), the sweet spot is usually:

7B–14B models in 4-bit quantization
Context length around 4k–8k to start (long context can blow up KV cache)
Use a runtime that matches the model format (GGUF ↔ llama.cpp, GPTQ/AWQ ↔ CUDA/PyTorch engines, etc.)

VRAM 101 (what uses memory during inference)

Inference (running a trained model to generate tokens) uses GPU VRAM mainly for:

Weights: the learned parameters (that “7B / 14B / 32B” number)
KV cache (Key–Value cache): per-token attention memory that grows with:
context length (how many tokens you keep)
batch size / concurrency (how many requests at once)
Workspace / overhead: temporary buffers, kernels, allocator fragmentation

Quick weight sizing rule

Weight memory ≈ parameters × bytes per parameter

Common weight formats:

FP16 (16-bit floating point) ≈ 2 bytes/param
BF16 (bfloat16) ≈ 2 bytes/param
INT8 (8-bit integer) ≈ 1 byte/param
4-bit quantization ≈ 0.5–0.7 bytes/param (depends on scheme + overhead)

So a rough mental model:

7B @ FP16 ≈ 14 GB (too big for a 12 GB card before KV cache)
7B @ 4-bit ≈ ~4–6 GB (much more realistic)
14B @ 4-bit ≈ ~8–10 GB (tight but often workable with sane context)

How to read a Hugging Face model page (fast)

When you open a Hugging Face model card, scan in this order:

Params: 7B, 14B, 32B (billions of parameters)
Format: safetensors, GGUF, GPTQ, AWQ, etc.
Precision / quantization: BF16, FP16, 8-bit, 4-bit, Q4_K_M…
Context length: 4k, 8k, 32k, 128k (sometimes listed as ctx)

If the page doesn’t clearly say #2–#4… assume you’ll learn the hard way.

Glossary: acronyms you’ll see all the time (and why you should care)

File formats & packaging

GGUF (GGML Unified Format)
A single-file format used by llama.cpp for fast local inference. Often comes with quant labels like Q4_K_M, Q5_K_M, Q8_0.
Why you care: GGUF is great when VRAM is tight or you want easy CPU/GPU offload.
SafeTensors (safetensors file format)
A safe, fast weight file format used widely with PyTorch/Transformers.
Why you care: It’s the “normal” modern weight format on Hugging Face.
ckpt (checkpoint)
A saved snapshot of model weights (more common in training contexts).
Why you care: Sometimes used informally for weight dumps; not always optimized for inference.

Precision / dtypes (how weights are stored)

FP16 (16-bit floating point)
Common for GPU inference; decent speed, moderate memory.
VRAM impact: ~2 bytes/param.
BF16 (bfloat16)
Another 16-bit float with a wider exponent range than FP16.
VRAM impact: ~2 bytes/param.
Why you care: Often more numerically stable; many GPUs support it well.
FP32 (32-bit floating point)
Higher precision, high memory use.
VRAM impact: ~4 bytes/param (rarely worth it for local inference).
TF32 (TensorFloat-32)
NVIDIA’s accelerated compute mode (mostly a compute format rather than a stored weight format).
Why you care: You might see it mentioned in performance notes, but it’s not what your model download “is”.

Quantization (making weights smaller)

Quantization compresses weights so bigger models fit in your VRAM.

INT8 (8-bit integer quantization)
VRAM impact: ~1 byte/param.
Why you care: Often good quality, but still larger than 4-bit.
4-bit quantization
VRAM impact: ~0.5–0.7 bytes/param.
Why you care: The main reason 7B–14B models fit nicely on 12 GB.

Specific 4-bit families you’ll see:

GPTQ (Generalized Post-Training Quantization)
A post-training quant method commonly used for NVIDIA GPU inference.
Why you care: Many “GPTQ 4-bit” repos target fast CUDA inference.
AWQ (Activation-Aware Weight Quantization)
Quantization tuned to preserve accuracy by considering activations.
Why you care: Often strong quality/speed tradeoffs at 4-bit.
bitsandbytes (bnb, BitsAndBytes library)
A popular library that enables loading models in 8-bit or 4-bit in PyTorch.
Why you care: You’ll see “load_in_4bit” a lot.
NF4 (NormalFloat 4)
A 4-bit datatype used by bitsandbytes.
Why you care: Often a “default good choice” for 4-bit loads.

GGUF quant labels (llama.cpp world):

Q4_K_M / Q5_K_M / Q8_0 (GGUF quantization variants)
Roughly: Q4 = smaller/faster, Q5/Q6 = middle ground, Q8 = larger/near-full quality.
Why you care: These labels largely determine VRAM/RAM footprint and output quality.

Runtime-specific:

EXL2 (ExLlamaV2 quant format)
A quant format tuned for fast local NVIDIA inference.
Why you care: Great speed; you’ll see values like 4.0 bpw.
bpw (bits per weight)
Average bits used per weight (e.g., 4.0 bpw ≈ 4-bit-ish).
Why you care: Lower bpw fits more easily, but may reduce quality.

Context & generation memory

ctx (context length)
Maximum tokens the model can “see” at once (e.g., 4096 / 8192 / 32768).
Why you care: Bigger ctx → bigger KV cache → more VRAM use.
KV cache (Key–Value cache)
Memory stored for attention so the model can generate efficiently.
Why you care: This is the silent VRAM killer at long context or high concurrency.
batch size / concurrency
How many tokens/requests you process in parallel.
Why you care: More concurrency = more KV cache and overhead.

Model scale & architecture

B (billions of parameters)
7B, 14B, etc.
Why you care: Parameter count strongly correlates with weight memory and compute cost.
MoE (Mixture of Experts)
A model architecture where only some “experts” run per token.
Why you care: Compute per token can be lower, but you still often need to store lots of total weights, so memory can still be big.
A3B / “activated 3B” (activated parameters)
A MoE-style label indicating how many parameters are used per token.
Why you care: “Activated params” ≠ “memory required.” You usually pay memory for the total experts stored.

Runtimes (what actually runs the model)

llama.cpp
A popular local inference runtime (CPU-first, but supports GPU offload).
Why you care: Best match for GGUF models; very practical for 12 GB GPUs.
Transformers (Hugging Face Transformers library)
The common Python runtime for model loading and inference.
Why you care: Works well with safetensors, and quant via bitsandbytes.
vLLM (a high-throughput serving engine)
Optimized for serving (batching + smarter KV management).
Why you care: Good when you need an API and multiple concurrent requests.
CUDA (Compute Unified Device Architecture)
NVIDIA’s GPU compute stack.
Why you care: Most high-performance local inference on RTX cards is CUDA-backed.

What’s “valid” on a 12 GB GPU (practical targets)

Recommended (“won’t ruin your evening”)

7B–14B @ 4-bit
GGUF Q4_K_M / Q5_K_M is a common “good default”
GPTQ/AWQ 4-bit builds for CUDA runtimes are also solid
Context length: start at 4k–8k
Concurrency: start with 1 request / small batch until you know your KV cache budget

Sometimes works (depends on context/offload/overhead)

20B-ish @ 4-bit with careful settings and/or CPU offload
14B @ INT8 (larger weights than 4-bit; can crowd out KV cache)

Usually painful on this setup

30B dense models (unless you accept heavy CPU offload and slowdowns)
Huge context (32k/128k) plus bigger models (KV cache can dominate VRAM)

“If you see this filename…” (decode at a glance)

model.Q4_K_M.gguf
GGUF (GGML Unified Format) + Q4 (4-bit quant), K-variant, M profile.
Expect smaller memory use; decent quality.
model-awq-4bit
AWQ (Activation-Aware Weight Quantization), 4-bit.
Expect CUDA-friendly inference with the right runtime.
model-gptq-4bit-128g
GPTQ (Generalized Post-Training Quantization), 4-bit; extra suffixes usually describe grouping/packing.
Expect a GPU-targeted quant build.
pytorch_model.safetensors
SafeTensors (safetensors file format) weights.
Expect “standard” Transformers loading; quant depends on how you load it.

Picking a format (quick recommendations)

If you want the easiest path on a 12 GB RTX card:

Choose GGUF (GGML Unified Format) + llama.cpp if you want:
simple local runs
easy CPU/GPU offload
lots of pre-quantized options
Choose AWQ (Activation-Aware Weight Quantization) or GPTQ (Generalized Post-Training Quantization) if you want:
GPU-first inference in a Python/CUDA stack
potential speed improvements with the right runtime
Choose bitsandbytes (bnb, BitsAndBytes library) if you want:
“just load it in 4-bit” convenience inside Transformers
minimal file-format wrangling

Common gotchas (the ones that bite first)

Weights fit but it still OOMs (out of memory)
That’s usually KV cache (Key–Value cache) + overhead.
Long context is not “free”
Increasing context increases memory during generation even if weights don’t change.
MoE (Mixture of Experts) naming can be misleading
“Activated parameters” tells you compute per token, not storage required.
4-bit ≠ 4-bit
Different quant schemes trade quality/speed/overhead differently (GGUF Q4 variants, AWQ, GPTQ, NF4, EXL2, etc.).

Cheatsheet: what you should look for on Hugging Face

When browsing, you’re basically hunting for:

7B–14B
4-bit quant (GGUF Q4/Q5, AWQ 4-bit, GPTQ 4-bit, NF4)
Context 4k–8k (at least to start)
A runtime you actually plan to use

And if a model page doesn’t mention quant/format/context clearly… scroll away with confidence.

“Gen” (common name)	RTX series / architecture	Typical “main” GPUs & VRAM	Tensor Core low-precision support (relevant to LLMs)	Ideal weight formats to use (what you should look for on Hugging Face)
3rd gen	RTX 30 series / Ampere	3060 12GB, 3070 8GB, 3080 10/12GB, 3090 24GB	Accelerates FP16, BF16, TF32, INT8, INT4 (plus others) (NVIDIA Developer)	Best overall: 4-bit quantized weights (e.g., AWQ (Activation-Aware Weight Quantization) / GPTQ (Generalized Post-Training Quantization) / EXL2 (ExLlamaV2 quant format) / GGUF (GGML Unified Format) Q4/Q5) to fit 7B–14B in 12GB. If you have 24GB: INT8 can be a nice quality bump.
4th gen	RTX 40 series / Ada Lovelace	4060 8/16GB, 4070 12GB, 4080 16GB, 4090 24GB (NVIDIA)	Still accelerates FP16/BF16/TF32/INT8, and adds FP8 support on 4th-gen Tensor Cores (NVIDIA Images)	Best overall: same as Ampere—4-bit weight quants are still the “fits + fast” default. Nice-to-have: FP8 weights/compute if your inference stack supports it well (more common in serving engines than in random HF repos). (NVIDIA Images)
5th gen	RTX 50 series / Blackwell	5070 12GB, 5070 Ti 16GB, 5080 16GB, 5090 32GB (NVIDIA)	Supports FP16/BF16/TF32/INT8, FP8 (2nd-gen FP8 Transformer Engine), and adds FP6 + FP4 support (NVIDIA Images)	Best overall today on HF: still 4-bit quants (AWQ/GPTQ/EXL2/GGUF) because they’re widely available. Emerging “ideal” for Blackwell: FP8/FP6/FP4 weight + kernel paths where supported (this depends heavily on the runtime/tooling, not just the GPU). (NVIDIA Images)

VRAM	Ideal weight choice	What it enables (roughly)
8 GB	4-bit	7B comfortably; 13B sometimes tight
12 GB	4-bit	7B–14B sweet spot
16 GB	4-bit (or INT8 for smaller models)	14B very comfy; INT8 7B with extra headroom
24 GB	INT8 (quality) or 4-bit (bigger models)	FP16/BF16 7B–13B; 4-bit ~30B-ish depending on overhead/context
32 GB	INT8 (quality) or 4-bit (bigger models)	room for larger 4-bit models + more KV cache

Runtime (expanded)	What it is best for	Model file formats it commonly uses	Quant types you’ll commonly see (expanded)	GPU/VRAM notes (esp. 12 GB)	Typical setup style
llama.cpp (C/C++ local inference engine)	Easiest “run locally” + great CPU fallback/offload	GGUF (GGML Unified Format)	GGUF Q4/Q5/Q6/Q8 (GGUF quantization variants like `Q4_K_M`)	Very practical on 12 GB: can GPU offload some layers and keep rest in RAM; good when VRAM is tight	CLI, desktop UIs, local servers (often simple)
Transformers (Hugging Face Transformers library, usually PyTorch)	Python workflows, notebooks, agent/tool integration	safetensors (SafeTensors format) / PyTorch weights	bitsandbytes (bnb, BitsAndBytes library) with NF4 (NormalFloat 4) 4-bit or INT8 (8-bit integer)	12 GB works well with 4-bit NF4 for 7B–14B; watch KV cache (Key–Value cache) at long context	Python scripts, notebooks, apps
vLLM (high-throughput LLM serving runtime)	Serving an API with batching & concurrency	Usually HF safetensors; sometimes specific quant packages	AWQ (Activation-Aware Weight Quantization), GPTQ (Generalized Post-Training Quantization), sometimes bnb (depends on build)	Best when you have multiple users/requests; KV/cache handling is efficient; 12 GB still likes 4-bit	API server (OpenAI-style endpoints common)
ExLlamaV2 (NVIDIA-optimized local runtime)	Fast single-user chat on RTX cards	EXL2 (ExLlamaV2 quant format)	EXL2 with bpw (bits per weight) like `4.0 bpw`	Often very fast on 12 GB for 7B–14B; less flexible outside supported formats	Local chat UIs / Python wrappers
TensorRT-LLM (NVIDIA TensorRT LLM runtime)	Maximum NVIDIA performance, production deployment	Engine builds (compiled artifacts), not “download and go”	Often uses optimized kernels; can use FP16 (16-bit floating point), BF16 (bfloat16), FP8 (8-bit floating point) where supported	Great speed, but higher setup cost; you typically build an engine per GPU/config	Production/serving, more engineering
ONNX Runtime (Open Neural Network Exchange runtime)	Portability + some acceleration; enterprise stacks	ONNX (Open Neural Network Exchange) models	Depends on export; can do FP16, INT8 quant in

Running the Models up that hill

So, you have your model, but how to run it? You have several options.

Run it straight on your machine using vllm and python
- If it goes awry, you may destroy your python libraries, and corrupt your system..
Run it in a virtual environment (python venv)
- Use uv venv .venv to create a virtual environment and go from there.
Run it within Docker Containers
- As long as you have the CUDA libraries set up correctly, you should have only slight overhead.
- Far simpler to spin up/tear down
- Least risk to system
Ollama or vLLM
- Ollama is a good managed choice, restricted by the ollama library
- vLLM is the choice for running huggingface models, as long as you have the huggingface cli installed and your GPU set up correctly.
CPU or GPU?
- As mentioned above some model runtimes will utilise CPU and GPU. Its hit and miss and can hang your machine. Try and stick to GPU, unless you’re on a new M5 Mac, which has a shared memory.

Docker Composition for a local agent model

name: vllm-mistral-neilhighley

services:
  vllm:
    image: vllm/vllm-openai:v0.12.0
    container_name: vllm-ministral
    ports:
      - "8333:8000"

    # Recommended by vLLM docker docs to avoid PyTorch shared-memory issues
    ipc: host

    # Hugging Face cache (faster restarts; avoids re-downloading)
    volumes:
      - ${HF_HOME:-~/.cache/huggingface}:/root/.cache/huggingface

    environment:
      # Put this in a .env file (see below)
      - HF_TOKEN=${HF_TOKEN}
      

    # Give the container access to the NVIDIA GPU
    # Works with Docker Compose v2 + NVIDIA Container Toolkit installed
    gpus: all

    # vLLM OpenAI-compatible server arguments
    # Mistral/Ministral model card recommends these flags for vLLM. :contentReference[oaicite:4]{index=4}
    command: >
      mistralai/Ministral-3-8B-Instruct-2512
      --tokenizer_mode mistral
      --config_format mistral
      --load_format mistral
      --enable-auto-tool-choice
      --tool-call-parser mistral
      --dtype auto
      --max-model-len "4096"
      --gpu-memory-utilization "0.3"
    networks:
      - nh-vllm-agent-network

networks:
  nh-vllm-agent-network:
    driver: bridge

Use the docker-compose.yaml file above and increase your gpu-memory-utilisation until it hits a wall. CPU memory is a secondary concern, but be careful using :latest for any containers, as it may end up downloading 8GB on each docker compose up . That is fine if you’re on fiber, but still takes time. Find the actual tag for the latest, and docker pull that locally, and use that until you need extra features.

Add openwebui or openhands to the composition to create front ends or agent workspaces. Utilise an agent-cli t ohelp you create a MCP or a python wrapper to enable RAG and other tools for your model.

IMHO> Taking :latest from docker repositories is an accident waiting to happen..

So this page is a good starter and should also be helpful for my personal use of these models.

Links

https://artificialanalysis.ai/leaderboards/models
https://huggingface.co/models
https://huggingface.co/docs
https://www.geeksforgeeks.org/artificial-intelligence/large-language-model-llm/
https://yellow.systems/blog/llm-deep-dive

Model running

https://docs.vllm.ai/en/latest

https://ollama.com

BONUS: Logseq flashcards

Add this as a page to LogSeq to help with LLM knowledge retaining

- # AI & Infrastructure Flashcards (Expanded)
- ## 1. The Acronym Deep Dive
- What does AI stand for in the context of this blog? #card
	- Artificial Intelligence: The broad field of creating systems that simulate human intelligence.
- What does ML stand for? #card
	- Machine Learning: A subset of AI where systems learn patterns from data without explicit programming.
- What does LLM stand for? #card
	- Large Language Model: Neural networks trained on massive text datasets to understand and generate language.
- What does NLP stand for? #card
	- Natural Language Processing: The specialized field of AI dealing with human language interaction.
- What does RAG stand for? #card
	- Retrieval-Augmented Generation: Giving an LLM a "search engine" to look up facts before answering.
- What does GGUF stand for? #card
	- GPT-Generated Unified Format: The successor to GGML, used for running models efficiently on consumer hardware.
- What does HF stand for? #card
	- Hugging Face: The platform used as the central hub for models, datasets, and AI collaboration.
- What does TGI stand for? #card
	- Text Generation Inference: A high-performance toolkit developed by Hugging Face for deploying LLMs.
- What does VRAM stand for? #card
	- Video Random Access Memory: The dedicated memory on a GPU that holds the model weights during inference.
- What does CUDA stand for? #card
	- Compute Unified Device Architecture: NVIDIA’s parallel computing platform that allows AI software to use the GPU.
- What does LoRA stand for? #card
	- Low-Rank Adaptation: A fine-tuning technique that allows adapting large models using very little compute.
- What does QLoRA stand for? #card
	- Quantized Low-Rank Adaptation: A method that combines quantization with LoRA to fine-tune models on even smaller GPUs.
- What does GPTQ stand for? #card
	- Generalized Post-Training Quantization: A 4-bit quantization method designed to run efficiently on GPUs.
- What does EXL2 stand for? #card
	- ExLlamaV2: A high-performance quantization format specifically optimized for extremely fast inference on NVIDIA GPUs.
- What does AWQ stand for? #card
	- Activation-aware Weight Quantization: A hardware-friendly quantization format that maintains higher accuracy than GPTQ.
- What does JSON stand for in the context of LLM outputs? #card
	- JavaScript Object Notation: The standard data format used when you want an LLM to provide structured, machine-readable data.
- What does REST stand for in AI APIs? #card
	- Representational State Transfer: The architectural style used by vLLM and Ollama to provide their web-based API endpoints.
- ## 2. Deployment & Infrastructure
- What is the vLLM URL? #card
	- https://github.com/vllm-project/vllm
- What is the Ollama URL? #card
	- https://ollama.com
- Name the 3 parts of the "Docker Composition" described in the blog. #card
	- 1. Base Image (NVIDIA/CUDA), 2. Inference Engine (vLLM/Ollama), 3. Model Volume (Persistent Storage).
- Why is Shared Memory (shm_size) critical for LLM Docker containers? #card
	- It allows different parts of the AI (like multiple GPUs) to talk to each other quickly without crashing.
- What is the "Hugging Face for the Win" philosophy? #card
	- Focus on using the best available open-source models from the community hub rather than training everything from scratch.
- What is a Model Volume in Docker? #card
	- A persistent storage folder that stays on your hard drive so you don't have to download 50GB models every time you restart a container.
- What is the primary benefit of PagedAttention? #card
	- It manages the LLM's memory like a computer's RAM, preventing waste and allowing the server to handle more users at once.
- ## 3. Practical Guidance
- When should you use a "Small Language Model" (SLM)? #card
	- When you need to run the AI locally on a laptop, phone, or device with limited VRAM.
- What is the blog's advice on "Local First"? #card
	- Start by running models locally with Ollama to understand how they work before moving to expensive cloud setups.
- How does Temperature affect an LLM? #card
	- Low Temperature (0.1) makes it focused and factual; High Temperature (0.8) makes it creative and random.
- What is the "Context Window"? #card
	- The total amount of information (input + output) the model can "remember" during a single conversation.
- What does a "Quantized" model actually do to the math? #card
	- It rounds the complex numbers in the model weights to simpler versions so they take up less space in memory.
- What is the "System Prompt" in Ollama? #card
	- A set of instructions given to the model at the very start to define its personality or rules (e.g., "You are a helpful coder").

Tags: ai, flashcards, hardware, huggingface, learning, llm, logseq, slm

Musings

The blog of Neil Highley, C# developer, Automation Engineer, IOT Tinkerer, Robot fan