Menu

LLM, SLM, Machine Learning, Artificial Intelligence and Huggingface for the win.

1st January 2026 - AI, Large Language Models
LLM, SLM, Machine Learning, Artificial Intelligence and Huggingface for the win.

Hugging Face model acronyms (and what they mean for GPU VRAM) : Last update: 01/01/2026

Musings. AKA: The blog of a person who keeps downloading models and then acting surprised when VRAM evaporates.

Content Warning: You NEED a fiber connection to do model development, as you will be downloading multiple gigabytes of models regularly, often and YOLOs your bandwidth.

Tags: AI, hardware, QuickAndDirty, huggingface


The 30-second takeaway

If you’re deploying on a consumer NVIDIA GPU, “does it fit?” is mostly:

  1. Weights (the model files)
  2. KV cache (Key–Value cache) (memory used while generating tokens)
  3. A bit of overhead (runtime + buffers)

On a 12 GB GPU (e.g., RTX 3080 Ti), the sweet spot is usually:


VRAM 101 (what uses memory during inference)

Inference (running a trained model to generate tokens) uses GPU VRAM mainly for:

Quick weight sizing rule

Weight memory ≈ parameters × bytes per parameter

Common weight formats:

So a rough mental model:


How to read a Hugging Face model page (fast)

When you open a Hugging Face model card, scan in this order:

  1. Params: 7B, 14B, 32B (billions of parameters)
  2. Format: safetensors, GGUF, GPTQ, AWQ, etc.
  3. Precision / quantization: BF16, FP16, 8-bit, 4-bit, Q4_K_M
  4. Context length: 4k, 8k, 32k, 128k (sometimes listed as ctx)

If the page doesn’t clearly say #2–#4… assume you’ll learn the hard way.


Glossary: acronyms you’ll see all the time (and why you should care)

File formats & packaging


Precision / dtypes (how weights are stored)


Quantization (making weights smaller)

Quantization compresses weights so bigger models fit in your VRAM.

Specific 4-bit families you’ll see:

GGUF quant labels (llama.cpp world):

Runtime-specific:


Context & generation memory


Model scale & architecture


Runtimes (what actually runs the model)


What’s “valid” on a 12 GB GPU (practical targets)

Recommended (“won’t ruin your evening”)

Sometimes works (depends on context/offload/overhead)

Usually painful on this setup


“If you see this filename…” (decode at a glance)


Picking a format (quick recommendations)

If you want the easiest path on a 12 GB RTX card:


Common gotchas (the ones that bite first)

  1. Weights fit but it still OOMs (out of memory)
    That’s usually KV cache (Key–Value cache) + overhead.
  2. Long context is not “free”
    Increasing context increases memory during generation even if weights don’t change.
  3. MoE (Mixture of Experts) naming can be misleading
    “Activated parameters” tells you compute per token, not storage required.
  4. 4-bit ≠ 4-bit
    Different quant schemes trade quality/speed/overhead differently (GGUF Q4 variants, AWQ, GPTQ, NF4, EXL2, etc.).

Cheatsheet: what you should look for on Hugging Face

When browsing, you’re basically hunting for:

And if a model page doesn’t mention quant/format/context clearly… scroll away with confidence.

“Gen” (common name)RTX series / architectureTypical “main” GPUs & VRAMTensor Core low-precision support (relevant to LLMs)Ideal weight formats to use (what you should look for on Hugging Face)
3rd genRTX 30 series / Ampere3060 12GB, 3070 8GB, 3080 10/12GB, 3090 24GBAccelerates FP16, BF16, TF32, INT8, INT4 (plus others) (NVIDIA Developer)Best overall: 4-bit quantized weights (e.g., AWQ (Activation-Aware Weight Quantization) / GPTQ (Generalized Post-Training Quantization) / EXL2 (ExLlamaV2 quant format) / GGUF (GGML Unified Format) Q4/Q5) to fit 7B–14B in 12GB. If you have 24GB: INT8 can be a nice quality bump.
4th genRTX 40 series / Ada Lovelace4060 8/16GB, 4070 12GB, 4080 16GB, 4090 24GB (NVIDIA)Still accelerates FP16/BF16/TF32/INT8, and adds FP8 support on 4th-gen Tensor Cores (NVIDIA Images)Best overall: same as Ampere—4-bit weight quants are still the “fits + fast” default. Nice-to-have: FP8 weights/compute if your inference stack supports it well (more common in serving engines than in random HF repos). (NVIDIA Images)
5th genRTX 50 series / Blackwell5070 12GB, 5070 Ti 16GB, 5080 16GB, 5090 32GB (NVIDIA)Supports FP16/BF16/TF32/INT8, FP8 (2nd-gen FP8 Transformer Engine), and adds FP6 + FP4 support (NVIDIA Images)Best overall today on HF: still 4-bit quants (AWQ/GPTQ/EXL2/GGUF) because they’re widely available. Emerging “ideal” for Blackwell: FP8/FP6/FP4 weight + kernel paths where supported (this depends heavily on the runtime/tooling, not just the GPU). (NVIDIA Images)
VRAMIdeal weight choiceWhat it enables (roughly)
8 GB4-bit7B comfortably; 13B sometimes tight
12 GB4-bit7B–14B sweet spot
16 GB4-bit (or INT8 for smaller models)14B very comfy; INT8 7B with extra headroom
24 GBINT8 (quality) or 4-bit (bigger models)FP16/BF16 7B–13B; 4-bit ~30B-ish depending on overhead/context
32 GBINT8 (quality) or 4-bit (bigger models)room for larger 4-bit models + more KV cache
Runtime (expanded)What it is best forModel file formats it commonly usesQuant types you’ll commonly see (expanded)GPU/VRAM notes (esp. 12 GB)Typical setup style
llama.cpp (C/C++ local inference engine)Easiest “run locally” + great CPU fallback/offloadGGUF (GGML Unified Format)GGUF Q4/Q5/Q6/Q8 (GGUF quantization variants like Q4_K_M)Very practical on 12 GB: can GPU offload some layers and keep rest in RAM; good when VRAM is tightCLI, desktop UIs, local servers (often simple)
Transformers (Hugging Face Transformers library, usually PyTorch)Python workflows, notebooks, agent/tool integrationsafetensors (SafeTensors format) / PyTorch weightsbitsandbytes (bnb, BitsAndBytes library) with NF4 (NormalFloat 4) 4-bit or INT8 (8-bit integer)12 GB works well with 4-bit NF4 for 7B–14B; watch KV cache (Key–Value cache) at long contextPython scripts, notebooks, apps
vLLM (high-throughput LLM serving runtime)Serving an API with batching & concurrencyUsually HF safetensors; sometimes specific quant packagesAWQ (Activation-Aware Weight Quantization), GPTQ (Generalized Post-Training Quantization), sometimes bnb (depends on build)Best when you have multiple users/requests; KV/cache handling is efficient; 12 GB still likes 4-bitAPI server (OpenAI-style endpoints common)
ExLlamaV2 (NVIDIA-optimized local runtime)Fast single-user chat on RTX cardsEXL2 (ExLlamaV2 quant format)EXL2 with bpw (bits per weight) like 4.0 bpwOften very fast on 12 GB for 7B–14B; less flexible outside supported formatsLocal chat UIs / Python wrappers
TensorRT-LLM (NVIDIA TensorRT LLM runtime)Maximum NVIDIA performance, production deploymentEngine builds (compiled artifacts), not “download and go”Often uses optimized kernels; can use FP16 (16-bit floating point), BF16 (bfloat16), FP8 (8-bit floating point) where supportedGreat speed, but higher setup cost; you typically build an engine per GPU/configProduction/serving, more engineering
ONNX Runtime (Open Neural Network Exchange runtime)Portability + some acceleration; enterprise stacksONNX (Open Neural Network Exchange) modelsDepends on export; can do FP16, INT8 quant in

Running the Models up that hill

So, you have your model, but how to run it? You have several options.

  1. Run it straight on your machine using vllm and python
    • If it goes awry, you may destroy your python libraries, and corrupt your system..
  2. Run it in a virtual environment (python venv)
    • Use uv venv .venv to create a virtual environment and go from there.
  3. Run it within Docker Containers
    • As long as you have the CUDA libraries set up correctly, you should have only slight overhead.
    • Far simpler to spin up/tear down
    • Least risk to system
  4. Ollama or vLLM
    • Ollama is a good managed choice, restricted by the ollama library
    • vLLM is the choice for running huggingface models, as long as you have the huggingface cli installed and your GPU set up correctly.
  5. CPU or GPU?
    • As mentioned above some model runtimes will utilise CPU and GPU. Its hit and miss and can hang your machine. Try and stick to GPU, unless you’re on a new M5 Mac, which has a shared memory.

Docker Composition for a local agent model

name: vllm-mistral-neilhighley

services:
  vllm:
    image: vllm/vllm-openai:v0.12.0
    container_name: vllm-ministral
    ports:
      - "8333:8000"

    # Recommended by vLLM docker docs to avoid PyTorch shared-memory issues
    ipc: host

    # Hugging Face cache (faster restarts; avoids re-downloading)
    volumes:
      - ${HF_HOME:-~/.cache/huggingface}:/root/.cache/huggingface

    environment:
      # Put this in a .env file (see below)
      - HF_TOKEN=${HF_TOKEN}
      

    # Give the container access to the NVIDIA GPU
    # Works with Docker Compose v2 + NVIDIA Container Toolkit installed
    gpus: all

    # vLLM OpenAI-compatible server arguments
    # Mistral/Ministral model card recommends these flags for vLLM. :contentReference[oaicite:4]{index=4}
    command: >
      mistralai/Ministral-3-8B-Instruct-2512
      --tokenizer_mode mistral
      --config_format mistral
      --load_format mistral
      --enable-auto-tool-choice
      --tool-call-parser mistral
      --dtype auto
      --max-model-len "4096"
      --gpu-memory-utilization "0.3"
    networks:
      - nh-vllm-agent-network

networks:
  nh-vllm-agent-network:
    driver: bridge

Use the docker-compose.yaml file above and increase your gpu-memory-utilisation until it hits a wall. CPU memory is a secondary concern, but be careful using :latest for any containers, as it may end up downloading 8GB on each docker compose up . That is fine if you’re on fiber, but still takes time. Find the actual tag for the latest, and docker pull that locally, and use that until you need extra features.

Add openwebui or openhands to the composition to create front ends or agent workspaces. Utilise an agent-cli t ohelp you create a MCP or a python wrapper to enable RAG and other tools for your model.

IMHO> Taking :latest from docker repositories is an accident waiting to happen..

So this page is a good starter and should also be helpful for my personal use of these models.

Links

https://artificialanalysis.ai/leaderboards/models
https://huggingface.co/models
https://huggingface.co/docs
https://www.geeksforgeeks.org/artificial-intelligence/large-language-model-llm/
https://yellow.systems/blog/llm-deep-dive

Model running

https://docs.vllm.ai/en/latest

https://ollama.com

BONUS: Logseq flashcards

Add this as a page to LogSeq to help with LLM knowledge retaining

- # AI & Infrastructure Flashcards (Expanded)
- ## 1. The Acronym Deep Dive
- What does AI stand for in the context of this blog? #card
	- Artificial Intelligence: The broad field of creating systems that simulate human intelligence.
- What does ML stand for? #card
	- Machine Learning: A subset of AI where systems learn patterns from data without explicit programming.
- What does LLM stand for? #card
	- Large Language Model: Neural networks trained on massive text datasets to understand and generate language.
- What does NLP stand for? #card
	- Natural Language Processing: The specialized field of AI dealing with human language interaction.
- What does RAG stand for? #card
	- Retrieval-Augmented Generation: Giving an LLM a "search engine" to look up facts before answering.
- What does GGUF stand for? #card
	- GPT-Generated Unified Format: The successor to GGML, used for running models efficiently on consumer hardware.
- What does HF stand for? #card
	- Hugging Face: The platform used as the central hub for models, datasets, and AI collaboration.
- What does TGI stand for? #card
	- Text Generation Inference: A high-performance toolkit developed by Hugging Face for deploying LLMs.
- What does VRAM stand for? #card
	- Video Random Access Memory: The dedicated memory on a GPU that holds the model weights during inference.
- What does CUDA stand for? #card
	- Compute Unified Device Architecture: NVIDIA’s parallel computing platform that allows AI software to use the GPU.
- What does LoRA stand for? #card
	- Low-Rank Adaptation: A fine-tuning technique that allows adapting large models using very little compute.
- What does QLoRA stand for? #card
	- Quantized Low-Rank Adaptation: A method that combines quantization with LoRA to fine-tune models on even smaller GPUs.
- What does GPTQ stand for? #card
	- Generalized Post-Training Quantization: A 4-bit quantization method designed to run efficiently on GPUs.
- What does EXL2 stand for? #card
	- ExLlamaV2: A high-performance quantization format specifically optimized for extremely fast inference on NVIDIA GPUs.
- What does AWQ stand for? #card
	- Activation-aware Weight Quantization: A hardware-friendly quantization format that maintains higher accuracy than GPTQ.
- What does JSON stand for in the context of LLM outputs? #card
	- JavaScript Object Notation: The standard data format used when you want an LLM to provide structured, machine-readable data.
- What does REST stand for in AI APIs? #card
	- Representational State Transfer: The architectural style used by vLLM and Ollama to provide their web-based API endpoints.
- ## 2. Deployment & Infrastructure
- What is the vLLM URL? #card
	- https://github.com/vllm-project/vllm
- What is the Ollama URL? #card
	- https://ollama.com
- Name the 3 parts of the "Docker Composition" described in the blog. #card
	- 1. Base Image (NVIDIA/CUDA), 2. Inference Engine (vLLM/Ollama), 3. Model Volume (Persistent Storage).
- Why is Shared Memory (shm_size) critical for LLM Docker containers? #card
	- It allows different parts of the AI (like multiple GPUs) to talk to each other quickly without crashing.
- What is the "Hugging Face for the Win" philosophy? #card
	- Focus on using the best available open-source models from the community hub rather than training everything from scratch.
- What is a Model Volume in Docker? #card
	- A persistent storage folder that stays on your hard drive so you don't have to download 50GB models every time you restart a container.
- What is the primary benefit of PagedAttention? #card
	- It manages the LLM's memory like a computer's RAM, preventing waste and allowing the server to handle more users at once.
- ## 3. Practical Guidance
- When should you use a "Small Language Model" (SLM)? #card
	- When you need to run the AI locally on a laptop, phone, or device with limited VRAM.
- What is the blog's advice on "Local First"? #card
	- Start by running models locally with Ollama to understand how they work before moving to expensive cloud setups.
- How does Temperature affect an LLM? #card
	- Low Temperature (0.1) makes it focused and factual; High Temperature (0.8) makes it creative and random.
- What is the "Context Window"? #card
	- The total amount of information (input + output) the model can "remember" during a single conversation.
- What does a "Quantized" model actually do to the math? #card
	- It rounds the complex numbers in the model weights to simpler versions so they take up less space in memory.
- What is the "System Prompt" in Ollama? #card
	- A set of instructions given to the model at the very start to define its personality or rules (e.g., "You are a helpful coder").

Leave a Reply