Skip to content

SOP: Local LLM Container — Terminal Only

SOP: Local LLM Container (Llama-3.1-14B) — Terminal Only Deployment

Section titled “SOP: Local LLM Container (Llama-3.1-14B) — Terminal Only Deployment”

Document Type: Standard Operating Procedure (SOP)**
Version: 1.01.26
Status: Approved for Use
Audience: Technician + Client
Confidentiality: Internal / Client Delivery
Platforms Supported: Windows 11 + Linux


To deploy a private, offline-capable Large Language Model (LLM) container running Llama-3.1-14B (Q4) using Docker Compose, with terminal-only interaction.
This configuration prioritizes minimalism, privacy, and reduced attack surface, without a graphical UI layer.


This SOP applies to:

  • Local workstation AI usage
  • Internal note processing, reasoning, and decision assistance
  • Environments where UI complexity is unnecessary or undesired

Not included:

  • Desktop GUI tools (e.g., Goose)
  • Time-based automation (See SOP #4)
  • Cloud AI or remote deployment

Technician Responsibilities

  • Deploy and verify container operation
  • Validate terminal interaction with LLM API
  • Communicate hardware compatibility and performance constraints

Client Responsibilities

  • Provide hardware, OS environment, and approval for usage data
  • Confirm expectations for privacy, offline capability, and speed

(Optional) IT/Compliance Responsibilities

  • Validate offline usage policies if required

  • CPU: 8 cores
  • RAM: 16 GB
  • Disk: 20 GB free
  • GPU: Optional (CPU fallback supported)
  • CPU: 12+ cores
  • RAM: 32–64 GB
  • GPU: NVIDIA RTX-3090 or better
  • NVMe or SSD storage for models
  • NVIDIA strongly preferred for local inference (CUDA support, ecosystem maturity)
  • AMD may not work for this use-case (ROCm/HIP/Vulkan inconsistent, may fall back to CPU)
  • CPU-only performance acceptable for low-volume reasoning tasks
ComponentWindows 11Linux
DockerDocker DesktopDocker Engine
Composedocker composedocker compose / Portainer
GPU AccelNVIDIA PreferredNVIDIA Preferred

This SOP uses:

Llama-3.1-14B-Instruct-Q4_K_M

Reasoning performance roughly comparable to high-end GPT-4-class cloud models (not specialized for coding).

Hereafter referred to as Llama-3.1-14B (Q4).


The following structure standardizes deployment files:

Docker/
LLM_Inference/
Models/

Download from: https://www.docker.com/products/docker-desktop/

Terminal window
mkdir C:\Models

Place .gguf file in C:\Models.

Path: Docker\LLM_Inference\docker-compose.yml

services:
llm:
image: ghcr.io/ggerganov/llama.cpp:latest
volumes:
- C:\Models:/models
ports:
- "8000:8000"
command: >
--model /models/llama-3-14b-instruct-q4_k_m.gguf
--host 0.0.0.0
--port 8000
--chat
restart: unless-stopped
Terminal window
cd Docker\LLM_Inference
docker compose up -d

Option A — curl (if installed):

Terminal window
curl http://localhost:8000/v1/models

Option B — PowerShell Invoke-RestMethod (curl alternative):

Terminal window
Invoke-RestMethod -Uri "http://localhost:8000/v1/models" -Method Get

If either returns JSON model info, the service is running.


Terminal window
sudo apt install docker.io docker-compose-plugin -y
Terminal window
mkdir -p /opt/Models

Download .gguf into /opt/Models.

Same as Windows, but with /opt/Models mount path.

Terminal window
cd Docker/LLM_Inference
docker compose up -d

Option A — curl:

Terminal window
curl http://localhost:8000/v1/models

Option B — wget (curl alternative):

Terminal window
wget -qO- http://localhost:8000/v1/models

If either returns JSON model info, the service is running.


9.1 Technician Prompting via HTTP (Example)

Section titled “9.1 Technician Prompting via HTTP (Example)”
Terminal window
curl -X POST http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "local",
"messages": [{"role": "user", "content": "Explain recursion simply."}]
}'

Alternative on Windows with PowerShell:

Terminal window
Invoke-RestMethod -Uri "http://localhost:8000/v1/chat/completions" -Method Post -Body '{
"model": "local",
"messages": [{"role": "user", "content": "Explain recursion simply."}]
}' -ContentType "application/json"

Terminal-only mode is preferred when:

  • Privacy > convenience
  • Minimal UI overhead desired
  • Technician is comfortable with CLI tools
  • Future automation (e.g., n8n) will be layered on top

Technician verifies:

  • Container reachable on localhost:8000
  • Reasoning responses coherent
  • Restart persistency operational:
Terminal window
docker compose restart llm

Client verifies:

  • Expected reasoning output
  • Terminal usage acceptable for workflow

  • Block outbound Docker network via firewall
  • Switch container to:
network_mode: none
  • Manual updates only
  • Offline model storage (baseline behavior)

  • Update models offline if needed
  • Re-run compose after updates
  • Backup model directory for versioning

  • AMD support not guaranteed and may fail
  • CPU fallback slower but functional
  • No UI layer included in this SOP

  • Version: 1.01.26
  • Editor: Elijah B
  • Next Review: Within 90 Days