SOP: Local LLM Container (Llama-3.1-14B) — Terminal Only Deployment¶
Document Type: Standard Operating Procedure (SOP)
Version: 1.01.26
Status: Approved for Use
Audience: Technician + Client
Confidentiality: Internal / Client Delivery
Platforms Supported:** Windows 11 + Linux
1. Purpose¶
To deploy a private, offline-capable Large Language Model (LLM) container running Llama-3.1-14B (Q4) using Docker Compose, with terminal-only interaction.
This configuration prioritizes minimalism, privacy, and reduced attack surface, without a graphical UI layer.
2. Scope¶
This SOP applies to: - Local workstation AI usage - Internal note processing, reasoning, and decision assistance - Environments where UI complexity is unnecessary or undesired
Not included: - Desktop GUI tools (e.g., Goose) - Time-based automation (See SOP #4) - Cloud AI or remote deployment
3. Responsibilities¶
Technician Responsibilities - Deploy and verify container operation - Validate terminal interaction with LLM API - Communicate hardware compatibility and performance constraints
Client Responsibilities - Provide hardware, OS environment, and approval for usage data - Confirm expectations for privacy, offline capability, and speed
(Optional) IT/Compliance Responsibilities - Validate offline usage policies if required
4. Requirements¶
4.1 Minimum Hardware¶
- CPU: 8 cores
- RAM: 16 GB
- Disk: 20 GB free
- GPU: Optional (CPU fallback supported)
4.2 Recommended Hardware¶
- CPU: 12+ cores
- RAM: 32–64 GB
- GPU: NVIDIA RTX-3090 or better
- NVMe or SSD storage for models
4.3 GPU Practical Notes (Realistic)¶
- NVIDIA strongly preferred for local inference (CUDA support, ecosystem maturity)
- AMD may not work for this use-case (ROCm/HIP/Vulkan inconsistent, may fall back to CPU)
- CPU-only performance acceptable for low-volume reasoning tasks
4.4 Supported OS¶
| Component | Windows 11 | Linux |
|---|---|---|
| Docker | Docker Desktop | Docker Engine |
| Compose | docker compose |
docker compose / Portainer |
| GPU Accel | NVIDIA Preferred | NVIDIA Preferred |
5. Model Selection Note¶
This SOP uses:
Llama-3.1-14B-Instruct-Q4_K_M
Reasoning performance roughly comparable to high-end GPT-4-class cloud models (not specialized for coding).
Hereafter referred to as Llama-3.1-14B (Q4).
6. Directory Structure¶
The following structure standardizes deployment files:
Docker/
LLM_Inference/
Models/
7. Procedure — Windows 11¶
7.1 Install Docker Desktop¶
Download from: https://www.docker.com/products/docker-desktop/
7.2 Prepare Model Storage¶
mkdir C:\Models
Place .gguf file in C:\Models.
7.3 Create Compose File¶
Path: Docker\LLM_Inference\docker-compose.yml
services:
llm:
image: ghcr.io/ggerganov/llama.cpp:latest
volumes:
- C:\Models:/models
ports:
- "8000:8000"
command: >
--model /models/llama-3-14b-instruct-q4_k_m.gguf
--host 0.0.0.0
--port 8000
--chat
restart: unless-stopped
7.4 Start LLM Container¶
cd Docker\LLM_Inference
docker compose up -d
7.5 Test via Terminal (Windows)¶
Option A — curl (if installed):
curl http://localhost:8000/v1/models
Option B — PowerShell Invoke-RestMethod (curl alternative):
Invoke-RestMethod -Uri "http://localhost:8000/v1/models" -Method Get
If either returns JSON model info, the service is running.
8. Procedure — Linux¶
8.1 Install Docker Engine + Compose¶
sudo apt install docker.io docker-compose-plugin -y
8.2 Prepare Model Storage¶
mkdir -p /opt/Models
Download .gguf into /opt/Models.
8.3 Compose File¶
Same as Windows, but with /opt/Models mount path.
8.4 Deploy Container¶
cd Docker/LLM_Inference
docker compose up -d
8.5 Test via Terminal (Linux)¶
Option A — curl:
curl http://localhost:8000/v1/models
Option B — wget (curl alternative):
wget -qO- http://localhost:8000/v1/models
If either returns JSON model info, the service is running.
9. Usage Notes (Terminal-Only Workflow)¶
9.1 Technician Prompting via HTTP (Example)¶
curl -X POST http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "local",
"messages": [{"role": "user", "content": "Explain recursion simply."}]
}'
Alternative on Windows with PowerShell:
Invoke-RestMethod -Uri "http://localhost:8000/v1/chat/completions" -Method Post -Body '{
"model": "local",
"messages": [{"role": "user", "content": "Explain recursion simply."}]
}' -ContentType "application/json"
9.2 Who Chooses This Option (Context)¶
Terminal-only mode is preferred when: - Privacy > convenience - Minimal UI overhead desired - Technician is comfortable with CLI tools - Future automation (e.g., n8n) will be layered on top
10. Validation / Verification¶
Technician verifies:
- Container reachable on localhost:8000
- Reasoning responses coherent
- Restart persistency operational:
docker compose restart llm
Client verifies: - Expected reasoning output - Terminal usage acceptable for workflow
11. Optional Lockdown (High Privacy)¶
- Block outbound Docker network via firewall
- Switch container to:
network_mode: none - Manual updates only
- Offline model storage (baseline behavior)
12. Maintenance¶
- Update models offline if needed
- Re-run compose after updates
- Backup model directory for versioning
13. Notes / Warnings¶
- AMD support not guaranteed and may fail
- CPU fallback slower but functional
- No UI layer included in this SOP
14. Revision Control¶
- Version: 1.01.26
- Editor: Elijah B
- Next Review: Within 90 Days