SOP: Local LLM Container — Terminal Only
SOP: Local LLM Container (Llama-3.1-14B) — Terminal Only Deployment
Section titled “SOP: Local LLM Container (Llama-3.1-14B) — Terminal Only Deployment”Document Type: Standard Operating Procedure (SOP)**
Version: 1.01.26
Status: Approved for Use
Audience: Technician + Client
Confidentiality: Internal / Client Delivery
Platforms Supported: Windows 11 + Linux
1. Purpose
Section titled “1. Purpose”To deploy a private, offline-capable Large Language Model (LLM) container running Llama-3.1-14B (Q4) using Docker Compose, with terminal-only interaction.
This configuration prioritizes minimalism, privacy, and reduced attack surface, without a graphical UI layer.
2. Scope
Section titled “2. Scope”This SOP applies to:
- Local workstation AI usage
- Internal note processing, reasoning, and decision assistance
- Environments where UI complexity is unnecessary or undesired
Not included:
- Desktop GUI tools (e.g., Goose)
- Time-based automation (See SOP #4)
- Cloud AI or remote deployment
3. Responsibilities
Section titled “3. Responsibilities”Technician Responsibilities
- Deploy and verify container operation
- Validate terminal interaction with LLM API
- Communicate hardware compatibility and performance constraints
Client Responsibilities
- Provide hardware, OS environment, and approval for usage data
- Confirm expectations for privacy, offline capability, and speed
(Optional) IT/Compliance Responsibilities
- Validate offline usage policies if required
4. Requirements
Section titled “4. Requirements”4.1 Minimum Hardware
Section titled “4.1 Minimum Hardware”- CPU: 8 cores
- RAM: 16 GB
- Disk: 20 GB free
- GPU: Optional (CPU fallback supported)
4.2 Recommended Hardware
Section titled “4.2 Recommended Hardware”- CPU: 12+ cores
- RAM: 32–64 GB
- GPU: NVIDIA RTX-3090 or better
- NVMe or SSD storage for models
4.3 GPU Practical Notes (Realistic)
Section titled “4.3 GPU Practical Notes (Realistic)”- NVIDIA strongly preferred for local inference (CUDA support, ecosystem maturity)
- AMD may not work for this use-case (ROCm/HIP/Vulkan inconsistent, may fall back to CPU)
- CPU-only performance acceptable for low-volume reasoning tasks
4.4 Supported OS
Section titled “4.4 Supported OS”| Component | Windows 11 | Linux |
|---|---|---|
| Docker | Docker Desktop | Docker Engine |
| Compose | docker compose | docker compose / Portainer |
| GPU Accel | NVIDIA Preferred | NVIDIA Preferred |
5. Model Selection Note
Section titled “5. Model Selection Note”This SOP uses:
Llama-3.1-14B-Instruct-Q4_K_M
Reasoning performance roughly comparable to high-end GPT-4-class cloud models (not specialized for coding).
Hereafter referred to as Llama-3.1-14B (Q4).
6. Directory Structure
Section titled “6. Directory Structure”The following structure standardizes deployment files:
Docker/ LLM_Inference/Models/7. Procedure — Windows 11
Section titled “7. Procedure — Windows 11”7.1 Install Docker Desktop
Section titled “7.1 Install Docker Desktop”Download from: https://www.docker.com/products/docker-desktop/
7.2 Prepare Model Storage
Section titled “7.2 Prepare Model Storage”mkdir C:\ModelsPlace .gguf file in C:\Models.
7.3 Create Compose File
Section titled “7.3 Create Compose File”Path: Docker\LLM_Inference\docker-compose.yml
services: llm: image: ghcr.io/ggerganov/llama.cpp:latest volumes: - C:\Models:/models ports: - "8000:8000" command: > --model /models/llama-3-14b-instruct-q4_k_m.gguf --host 0.0.0.0 --port 8000 --chat restart: unless-stopped7.4 Start LLM Container
Section titled “7.4 Start LLM Container”cd Docker\LLM_Inferencedocker compose up -d7.5 Test via Terminal (Windows)
Section titled “7.5 Test via Terminal (Windows)”Option A — curl (if installed):
curl http://localhost:8000/v1/modelsOption B — PowerShell Invoke-RestMethod (curl alternative):
Invoke-RestMethod -Uri "http://localhost:8000/v1/models" -Method GetIf either returns JSON model info, the service is running.
8. Procedure — Linux
Section titled “8. Procedure — Linux”8.1 Install Docker Engine + Compose
Section titled “8.1 Install Docker Engine + Compose”sudo apt install docker.io docker-compose-plugin -y8.2 Prepare Model Storage
Section titled “8.2 Prepare Model Storage”mkdir -p /opt/ModelsDownload .gguf into /opt/Models.
8.3 Compose File
Section titled “8.3 Compose File”Same as Windows, but with /opt/Models mount path.
8.4 Deploy Container
Section titled “8.4 Deploy Container”cd Docker/LLM_Inferencedocker compose up -d8.5 Test via Terminal (Linux)
Section titled “8.5 Test via Terminal (Linux)”Option A — curl:
curl http://localhost:8000/v1/modelsOption B — wget (curl alternative):
wget -qO- http://localhost:8000/v1/modelsIf either returns JSON model info, the service is running.
9. Usage Notes (Terminal-Only Workflow)
Section titled “9. Usage Notes (Terminal-Only Workflow)”9.1 Technician Prompting via HTTP (Example)
Section titled “9.1 Technician Prompting via HTTP (Example)”curl -X POST http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{ "model": "local", "messages": [{"role": "user", "content": "Explain recursion simply."}] }'Alternative on Windows with PowerShell:
Invoke-RestMethod -Uri "http://localhost:8000/v1/chat/completions" -Method Post -Body '{ "model": "local", "messages": [{"role": "user", "content": "Explain recursion simply."}]}' -ContentType "application/json"9.2 Who Chooses This Option (Context)
Section titled “9.2 Who Chooses This Option (Context)”Terminal-only mode is preferred when:
- Privacy > convenience
- Minimal UI overhead desired
- Technician is comfortable with CLI tools
- Future automation (e.g., n8n) will be layered on top
10. Validation / Verification
Section titled “10. Validation / Verification”Technician verifies:
- Container reachable on
localhost:8000 - Reasoning responses coherent
- Restart persistency operational:
docker compose restart llmClient verifies:
- Expected reasoning output
- Terminal usage acceptable for workflow
11. Optional Lockdown (High Privacy)
Section titled “11. Optional Lockdown (High Privacy)”- Block outbound Docker network via firewall
- Switch container to:
network_mode: none- Manual updates only
- Offline model storage (baseline behavior)
12. Maintenance
Section titled “12. Maintenance”- Update models offline if needed
- Re-run compose after updates
- Backup model directory for versioning
13. Notes / Warnings
Section titled “13. Notes / Warnings”- AMD support not guaranteed and may fail
- CPU fallback slower but functional
- No UI layer included in this SOP
14. Revision Control
Section titled “14. Revision Control”- Version: 1.01.26
- Editor: Elijah B
- Next Review: Within 90 Days