SOP: Local LLM Container (Llama-3.1-14B) — Terminal Only Deployment¶

Document Type: Standard Operating Procedure (SOP)
Version: 1.01.26
Status: Approved for Use
Audience: Technician + Client
Confidentiality: Internal / Client Delivery
Platforms Supported:** Windows 11 + Linux

1. Purpose¶

To deploy a private, offline-capable Large Language Model (LLM) container running Llama-3.1-14B (Q4) using Docker Compose, with terminal-only interaction.
This configuration prioritizes minimalism, privacy, and reduced attack surface, without a graphical UI layer.

2. Scope¶

This SOP applies to: - Local workstation AI usage - Internal note processing, reasoning, and decision assistance - Environments where UI complexity is unnecessary or undesired

Not included: - Desktop GUI tools (e.g., Goose) - Time-based automation (See SOP #4) - Cloud AI or remote deployment

3. Responsibilities¶

Technician Responsibilities - Deploy and verify container operation - Validate terminal interaction with LLM API - Communicate hardware compatibility and performance constraints

Client Responsibilities - Provide hardware, OS environment, and approval for usage data - Confirm expectations for privacy, offline capability, and speed

(Optional) IT/Compliance Responsibilities - Validate offline usage policies if required

4. Requirements¶

4.1 Minimum Hardware¶

CPU: 8 cores
RAM: 16 GB
Disk: 20 GB free
GPU: Optional (CPU fallback supported)

4.2 Recommended Hardware¶

CPU: 12+ cores
RAM: 32–64 GB
GPU: NVIDIA RTX-3090 or better
NVMe or SSD storage for models

4.3 GPU Practical Notes (Realistic)¶

NVIDIA strongly preferred for local inference (CUDA support, ecosystem maturity)
AMD may not work for this use-case (ROCm/HIP/Vulkan inconsistent, may fall back to CPU)
CPU-only performance acceptable for low-volume reasoning tasks

4.4 Supported OS¶

Component	Windows 11	Linux
Docker	Docker Desktop	Docker Engine
Compose	`docker compose`	`docker compose` / Portainer
GPU Accel	NVIDIA Preferred	NVIDIA Preferred

5. Model Selection Note¶

This SOP uses:

Llama-3.1-14B-Instruct-Q4_K_M

Reasoning performance roughly comparable to high-end GPT-4-class cloud models (not specialized for coding).

Hereafter referred to as Llama-3.1-14B (Q4).

6. Directory Structure¶

The following structure standardizes deployment files:

Docker/
  LLM_Inference/
Models/

7. Procedure — Windows 11¶

7.1 Install Docker Desktop¶

Download from: https://www.docker.com/products/docker-desktop/

7.2 Prepare Model Storage¶

mkdir C:\Models

Place .gguf file in C:\Models.

7.3 Create Compose File¶

Path: Docker\LLM_Inference\docker-compose.yml

services:
  llm:
    image: ghcr.io/ggerganov/llama.cpp:latest
    volumes:
      - C:\Models:/models
    ports:
      - "8000:8000"
    command: >
      --model /models/llama-3-14b-instruct-q4_k_m.gguf
      --host 0.0.0.0
      --port 8000
      --chat
    restart: unless-stopped

7.4 Start LLM Container¶

cd Docker\LLM_Inference
docker compose up -d

7.5 Test via Terminal (Windows)¶

Option A — curl (if installed):

curl http://localhost:8000/v1/models

Option B — PowerShell Invoke-RestMethod (curl alternative):

Invoke-RestMethod -Uri "http://localhost:8000/v1/models" -Method Get

If either returns JSON model info, the service is running.

8. Procedure — Linux¶

8.1 Install Docker Engine + Compose¶

sudo apt install docker.io docker-compose-plugin -y

8.2 Prepare Model Storage¶

mkdir -p /opt/Models

Download .gguf into /opt/Models.

8.3 Compose File¶

Same as Windows, but with /opt/Models mount path.

8.4 Deploy Container¶

cd Docker/LLM_Inference
docker compose up -d

8.5 Test via Terminal (Linux)¶

Option A — curl:

curl http://localhost:8000/v1/models

Option B — wget (curl alternative):

wget -qO- http://localhost:8000/v1/models

If either returns JSON model info, the service is running.

9. Usage Notes (Terminal-Only Workflow)¶

9.1 Technician Prompting via HTTP (Example)¶

curl -X POST http://localhost:8000/v1/chat/completions   -H "Content-Type: application/json"   -d '{
    "model": "local",
    "messages": [{"role": "user", "content": "Explain recursion simply."}]
  }'

Alternative on Windows with PowerShell:

Invoke-RestMethod -Uri "http://localhost:8000/v1/chat/completions" -Method Post -Body '{
  "model": "local",
  "messages": [{"role": "user", "content": "Explain recursion simply."}]
}' -ContentType "application/json"

9.2 Who Chooses This Option (Context)¶

Terminal-only mode is preferred when: - Privacy > convenience - Minimal UI overhead desired - Technician is comfortable with CLI tools - Future automation (e.g., n8n) will be layered on top

10. Validation / Verification¶

Technician verifies: - Container reachable on localhost:8000 - Reasoning responses coherent - Restart persistency operational:

docker compose restart llm

Client verifies: - Expected reasoning output - Terminal usage acceptable for workflow

11. Optional Lockdown (High Privacy)¶

Block outbound Docker network via firewall
Switch container to:
```
network_mode: none
```
Manual updates only
Offline model storage (baseline behavior)

12. Maintenance¶

Update models offline if needed
Re-run compose after updates
Backup model directory for versioning

13. Notes / Warnings¶

AMD support not guaranteed and may fail
CPU fallback slower but functional
No UI layer included in this SOP

14. Revision Control¶

Version: 1.01.26
Editor: Elijah B
Next Review: Within 90 Days