SOP: Local LLM Container — Terminal Only

SOP: Local LLM Container (Llama-3.1-14B) — Terminal Only Deployment

Document Type: Standard Operating Procedure (SOP)**
Version: 1.01.26
Status: Approved for Use
Audience: Technician + Client
Confidentiality: Internal / Client Delivery
Platforms Supported: Windows 11 + Linux

1. Purpose

To deploy a private, offline-capable Large Language Model (LLM) container running Llama-3.1-14B (Q4) using Docker Compose, with terminal-only interaction.
This configuration prioritizes minimalism, privacy, and reduced attack surface, without a graphical UI layer.

2. Scope

This SOP applies to:

Local workstation AI usage
Internal note processing, reasoning, and decision assistance
Environments where UI complexity is unnecessary or undesired

Not included:

Desktop GUI tools (e.g., Goose)
Time-based automation (See SOP #4)
Cloud AI or remote deployment

3. Responsibilities

Technician Responsibilities

Deploy and verify container operation
Validate terminal interaction with LLM API
Communicate hardware compatibility and performance constraints

Client Responsibilities

Provide hardware, OS environment, and approval for usage data
Confirm expectations for privacy, offline capability, and speed

(Optional) IT/Compliance Responsibilities

Validate offline usage policies if required

4. Requirements

4.1 Minimum Hardware

CPU: 8 cores
RAM: 16 GB
Disk: 20 GB free
GPU: Optional (CPU fallback supported)

4.2 Recommended Hardware

CPU: 12+ cores
RAM: 32–64 GB
GPU: NVIDIA RTX-3090 or better
NVMe or SSD storage for models

4.3 GPU Practical Notes (Realistic)

NVIDIA strongly preferred for local inference (CUDA support, ecosystem maturity)
AMD may not work for this use-case (ROCm/HIP/Vulkan inconsistent, may fall back to CPU)
CPU-only performance acceptable for low-volume reasoning tasks

4.4 Supported OS

Component	Windows 11	Linux
Docker	Docker Desktop	Docker Engine
Compose	`docker compose`	`docker compose` / Portainer
GPU Accel	NVIDIA Preferred	NVIDIA Preferred

5. Model Selection Note

This SOP uses:

Llama-3.1-14B-Instruct-Q4_K_M

Reasoning performance roughly comparable to high-end GPT-4-class cloud models (not specialized for coding).

Hereafter referred to as Llama-3.1-14B (Q4).

6. Directory Structure

The following structure standardizes deployment files:

Docker/
  LLM_Inference/
Models/

7. Procedure — Windows 11

7.1 Install Docker Desktop

Download from: https://www.docker.com/products/docker-desktop/

7.2 Prepare Model Storage

mkdir C:\Models

Place .gguf file in C:\Models.

7.3 Create Compose File

Path: Docker\LLM_Inference\docker-compose.yml

services:
  llm:
    image: ghcr.io/ggerganov/llama.cpp:latest
    volumes:
      - C:\Models:/models
    ports:
      - "8000:8000"
    command: >
      --model /models/llama-3-14b-instruct-q4_k_m.gguf
      --host 0.0.0.0
      --port 8000
      --chat
    restart: unless-stopped

7.4 Start LLM Container

cd Docker\LLM_Inference
docker compose up -d

7.5 Test via Terminal (Windows)

Option A — curl (if installed):

curl http://localhost:8000/v1/models

Option B — PowerShell Invoke-RestMethod (curl alternative):

Invoke-RestMethod -Uri "http://localhost:8000/v1/models" -Method Get

If either returns JSON model info, the service is running.

8. Procedure — Linux

8.1 Install Docker Engine + Compose

sudo apt install docker.io docker-compose-plugin -y

8.2 Prepare Model Storage

mkdir -p /opt/Models

Download .gguf into /opt/Models.

8.3 Compose File

Same as Windows, but with /opt/Models mount path.

8.4 Deploy Container

cd Docker/LLM_Inference
docker compose up -d

8.5 Test via Terminal (Linux)

Option A — curl:

curl http://localhost:8000/v1/models

Option B — wget (curl alternative):

wget -qO- http://localhost:8000/v1/models

If either returns JSON model info, the service is running.

9. Usage Notes (Terminal-Only Workflow)

9.1 Technician Prompting via HTTP (Example)

curl -X POST http://localhost:8000/v1/chat/completions   -H "Content-Type: application/json"   -d '{
    "model": "local",
    "messages": [{"role": "user", "content": "Explain recursion simply."}]
  }'

Alternative on Windows with PowerShell:

Invoke-RestMethod -Uri "http://localhost:8000/v1/chat/completions" -Method Post -Body '{
  "model": "local",
  "messages": [{"role": "user", "content": "Explain recursion simply."}]
}' -ContentType "application/json"

9.2 Who Chooses This Option (Context)

Terminal-only mode is preferred when:

Privacy > convenience
Minimal UI overhead desired
Technician is comfortable with CLI tools
Future automation (e.g., n8n) will be layered on top

10. Validation / Verification

Technician verifies:

Container reachable on localhost:8000
Reasoning responses coherent
Restart persistency operational:

docker compose restart llm

Client verifies:

Expected reasoning output
Terminal usage acceptable for workflow

11. Optional Lockdown (High Privacy)

Block outbound Docker network via firewall
Switch container to:

network_mode: none

Manual updates only
Offline model storage (baseline behavior)

12. Maintenance

Update models offline if needed
Re-run compose after updates
Backup model directory for versioning

13. Notes / Warnings

AMD support not guaranteed and may fail
CPU fallback slower but functional
No UI layer included in this SOP

14. Revision Control

Version: 1.01.26
Editor: Elijah B
Next Review: Within 90 Days