Skip to content

SOP: Local LLM Container (Llama-3.1-14B) — Terminal Only Deployment

Document Type: Standard Operating Procedure (SOP)
Version: 1.01.26
Status: Approved for Use
Audience: Technician + Client
Confidentiality: Internal / Client Delivery
Platforms Supported:** Windows 11 + Linux


1. Purpose

To deploy a private, offline-capable Large Language Model (LLM) container running Llama-3.1-14B (Q4) using Docker Compose, with terminal-only interaction.
This configuration prioritizes minimalism, privacy, and reduced attack surface, without a graphical UI layer.


2. Scope

This SOP applies to: - Local workstation AI usage - Internal note processing, reasoning, and decision assistance - Environments where UI complexity is unnecessary or undesired

Not included: - Desktop GUI tools (e.g., Goose) - Time-based automation (See SOP #4) - Cloud AI or remote deployment


3. Responsibilities

Technician Responsibilities - Deploy and verify container operation - Validate terminal interaction with LLM API - Communicate hardware compatibility and performance constraints

Client Responsibilities - Provide hardware, OS environment, and approval for usage data - Confirm expectations for privacy, offline capability, and speed

(Optional) IT/Compliance Responsibilities - Validate offline usage policies if required


4. Requirements

4.1 Minimum Hardware

  • CPU: 8 cores
  • RAM: 16 GB
  • Disk: 20 GB free
  • GPU: Optional (CPU fallback supported)
  • CPU: 12+ cores
  • RAM: 32–64 GB
  • GPU: NVIDIA RTX-3090 or better
  • NVMe or SSD storage for models

4.3 GPU Practical Notes (Realistic)

  • NVIDIA strongly preferred for local inference (CUDA support, ecosystem maturity)
  • AMD may not work for this use-case (ROCm/HIP/Vulkan inconsistent, may fall back to CPU)
  • CPU-only performance acceptable for low-volume reasoning tasks

4.4 Supported OS

Component Windows 11 Linux
Docker Docker Desktop Docker Engine
Compose docker compose docker compose / Portainer
GPU Accel NVIDIA Preferred NVIDIA Preferred

5. Model Selection Note

This SOP uses:

Llama-3.1-14B-Instruct-Q4_K_M

Reasoning performance roughly comparable to high-end GPT-4-class cloud models (not specialized for coding).

Hereafter referred to as Llama-3.1-14B (Q4).


6. Directory Structure

The following structure standardizes deployment files:

Docker/
  LLM_Inference/
Models/

7. Procedure — Windows 11

7.1 Install Docker Desktop

Download from: https://www.docker.com/products/docker-desktop/

7.2 Prepare Model Storage

mkdir C:\Models

Place .gguf file in C:\Models.

7.3 Create Compose File

Path: Docker\LLM_Inference\docker-compose.yml

services:
  llm:
    image: ghcr.io/ggerganov/llama.cpp:latest
    volumes:
      - C:\Models:/models
    ports:
      - "8000:8000"
    command: >
      --model /models/llama-3-14b-instruct-q4_k_m.gguf
      --host 0.0.0.0
      --port 8000
      --chat
    restart: unless-stopped

7.4 Start LLM Container

cd Docker\LLM_Inference
docker compose up -d

7.5 Test via Terminal (Windows)

Option A — curl (if installed):

curl http://localhost:8000/v1/models

Option B — PowerShell Invoke-RestMethod (curl alternative):

Invoke-RestMethod -Uri "http://localhost:8000/v1/models" -Method Get

If either returns JSON model info, the service is running.


8. Procedure — Linux

8.1 Install Docker Engine + Compose

sudo apt install docker.io docker-compose-plugin -y

8.2 Prepare Model Storage

mkdir -p /opt/Models

Download .gguf into /opt/Models.

8.3 Compose File

Same as Windows, but with /opt/Models mount path.

8.4 Deploy Container

cd Docker/LLM_Inference
docker compose up -d

8.5 Test via Terminal (Linux)

Option A — curl:

curl http://localhost:8000/v1/models

Option B — wget (curl alternative):

wget -qO- http://localhost:8000/v1/models

If either returns JSON model info, the service is running.


9. Usage Notes (Terminal-Only Workflow)

9.1 Technician Prompting via HTTP (Example)

curl -X POST http://localhost:8000/v1/chat/completions   -H "Content-Type: application/json"   -d '{
    "model": "local",
    "messages": [{"role": "user", "content": "Explain recursion simply."}]
  }'

Alternative on Windows with PowerShell:

Invoke-RestMethod -Uri "http://localhost:8000/v1/chat/completions" -Method Post -Body '{
  "model": "local",
  "messages": [{"role": "user", "content": "Explain recursion simply."}]
}' -ContentType "application/json"

9.2 Who Chooses This Option (Context)

Terminal-only mode is preferred when: - Privacy > convenience - Minimal UI overhead desired - Technician is comfortable with CLI tools - Future automation (e.g., n8n) will be layered on top


10. Validation / Verification

Technician verifies: - Container reachable on localhost:8000 - Reasoning responses coherent - Restart persistency operational:

docker compose restart llm

Client verifies: - Expected reasoning output - Terminal usage acceptable for workflow


11. Optional Lockdown (High Privacy)

  • Block outbound Docker network via firewall
  • Switch container to:
    network_mode: none
    
  • Manual updates only
  • Offline model storage (baseline behavior)

12. Maintenance

  • Update models offline if needed
  • Re-run compose after updates
  • Backup model directory for versioning

13. Notes / Warnings

  • AMD support not guaranteed and may fail
  • CPU fallback slower but functional
  • No UI layer included in this SOP

14. Revision Control

  • Version: 1.01.26
  • Editor: Elijah B
  • Next Review: Within 90 Days