Skip to content

SOP: Local LLM Container (Llama-3.1-14B) with Goose UI on Host

Document Type: Standard Operating Procedure (SOP)
Version: 1.01.26
Status: Approved for Use
Audience: Technician + Client
Confidentiality: Internal / Client Delivery
Platforms Supported: Windows 11 + Linux


1. Purpose

To deploy a private, offline-capable local Large Language Model (LLM) container running Llama-3.1-14B (Q4) using Docker Compose, with Goose installed on the host as the user-facing interface.


2. Scope

This SOP applies to private workstation deployments where: - No cloud dependency is desired - Reasoning-oriented local inference is needed - A graphical or desktop UI is preferred

Not included: - Cloud AI services - Remote multi-user inference - Regulatory compliance configurations - Air-gapped deployments (see Optional Lockdown)


3. Responsibilities

Technician Responsibilities - Deploy and maintain local model container - Validate Goose → LLM connectivity - Confirm performance expectations with client - Communicate hardware limitations and privacy constraints

Client Responsibilities - Provide hardware + OS environment - Approve intended use cases and privacy sensitivity - Accept performance limitations based on hardware selection

(Optional) IT/Compliance Responsibilities - Approve local-only AI usage policies if applicable - Validate network and storage isolation per organization policy


4. Requirements

4.1 Minimum Hardware

  • CPU: 8 cores
  • RAM: 16 GB
  • Disk: 20 GB free
  • GPU: Optional (CPU fallback supported)
  • CPU: 12+ cores
  • RAM: 32–64 GB
  • GPU: NVIDIA RTX 3090 or better
  • SSD/NVMe for model storage

4.3 GPU Practical Notes (NVIDIA vs AMD)

  • NVIDIA strongly preferred for llama.cpp inference due to CUDA ecosystem maturity
  • AMD may not work for this use-case unless ROCm/HIP/Vulkan toolchain succeeds; compatibility varies by model, quant, driver, and distro
  • AMD may fall back to CPU or significantly degraded Vulkan performance
  • CPU-only operation is viable for light workloads but slower

4.4 Supported OS

Component Windows 11 Linux
Goose UI Supported Supported
Docker Docker Desktop Docker Engine
Compose docker compose docker compose or Portainer
GPU Accel NVIDIA Preferred NVIDIA Preferred

5. Model Selection Note

Example model used in this SOP:

Llama-3.1-14B-Instruct-Q4_K_M, roughly comparable to high-end GPT-4-class cloud models in reasoning (not coding), widely deployed on consumer hardware.

After this section, referred to as Llama-3.1-14B (Q4).


6. Directory Structure (Standardized)

All deployment resources should be organized as follows:

Docker/
  Portainer_Management/
  LLM_Inference/
Models/
  • Docker/Portainer_Management/ = Compose file for Portainer stack
  • Docker/LLM_Inference/ = Compose file for LLM container
  • Models/ = Offline GGUF models stored on host

7. Procedure — Windows 11

7.1 Install Docker Desktop

Download from: https://www.docker.com/products/docker-desktop/
Enable WSL2 backend when prompted.

7.2 Prepare Model Storage

mkdir C:\Models
Download .gguf model into C:\Models.

7.3 Create Compose File

Path: Docker\LLM_Inference\docker-compose.yml

services:
  llm:
    image: ghcr.io/ggerganov/llama.cpp:latest
    volumes:
      - C:\Models:/models
    ports:
      - "8000:8000"
    command: >
      --model /models/llama-3-14b-instruct-q4_k_m.gguf
      --host 0.0.0.0
      --port 8000
      --chat
    restart: unless-stopped

7.4 Start Container

cd Docker\LLM_Inference
docker compose up -d

7.5 Install Goose on Host

Option A — Winget:

winget install block.goose

Option B — Direct Installer: Download .exe from: https://block.github.io/goose

7.6 Connect Goose to LLM Endpoint

Set endpoint:

http://localhost:8000/v1


8. Procedure — Linux

8.1 Install Docker Engine + Compose

sudo apt install docker.io docker-compose-plugin -y

8.2 Portainer Deployment (Compose Method)

Path: Docker/Portainer_Management/docker-compose.yml

services:
  portainer:
    image: portainer/portainer-ce
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock
      - portainer_data:/data
    ports:
      - "9443:9443"
volumes:
  portainer_data:

Deploy:

cd Docker/Portainer_Management
docker compose up -d

8.3 Model Storage

mkdir -p /opt/Models

8.4 LLM Compose File

Path: Docker/LLM_Inference/docker-compose.yml
Same as Windows but path adjusted to /opt/Models.

8.5 Deploy LLM

cd Docker/LLM_Inference
docker compose up -d

8.6 Install Goose on Host

Install from official instructions, point UI to:

http://localhost:8000/v1


9. Validation / Verification

Technician verifies: - LLM responds at /v1/chat/completions - Goose sends prompts and receives responses - Restart persistency works:

docker compose restart llm
- No cloud dependency present

Client verifies: - Reasoning responses meet expectations - UI is functional and local


10. Troubleshooting (Common)

Problem Cause Fix
Slow responses CPU fallback Confirm GPU capability
No connection Port issue Verify 8000:8000 mapping
AMD not utilized Expected Use CPU or NVIDIA hardware
Goose errors Incorrect endpoint Reconfigure to localhost
No model Wrong path Check .gguf placement

11. Optional Lockdown (High Privacy)

  • Apply Windows/Linux firewall outbound deny for Goose
  • Remove outbound rules for Docker service
  • Disable updates for Goose + model containers
  • Require client approval for workflow changes

12. Maintenance

  • Update models manually (offline)
  • Restart containers after updates
  • Backup Models/ if versioning matters

13. Notes / Warnings

  • AMD support not guaranteed; may not function
  • CPU fallback acceptable for light reasoning
  • Offline-first behavior is standard, not optional

14. Revision Control

  • Version: 1.01.26
  • Editor: Elijah B
  • Next Review: Within 90 Days