Skip to content

SOP: Local LLM Container with Goose UI

SOP: Local LLM Container (Llama-3.1-14B) with Goose UI on Host

Section titled “SOP: Local LLM Container (Llama-3.1-14B) with Goose UI on Host”

Document Type: Standard Operating Procedure (SOP)
Version: 1.01.26
Status: Approved for Use
Audience: Technician + Client
Confidentiality: Internal / Client Delivery
Platforms Supported: Windows 11 + Linux


To deploy a private, offline-capable local Large Language Model (LLM) container running Llama-3.1-14B (Q4) using Docker Compose, with Goose installed on the host as the user-facing interface.


This SOP applies to private workstation deployments where:

  • No cloud dependency is desired
  • Reasoning-oriented local inference is needed
  • A graphical or desktop UI is preferred

Not included:

  • Cloud AI services
  • Remote multi-user inference
  • Regulatory compliance configurations
  • Air-gapped deployments (see Optional Lockdown)

Technician Responsibilities

  • Deploy and maintain local model container
  • Validate Goose → LLM connectivity
  • Confirm performance expectations with client
  • Communicate hardware limitations and privacy constraints

Client Responsibilities

  • Provide hardware + OS environment
  • Approve intended use cases and privacy sensitivity
  • Accept performance limitations based on hardware selection

(Optional) IT/Compliance Responsibilities

  • Approve local-only AI usage policies if applicable
  • Validate network and storage isolation per organization policy

  • CPU: 8 cores
  • RAM: 16 GB
  • Disk: 20 GB free
  • GPU: Optional (CPU fallback supported)
  • CPU: 12+ cores
  • RAM: 32–64 GB
  • GPU: NVIDIA RTX 3090 or better
  • SSD/NVMe for model storage
  • NVIDIA strongly preferred for llama.cpp inference due to CUDA ecosystem maturity
  • AMD may not work for this use-case unless ROCm/HIP/Vulkan toolchain succeeds; compatibility varies by model, quant, driver, and distro
  • AMD may fall back to CPU or significantly degraded Vulkan performance
  • CPU-only operation is viable for light workloads but slower
ComponentWindows 11Linux
Goose UISupportedSupported
DockerDocker DesktopDocker Engine
Composedocker composedocker compose or Portainer
GPU AccelNVIDIA PreferredNVIDIA Preferred

Example model used in this SOP:

Llama-3.1-14B-Instruct-Q4_K_M, roughly comparable to high-end GPT-4-class cloud models in reasoning (not coding), widely deployed on consumer hardware.

After this section, referred to as Llama-3.1-14B (Q4).


All deployment resources should be organized as follows:

Docker/
Portainer_Management/
LLM_Inference/
Models/
  • Docker/Portainer_Management/ = Compose file for Portainer stack
  • Docker/LLM_Inference/ = Compose file for LLM container
  • Models/ = Offline GGUF models stored on host

Download from: https://www.docker.com/products/docker-desktop/
Enable WSL2 backend when prompted.

Terminal window
mkdir C:\Models

Download .gguf model into C:\Models.

Path: Docker\LLM_Inference\docker-compose.yml

services:
llm:
image: ghcr.io/ggerganov/llama.cpp:latest
volumes:
- C:\Models:/models
ports:
- "8000:8000"
command: >
--model /models/llama-3-14b-instruct-q4_k_m.gguf
--host 0.0.0.0
--port 8000
--chat
restart: unless-stopped
Terminal window
cd Docker\LLM_Inference
docker compose up -d

Option A — Winget:

Terminal window
winget install block.goose

Option B — Direct Installer: Download .exe from: https://block.github.io/goose

Set endpoint:

http://localhost:8000/v1

Terminal window
sudo apt install docker.io docker-compose-plugin -y

Path: Docker/Portainer_Management/docker-compose.yml

services:
portainer:
image: portainer/portainer-ce
volumes:
- /var/run/docker.sock:/var/run/docker.sock
- portainer_data:/data
ports:
- "9443:9443"
volumes:
portainer_data:

Deploy:

Terminal window
cd Docker/Portainer_Management
docker compose up -d
Terminal window
mkdir -p /opt/Models

Path: Docker/LLM_Inference/docker-compose.yml
Same as Windows but path adjusted to /opt/Models.

Terminal window
cd Docker/LLM_Inference
docker compose up -d

Install from official instructions, point UI to:

http://localhost:8000/v1

Technician verifies:

  • LLM responds at /v1/chat/completions
  • Goose sends prompts and receives responses
  • Restart persistency works:
Terminal window
docker compose restart llm
  • No cloud dependency present

Client verifies:

  • Reasoning responses meet expectations
  • UI is functional and local

ProblemCauseFix
Slow responsesCPU fallbackConfirm GPU capability
No connectionPort issueVerify 8000:8000 mapping
AMD not utilizedExpectedUse CPU or NVIDIA hardware
Goose errorsIncorrect endpointReconfigure to localhost
No modelWrong pathCheck .gguf placement

  • Apply Windows/Linux firewall outbound deny for Goose
  • Remove outbound rules for Docker service
  • Disable updates for Goose + model containers
  • Require client approval for workflow changes

  • Update models manually (offline)
  • Restart containers after updates
  • Backup Models/ if versioning matters

  • AMD support not guaranteed; may not function
  • CPU fallback acceptable for light reasoning
  • Offline-first behavior is standard, not optional

  • Version: 1.01.26
  • Editor: Elijah B
  • Next Review: Within 90 Days