SOP: LM Studio Local Runner with Llama-3.1-14B (Q4)¶
Document Type: Standard Operating Procedure (SOP)
Version: 1.01.26
Status: Approved for Use
Audience: Technician + Client
Confidentiality: Internal / Client Delivery
Platforms Supported:** Windows 11 + Linux
1. Purpose¶
To deploy Llama-3.1-14B (Q4) using LM Studio as a local model runner with an integrated chat interface, providing a “local ChatGPT-style” experience without Docker complexity.
2. Scope¶
Applies when: - A desktop UI is desired but containerization is not required. - Single-user local inference is sufficient. - Client prefers the simplest way to “just talk to the model” on their machine.
Not included: - Docker-based isolation. - Agent workflows (file I/O, automation). - Server exposure beyond local host (optional).
3. Responsibilities¶
Technician Responsibilities - Install LM Studio. - Configure and test Llama-3.1-14B (Q4). - Explain performance and hardware limitations to the client.
Client Responsibilities - Provide suitable hardware and OS. - Approve use on sensitive data only if acceptable. - Understand that this is a single-user, local runner setup.
(Optional) IT/Compliance Responsibilities - Approve installation if used in managed environments. - Review privacy expectations for sensitive content.
4. Requirements¶
4.1 Minimum Hardware¶
- CPU: 8 cores
- RAM: 16 GB
- Disk: 20 GB free
- GPU: Optional (CPU-only possible but slower)
4.2 Recommended Hardware¶
- CPU: 12+ cores
- RAM: 32–64 GB
- GPU: NVIDIA RTX 3090 or better
- NVMe/SSD storage
4.3 GPU Practical Notes¶
- NVIDIA strongly preferred; LM Studio supports GPU acceleration best with CUDA-capable GPUs.
- AMD cards may not accelerate models reliably; support depends on drivers and LM Studio’s backend; CPU fallback may be used instead.
- CPU-only is usable for light workloads, slower for large prompts.
4.4 Supported OS¶
| Component | Windows 11 | Linux |
|---|---|---|
| LM Studio App | Supported | Supported (via AppImage / .deb) |
| GPU Accel | NVIDIA Preferred | NVIDIA Preferred |
5. Model Selection Note¶
Example model:
Llama-3.1-14B-Instruct-Q4_K_M, roughly GPT-4-class for general reasoning (not coding-specialized), widely used on consumer hardware.
Referred to as Llama-3.1-14B (Q4).
6. Procedure — Windows 11¶
6.1 Install LM Studio¶
- Download installer from: https://lmstudio.ai
- Run
.exeand follow the installation prompts.
6.2 Launch LM Studio and Download Model¶
- Open LM Studio.
- Go to Models tab.
- Search for Llama-3.1-14B-Instruct-Q4_K_M (or nearest equivalent).
- Click Download.
6.3 Configure GPU / Performance (If Available)¶
- Open Settings → Performance.
- Ensure GPU offload is enabled (for NVIDIA).
- Optionally set thread count based on CPU core count.
6.4 Use Integrated Chat UI¶
- Open Chat tab.
- Select Llama-3.1-14B (Q4) as the active model.
- Start chatting (e.g., “Explain topic X…”).
This mode is entirely local by default once model is downloaded.
6.5 (Optional) Start Local API Server¶
If developers/tools need an endpoint:
- In LM Studio, go to Server / API tab.
- Choose model Llama-3.1-14B (Q4).
- Click Start Server and note the URL (e.g.,
http://127.0.0.1:1234/v1).
7. Procedure — Linux¶
7.1 Install LM Studio¶
- Download AppImage or
.debfrom: https://lmstudio.ai - AppImage example:
chmod +x LM_Studio*.AppImage ./LM_Studio*.AppImage
7.2 Download Model¶
Same as Windows: use UI to search for and download Llama-3.1-14B (Q4).
7.3 Configure GPU / Performance¶
- In LM Studio settings, enable GPU acceleration if NVIDIA GPU and drivers are installed.
- For AMD, acceleration may not work reliably; default to CPU if issues occur.
7.4 Use Integrated Chat UI¶
Use the Chat tab as on Windows.
7.5 Optional Local API Server¶
Start the server and note URL (e.g., http://127.0.0.1:1234/v1) if needed.
8. Who and Why This Option Fits¶
This setup is ideal for: - Clients wanting a simple local ChatGPT-like app. - Technicians who don’t need Docker or agents. - Use cases focused on reasoning, drafting, and Q&A, not heavy automation.
It is not ideal when: - Multi-container automation (n8n, agents) is required. - Strong container-level isolation is mandated.
9. Validation / Verification¶
Technician verifies:
- Chat responses are coherent and timely.
- GPU is used if expected (check LM Studio indicators or logs).
- API server (if enabled) responds at /v1/models.
Client verifies: - Model answers are acceptable for their use. - Performance is adequate on their workloads.
10. Troubleshooting¶
| Issue | Likely Cause | Resolution |
|---|---|---|
| Slow responses | CPU-only inference | Check GPU config / consider lighter model |
| App crashes | Insufficient RAM/VRAM | Reduce context size or switch to smaller quant |
| AMD GPU unused | Expected limitation | Use CPU or NVIDIA GPU |
| Cannot start server | Port conflict | Change port in LM Studio settings |
11. Optional Lockdown (Higher Privacy)¶
For clients with stronger privacy requirements: - Use OS firewall to block LM Studio outbound connections. - Avoid enabling any cloud providers in LM Studio. - Store model files on local disk only, not sync folders. - Disable auto-update if organization prefers fixed versions.
12. Maintenance¶
- Periodically review model version and LM Studio updates.
- Back up configuration files if needed.
- Re-verify performance after any major update.
13. Notes / Warnings¶
- This is the simplest of the local setups in these SOPs; it is not containerized.
- AMD GPU acceleration is not guaranteed and may not function.
- For file automation or scheduling, see the other SOPs.
14. Revision Control¶
- Version: 1.01.26
- Editor: Elijah B
- Next Review: Within 90 Days