Blog Posts

Graphify + MemMachine: 79× Token Reduction, Zero Vector Database
I help maintain MemMachine — an open-source long-term memory layer for AI agents. It’s a real codebase: 442 source files, 171 docs, a graph database, a SQL store, an MCP server, a REST API, a Python SDK, and integrations with eight different agent frameworks. When a new contributor asks “where does episodic memory actually get written?”, grep, the tool of choice for many AI coding assistants, doesn’t cut it. The answer threads through five files in three folders, plus a docker-compose service definition and a Helm chart. Each question you ask, it has to search all of these files, using the LLM to semantically understand the question and the files, then piece together an answer. This can take a lot of tokens and consume much of the context window.
Read More
Is Thinking Mode Affecting Your Agentic Workflows?
I jumped on the trend of running local LLMs and agents and was having a lot of fun until my agents kept failing, timing out, and just stopping without any obvious reason. I tried PaperClip + ZeroClaw, PaperClip + Hermes-Agent, and Hermes-Agent + Hermes-Workspace with Qwen 3.6 and Gemma 4 models (various sizes and quantization levels). All of them failed in the same way at some point in the workflow with almost nothing reported in the logs to indicate what was happening. Some tasks completed without any problem, but most did not, often leaving me to wonder what was going on. After many hours of debugging and reading many forums, I finally found that this was a model serving configuration trap that catches many people the first time they self-host a reasoning model.
Read More
How To Run ZeroClaw in Docker with local LLMs (Qwen3 on an NVIDIA DGX Spark)
ZeroClaw is an open-source agent runtime. By default it expects a frontier model API key such as Claude, OpenAI, etc. This guide shows how to use a local Qwen3.6 model served by vLLM on an NVIDIA DGX Spark, routed through LiteLLM, with ZeroClaw and Firecrawl running in Docker on a separate host.
It also documents the onboarding bug I hit on a fresh install in v0.7.4 — ZeroClaw issue #6123 — and the config-only workaround.
Read More
Run Free LLMs at Scale: LiteLLM Gateway with Groq, NVIDIA NIM, OpenRouter, and Local vLLM
Introduction
Running large language models is increasingly affordable — but “affordable” rarely means “free, all the time, for every request.” Cloud providers each come with their own rate limits, daily quotas, and occasional model deprecations. Local hardware is fast and private, but not always available (DGX Spark powered down, model being updated, VRAM needed elsewhere). Somewhere between “I have an API key” and “my agents work reliably at scale” is a configuration problem that most guides skip over entirely.
Read More
vLLM Recipe: RedHatAI/Qwen3.6-35B-A3B-NVFP4 on DGX Spark
This is a vLLM Recipe - a production-ready Docker Compose configuration for running open-weight models on local hardware. It documents the exact setup, configuration rationale, and benchmark results so you can get a model running quickly. You are welcome to change the parameters to suit your workloads. This worked for me, so I hope you find it helpful.
This recipe covers Qwen3.6-35B-A3B-NVFP4 - a Mixture-of-Experts model with 35B total parameters but only ~3B active at inference - quantized to NVFP4 by Red Hat AI and running on the NVIDIA DGX Spark (my GigaByte AI Top Atom) with a GB10 Blackwell GPU and 128 GB of unified CPU/GPU memory.
Read More
Self-Hosting Firecrawl on Ubuntu 25.04 with Docker Compose
Modern AI agents — Claude Code, Codex, OpenClaw, Hermes-Agent, and custom LangChain pipelines — need a way to read the web. Not raw HTML full of navigation debris, cookie banners, and JavaScript noise, but clean structured text that a language model can actually reason about. Firecrawl is the missing piece: an open-source web scraping and crawling API that fetches any URL and returns clean Markdown, ready to drop straight into a context window or a RAG pipeline.
Read MoreBuilding an Agentic Team for an Open Source Project with Claude Code
A core engineer on MemMachine — the one who owned the Semantic Memory subsystem — left the project. The codebase didn’t grow any less complex overnight, but the human attention available to maintain it did. That’s a familiar shape of problem in any open source project, and it’s the exact shape where a well-designed Claude Code agent team earns its keep.
This post documents what I built: a 22-agent maintenance team that lives entirely inside MemMachine’s repository, coordinates via Claude Code’s experimental Agent Teams runtime, and operates under a design I can reproduce for any existing repository with real code. The agents don’t push code, don’t sign commits, don’t merge pull requests, and don’t cut releases — humans still gatekeep every consequential action. What the agents do do is the tedious and error-prone middle of software maintenance: triage, spec drafting, implementation, QA, security review, docs, dependency and upstream tracking.
Read More
Using the API to Find Free Hosted Models on NVIDIA Builder
The NVIDIA Developer Program provides access to a wide catalog of AI models through NVIDIA Inference Microservices (NIM), offering an OpenAI-compatible API. You can browse and discover available models at build.nvidia.com/explore/discover .
If you want to find models with free hosted endpoints in the browser, you can enable the “Free Endpoint” filter
on the model catalog page. But what if you need that information programmatically – in a script, a CI pipeline, or as part of an automated workflow? The browser filter is not accessible through the API, and the /v1/models endpoint does not distinguish between free hosted models and everything else.
Categories
- 3D Printing ( 7 )
- AI ( 11 )
- Books ( 2 )
- Cloud Computing ( 1 )
- Conferences ( 2 )
- CXL ( 15 )
- Data Center ( 2 )
- Development ( 2 )
- Events ( 2 )
- Hardware ( 1 )
- How To ( 35 )
- HowTo ( 1 )
- Linux ( 31 )
- Machine Learning ( 1 )
- OrcaSlicer ( 2 )
- Performance ( 2 )
- Persistent Memory ( 1 )
- PMEM ( 1 )
- Product Manager ( 1 )
- Projects ( 3 )
- Servers ( 1 )
- Storage ( 1 )
- System Administration ( 2 )
- Troubleshooting ( 4 )
- Ubuntu ( 1 )
- Vector Databases ( 1 )
Tags
- 3D Printing
- 3MF
- ACPI
- ACPI-CA
- Acpidump
- Active-Memory
- Agent
- Agent Runtime
- Agent Skills
- Agent Teams
- AI
- AI Agents
- AI Engineering
- AI Infrastructure
- AMD
- API
- Apple Silicon
- Arcade
- Artificial Intelligence
- AST Extraction
- AutoGen
- AWS EC2
- Bash
- Benchmark
- Blackwell
- Blister Pack
- Book
- Boot
- Bootable-Usb
- Build From Source
- Buyer's Guide
- C
- C-2
- Chat Completions
- Chat GPT
- ChatGPT
- Claude Code
- Clflushopt
- Cloud
- CMake
- Code Tunnel
- Code-Server
- Codespaces
- Codex
- Compute Express Link
- Cpu
- Crawling
- CrewAI
- Custom GPT
- Custom-Kernel
- CXL
- CXL 1.0
- CXL 1.1
- CXL 2.0
- CXL 3.0
- CXL Devices
- CXL Specification
- Data Center
- DAX
- Daxctl
- Debugging
- DeepSeek-R1
- Dell
- Development
- Device-Mapper
- DGX Spark
- Dm-Writecache
- Docker
- Docker Compose
- DRAM
- Edge
- Enfabrica
- Esxi
- Fastfetch
- Featured
- Fedora
- Firecrawl
- Firmware
- Free AI Models
- Free LLM API
- Frequency
- FSDAX
- G-Code
- GB10
- Gemma3
- Generative Prompt Engineering
- Git
- GLM-4.7
- Governor
- Gpg
- GPT
- Gpt-3
- Gpt-4
- GPU
- Grafana
- Graph Database
- Graphify
- GraphRAG
- Groq
- H3 Platform
- Hermes-Agent
- Home Lab
- HPE
- Iasl
- Intel
- Ipmctl
- Java
- Kernel
- Knowledge Graph
- Kvm
- LangChain
- LangGraph
- Lenovo
- Linux
- Linux Kernel
- Linux-Volume-Manager
- LiteLLM
- Llama.cpp
- LLM
- LLM Fallback
- LLM Gateway
- Local LLM
- Lvm
- Machine Learning
- MacOS
- Mainline
- MAME
- Max_tokens
- MCP
- MCP Server
- MemMachine
- Memory
- Memory Management
- Memory Mapping
- Memory-Tiering
- Micron
- Microsoft
- ML
- Mmap
- Model Serving
- MoE
- Movdir64b
- MTP
- Mysql
- Napkin Math
- NDCTL
- Neo4j
- Neofetch
- NIM
- NUMA
- Nvdimm
- NVFP4
- NVIDIA
- NVIDIA Builder
- NVIDIA Developer Program
- NVIDIA NIM
- Ollama
- Open Source
- Open Source Maintenance
- Open WebUI
- OpenAI-Compatible
- OpenAI-Compatible API
- OpenClaw
- OpenRouter
- OpenWebUI
- Optane
- OrcaSlicer
- Pagemap
- PCIe
- Percona
- Performance
- Performance Tuning
- Persistent Memory
- Personal Branding
- Physical Address
- Physical Memory
- Pmdk
- PMem
- Powersave
- Procfs
- Product Manager
- Programming
- Prometheus
- Prompt Engineering
- Python
- Qdrant
- QEMU
- Qwen3
- Qwen3.6
- RAG
- Rate Limiting
- Reasoning Models
- RedHatAI
- Remote Development
- Retimers
- Retrieval Augmented Generation
- Rust
- Samsung
- Self-Hosting
- Server
- Servers
- SGLang
- SNC
- Spec-Driven Development
- Speculative Decoding
- SSH
- STREAM Benchmark
- Sub-NUMA Cluster
- Sub-NUMA Clustering
- Subagents
- Supermicro
- Switches
- Sysadmin
- Sysfs
- System Administration
- System Information
- System-Ram
- Technical Documentation
- Terminal
- Thinking Mode
- Tiered-Memory
- Token Reduction
- Travel Moves
- Tree-Sitter
- Tutorial
- Ubuntu
- Ubuntu 22.04
- Ubuntu 25.04
- Uv
- Vector Databases
- Virtual Memory
- VLLM
- Vmware
- Vmware-Esxi
- Vpmem
- VS Code
- Vsphere
- Web Scraping
- Website
- Window
- Windows
- Windows-Server
- Working-Set-Size
- Wss
- Xcode
- ZeroClaw