docs(blog): add turboquant and moe cache post

updated
docs(blog): add local ai rig post
2026-05-08 16:10:13 -04:00 · 2026-04-16 10:00:00 -04:00 · 2026-03-23 00:56:25 -04:00 · 2026-03-18 10:00:00 -04:00 · 2026-03-04 00:02:00 -05:00 · 2026-02-10 10:00:00 -05:00
3 changed files with 308 additions and 0 deletions
--- a/_posts/2026-02-10-why-im-moving-more-ai-work-off-the-cloud.md
+++ b/_posts/2026-02-10-why-im-moving-more-ai-work-off-the-cloud.md
@ -0,0 +1,85 @@
 ---
 layout: post
 title: "Why I'm Moving More AI Work Off the Cloud"
 date: 2026-02-10
 description: "Cloud AI is useful, but I want more of my day-to-day AI workflow on infrastructure I control."
 tags: [ai, local-ai, privacy, security]
 ---
 I have been moving more of my AI workflow onto hardware I control.
 That does not mean I am done with cloud models. I still use them, and for some tasks they are clearly the best tool available. The frontier models are fast, capable, and convenient in a way that is hard to argue with.
 But convenience is not the only thing I care about.
 For a growing amount of my work, especially research, security work, and personal tooling, I want fewer external dependencies. I want to decide what context leaves my machine. I want tools that keep working when an API changes, a rate limit shows up, or a provider decides a workflow no longer fits neatly inside its acceptable-use boundaries.
 That has pushed me toward a local-first AI setup: local models when they make sense, local search and retrieval, and a workstation built for experimenting without asking permission from someone else's platform.
 ## The Problem With Renting Every Thought
 Cloud AI has a strange gravity to it. It is easy to start with one hosted model, one API key, and one chat window. Then slowly more of the workflow moves there. Notes, code, research questions, logs, documents, debugging sessions, threat models. The model gets better as it sees more context, so the incentive is always to give it more.
 At some point the question changes from "is this useful?" to "how much of my working memory am I comfortable routing through a service I do not control?"
 Sometimes the answer is: plenty. If I am asking a general programming question, summarizing public docs, or comparing technologies, the privacy concern is low. The cloud model is just a good tool.
 Other times the answer is different. If I am working through security research, private notes, closed-source code, unfinished ideas, internal infrastructure details, or anything that would be awkward to paste into a public forum, I would rather keep the default path local.
 This is less about paranoia than posture. I do not want to make a sensitive workflow depend on remembering, every single time, which context is safe to send somewhere else.
 ## Security Work Is Often Awkwardly Shaped
 Security research creates a particular kind of friction with hosted models.
 A lot of legitimate defensive work looks suspicious when reduced to a prompt. Understanding exploit chains, malware behavior, persistence mechanisms, credential abuse, phishing infrastructure, evasion techniques, and post-compromise behavior is necessary if you want to defend against those things. It is also dual-use by nature.
 Cloud models often handle that ambiguity by refusing broadly. I understand why. Providers are operating at huge scale, they have to make conservative policy decisions, and they do not know who is asking or why.
 But from the researcher's side, broad refusal can turn a useful assistant into a wall. The model does not need to help someone cause harm to be useful. It can help explain behavior, compare mitigations, reason through detections, review lab code, or identify what a suspicious artifact is trying to do.
 For that kind of work, local models matter. Not because "uncensored" should mean irresponsible, but because security work needs room to discuss uncomfortable systems honestly. Running models locally puts the responsibility where it belongs: on the person operating the tool.
 ## Privacy Is Also About Drafts
 People often talk about privacy as if it only matters for secrets: passwords, keys, customer data, proprietary documents. Those matter, obviously.
 But drafts matter too.
 Half-formed ideas, personal notes, research trails, failed experiments, and weird debugging paths say a lot about how someone thinks. They are not always sensitive in the legal sense, but they are still private. I want the freedom to explore messy ideas without turning every intermediate thought into data exhaust for a remote service.
 Local AI makes that easier. I can point a model at notes, logs, repos, or experiments without first filtering everything through "would I be comfortable uploading this?"
 That changes the feel of the tool. It becomes less like a web service and more like part of the machine.
 ## Local Models Are Not Magic
 There are tradeoffs.
 Local models are often slower. They can be less capable than the best hosted models. Hardware is expensive, loud, hot, and occasionally annoying. Running the stack yourself means you inherit the boring parts: drivers, model formats, disk space, memory pressure, cooling, updates, broken builds, and tools that almost work.
 I do not think local AI replaces cloud AI for everything. That is not the point.
 The point is to own the workflows where ownership matters. If a task needs the best reasoning model in the world, I may still use a cloud model. If a task needs privacy, repeatability, looser research constraints, or deeper integration with my local environment, I want a path that does not leave my hardware.
 ## Tools Matter As Much As Models
 Running a model locally is only one piece of the problem.
 A model sitting on a workstation still needs useful context. It needs current information, documentation, source material, and a way to inspect the web. Otherwise it becomes a very private but very stale assistant.
 That is why I started building more of the surrounding tooling too. Local search. Local retrieval. Local reranking. Agent tools that do not need a paid search API every time they need to answer a grounded question.
 The goal is not to rebuild the entire internet in my office. It is to make the common path private and inspectable: the model, the search layer, the retrieval pipeline, and the machine they run on.
 ## What I Want From This Stack
 I want an AI setup that feels boring in the right ways.
 I want to ask questions against my own notes without thinking about where the text is going. I want to run models that are useful for security research without fighting a policy layer designed for the most abusive possible user. I want web search that is cheap enough to use freely and transparent enough to debug when the results are bad.
 Most of all, I want the stack to be mine. Not because every local component is better than every cloud service, but because control changes what you are willing to build.
 When the infrastructure is yours, experiments get easier. Weird ideas get cheaper. Private workflows stay private by default. And when something breaks, you can open the box and see why.
 That is the direction I want more of my AI work to move in.
--- a/_posts/2026-03-18-building-a-local-ai-rig.md
+++ b/_posts/2026-03-18-building-a-local-ai-rig.md
@ -0,0 +1,111 @@
 ---
 layout: post
 title: "Building a Local AI Rig"
 date: 2026-03-18
 description: "The machine I built for local AI work, why I built it, and why good enough hardware changes how I use these tools."
 tags: [ai, local-ai, hardware, rocm]
 ---
 After deciding I wanted more of my AI workflow to run locally, the next question was what kind of machine would make that practical.
 Not a benchmark trophy. Not a datacenter in a spare room. Just a workstation that could run useful models every day without making the whole thing feel like a science project.
 The machine I ended up with is built around an AMD EPYC 7402P, 256 GB of RAM, a 2 TB NVMe drive, and an AMD Radeon AI PRO R9700 with 32 GB of VRAM. The board gives me real x16 PCIe slots too, which leaves room to add more GPUs later instead of rebuilding the whole machine around a dead-end platform. It runs Linux with ROCm, and ROCm has been good enough for the kind of AI work I care about.
 That last part matters because a lot of local AI discussion still assumes NVIDIA is the only serious path. NVIDIA is clearly the easier default in many cases. CUDA support is better, more projects test against it first, and you run into fewer weird edges.
 But "best supported" and "only workable" are not the same thing. For my use case, this AMD setup has been usable enough that I can stop thinking about the GPU most of the time and focus on the work.
 ## Buying Around the Hype
 AI hardware pricing is strange right now.
 The obvious parts are expensive because everyone wants them. New NVIDIA cards, especially the high-end ones, are priced like the market knows exactly how much demand local AI has created. I did not want to build the whole machine around paying hype-cycle prices for every component.
 So the rest of the system is deliberately unglamorous: last-generation server-grade hardware that is still very capable. The EPYC platform gives me cores, memory capacity, PCIe lanes, and room to grow without paying workstation-fashion prices. It is not the newest thing, but it is exactly the kind of hardware that becomes interesting once it falls out of the datacenter upgrade cycle.
 The GPU is the more deliberate bet. The Radeon AI PRO R9700 is a new card, but it was dramatically cheaper than chasing the top NVIDIA consumer cards. At roughly a fifth of the price of the NVIDIA 5090s I was seeing, the question became whether I believed the AMD software ecosystem would keep improving enough to make the tradeoff worth it.
 So far, that bet looks reasonable. ROCm is not as polished as CUDA, but I am getting decent performance out of the card today. For my workloads, that matters more than having the most obvious logo on the box.
 ## What I Wanted From the Machine
 The goal was not to beat hosted frontier models. A local workstation is not going to turn into an infinite cloud API just because it has a large GPU in it.
 What I wanted was a default place to run private work.
 I wanted to be able to test models without uploading notes, code, or research context to a third party. I wanted a box that could sit on my network and be available whenever I wanted to experiment. I wanted enough memory that large models, local search tools, indexing jobs, and normal development work would not constantly fight each other.
 I also wanted the machine to be boring. There is a version of local AI where every session starts with debugging drivers, chasing library versions, or trying to remember which environment variables made something work last week. That gets old fast. The rig needed to be powerful enough to be useful, but stable enough to fade into the background.
 ## The Daily Models
 The models I settled on for day-to-day use are Qwen3.6 27B and Qwen3.6 35B-A3B.
 That has been a good balance for me. They are capable enough for coding help, research, summarization, and general reasoning without making every prompt feel painfully slow. They are also large enough that the local setup feels meaningfully different from running a tiny model just to prove that local inference works.
 In normal use, I see roughly 25 tokens per second from the 27B model and around 60 tokens per second from the 35B-A3B model. Those are not formal benchmark numbers, but they are the numbers that matter for me: fast enough that I reach for the local models during the day instead of treating them like a novelty.
 That is the line I care about: not "can I technically run a model?" but "would I actually choose to use this?"
 Plenty of local AI setups clear the first bar and fail the second. A model can be private, cheap per token, and fully under your control, but if it is too slow or too weak you eventually stop reaching for it. For local AI to matter, it has to become part of the normal workflow.
 This machine gets close enough to that for me.
 ## Why So Much System RAM?
 The 256 GB of RAM is not there because every model needs it. Most of the time, the GPU is the part people talk about, and for good reason. VRAM decides a lot about what can run comfortably.
 But system RAM gives the machine breathing room.
 It lets me keep larger models, caches, indexing jobs, containers, build trees, and other tools around without the machine feeling fragile. It also matters for experiments where not everything fits neatly in VRAM. Local AI is not just one process running one model. It tends to become a small pile of services: inference, search, retrieval, development tools, monitoring, and whatever else I am currently testing.
 I did not want the box to be useful only when treated delicately.
 ## ROCm Has Been Good Enough
 ROCm still has rough edges. I would not pretend otherwise. Some projects assume CUDA first. Some instructions are written as if AMD users do not exist. Sometimes support depends on exactly which GPU generation, kernel, library, or build flags are involved.
 But for this machine, with this GPU, it has been good enough to do real work.
 That is an important distinction. I am not trying to make a universal claim that AMD is the right choice for everyone building a local AI box. If someone wants the smoothest possible path and has the budget, NVIDIA is still the safest answer.
 I am saying that the AMD path is workable now in a way that matters. It is not just a curiosity. I can run my daily models, build against the stack, and get decent performance. The machine is useful.
 For me, that is the threshold.
 ## Owning the Box Changes the Workflow
 The biggest difference is psychological.
 When inference is local, I use it differently. I am more willing to paste rough notes into it. I am more willing to let it chew on something unpolished. I am more willing to test odd workflows, run long experiments, and point tools at local files.
 There is no per-token bill in the back of my mind. There is no question about which provider is storing what. There is no need to route every experiment through a hosted interface that was built for a general audience.
 The machine is not free. The hardware cost is real. Power and heat are real. Time spent maintaining the setup is real.
 But once it exists, the marginal cost of curiosity gets much lower.
 That matters more than I expected.
 ## What It Still Does Not Solve
 Local hardware does not remove all the hard parts.
 The best cloud models are still better at many tasks. Long-context work still runs into memory pressure. Some software stacks are fragile. Model quality varies wildly. Quantization choices matter. A bad prompt still gives a bad answer, just privately.
 There is also a maintenance burden that hosted tools hide. If something breaks, I own it. Driver updates, ROCm changes, model compatibility, build failures, disk usage, and thermal behavior all become my problem.
 That is the trade.
 I am comfortable with it because the machine gives me something I cannot get from a subscription: a place to experiment freely, privately, and repeatedly.
 ## The Point
 The local rig is not about rejecting the cloud completely. It is about changing the default.
 Cloud models are still part of my toolbox. But for everyday AI work, especially the parts involving private notes, security research, local code, and experiments that benefit from being close to the machine, I want a capable local path.
 This setup gives me that.
 It is not perfect. It does not need to be. It is fast enough, private enough, and flexible enough that I actually use it. That is the thing that matters.
--- a/_posts/2026-04-16-experimenting-with-turboquant-and-moe-caching.md
+++ b/_posts/2026-04-16-experimenting-with-turboquant-and-moe-caching.md
@ -0,0 +1,112 @@
 ---
 layout: post
 title: "Experimenting With TurboQuant and MoE Caching"
 date: 2026-04-16
 description: "Some notes from maintaining a TurboQuant llama.cpp fork and testing whether hot MoE experts on the GPU could make a huge local model practical."
 tags: [ai, llama-cpp, local-ai, inference]
 ---
 After getting the local AI rig into a usable place, I started poking at the next obvious problem: how far could I push it?
 The model I was interested in was Qwen3.5 397B-A17B. It is the kind of model that makes local inference feel ridiculous in both directions. On one hand, the fact that it can run at all on a machine in my house is impressive. On the other hand, "can run" and "is pleasant to use" are very different things.
 That led me into two related experiments in my llama.cpp fork: maintaining a TurboQuant branch for long-context inference, and testing a Mixture-of-Experts cache that tried to keep the hot experts on the GPU while leaving the rest of the model in system RAM.
 TurboQuant was the clear success. The MoE cache was the useful negative result.
 ## TurboQuant Was the Win
 The TurboQuant side was not some grand original implementation from scratch. It was mostly integration and maintenance work.
 There was an existing TurboQuant llama.cpp fork, and my work was mainly about rebasing that onto a newer llama.cpp release so I could use it with the rest of my setup. That kind of work is less glamorous than writing a new algorithm, but it is a big part of making local AI experiments real.
 llama.cpp moves quickly. Backends change, build systems change, kernel code changes, model support changes, and a fork that worked a few months ago can become stale fast. Rebasing an inference fork is not just "resolve a conflict and move on." You have to make sure the pieces still mean the same thing after upstream moved underneath them.
 I have already fallen behind again and need to redo that rebase at some point. That is the cost of carrying an experimental branch on top of a fast-moving project.
 But the result was absolutely worth it.
 TurboQuant attacks one of the most annoying limits in local inference: the KV cache. Long context is useful, but it is not free. Every token leaves memory behind, and at large context sizes that memory becomes a serious part of whether a model is usable at all.
 The Google paper behind TurboQuant, [TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate](https://huggingface.co/papers/2504.19874), argues that KV cache quantization can get extremely aggressive while staying effectively quality neutral at practical bit widths. That lines up with what I saw in practice.
 With TurboQuant in my llama.cpp fork, KV cache size dropped from roughly 12 GB to about 1.2 GB while still retaining the full model's 262k token context. In normal use it felt almost entirely lossless.
 That is not a small improvement. That changes what kind of long-context work is realistic on local hardware.
 I would like to see llama.cpp incorporate this upstream. For now, I maintain the fork because the difference is too useful to give up.
 ## The MoE Cache Idea Was Different
 The more uncertain experiment was the MoE cache.
 Mixture-of-Experts models are strange from a systems point of view. The full model can be enormous, but each token only activates part of it. That creates an obvious temptation: if the model is too large to keep entirely on the GPU, maybe you can keep the most-used experts there and leave the rest in CPU memory.
 That was the idea I wanted to test with Qwen3.5 397B-A17B.
 The rough plan was:
 - load the full model into system RAM
 - track which experts were getting used
 - keep the hot experts resident on the GPU
 - fall back to CPU memory for the rest
 - see whether the cache hit rate was high enough to beat the simple approach
 In theory, that sounds promising. In practice, the machine still has to move data across the system in exactly the wrong places. The cache can help only if the experts it keeps on the GPU are reused often enough to make up for the cost of managing the cache and moving missed experts around.
 That was the real question: not whether the idea could be implemented, but whether the hardware balance made it worth doing.
 ## It Worked, But It Was Slower
 The cache worked.
 That is worth saying clearly. The experiment was not a failure in the sense of "this could not run." The model loaded. The expert cache did what it was supposed to do. The system could keep hot experts on the GPU and run the rest out of RAM.
 But the performance was not good enough.
 With the MoE cache enabled, I was seeing around 8 tokens per second. With the model fully loaded in RAM and no MoE cache, I was seeing closer to 10 tokens per second.
 That is not the result I wanted, but it is the result that matters.
 The simpler approach was faster. Not by an enormous amount, but enough that the extra complexity was hard to justify. If a cache makes the system more complicated and still loses to the baseline, the right answer is not to pretend the cache won. The right answer is to ask why.
 ## The Bottleneck Was the System
 This is where local AI gets interesting to me.
 A lot of model discussion focuses on the model itself: parameter count, quantization, context length, benchmark scores. Those things matter, but at this scale the system around the model matters just as much.
 The MoE cache was betting that GPU residency for hot experts would beat the cost of pulling everything through CPU memory. On my hardware, that bet did not pay off. The transfer costs, cache management, and actual expert access pattern did not line up well enough.
 That does not mean the idea is useless. It means the idea is hardware-sensitive.
 On a different machine, the answer could change. More VRAM, multiple GPUs, faster PCIe, different memory bandwidth, a different MoE activation pattern, or a smarter cache policy could move the result. This is exactly why I wanted a local rig in the first place: I can test ideas against real hardware instead of guessing.
 ## What I Changed in the Fork
 My fork ended up with a stack of experimental pieces around this idea:
 - TurboQuant rebased onto a newer llama.cpp base
 - an MoE expert activation profiler
 - cache configuration exposed through normal runtime flags
 - hot-expert seeding from profiler output
 - fixes for cache correctness issues I ran into while testing
 - hysteresis so experts had to show up more than once before being promoted
 Some of that was infrastructure more than optimization. Profiling, configuration, and correctness fixes are not the exciting part of an experiment, but they are what make the result believable.
 Without them, it is too easy to fool yourself. Maybe the cache is faster. Maybe the workload changed. Maybe the model is silently wrong. Maybe the one prompt you tested happened to hit the right experts. The boring pieces are how you reduce that uncertainty.
 ## The Useful MoE Result
 The useful MoE result was not "I made a 397B model fast on one consumer GPU."
 I did not.
 The useful result from that side of the experiment was learning where the limits were. Qwen3.5 397B-A17B could run locally on my machine. The MoE cache idea could be implemented. But on this hardware, with this setup, the cache was slower than leaving the model in RAM.
 That is still progress. A negative result with numbers is better than a vague assumption. Now I know more about where the bottleneck is, what kind of hardware might change the answer, and which parts of the software stack are worth revisiting later.
 I also have a fork that is easier to experiment with next time, even if it has already started to fall behind upstream again.
 That is the shape of a lot of local AI work right now. The field moves quickly, the tools are uneven, and not every idea survives contact with the machine. But when the hardware is yours and the stack is inspectable, even the failed experiments leave something useful behind.
Author	SHA1	Message	Date
Bryan Ramos	27d3f85728	docs(blog): add turboquant and moe cache post	2026-04-16 10:00:00 -04:00
Bryan Ramos	89d37eb5a9	updated	2026-03-23 00:56:25 -04:00
Bryan Ramos	d01e601dc8	docs(blog): add local ai rig post	2026-03-18 10:00:00 -04:00
Bryan Ramos	d451f64428	overhaul	2026-03-04 00:02:00 -05:00
Bryan Ramos	13f4b571c5	docs(blog): add local-first ai motivation post	2026-02-10 10:00:00 -05:00
Bryan Ramos	b24921804a	Remove Twitter link from index.html Removed Twitter link from the contact list.	2025-11-16 17:23:15 -05:00