The GPU Detour

There is a chat on this site called Ask Tungi. You can ask it about my projects, how I work, what I think about product problems. Under the hood, a retrieval pipeline pulls in real project context, and a language model responds in my voice.

Most people would call an API and be done. I did that too, eventually. But first I spent a few weeks running my own model on a dedicated GPU. I learned a lot from that.

Why I wanted to self-host

No per-token billing — you pay for compute, not output. Full control over the model. Open weights, so you can fine-tune later. And your data stays on your infrastructure. After years of shipping under GDPR and KRITIS, that last part matters to me.

I found a GPU platform that wraps dedicated machines behind a standard chat completions API. You pick a model, pick a machine, deploy, and you get an endpoint. I had a working chat streaming responses within an hour.

Then the problems started.

The machine question

Your model size decides your GPU. A 7B parameter model runs fine on an A10G with 24GB VRAM. Go bigger — 13B, 70B — and you need A100s or H100s. Those cost real money per hour.

The platform had serverless GPUs. Scale to zero when nobody is chatting. Sounds great for a portfolio site. But when the first message comes in after idle, the model has to load back into VRAM. That takes ten to thirty seconds.

My site gets single-digit visitors a day. Every first message hits a cold GPU. Cold starts are not an edge case here — they are the normal experience.

Keeping a GPU always on fixes that, but runs fifty to two hundred dollars a month. For a portfolio project, no.

Quality, speed, money

You get two. Not three.

A good model on a fast GPU gives you quality and speed. It also costs more than my domain and hosting combined. A good model on a serverless GPU gives you quality and low cost, but fifteen-second cold starts. The visitor stares at a spinner. Some close the tab. A small model on a serverless GPU is fast and cheap, but the answers get worse — shorter, more generic, bad at following a detailed system prompt.

I sat in the middle one for a while. Quality and cheap. Every session started with a fifteen-second wait. The feature I built to show how I think about products was, itself, a bad product.

Train, fine-tune, or just prompt

Part of the reason I self-hosted was fine-tuning. The plan: take an open-weight model, train it on how I write, and get a chat that sounds like me without needing a giant system prompt. I started putting together the training data. Q&A pairs, tone examples, guardrails.

Then I thought about it more carefully.

Training a model from scratch is not realistic for one person. You need compute, data pipelines, months of work. Fine-tuning is more accessible, but it ages badly. Models get replaced every six to twelve months. When the next generation comes out, your fine-tuned weights sit on old architecture. You retrain, re-evaluate, redeploy. Same effort every cycle.

A good system prompt works on any model. A retrieval pipeline that pulls in the right project context does not care what generates the response. And an eval suite — I built one with over sixty tests for voice, guardrails, and injection resistance — tells you within minutes if a new model holds up.

So I built the eval suite instead of the fine-tune. When I need to swap models, I change an environment variable, run the tests, and ship. I would rather invest in that than in weights I will throw away in a year.

There are cases where fine-tuning is the right call. Specialised reasoning, unusual output formats, edge cases that no prompt covers. For a conversational chat with a clear personality, prompting and retrieval got me there without the overhead.

Security does not care where inference runs

When I moved off the GPU, every security layer came with me unchanged. Prompt injection detection for jailbreak attempts. Rate limiting through Redis, twenty requests per hour per IP. CSRF origin checks. Input sanitisation with length limits. A stateless edge function that never stores personal data.

All of that sits upstream of the model call. It works the same whether the model lives on a GPU I rent or behind a third-party API. Building security into the architecture instead of around a specific provider turned out to be the right bet.

So I switched

I moved to a hosted API. Small, fast model. Sub-second time to first token. No cold starts. And better errors — I could now tell apart a bad key, a rate limit, an upstream outage, and a timeout, instead of getting one generic "sleeping" message for everything.

At my traffic — under a hundred messages a month — API costs are less than a dollar. The GPU setup was free when idle, but ran fifteen to twenty dollars once anyone actually used it. Plus the cold start on every session.

If I had real traffic, thousands of conversations a day, or needed strict data residency, the math would go the other way. Self-hosting wins when token costs pile up and you need control the API does not offer. I was not in that situation.

What I actually built after switching

The infrastructure decision settled fast. The interesting work came after.

The retrieval pipeline grew into something I'm genuinely proud of. I moved from simple keyword matching to TF-IDF vectors with cosine similarity — each project and case study chunk gets vectorised at build time, and incoming queries are matched against them. I added query expansion so synonyms and related terms don't miss relevant chunks. The similarity threshold took real tuning: too aggressive and starter questions like "what was your hardest project?" returned nothing; too loose and unrelated chunks polluted the context.

The voice work was harder. I rewrote the system prompt several times before the tone felt right — confident but not boastful, specific to my actual projects, and able to stay in character across a long conversation without slipping into generic consultant speak. The eval suite caught every regression when I changed the prompt.

The feature I'm most glad I added is document upload. You can drop in a job description or a project brief, and the chat will map Tungi's experience against it — specific projects, relevant skills, a match score if it's a JD. That only works because the retrieval system is grounded in real context. A bare API call with no retrieval would give you a generic response. The vector index makes it specific.

What actually mattered

Start with the API. Ship the feature, not the infrastructure. The hard problems in an AI chat have nothing to do with GPUs. They are about making retrieval good enough that responses are grounded, writing a prompt that holds a consistent voice, building tests that catch when quality drops, and keeping the security boundary tight between user input and model output.

Self-host when you have a concrete reason. Not because it feels like the serious thing to do.

The most useful thing I built was not the deployment on a GPU. It was the retrieval pipeline, the system prompt, and the eval suite. Those work on any backend. The GPU was a rental. The retrieval system is the product.

How I self-hosted an AI chat on my portfolio (and why I stopped)