How I Used Ollama Locally Instead of Paying for the OpenAI API

And why it was one of the best decisions I made while building InterviewAI

There's a moment every indie developer dreads.

You've been building for weeks. The app is finally starting to feel real — the UI is clean, the auth works, the database is humming. And then you open the OpenAI pricing page.

That was me, sometime around midnight, staring at API costs and doing rough math in my head. InterviewAI — the AI-powered interview prep platform I was building — needed a language model at its core. It needed to generate interview questions tailored to a job role and evaluate spoken answers in real time.

The OpenAI API could do all of that beautifully. But at scale, even a modest number of daily users would rack up a bill I couldn't justify as a third-year CS student building a side project between assignments.

So I asked myself: what if I just ran the model myself?

Enter Ollama

I'd heard of Ollama before but always dismissed it as a "hobbyist" tool — something you'd use to chat with a local model for fun, not something you'd build a production feature on.

I was wrong.

Ollama lets you pull and run open-source LLMs locally with a single command. No API key. No rate limits. No usage bill. It spins up a local REST API on http://localhost:11434 that you can call from any backend — which, in my case, was a Next.js app running on Vercel... with one catch I'll get to in a moment.

What I Used It For

InterviewAI has one core AI-powered feature, and Ollama powers it.

1. Generating Interview Questions

When a user enters a job role (say, "Frontend Developer at a fintech startup"), InterviewAI generates a set of tailored interview questions — behavioural, technical, and situational.

I used gpt-oss:120b-cloud — OpenAI's open-weight 120B parameter model available through Ollama. Here's the part that makes it genuinely clever: the -cloud tag means the model doesn't run on your machine at all. Ollama automatically offloads it to their cloud infrastructure, so you skip the 65GB download and the GPU requirement entirely. But from your code's perspective, it still looks like a local call to localhost:11434 — same API, same interface, zero changes to your integration.

The prompt I ended up using was simple but effective:

const response = await fetch("http://localhost:11434/api/chat", {
  method: "POST",
  headers: { "Content-Type": "application/json" },
  body: JSON.stringify({
    model: "gpt-oss:120b-cloud",
    messages: [
      {
        role: "user",
        content: `Generate 5 interview questions for a ${role} position.
                  Include 2 technical, 2 behavioural, and 1 situational question.
                  Return as a JSON array.`
      }
    ],
    stream: false
  })
});

const data = await response.json();
const questions = JSON.parse(data.response);

One thing that tripped me up early: Ollama's /api/generate returns the full response in data.response as a string. If you're expecting JSON, you need to parse it yourself — and prompt the model explicitly to return only JSON with no markdown fences. Otherwise you'll get ```json blocks wrapping your output and JSON.parse will throw.

The Catch: Ollama Still Needs to Be Running

Here's the part nobody tells you upfront.

Even with cloud models, Ollama acts as a local proxy — your code calls localhost:11434, Ollama handles the request and routes it to their cloud servers behind the scenes. That means Ollama itself still needs to be running on the machine making the call.

When I deployed InterviewAI to Vercel, the calls silently failed. Vercel serverless functions have no localhost:11434 — there's no Ollama process running there. The -cloud tag offloads the heavy compute, but it doesn't remove the need for a local Ollama instance.

My solution was to keep the Ollama-powered features working in a local development / self-hosted mode, while the deployed version handles it differently. For a portfolio project, this is fine — document it clearly and users who want the full AI features run it locally.

If you want Ollama cloud models to work in a fully deployed setup, Ollama does offer a direct cloud API at https://ollama.com/api/chat with an API key — no local instance needed:

const response = await fetch("https://ollama.com/api/chat", {
  method: "POST",
  headers: {
    "Content-Type": "application/json",
    "Authorization": `Bearer ${process.env.OLLAMA_API_KEY}`
  },
  body: JSON.stringify({
    model: "gpt-oss:120b",
    messages: [{ role: "user", content: prompt }],
    stream: false
  })
});

This is the production path — and it's still free-tier friendly compared to OpenAI.

Was It Worth It?

Completely.

For a project where I needed to move fast, experiment freely, and not worry about burning through API credits every time I tested a new prompt, running Ollama locally was the right call. I iterated on prompts dozens of times without a second thought about cost.

gpt-oss isn't a compromise. For structured tasks — generating questions from a template, scoring an answer against a rubric — it's genuinely good enough. And "good enough" that's free beats "perfect" that costs money you don't have.

If you're a student or indie dev building something AI-powered, give Ollama a real shot before reaching for the OpenAI dashboard. You might be surprised how far it gets you.

Quick Setup (if you want to try it)

# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Sign in (required for cloud models)
ollama signin

# Pull the cloud model (no 65GB download — runs on Ollama's servers)
ollama pull gpt-oss:120b-cloud

# Test it
curl http://localhost:11434/api/chat -d '{
  "model": "gpt-oss:120b-cloud",
  "messages": [{"role": "user", "content": "What is a closure in JavaScript?"}],
  "stream": false
}'

That's it. Your local GPT is running.

I'm Dainwi, a CS student building InterviewAI, Opus, and other projects. If this helped you, follow me on Instagram @iamdainwichoudhary where I post dev content in Hinglish.