Unplugging the Cloud: My Journey Running LLMs Locally with Ollama
When I first heard I could download and run a large language model right on my own Windows machine — no expensive GPU required — I was both skeptical and thrilled. Could this really replace my reliance on cloud APIs? Meet Ollama, a tool that changed how I think about local AI inference!
What Is Ollama?
Ollama is an open-source tool built on top of llama.cpp, specifically designed for local LLM model management. It handles everything from downloading models to serving them on your machine. Think of it as your personal gateway to powerful language models — without having to hand over your data or your wallet to cloud providers.
I first stumbled upon Ollama while taking a GenAI course from FCC, where we were exploring alternatives to cloud-based solutions. Tools like LangChain and Hugging Face are powerful, but they can feel heavy for local experimentation or come with hidden cloud costs and data privacy concerns. Enter Ollama, which:
- Automates the process of finding, downloading, and updating model files.
- Abstracts away different model formats (GGML, Safetensors, etc.) — just say “use Llama 2,” and it pulls everything needed.
- Provides local inference via the command line or a simple API endpoint on your own infrastructure.
My Environment Setup
I’m running a Windows machine with:
- 32 GB RAM
- 2.70 GHz AMD Ryzen 7 PRO processor (8 cores, 16 logical processors)
- Integrated graphics (so, practically no fancy GPU for heavy AI tasks)
Because Ollama can run quantized models, I didn’t need a high-end GPU. Here’s how I set things up:
- I installed the Windows version of Ollama (available for Mac & Linux too), then used Windows Powershell to access it. This is because its a counsole on and doesnt have any interface. This makes things much easier to integrate.
VS Code + WSL
- I rely on Visual Studio Code with the Remote — WSL extension. This way, I can edit scripts in VS Code but run them in my Linux environment under WSL. It’s a smooth workflow, the best of both worlds! This setup lets me keep my Windows environment for other tasks while still enjoying the Linux command-line tooling.
Getting Ollama Running
Step 1: Download Ollama
Go to the Ollama website(or GitHub) and grab the setup file for your platform (macOS, Windows, or Linux). Since I’m on Windows, I installed the Windows build. It’s a console-based utility with no graphical interface, which actually keeps it lightweight.
ollama
Step 2: Choose a Model
Ollama supports a range of LLMs. I chose Deepseek r1 (1.5B parameters) — one of the smaller models — to ensure smooth performance on my CPU. A handful of other models I considered (though some are heavier) include Qwen, Llama3.2-vision, Gemma, Mistral, etc.
Personal Note:
“I was excited to see so many open-source models available. But remember, bigger isn’t always better — sometimes you want a model that can run seamlessly on your hardware without crashes or delays.”
ollama run deepseek-r1:1.5b
Additionally, I choose to deploy a vision model which was quite heavy on my machine, however, it develops a perspective on how to interact with the model locally. The model I used was llama3.2-vision 11b, which was 7.8 GB, much higher threshold for my machine. Later, when used the models during simple code execution, each (simple) quarry took more than 10 seconds for Deepseek model and took more than 20 second for llama3.2-vision model.
ollama pull llama3.2-vision
Integration with VS Code + WSL
Since I’m a heavy VS Code user:
- Remote — WSL Extension: I opened my WSL folder in VS Code and wrote Python scripts to query the local Ollama endpoint.
- Debug & Iterate: I could quickly test different prompts or parameter settings (like temperature, top_p, etc.) in my Python code without leaving the editor.
Personal Tip:
“Having everything in one place — my code in VS Code, the model running via WSL — felt like a supercharged local dev environment. No more bouncing between a web UI or an external API console.”
This is a code for chat.py, which is using a deepseek model on a local system.
import os
import ollama
print("Hello, Ollama!")
print("Current directory files:", os.listdir('.'))
try:
response = ollama.chat(
model='deepseek-r1:1.5b',
messages=[{
'role': 'user',
'content': 'What is in the name of last Ottoman King?'
}]
)
print("Response received:")
print(response)
except Exception as e:
print("Error during ollama.chat:", e)
This is a code for image.py, which is using llama3.2-vision model.
import os
import ollama
print("Hello, Ollama!")
print("Current directory files:", os.listdir('.'))
try:
response = ollama.chat(
model='llama3.2-vision',
messages=[{
'role': 'user',
'content': 'What is in this image?',
'images': ['image.jpg'] # Make sure this file exists
}]
)
print("Response received:")
print(response)
except Exception as e:
print("Error during ollama.chat:", e)
The “Wow” Factor: Hands-On Interactions
- Performance Surprise: Even on integrated graphics, the quantized model performed decently, generating coherent text at a reasonable speed.
- Data Privacy: I realized mid-prompt how satisfying it is that all my queries stay local. If I want to feed it sensitive text — like internal docs — I can do so without second thoughts.
- No Cloud Costs: No more worrying about racking up bills on paid inference APIs.
Potential Use Cases & Next Steps
Now that I’ve experienced the convenience and security of running an LLM locally, I can’t help but imagine all the practical applications Ollama could unlock. At the simplest level, you could spin up a local chatbot or assistant — perhaps for internal IT troubleshooting or employee FAQs — where every query remains safely on your own infrastructure. If you often deal with large text files, consider building a document summarization tool that ingests PDFs, Word docs, or text dumps and returns concise bullet points, all without sending content to any external server.
For more technical users, there’s the option of a developer assistant, where Ollama can generate boilerplate code or provide refactoring suggestions directly in your IDE. Meanwhile, if your organization’s knowledge is scattered across wikis, databases, and PDFs, a Knowledge Base Q&A system powered by a local LLM could streamline search and retrieval, keeping proprietary information off the cloud. On the creative side, you could craft an AI-powered writing assistant to help your marketing team draft blog posts, newsletters, or social media copy — again, all behind your own firewall. And lastly, for those in customer-facing roles, a customer support chatbot that integrates with CRM systems locally could handle basic inquiries, schedule appointments, or provide FAQ responses while adhering to strict data-protection regulations.
What excites me most is how all these ideas can be deployed on your own terms — no monthly inference fees, no third-party storing of your data, and the flexibility to tweak or replace the underlying models whenever you want. It’s a refreshing departure from the typical “hosted AI” paradigm, making Ollama an intriguing option for small businesses, hobbyists, or even larger organizations looking to keep critical data strictly on-premises.
Lessons Learned & Tips
Throughout my local LLM journey, I learned a handful of best practices that can make or break your experience. First off, hardware considerations are crucial: even though Ollama leverages quantized models to reduce memory usage, you still want a decent CPU and sufficient RAM to avoid slowdowns or crashes — bigger models demand bigger resources. Next, prompt engineering emerged as a game-changer; how you frame your queries often determines whether the output is mediocre or spot-on. I found it helpful to version-control prompts, adjusting them over time for improved consistency. While Ollama doesn’t natively support fine-tuning, prompt tuning can still achieve specialized results by incorporating domain-specific context into your queries or system prompts. On the security and privacy front, I appreciated that all data stays in-house, but I still took care to secure my local endpoint — especially if exposing it to a local network. Meanwhile, monitoring and logging are key if you’re running experiments at scale; tracking response times, CPU usage, and prompt patterns can reveal bottlenecks or highlight prompt improvements. Finally, for those planning to deploy Ollama beyond a personal workstation, containerization (like Docker or Kubernetes) can simplify scaling and updates, ensuring that your local LLM ecosystem remains stable, consistent, and easy to maintain.
Conclusion & Invitation
Final Thoughts
- Simplicity: Ollama’s centralized model download and serving make local LLM usage surprisingly smooth.
- Flexibility: Whether it’s Llama 2, Deepseek, or Qwen, you choose your model — no one-size-fits-all approach.
- Data Control & Cost: By running locally, you keep data private and avoid cloud API costs.
Invitation:
“If you’re curious about harnessing AI on your own turf, give Ollama a spin! Feel free to reach out with questions or show off your own local LLM experiments. It’s amazing how far you can go without ever leaving your machine.”
This is a first article in my quest of getting to know more about GenAI. My primary learning resource is FreeCodeCamp. You can follow me on X and Githuib.