Local LLMs with Ollama: Build AI Agents with Zero API Costs
Run AI agents 100% locally with Ollama. Learn to set up Llama 3.2, Mistral, and DeepSeek, then build production-ready agents that work offline with full privacy.
Moshiour Rahman
Advertisement
AI Agents Mastery Series
This is Part 3 of our comprehensive AI Agents series.
| Part | Topic | Level |
|---|---|---|
| 1 | Fundamentals - Build from Scratch | Beginner |
| 2 | LangGraph Deep Dive | Intermediate |
| 3 | Local LLMs with Ollama | Intermediate |
| 4 | Tool-Using Agents | Intermediate |
| 5 | Multi-Agent Systems | Advanced |
| 6 | Production Deployment | Advanced |
Why Local LLMs?
Cloud APIs are great, but they have downsides:
| Cloud APIs | Local LLMs |
|---|---|
| Pay per token ($$$) | Free after download |
| Data sent to third party | 100% private |
| Requires internet | Works offline |
| Rate limits | Unlimited requests |
| Vendor lock-in | Model freedom |
In 2025, local LLMs have become production-viable. Models like Llama 3.2, Mistral, and DeepSeek run efficiently on consumer hardware.
Ollama: Docker for LLMs
Ollama packages LLMs like Docker packages applications—everything in one command:
ollama run llama3.2
That’s it. Model downloaded, configured, and running.
System Requirements
| Model Size | RAM Required | GPU (Optional) | Example Models |
|---|---|---|---|
| 1-3B | 8GB | Not needed | Llama 3.2 1B, Phi-3 Mini |
| 7-8B | 16GB | RTX 3060+ | Llama 3.2 8B, Mistral 7B |
| 13B+ | 32GB | RTX 3080+ | Llama 2 13B, CodeLlama |
| 70B | 64GB+ | RTX 4090 / A100 | Llama 2 70B |
Installing Ollama
macOS
brew install ollama
Or download from ollama.ai
Linux
curl -fsSL https://ollama.com/install.sh | sh
Windows
Download the installer from ollama.ai or use WSL2.
Verify Installation
ollama --version
# ollama version 0.4.x
Your First Local Model
# Download and run Llama 3.2 (3B - fast, good quality)
ollama run llama3.2
# You're now in an interactive chat!
>>> What is the capital of France?
The capital of France is Paris...
Press Ctrl+D to exit.
Popular Models for Agents
| Model | Size | Best For | Command |
|---|---|---|---|
| llama3.2 | 3B | General, fast | ollama run llama3.2 |
| llama3.2:8b | 8B | Better reasoning | ollama run llama3.2:8b |
| mistral | 7B | Balanced performance | ollama run mistral |
| deepseek-r1 | 7B | Reasoning tasks | ollama run deepseek-r1 |
| codellama | 7B | Code generation | ollama run codellama |
| qwen2.5-coder | 7B | Code + chat | ollama run qwen2.5-coder |
Download Models
# Pull without running
ollama pull llama3.2
ollama pull mistral
ollama pull codellama
# List downloaded models
ollama list
# Remove a model
ollama rm llama3.2
Ollama API: OpenAI Compatible
Ollama exposes an API that’s compatible with OpenAI’s format. Start the server:
# Ollama runs as a service, but you can also start manually
ollama serve
The API is available at http://localhost:11434.
Direct API Usage
# ollama_api.py
import requests
def chat_with_ollama(prompt: str, model: str = "llama3.2") -> str:
"""Chat with Ollama using the REST API."""
response = requests.post(
"http://localhost:11434/api/generate",
json={
"model": model,
"prompt": prompt,
"stream": False
}
)
return response.json()["response"]
# Test it
print(chat_with_ollama("Explain recursion in programming in 2 sentences."))
OpenAI-Compatible Endpoint
Ollama also provides an OpenAI-compatible endpoint at /v1:
# ollama_openai_compat.py
from openai import OpenAI
# Point to Ollama's local server
client = OpenAI(
base_url="http://localhost:11434/v1",
api_key="ollama" # Required but not used
)
response = client.chat.completions.create(
model="llama3.2",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is Python?"}
]
)
print(response.choices[0].message.content)
This means any code using OpenAI’s SDK works with Ollama just by changing the base URL!
Python Ollama Library
For a native experience, use the official library:
pip install ollama
# ollama_native.py
import ollama
# Simple generation
response = ollama.generate(
model='llama3.2',
prompt='Why is the sky blue?'
)
print(response['response'])
# Chat format
response = ollama.chat(
model='llama3.2',
messages=[
{'role': 'user', 'content': 'What is the capital of Japan?'}
]
)
print(response['message']['content'])
# Streaming
for chunk in ollama.chat(
model='llama3.2',
messages=[{'role': 'user', 'content': 'Write a haiku about coding'}],
stream=True
):
print(chunk['message']['content'], end='', flush=True)
Ollama + LangChain Integration
LangChain has built-in Ollama support:
pip install langchain-ollama
# ollama_langchain.py
from langchain_ollama import OllamaLLM, ChatOllama
# Basic LLM
llm = OllamaLLM(model="llama3.2")
response = llm.invoke("Explain quantum computing simply")
print(response)
# Chat model (recommended for agents)
chat = ChatOllama(model="llama3.2")
response = chat.invoke([
{"role": "system", "content": "You are a Python expert."},
{"role": "user", "content": "What's the difference between list and tuple?"}
])
print(response.content)
Build a Local AI Agent
Now let’s build a fully local agent using LangGraph + Ollama:
# local_agent.py
import os
from typing import Annotated, TypedDict, Literal
from langgraph.graph import StateGraph, START, END
from langgraph.graph.message import add_messages
from langgraph.prebuilt import ToolNode, tools_condition
from langchain_ollama import ChatOllama
from langchain_core.tools import tool
from datetime import datetime
import subprocess
# Initialize local LLM
llm = ChatOllama(
model="llama3.2",
temperature=0 # More deterministic for tool calling
)
# Define tools
@tool
def get_current_time() -> str:
"""Get the current date and time."""
return datetime.now().strftime("%Y-%m-%d %H:%M:%S")
@tool
def run_python_code(code: str) -> str:
"""Execute Python code and return the output.
Use this for calculations or data processing."""
try:
# Create a safe execution environment
result = subprocess.run(
['python', '-c', code],
capture_output=True,
text=True,
timeout=10
)
if result.returncode == 0:
return result.stdout or "Code executed successfully (no output)"
return f"Error: {result.stderr}"
except subprocess.TimeoutExpired:
return "Error: Code execution timed out"
except Exception as e:
return f"Error: {str(e)}"
@tool
def read_file(filepath: str) -> str:
"""Read the contents of a file."""
try:
with open(filepath, 'r') as f:
content = f.read()
if len(content) > 2000:
return content[:2000] + "\n... (truncated)"
return content
except FileNotFoundError:
return f"Error: File '{filepath}' not found"
except Exception as e:
return f"Error reading file: {str(e)}"
@tool
def write_file(filepath: str, content: str) -> str:
"""Write content to a file."""
try:
with open(filepath, 'w') as f:
f.write(content)
return f"Successfully wrote to {filepath}"
except Exception as e:
return f"Error writing file: {str(e)}"
@tool
def list_directory(path: str = ".") -> str:
"""List files and directories in a path."""
try:
import os
items = os.listdir(path)
return "\n".join(items) if items else "Directory is empty"
except Exception as e:
return f"Error: {str(e)}"
tools = [get_current_time, run_python_code, read_file, write_file, list_directory]
# Bind tools to LLM
llm_with_tools = llm.bind_tools(tools)
# State definition
class State(TypedDict):
messages: Annotated[list, add_messages]
# Agent node
def agent(state: State) -> State:
"""The agent decides what to do."""
response = llm_with_tools.invoke(state["messages"])
return {"messages": [response]}
# Build the graph
graph_builder = StateGraph(State)
graph_builder.add_node("agent", agent)
graph_builder.add_node("tools", ToolNode(tools=tools))
graph_builder.add_edge(START, "agent")
graph_builder.add_conditional_edges("agent", tools_condition)
graph_builder.add_edge("tools", "agent")
# Compile
local_agent = graph_builder.compile()
def run_local_agent(query: str) -> str:
"""Run a query through the local agent."""
print(f"\n{'='*60}")
print(f"🦙 LOCAL AGENT (Ollama + LangGraph)")
print(f"Query: {query}")
print('='*60)
result = local_agent.invoke({
"messages": [{"role": "user", "content": query}]
})
# Get final response
final_message = result["messages"][-1]
response = final_message.content if hasattr(final_message, 'content') else str(final_message)
print(f"\n📝 Response:\n{response}")
return response
if __name__ == "__main__":
# Test queries
run_local_agent("What time is it right now?")
run_local_agent("Calculate the factorial of 10 using Python code")
run_local_agent("List the files in the current directory")
Handling Tool Calling with Local Models
Not all local models support function/tool calling natively. Here’s a pattern that works with any model:
# universal_tool_agent.py
import json
import re
from langchain_ollama import ChatOllama
from langchain_core.messages import HumanMessage, AIMessage, SystemMessage
llm = ChatOllama(model="llama3.2")
TOOLS = {
"calculator": {
"description": "Performs math calculations",
"usage": 'calculator(expression="2+2")'
},
"get_time": {
"description": "Gets current time",
"usage": "get_time()"
}
}
def execute_tool(name: str, args: dict) -> str:
if name == "calculator":
try:
expr = args.get("expression", "0")
allowed = set('0123456789+-*/.() ')
if all(c in allowed for c in expr):
return str(eval(expr))
return "Invalid expression"
except:
return "Calculation error"
elif name == "get_time":
from datetime import datetime
return datetime.now().strftime("%Y-%m-%d %H:%M:%S")
return f"Unknown tool: {name}"
def create_prompt() -> str:
tool_list = "\n".join([
f"- {name}: {info['description']}. Usage: {info['usage']}"
for name, info in TOOLS.items()
])
return f"""You are an AI assistant with access to tools.
Available tools:
{tool_list}
When you need a tool, respond ONLY with this exact format:
TOOL: tool_name
ARGS: {{"param": "value"}}
When you have the final answer, respond normally without TOOL/ARGS.
Think step by step."""
def parse_tool_call(response: str) -> tuple:
"""Parse tool call from response."""
tool_match = re.search(r'TOOL:\s*(\w+)', response)
args_match = re.search(r'ARGS:\s*({.+?})', response, re.DOTALL)
if tool_match:
tool_name = tool_match.group(1)
args = {}
if args_match:
try:
args = json.loads(args_match.group(1))
except:
pass
return tool_name, args
return None, None
class UniversalToolAgent:
def __init__(self, max_iterations: int = 5):
self.max_iterations = max_iterations
def run(self, query: str) -> str:
messages = [
SystemMessage(content=create_prompt()),
HumanMessage(content=query)
]
for i in range(self.max_iterations):
response = llm.invoke(messages)
response_text = response.content
print(f"\n[Iteration {i+1}] Agent: {response_text[:200]}...")
tool_name, args = parse_tool_call(response_text)
if tool_name:
print(f" → Tool: {tool_name}({args})")
result = execute_tool(tool_name, args)
print(f" → Result: {result}")
messages.append(AIMessage(content=response_text))
messages.append(HumanMessage(content=f"Tool result: {result}"))
else:
# No tool call - this is the final answer
return response_text
return "Max iterations reached"
# Test
if __name__ == "__main__":
agent = UniversalToolAgent()
print(agent.run("What is 15 * 47 + 123?"))
print(agent.run("What time is it?"))
Performance Optimization
GPU Acceleration
If you have an NVIDIA GPU, Ollama uses it automatically. Verify:
ollama run llama3.2 --verbose
# Look for "gpu" in the output
Model Quantization
Smaller quantized models run faster with minimal quality loss:
# 4-bit quantized (fastest, smallest)
ollama run llama3.2:3b-instruct-q4_0
# 8-bit quantized (balanced)
ollama run llama3.2:3b-instruct-q8_0
Concurrent Requests
Ollama handles multiple requests automatically:
# concurrent_requests.py
import asyncio
import aiohttp
import time
async def query_ollama(session, prompt, model="llama3.2"):
async with session.post(
"http://localhost:11434/api/generate",
json={"model": model, "prompt": prompt, "stream": False}
) as response:
result = await response.json()
return result["response"]
async def main():
queries = [
"What is Python?",
"Explain JavaScript",
"What is Rust?",
"Describe Go language"
]
start = time.time()
async with aiohttp.ClientSession() as session:
tasks = [query_ollama(session, q) for q in queries]
results = await asyncio.gather(*tasks)
elapsed = time.time() - start
print(f"Processed {len(queries)} queries in {elapsed:.2f}s")
for q, r in zip(queries, results):
print(f"\nQ: {q}\nA: {r[:100]}...")
asyncio.run(main())
Custom Models with Modelfiles
Create specialized models with custom system prompts:
# Modelfile
FROM llama3.2
PARAMETER temperature 0.7
PARAMETER num_ctx 4096
SYSTEM """You are a senior Python developer. You write clean,
efficient, well-documented code. You always include type hints
and follow PEP 8 style guidelines. When explaining code, you
break it down step by step."""
Build and run:
ollama create python-expert -f Modelfile
ollama run python-expert
Comparing Models
Here’s a benchmark of popular models for agent tasks:
| Model | Tool Calling | Reasoning | Speed | RAM |
|---|---|---|---|---|
| llama3.2 (3B) | Good | Good | Fast | 8GB |
| llama3.2 (8B) | Better | Better | Medium | 16GB |
| mistral (7B) | Good | Good | Fast | 16GB |
| deepseek-r1 (7B) | Excellent | Excellent | Medium | 16GB |
| qwen2.5-coder (7B) | Good | Good (code) | Fast | 16GB |
For agent tasks, llama3.2 (3B) is the best balance of speed and capability for most use cases.
Common Issues & Solutions
| Issue | Cause | Solution |
|---|---|---|
| Slow responses | No GPU / small RAM | Use smaller model, add GPU |
| ”Model not found” | Not pulled | ollama pull model-name |
| Connection refused | Ollama not running | ollama serve |
| Out of memory | Model too large | Use quantized version |
| Poor tool calling | Model limitation | Use structured prompts |
Summary
| What You Learned | Key Takeaway |
|---|---|
| Why local LLMs | Privacy, cost savings, offline capability |
| Ollama basics | Pull, run, and manage models |
| API usage | REST API + OpenAI compatibility |
| LangChain integration | ChatOllama for agents |
| Tool calling | Works with proper prompting |
| Optimization | GPU, quantization, concurrency |
What’s Next?
In Part 4, we’ll build agents with real-world tools—web search, code execution, file operations, and API integrations.
Continue to Part 4: Tool-Using Agents →
Full Code Repository
git clone https://github.com/Moshiour027/ai-agents-mastery.git
cd ai-agents-mastery/03-ollama
pip install -r requirements.txt
python local_agent.py Advertisement
Moshiour Rahman
Software Architect & AI Engineer
Enterprise software architect with deep expertise in financial systems, distributed architecture, and AI-powered applications. Building large-scale systems at Fortune 500 companies. Specializing in LLM orchestration, multi-agent systems, and cloud-native solutions. I share battle-tested patterns from real enterprise projects.
Related Articles
Run Llama Locally: Complete Guide to Local LLM Deployment
Deploy Llama and other open-source LLMs locally. Learn Ollama, llama.cpp, quantization, and build private AI applications without cloud APIs.
PythonMulti-Agent Systems: Build AI Teams with CrewAI & LangGraph
Master multi-agent orchestration with CrewAI and LangGraph. Build specialized AI teams that collaborate, delegate tasks, and solve complex problems together.
PythonLangGraph Deep Dive: Build AI Agents as State Machines
Master LangGraph for building production AI agents. Learn state graphs, conditional routing, cycles, and persistence patterns with hands-on examples.
Comments
Comments are powered by GitHub Discussions.
Configure Giscus at giscus.app to enable comments.