Local LLMs with Ollama: Build AI Agents with Zero API Costs

AI Agents Mastery Series

This is Part 3 of our comprehensive AI Agents series.

Part	Topic	Level
1	Fundamentals - Build from Scratch	Beginner
2	LangGraph Deep Dive	Intermediate
3	Local LLMs with Ollama	Intermediate
4	Tool-Using Agents	Intermediate
5	Multi-Agent Systems	Advanced
6	Production Deployment	Advanced

Why Local LLMs?

Cloud APIs are great, but they have downsides:

Cloud APIs	Local LLMs
Pay per token ($$$)	Free after download
Data sent to third party	100% private
Requires internet	Works offline
Rate limits	Unlimited requests
Vendor lock-in	Model freedom

In 2025, local LLMs have become production-viable. Models like Llama 3.2, Mistral, and DeepSeek run efficiently on consumer hardware.

Ollama: Docker for LLMs

Ollama packages LLMs like Docker packages applications—everything in one command:

ollama run llama3.2

That’s it. Model downloaded, configured, and running.

System Requirements

Model Size	RAM Required	GPU (Optional)	Example Models
1-3B	8GB	Not needed	Llama 3.2 1B, Phi-3 Mini
7-8B	16GB	RTX 3060+	Llama 3.2 8B, Mistral 7B
13B+	32GB	RTX 3080+	Llama 2 13B, CodeLlama
70B	64GB+	RTX 4090 / A100	Llama 2 70B

Installing Ollama

macOS

brew install ollama

Or download from ollama.ai

Linux

curl -fsSL https://ollama.com/install.sh | sh

Windows

Download the installer from ollama.ai or use WSL2.

Verify Installation

ollama --version
# ollama version 0.4.x

Your First Local Model

# Download and run Llama 3.2 (3B - fast, good quality)
ollama run llama3.2

# You're now in an interactive chat!
>>> What is the capital of France?
The capital of France is Paris...

Press Ctrl+D to exit.

Popular Models for Agents

Model	Size	Best For	Command
llama3.2	3B	General, fast	`ollama run llama3.2`
llama3.2:8b	8B	Better reasoning	`ollama run llama3.2:8b`
mistral	7B	Balanced performance	`ollama run mistral`
deepseek-r1	7B	Reasoning tasks	`ollama run deepseek-r1`
codellama	7B	Code generation	`ollama run codellama`
qwen2.5-coder	7B	Code + chat	`ollama run qwen2.5-coder`

Download Models

# Pull without running
ollama pull llama3.2
ollama pull mistral
ollama pull codellama

# List downloaded models
ollama list

# Remove a model
ollama rm llama3.2

Ollama API: OpenAI Compatible

Ollama exposes an API that’s compatible with OpenAI’s format. Start the server:

# Ollama runs as a service, but you can also start manually
ollama serve

The API is available at http://localhost:11434.

Direct API Usage

# ollama_api.py
import requests

def chat_with_ollama(prompt: str, model: str = "llama3.2") -> str:
    """Chat with Ollama using the REST API."""

    response = requests.post(
        "http://localhost:11434/api/generate",
        json={
            "model": model,
            "prompt": prompt,
            "stream": False
        }
    )

    return response.json()["response"]

# Test it
print(chat_with_ollama("Explain recursion in programming in 2 sentences."))

OpenAI-Compatible Endpoint

Ollama also provides an OpenAI-compatible endpoint at /v1:

# ollama_openai_compat.py
from openai import OpenAI

# Point to Ollama's local server
client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama"  # Required but not used
)

response = client.chat.completions.create(
    model="llama3.2",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is Python?"}
    ]
)

print(response.choices[0].message.content)

This means any code using OpenAI’s SDK works with Ollama just by changing the base URL!

Python Ollama Library

For a native experience, use the official library:

pip install ollama

# ollama_native.py
import ollama

# Simple generation
response = ollama.generate(
    model='llama3.2',
    prompt='Why is the sky blue?'
)
print(response['response'])

# Chat format
response = ollama.chat(
    model='llama3.2',
    messages=[
        {'role': 'user', 'content': 'What is the capital of Japan?'}
    ]
)
print(response['message']['content'])

# Streaming
for chunk in ollama.chat(
    model='llama3.2',
    messages=[{'role': 'user', 'content': 'Write a haiku about coding'}],
    stream=True
):
    print(chunk['message']['content'], end='', flush=True)

Ollama + LangChain Integration

LangChain has built-in Ollama support:

pip install langchain-ollama

# ollama_langchain.py
from langchain_ollama import OllamaLLM, ChatOllama

# Basic LLM
llm = OllamaLLM(model="llama3.2")
response = llm.invoke("Explain quantum computing simply")
print(response)

# Chat model (recommended for agents)
chat = ChatOllama(model="llama3.2")
response = chat.invoke([
    {"role": "system", "content": "You are a Python expert."},
    {"role": "user", "content": "What's the difference between list and tuple?"}
])
print(response.content)

Build a Local AI Agent

Now let’s build a fully local agent using LangGraph + Ollama:

# local_agent.py
import os
from typing import Annotated, TypedDict, Literal
from langgraph.graph import StateGraph, START, END
from langgraph.graph.message import add_messages
from langgraph.prebuilt import ToolNode, tools_condition
from langchain_ollama import ChatOllama
from langchain_core.tools import tool
from datetime import datetime
import subprocess

# Initialize local LLM
llm = ChatOllama(
    model="llama3.2",
    temperature=0  # More deterministic for tool calling
)

# Define tools
@tool
def get_current_time() -> str:
    """Get the current date and time."""
    return datetime.now().strftime("%Y-%m-%d %H:%M:%S")

@tool
def run_python_code(code: str) -> str:
    """Execute Python code and return the output.
    Use this for calculations or data processing."""
    try:
        # Create a safe execution environment
        result = subprocess.run(
            ['python', '-c', code],
            capture_output=True,
            text=True,
            timeout=10
        )
        if result.returncode == 0:
            return result.stdout or "Code executed successfully (no output)"
        return f"Error: {result.stderr}"
    except subprocess.TimeoutExpired:
        return "Error: Code execution timed out"
    except Exception as e:
        return f"Error: {str(e)}"

@tool
def read_file(filepath: str) -> str:
    """Read the contents of a file."""
    try:
        with open(filepath, 'r') as f:
            content = f.read()
            if len(content) > 2000:
                return content[:2000] + "\n... (truncated)"
            return content
    except FileNotFoundError:
        return f"Error: File '{filepath}' not found"
    except Exception as e:
        return f"Error reading file: {str(e)}"

@tool
def write_file(filepath: str, content: str) -> str:
    """Write content to a file."""
    try:
        with open(filepath, 'w') as f:
            f.write(content)
        return f"Successfully wrote to {filepath}"
    except Exception as e:
        return f"Error writing file: {str(e)}"

@tool
def list_directory(path: str = ".") -> str:
    """List files and directories in a path."""
    try:
        import os
        items = os.listdir(path)
        return "\n".join(items) if items else "Directory is empty"
    except Exception as e:
        return f"Error: {str(e)}"

tools = [get_current_time, run_python_code, read_file, write_file, list_directory]

# Bind tools to LLM
llm_with_tools = llm.bind_tools(tools)

# State definition
class State(TypedDict):
    messages: Annotated[list, add_messages]

# Agent node
def agent(state: State) -> State:
    """The agent decides what to do."""
    response = llm_with_tools.invoke(state["messages"])
    return {"messages": [response]}

# Build the graph
graph_builder = StateGraph(State)

graph_builder.add_node("agent", agent)
graph_builder.add_node("tools", ToolNode(tools=tools))

graph_builder.add_edge(START, "agent")
graph_builder.add_conditional_edges("agent", tools_condition)
graph_builder.add_edge("tools", "agent")

# Compile
local_agent = graph_builder.compile()

def run_local_agent(query: str) -> str:
    """Run a query through the local agent."""
    print(f"\n{'='*60}")
    print(f"🦙 LOCAL AGENT (Ollama + LangGraph)")
    print(f"Query: {query}")
    print('='*60)

    result = local_agent.invoke({
        "messages": [{"role": "user", "content": query}]
    })

    # Get final response
    final_message = result["messages"][-1]
    response = final_message.content if hasattr(final_message, 'content') else str(final_message)

    print(f"\n📝 Response:\n{response}")
    return response

if __name__ == "__main__":
    # Test queries
    run_local_agent("What time is it right now?")

    run_local_agent("Calculate the factorial of 10 using Python code")

    run_local_agent("List the files in the current directory")

Handling Tool Calling with Local Models

Not all local models support function/tool calling natively. Here’s a pattern that works with any model:

# universal_tool_agent.py
import json
import re
from langchain_ollama import ChatOllama
from langchain_core.messages import HumanMessage, AIMessage, SystemMessage

llm = ChatOllama(model="llama3.2")

TOOLS = {
    "calculator": {
        "description": "Performs math calculations",
        "usage": 'calculator(expression="2+2")'
    },
    "get_time": {
        "description": "Gets current time",
        "usage": "get_time()"
    }
}

def execute_tool(name: str, args: dict) -> str:
    if name == "calculator":
        try:
            expr = args.get("expression", "0")
            allowed = set('0123456789+-*/.() ')
            if all(c in allowed for c in expr):
                return str(eval(expr))
            return "Invalid expression"
        except:
            return "Calculation error"
    elif name == "get_time":
        from datetime import datetime
        return datetime.now().strftime("%Y-%m-%d %H:%M:%S")
    return f"Unknown tool: {name}"

def create_prompt() -> str:
    tool_list = "\n".join([
        f"- {name}: {info['description']}. Usage: {info['usage']}"
        for name, info in TOOLS.items()
    ])

    return f"""You are an AI assistant with access to tools.

Available tools:
{tool_list}

When you need a tool, respond ONLY with this exact format:
TOOL: tool_name
ARGS: {{"param": "value"}}

When you have the final answer, respond normally without TOOL/ARGS.

Think step by step."""

def parse_tool_call(response: str) -> tuple:
    """Parse tool call from response."""
    tool_match = re.search(r'TOOL:\s*(\w+)', response)
    args_match = re.search(r'ARGS:\s*({.+?})', response, re.DOTALL)

    if tool_match:
        tool_name = tool_match.group(1)
        args = {}
        if args_match:
            try:
                args = json.loads(args_match.group(1))
            except:
                pass
        return tool_name, args

    return None, None

class UniversalToolAgent:
    def __init__(self, max_iterations: int = 5):
        self.max_iterations = max_iterations

    def run(self, query: str) -> str:
        messages = [
            SystemMessage(content=create_prompt()),
            HumanMessage(content=query)
        ]

        for i in range(self.max_iterations):
            response = llm.invoke(messages)
            response_text = response.content

            print(f"\n[Iteration {i+1}] Agent: {response_text[:200]}...")

            tool_name, args = parse_tool_call(response_text)

            if tool_name:
                print(f"  → Tool: {tool_name}({args})")
                result = execute_tool(tool_name, args)
                print(f"  → Result: {result}")

                messages.append(AIMessage(content=response_text))
                messages.append(HumanMessage(content=f"Tool result: {result}"))
            else:
                # No tool call - this is the final answer
                return response_text

        return "Max iterations reached"

# Test
if __name__ == "__main__":
    agent = UniversalToolAgent()
    print(agent.run("What is 15 * 47 + 123?"))
    print(agent.run("What time is it?"))

Performance Optimization

GPU Acceleration

If you have an NVIDIA GPU, Ollama uses it automatically. Verify:

ollama run llama3.2 --verbose
# Look for "gpu" in the output

Model Quantization

Smaller quantized models run faster with minimal quality loss:

# 4-bit quantized (fastest, smallest)
ollama run llama3.2:3b-instruct-q4_0

# 8-bit quantized (balanced)
ollama run llama3.2:3b-instruct-q8_0

Concurrent Requests

Ollama handles multiple requests automatically:

# concurrent_requests.py
import asyncio
import aiohttp
import time

async def query_ollama(session, prompt, model="llama3.2"):
    async with session.post(
        "http://localhost:11434/api/generate",
        json={"model": model, "prompt": prompt, "stream": False}
    ) as response:
        result = await response.json()
        return result["response"]

async def main():
    queries = [
        "What is Python?",
        "Explain JavaScript",
        "What is Rust?",
        "Describe Go language"
    ]

    start = time.time()

    async with aiohttp.ClientSession() as session:
        tasks = [query_ollama(session, q) for q in queries]
        results = await asyncio.gather(*tasks)

    elapsed = time.time() - start
    print(f"Processed {len(queries)} queries in {elapsed:.2f}s")

    for q, r in zip(queries, results):
        print(f"\nQ: {q}\nA: {r[:100]}...")

asyncio.run(main())

Custom Models with Modelfiles

Create specialized models with custom system prompts:

# Modelfile
FROM llama3.2

PARAMETER temperature 0.7
PARAMETER num_ctx 4096

SYSTEM """You are a senior Python developer. You write clean,
efficient, well-documented code. You always include type hints
and follow PEP 8 style guidelines. When explaining code, you
break it down step by step."""

Build and run:

ollama create python-expert -f Modelfile
ollama run python-expert

Comparing Models

Here’s a benchmark of popular models for agent tasks:

Model	Tool Calling	Reasoning	Speed	RAM
llama3.2 (3B)	Good	Good	Fast	8GB
llama3.2 (8B)	Better	Better	Medium	16GB
mistral (7B)	Good	Good	Fast	16GB
deepseek-r1 (7B)	Excellent	Excellent	Medium	16GB
qwen2.5-coder (7B)	Good	Good (code)	Fast	16GB

For agent tasks, llama3.2 (3B) is the best balance of speed and capability for most use cases.

Common Issues & Solutions

Issue	Cause	Solution
Slow responses	No GPU / small RAM	Use smaller model, add GPU
”Model not found”	Not pulled	`ollama pull model-name`
Connection refused	Ollama not running	`ollama serve`
Out of memory	Model too large	Use quantized version
Poor tool calling	Model limitation	Use structured prompts

Summary

What You Learned	Key Takeaway
Why local LLMs	Privacy, cost savings, offline capability
Ollama basics	Pull, run, and manage models
API usage	REST API + OpenAI compatibility
LangChain integration	`ChatOllama` for agents
Tool calling	Works with proper prompting
Optimization	GPU, quantization, concurrency

What’s Next?

In Part 4, we’ll build agents with real-world tools—web search, code execution, file operations, and API integrations.

Continue to Part 4: Tool-Using Agents →

Full Code Repository

git clone https://github.com/Moshiour027/ai-agents-mastery.git
cd ai-agents-mastery/03-ollama
pip install -r requirements.txt
python local_agent.py