I Spent 6 Months Building AI-Powered Tools. Here’s What Actually Works.
Six months ago I decided to stop watching AI tutorials and start building. Not toy projects. Not hello-world chatbots. I mean actual systems — tools that people at work use every day, pipelines that run unattended at midnight, agents that make decisions without me holding their hand.
What I found was a massive gap between what the demos show and what production reality looks like. The demos make it look like magic. The reality involves token limits, hallucinations, cost management, and the deeply humbling experience of watching your carefully crafted prompt completely fall apart on real data.
This article is everything I learned — the architectural patterns, the libraries, the prompt engineering tricks, and the mistakes I made so you don’t have to.
1. The Mental Model Shift That Changes Everything
Before we touch a single line of code, let me save you weeks of frustration.
Most developers approach AI tools the wrong way. They start with the technology — GPT-4, embeddings, RAG, agents — and then go looking for problems to apply it to. That’s backwards. Every good AI project I’ve built started with a specific, painful problem and then asked: can an LLM meaningfully help here?
The question I now ask before any AI project: what is the human currently doing manually, and what would it take to automate 80% of it?
Not 100%. That 20% where edge cases live is often where the real complexity hides, and trying to automate it perfectly will sink your timeline. Automate the bulk of the work, flag the edge cases for human review, and ship.
With that framing locked in, here’s the stack I’ve settled on for building AI-powered tools in Python.
2. The OpenAI API — Going Beyond Simple Completions
Most tutorials show you the bare minimum: pass a prompt, get a response. That covers maybe 10% of what the API can actually do for you. The features that have made the biggest difference in my projects are structured outputs, function calling, and batching.
Let me show you structured outputs first, because it’s the one that unlocks everything else.
import openai
from pydantic import BaseModel
from typing import Optional, List
import json
client = openai.OpenAI(api_key="your_api_key")
# Define the exact shape of data you want back
class ExtractedInvoice(BaseModel):
vendor_name: str
invoice_number: str
total_amount: float
currency: str
line_items: List[dict]
due_date: Optional[str]
is_overdue: bool
def extract_invoice_data(raw_text: str) -> ExtractedInvoice:
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{
"role": "system",
"content": """You are a financial data extraction assistant.
Extract all invoice information from the provided text with precision.
If a field is not found, use null for optional fields.
For is_overdue, determine based on due date if present."""
},
{
"role": "user",
"content": f"Extract the invoice data from this text:\n\n{raw_text}"
}
],
response_format={
"type": "json_schema",
"json_schema": {
"name": "invoice_extraction",
"schema": ExtractedInvoice.model_json_schema()
}
},
temperature=0.1 # Low temperature for data extraction tasks
)
raw_json = response.choices[0].message.content
return ExtractedInvoice(**json.loads(raw_json))
# Use it
invoice_text = """
INVOICE #INV-2024-0892
Vendor: Acme Software Solutions
Date: January 15, 2024
Due: February 15, 2024
Services:
- API Development (40hrs @ $150/hr): $6,000.00
- Code Review (8hrs @ $150/hr): $1,200.00
- Documentation: $500.00
Total Due: USD $7,700.00
"""
result = extract_invoice_data(invoice_text)
print(f"Vendor: {result.vendor_name}")
print(f"Total: {result.total_amount} {result.currency}")
print(f"Line items: {len(result.line_items)}")
The difference between this and a raw string response is enormous in production. When your LLM returns structured Pydantic objects, you can plug them directly into databases, APIs, and downstream logic without a single regex or string parse.
Now let’s look at function calling, which is how you give the model the ability to take actions:
import openai
import json
from datetime import datetime
import sqlite3
client = openai.OpenAI(api_key="your_api_key")
# Define what tools the model can use
tools = [
{
"type": "function",
"function": {
"name": "search_customer_database",
"description": "Search for customer information by name, email, or customer ID",
"parameters": {
"type": "object",
"properties": {
"search_term": {
"type": "string",
"description": "The name, email address, or customer ID to search for"
},
"search_type": {
"type": "string",
"enum": ["name", "email", "id"],
"description": "What type of search to perform"
}
},
"required": ["search_term", "search_type"]
}
}
},
{
"type": "function",
"function": {
"name": "create_support_ticket",
"description": "Create a new customer support ticket in the system",
"parameters": {
"type": "object",
"properties": {
"customer_id": {"type": "string"},
"issue_summary": {"type": "string"},
"priority": {
"type": "string",
"enum": ["low", "medium", "high", "critical"]
},
"category": {
"type": "string",
"enum": ["billing", "technical", "account", "general"]
}
},
"required": ["customer_id", "issue_summary", "priority", "category"]
}
}
}
]
def search_customer_database(search_term: str, search_type: str) -> dict:
# In real code this queries your actual database
mock_customers = {
"john@example.com": {"id": "C001", "name": "John Smith", "plan": "Pro", "since": "2022-03"},
"sarah@company.com": {"id": "C002", "name": "Sarah Johnson", "plan": "Enterprise", "since": "2021-11"}
}
if search_type == "email" and search_term in mock_customers:
return mock_customers[search_term]
return {"error": "Customer not found"}
def create_support_ticket(customer_id, issue_summary, priority, category) -> dict:
ticket_id = f"TKT-{datetime.now().strftime('%Y%m%d%H%M%S')}"
return {"ticket_id": ticket_id, "status": "created", "customer_id": customer_id}
def run_support_agent(user_message: str):
messages = [
{"role": "system", "content": "You are a helpful customer support agent. Use the available tools to look up customer info and create tickets when needed."},
{"role": "user", "content": user_message}
]
while True:
response = client.chat.completions.create(
model="gpt-4o",
messages=messages,
tools=tools,
tool_choice="auto"
)
message = response.choices[0].message
# If no tool calls, we have the final answer
if not message.tool_calls:
return message.content
# Process each tool call
messages.append(message)
for tool_call in message.tool_calls:
fn_name = tool_call.function.name
fn_args = json.loads(tool_call.function.arguments)
if fn_name == "search_customer_database":
result = search_customer_database(**fn_args)
elif fn_name == "create_support_ticket":
result = create_support_ticket(**fn_args)
else:
result = {"error": "Unknown function"}
messages.append({
"role": "tool",
"tool_call_id": tool_call.id,
"content": json.dumps(result)
})
# Test it
response = run_support_agent(
"I have a customer john@example.com who is having billing issues and needs urgent help"
)
print(response)
This is a basic agent loop. The model decides when to call tools, calls them, gets results back, and continues reasoning until it has a complete answer. This pattern alone powers about 70% of the AI tooling I’ve built.
3. Building a Real RAG Pipeline From Scratch
Retrieval Augmented Generation is the most requested AI project I’ve seen in the past year, and also the one most people implement poorly. A chatbot that hallucinates answers to questions about your internal docs is worse than no chatbot at all.
Let me show you a RAG system that actually handles the edge cases.
import openai
from sentence_transformers import SentenceTransformer
import numpy as np
import fitz # PyMuPDF
import pandas as pd
from pathlib import Path
import pickle
import hashlib
from typing import List, Tuple
client = openai.OpenAI(api_key="your_api_key")
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")
# ─── Step 1: Document ingestion with intelligent chunking ───────────────────
def extract_chunks_from_pdf(pdf_path: str, chunk_size: int = 800, overlap: int = 150) -> List[dict]:
doc = fitz.open(pdf_path)
chunks = []
filename = Path(pdf_path).name
for page_num in range(len(doc)):
page = doc[page_num]
page_text = page.get_text("text")
# Skip pages with very little content (headers, blank pages)
if len(page_text.strip()) < 100:
continue
# Split into chunks with overlap
start = 0
chunk_index = 0
while start < len(page_text):
end = start + chunk_size
chunk_text = page_text[start:end].strip()
if len(chunk_text) > 50: # Ignore tiny fragments
chunks.append({
"text": chunk_text,
"source": filename,
"page": page_num + 1,
"chunk_index": chunk_index,
"chunk_id": hashlib.md5(f"{filename}{page_num}{chunk_index}".encode()).hexdigest()
})
chunk_index += 1
start += chunk_size - overlap
doc.close()
return chunks
# ─── Step 2: Embedding and storage ─────────────────────────────────────────
def build_knowledge_base(pdf_directory: str, cache_path: str = "kb_cache.pkl") -> pd.DataFrame:
cache = Path(cache_path)
if cache.exists():
print("Loading cached knowledge base...")
with open(cache, "rb") as f:
return pickle.load(f)
all_chunks = []
pdf_files = list(Path(pdf_directory).glob("*.pdf"))
print(f"Processing {len(pdf_files)} PDF files...")
for pdf_path in pdf_files:
print(f" → {pdf_path.name}")
chunks = extract_chunks_from_pdf(str(pdf_path))
all_chunks.extend(chunks)
print(f"Total chunks: {len(all_chunks)}")
print("Generating embeddings (this may take a moment)...")
texts = [chunk["text"] for chunk in all_chunks]
embeddings = embedding_model.encode(texts, batch_size=32, show_progress_bar=True)
df = pd.DataFrame(all_chunks)
df["embedding"] = list(embeddings)
with open(cache, "wb") as f:
pickle.dump(df, f)
print("Knowledge base built and cached.")
return df
# ─── Step 3: Semantic search with reranking ────────────────────────────────
def search_knowledge_base(query: str, df: pd.DataFrame, top_k: int = 5) -> List[dict]:
query_embedding = embedding_model.encode([query])[0]
# Compute cosine similarity
embeddings_matrix = np.vstack(df["embedding"].values)
query_norm = query_embedding / np.linalg.norm(query_embedding)
embeddings_norm = embeddings_matrix / np.linalg.norm(embeddings_matrix, axis=1, keepdims=True)
similarities = np.dot(embeddings_norm, query_norm)
df["similarity"] = similarities
top_results = df.nlargest(top_k, "similarity")
return top_results[["text", "source", "page", "similarity"]].to_dict("records")
# ─── Step 4: Generation with citations ────────────────────────────────────
def answer_question(query: str, df: pd.DataFrame) -> dict:
results = search_knowledge_base(query, df, top_k=5)
# Filter out low-confidence results
filtered = [r for r in results if r["similarity"] > 0.35]
if not filtered:
return {
"answer": "I don't have enough relevant information in the knowledge base to answer this question confidently.",
"sources": [],
"confidence": "low"
}
# Build context with source labels
context_parts = []
for i, result in enumerate(filtered):
context_parts.append(
f"[Source {i+1}: {result['source']}, Page {result['page']}]\n{result['text']}"
)
context = "\n\n---\n\n".join(context_parts)
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "system",
"content": """You are a precise document assistant. Answer questions using ONLY
the provided context. Cite sources as [Source N] inline.
If the context doesn't contain the answer, say so explicitly -
do not invent information."""
},
{
"role": "user",
"content": f"Context:\n{context}\n\nQuestion: {query}"
}
],
temperature=0.2
)
return {
"answer": response.choices[0].message.content,
"sources": [{"file": r["source"], "page": r["page"], "score": round(r["similarity"], 3)} for r in filtered],
"confidence": "high" if filtered[0]["similarity"] > 0.7 else "medium"
}
# Run it
kb = build_knowledge_base("./documents")
result = answer_question("What is the refund policy for enterprise customers?", kb)
print(result["answer"])
print("\nSources:", result["sources"])
The two things most RAG tutorials skip: filtering by similarity threshold (so you don’t return garbage results when the query has no good match) and confidence scoring so your UI can communicate uncertainty to the user.
4. Prompt Engineering That Actually Holds Up in Production
I have a confession. Early on I was terrible at writing prompts. I’d write something that worked perfectly on five test cases and then completely fell apart on real data. The problem was I was writing prompts for the best case instead of the worst case.
Here’s the framework I now use for every production prompt:
import openai
from typing import Optional
import json
client = openai.OpenAI(api_key="your_api_key")
# ─── The prompt template system ────────────────────────────────────────────
SYSTEM_PROMPTS = {
"classifier": """You are a precise text classification system.
RULES:
- Classify the input into EXACTLY one of the provided categories
- Base your decision solely on the text content
- If the text is ambiguous, choose the most likely category
- If the text clearly belongs to none of the categories, respond with "UNCATEGORIZED"
- Respond with ONLY a JSON object, no explanation
OUTPUT FORMAT:
{"category": "<category_name>", "confidence": <0.0-1.0>, "reasoning": "<one sentence>"}""",
"extractor": """You are a precise data extraction system.
RULES:
- Extract only what is explicitly stated in the text
- Do not infer or assume information that is not present
- Use null for any field not found in the text
- Dates should be in ISO format (YYYY-MM-DD) when possible
- Respond with ONLY valid JSON, no markdown, no explanation""",
"summarizer": """You are a concise technical summarizer.
RULES:
- Preserve all technical details, numbers, and specifications
- Do not add interpretation or opinion
- Structure: key findings first, supporting detail second
- Maximum length: 3 paragraphs unless instructed otherwise
- Use the same technical vocabulary as the source material"""
}
def classify_text(
text: str,
categories: list,
examples: Optional[list] = None
) -> dict:
category_list = "\n".join([f"- {cat}" for cat in categories])
few_shot = ""
if examples:
few_shot = "\n\nEXAMPLES:\n"
for ex in examples:
few_shot += f'Input: "{ex["text"]}"\nOutput: {json.dumps(ex["output"])}\n\n'
user_prompt = f"""Categories:
{category_list}
{few_shot}
Classify this text:
{text}"""
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": SYSTEM_PROMPTS["classifier"]},
{"role": "user", "content": user_prompt}
],
temperature=0, # Zero temperature for classification - you want determinism
max_tokens=150
)
try:
return json.loads(response.choices[0].message.content)
except json.JSONDecodeError:
return {"category": "PARSE_ERROR", "confidence": 0.0, "reasoning": "Failed to parse model output"}
# Test with few-shot examples
examples = [
{
"text": "My payment was charged twice this month",
"output": {"category": "billing", "confidence": 0.99, "reasoning": "Explicit mention of payment issue"}
},
{
"text": "The app crashes when I try to export",
"output": {"category": "technical", "confidence": 0.97, "reasoning": "Software crash is a technical issue"}
}
]
result = classify_text(
text="I can't log into my account after resetting my password",
categories=["billing", "technical", "account_access", "general_inquiry"],
examples=examples
)
print(result)
# {"category": "account_access", "confidence": 0.95, "reasoning": "Login failure after password reset is an account access issue"}
The key habits here: separate system prompts by task type, use few-shot examples for anything where tone or format matters, set temperature to zero for any deterministic task like classification or extraction, and always wrap JSON parsing in a try/except because the model will occasionally surprise you.
5. Building an AI Pipeline With LangChain (And When to Skip It)
LangChain gets a lot of mixed feelings in the developer community. After using it in three production projects, my view is nuanced: it’s excellent for specific things and genuinely in the way for others.
Use it for: multi-step chains, memory management in conversations, document loaders. Skip it for: simple one-shot completions, custom agent loops, anything where you need fine-grained control.
Here’s a real use case where it earns its place — a multi-step document processing chain:
from langchain_openai import ChatOpenAI
from langchain.prompts import ChatPromptTemplate
from langchain.schema.output_parser import StrOutputParser
from langchain.schema.runnable import RunnableParallel, RunnablePassthrough
from langchain_community.document_loaders import PyMuPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import FAISS
import os
os.environ["OPENAI_API_KEY"] = "your_api_key"
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0.2)
# ─── Multi-step chain: summarize → extract key points → generate action items ───
summarize_prompt = ChatPromptTemplate.from_template("""
Summarize the following business document in 2-3 sentences.
Focus on the main purpose, key decisions, and outcomes.
Document:
{document}
Summary:""")
extract_prompt = ChatPromptTemplate.from_template("""
Given this document summary, extract 3-5 key business insights as a numbered list.
Each insight should be actionable and specific.
Summary:
{summary}
Key Insights:""")
action_prompt = ChatPromptTemplate.from_template("""
Based on these insights, generate specific action items with owners and deadlines.
Format as: ACTION: [task] | OWNER: [role] | DEADLINE: [timeframe]
Insights:
{insights}
Action Items:""")
# Build the chain - output of one step feeds the next
summarize_chain = summarize_prompt | llm | StrOutputParser()
extract_chain = extract_prompt | llm | StrOutputParser()
action_chain = action_prompt | llm | StrOutputParser()
def process_document(document_text: str) -> dict:
summary = summarize_chain.invoke({"document": document_text})
insights = extract_chain.invoke({"summary": summary})
actions = action_chain.invoke({"insights": insights})
return {
"summary": summary,
"insights": insights,
"action_items": actions
}
# ─── Now add retrieval ──────────────────────────────────────────────────────
def build_conversational_qa(pdf_path: str):
loader = PyMuPDFLoader(pdf_path)
documents = loader.load()
splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200,
separators=["\n\n", "\n", ". ", " ", ""]
)
chunks = splitter.split_documents(documents)
vectorstore = FAISS.from_documents(chunks, OpenAIEmbeddings())
retriever = vectorstore.as_retriever(search_kwargs={"k": 4})
qa_prompt = ChatPromptTemplate.from_template("""
Answer the question based on the context below.
If the context doesn't contain the answer, say "I don't have enough information."
Be specific and cite page numbers when available.
Context:
{context}
Question: {question}
Answer:""")
chain = (
RunnableParallel({
"context": retriever,
"question": RunnablePassthrough()
})
| qa_prompt
| llm
| StrOutputParser()
)
return chain
qa_chain = build_conversational_qa("quarterly_report.pdf")
answer = qa_chain.invoke("What was the revenue growth in Q3?")
print(answer)
The RunnableParallel pattern is where LangChain really shines — running retrieval and passing the question through simultaneously, then combining them before the prompt. Writing this from scratch with plain async Python is doable but significantly more verbose.
6. AI Agents That Actually Complete Multi-Step Tasks
An agent is just a loop: give the model tools, let it decide which tools to use, execute those tools, feed results back, repeat until done. Simple in theory. Chaotic in practice if you don’t build in the right guardrails.
Here’s an agent I built to automate competitive research — it searches the web, reads pages, and synthesizes a structured report:
import openai
import requests
from bs4 import BeautifulSoup
import json
from typing import List
import time
client = openai.OpenAI(api_key="your_api_key")
# ─── Tool definitions ───────────────────────────────────────────────────────
def web_search(query: str, num_results: int = 5) -> List[dict]:
# Using DuckDuckGo instant answers API (no key required)
url = f"https://api.duckduckgo.com/?q={query}&format=json&no_html=1"
try:
resp = requests.get(url, timeout=10)
data = resp.json()
results = []
for topic in data.get("RelatedTopics", [])[:num_results]:
if "Text" in topic:
results.append({
"title": topic.get("Text", "")[:100],
"url": topic.get("FirstURL", ""),
"snippet": topic.get("Text", "")
})
return results if results else [{"error": "No results found"}]
except Exception as e:
return [{"error": str(e)}]
def scrape_webpage(url: str, max_chars: int = 3000) -> str:
try:
headers = {"User-Agent": "Mozilla/5.0"}
resp = requests.get(url, headers=headers, timeout=15)
soup = BeautifulSoup(resp.content, "html.parser")
# Remove scripts, styles, nav elements
for tag in soup(["script", "style", "nav", "footer", "header"]):
tag.decompose()
text = soup.get_text(separator="\n", strip=True)
# Clean up excess whitespace
lines = [line.strip() for line in text.split("\n") if line.strip()]
clean_text = "\n".join(lines)
return clean_text[:max_chars]
except Exception as e:
return f"Error scraping page: {str(e)}"
def save_report(filename: str, content: str) -> str:
with open(filename, "w") as f:
f.write(content)
return f"Report saved to {filename}"
# ─── Agent loop ─────────────────────────────────────────────────────────────
TOOLS = [
{
"type": "function",
"function": {
"name": "web_search",
"description": "Search the web for information on a topic",
"parameters": {
"type": "object",
"properties": {
"query": {"type": "string", "description": "The search query"},
"num_results": {"type": "integer", "description": "Number of results (default 5)", "default": 5}
},
"required": ["query"]
}
}
},
{
"type": "function",
"function": {
"name": "scrape_webpage",
"description": "Read and extract text content from a webpage URL",
"parameters": {
"type": "object",
"properties": {
"url": {"type": "string", "description": "The URL to scrape"},
"max_chars": {"type": "integer", "description": "Maximum characters to return", "default": 3000}
},
"required": ["url"]
}
}
},
{
"type": "function",
"function": {
"name": "save_report",
"description": "Save the final research report to a file",
"parameters": {
"type": "object",
"properties": {
"filename": {"type": "string"},
"content": {"type": "string", "description": "The full report content"}
},
"required": ["filename", "content"]
}
}
}
]
def run_research_agent(research_topic: str, max_iterations: int = 10) -> str:
messages = [
{
"role": "system",
"content": """You are a thorough research agent. When given a research topic:
1. Search for relevant information using web_search
2. Read the most relevant pages using scrape_webpage
3. Synthesize your findings into a structured report
4. Save the report using save_report
Be systematic. Search multiple angles of the topic. Read at least 3 sources before writing.
Structure your final report with: Executive Summary, Key Findings, Sources."""
},
{
"role": "user",
"content": f"Research this topic and produce a detailed report: {research_topic}"
}
]
iteration = 0
while iteration < max_iterations:
iteration += 1
print(f"\n--- Agent iteration {iteration} ---")
response = client.chat.completions.create(
model="gpt-4o",
messages=messages,
tools=TOOLS,
tool_choice="auto"
)
message = response.choices[0].message
messages.append(message)
if not message.tool_calls:
print("Agent completed task.")
return message.content
for tool_call in message.tool_calls:
fn_name = tool_call.function.name
fn_args = json.loads(tool_call.function.arguments)
print(f" Calling: {fn_name}({list(fn_args.keys())})")
if fn_name == "web_search":
result = web_search(**fn_args)
elif fn_name == "scrape_webpage":
result = scrape_webpage(**fn_args)
time.sleep(1) # Be polite to servers
elif fn_name == "save_report":
result = save_report(**fn_args)
else:
result = {"error": "Unknown tool"}
messages.append({
"role": "tool",
"tool_call_id": tool_call.id,
"content": json.dumps(result) if isinstance(result, (dict, list)) else result
})
return "Agent reached maximum iterations without completing task."
# Run it
result = run_research_agent("Python async programming best practices 2024")
print(result)
The max_iterations guard is not optional — without it, a confused agent will happily run forever and burn through your API budget. Set it low during development, raise it once the agent behavior is stable.
7. Managing Costs Without Sacrificing Quality
This is the part nobody talks about in tutorials because tutorials don’t pay API bills. I’ve had months where a single project cost me more than I expected because I was lazy about model selection and prompt efficiency.
Here’s the framework I use now:
import openai
import tiktoken
from functools import lru_cache
import hashlib
import json
import os
from pathlib import Path
client = openai.OpenAI(api_key="your_api_key")
# ─── Token counting before you send ────────────────────────────────────────
def count_tokens(text: str, model: str = "gpt-4o-mini") -> int:
try:
enc = tiktoken.encoding_for_model(model)
except KeyError:
enc = tiktoken.get_encoding("cl100k_base")
return len(enc.encode(text))
def estimate_cost(input_tokens: int, output_tokens: int, model: str) -> float:
# Prices per 1M tokens (as of early 2024 - check current pricing)
pricing = {
"gpt-4o": {"input": 5.00, "output": 15.00},
"gpt-4o-mini": {"input": 0.15, "output": 0.60},
"gpt-3.5-turbo": {"input": 0.50, "output": 1.50}
}
if model not in pricing:
return 0.0
p = pricing[model]
return (input_tokens * p["input"] + output_tokens * p["output"]) / 1_000_000
# ─── Response caching for repeated queries ─────────────────────────────────
CACHE_DIR = Path(".ai_cache")
CACHE_DIR.mkdir(exist_ok=True)
def get_cache_key(model: str, messages: list, temperature: float) -> str:
content = json.dumps({"model": model, "messages": messages, "temperature": temperature}, sort_keys=True)
return hashlib.sha256(content.encode()).hexdigest()
def cached_completion(model: str, messages: list, temperature: float = 0.7, max_tokens: int = 1000) -> str:
# Only cache deterministic calls (temperature = 0)
if temperature == 0:
cache_key = get_cache_key(model, messages, temperature)
cache_file = CACHE_DIR / f"{cache_key}.json"
if cache_file.exists():
with open(cache_file) as f:
cached = json.load(f)
print(f"[CACHE HIT] Saved ~${cached['estimated_cost']:.4f}")
return cached["response"]
# Estimate cost before sending
full_text = " ".join([m["content"] for m in messages if isinstance(m["content"], str)])
input_tokens = count_tokens(full_text, model)
estimated = estimate_cost(input_tokens, max_tokens, model)
if estimated > 0.10: # Warn if single call costs more than 10 cents
print(f"[COST WARNING] This call may cost ~${estimated:.4f}")
response = client.chat.completions.create(
model=model,
messages=messages,
temperature=temperature,
max_tokens=max_tokens
)
result = response.choices[0].message.content
actual_input = response.usage.prompt_tokens
actual_output = response.usage.completion_tokens
actual_cost = estimate_cost(actual_input, actual_output, model)
if temperature == 0:
with open(cache_file, "w") as f:
json.dump({"response": result, "estimated_cost": actual_cost}, f)
print(f"[USAGE] {actual_input} in / {actual_output} out | Cost: ~${actual_cost:.4f}")
return result
# ─── Model routing based on task complexity ─────────────────────────────────
def smart_completion(prompt: str, task_type: str) -> str:
# Route simple tasks to cheaper models
cheap_tasks = ["classification", "extraction", "simple_summary", "yes_no"]
expensive_tasks = ["complex_reasoning", "code_generation", "nuanced_analysis"]
if task_type in cheap_tasks:
model = "gpt-4o-mini"
max_tokens = 300
elif task_type in expensive_tasks:
model = "gpt-4o"
max_tokens = 2000
else:
model = "gpt-4o-mini"
max_tokens = 800
messages = [{"role": "user", "content": prompt}]
return cached_completion(model, messages, temperature=0, max_tokens=max_tokens)
The two habits that have cut my AI costs by roughly 60%: routing simple tasks to gpt-4o-mini instead of gpt-4o, and caching deterministic (temperature=0) responses so repeated queries on the same data don't re-hit the API.
8. Putting a Gradio UI on Everything
The fastest way to go from a working Python script to something non-technical stakeholders can actually use is Gradio. Not a React app. Not a Flask server. Just Gradio.
import gradio as gr
import openai
import pandas as pd
from pathlib import Path
client = openai.OpenAI(api_key="your_api_key")
# ─── Multi-tab AI tool dashboard ────────────────────────────────────────────
def analyze_document(file, analysis_type):
if file is None:
return "Please upload a document."
with open(file.name, "r", encoding="utf-8", errors="ignore") as f:
content = f.read()[:8000] # Truncate for demo
prompts = {
"Summary": "Summarize this document in 3 paragraphs. Include key findings and recommendations.",
"Action Items": "Extract all action items, tasks, and deadlines from this document as a numbered list.",
"Risk Analysis": "Identify any risks, concerns, or potential issues mentioned in this document.",
"Key Metrics": "Extract all numerical data, percentages, and KPIs from this document in a table format."
}
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": "You are a precise document analyst."},
{"role": "user", "content": f"{prompts[analysis_type]}\n\nDocument:\n{content}"}
],
temperature=0.2
)
return response.choices[0].message.content
def chat_with_context(message, history, system_context):
messages = [{"role": "system", "content": system_context or "You are a helpful assistant."}]
for human, assistant in history:
messages.append({"role": "user", "content": human})
messages.append({"role": "assistant", "content": assistant})
messages.append({"role": "user", "content": message})
response = client.chat.completions.create(
model="gpt-4o",
messages=messages,
temperature=0.7
)
return response.choices[0].message.content
def classify_batch(csv_file, text_column, categories):
if csv_file is None:
return None, "Please upload a CSV file."
df = pd.read_csv(csv_file.name)
if text_column not in df.columns:
return None, f"Column '{text_column}' not found. Available columns: {list(df.columns)}"
cat_list = [c.strip() for c in categories.split(",")]
results = []
for text in df[text_column].head(20): # Cap at 20 for demo
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": f"Classify the input into one of: {cat_list}. Respond with only the category name."},
{"role": "user", "content": str(text)}
],
temperature=0
)
results.append(response.choices[0].message.content.strip())
df = df.head(20).copy()
df["AI_Category"] = results
output_path = "/tmp/classified_output.csv"
df.to_csv(output_path, index=False)
return output_path, f"Classified {len(results)} rows successfully."
# ─── Build the multi-tab interface ──────────────────────────────────────────
with gr.Blocks(title="AI Toolkit", theme=gr.themes.Soft()) as app:
gr.Markdown("## AI Productivity Toolkit")
with gr.Tab("Document Analysis"):
with gr.Row():
doc_input = gr.File(label="Upload Document (.txt, .md)")
analysis_type = gr.Radio(
["Summary", "Action Items", "Risk Analysis", "Key Metrics"],
label="Analysis Type",
value="Summary"
)
doc_output = gr.Textbox(label="Analysis Result", lines=15)
gr.Button("Analyze").click(analyze_document, [doc_input, analysis_type], doc_output)
with gr.Tab("Custom Chat"):
system_ctx = gr.Textbox(
label="System Context (optional)",
placeholder="e.g. You are a financial advisor specializing in startup funding...",
lines=2
)
chat = gr.ChatInterface(
fn=lambda msg, hist: chat_with_context(msg, hist, system_ctx.value),
title="",
)
with gr.Tab("Batch Classifier"):
with gr.Row():
csv_input = gr.File(label="Upload CSV")
text_col = gr.Textbox(label="Text Column Name", value="text")
categories_input = gr.Textbox(
label="Categories (comma separated)",
value="positive, negative, neutral"
)
with gr.Row():
csv_output = gr.File(label="Download Results")
status_output = gr.Textbox(label="Status")
gr.Button("Classify").click(
classify_batch,
[csv_input, text_col, categories_input],
[csv_output, status_output]
)
app.launch(share=False, server_port=7860)
Run this and you have a multi-tab AI dashboard that your entire team can use from a browser. The document analysis tab, the custom chat with configurable system context, and the batch classifier have saved me hours of back-and-forth with stakeholders who just want to try things without running Python.
9. The Architecture I Use for Every Production AI Project
After building enough of these systems, I’ve converged on a structure that works. Not because it’s theoretically perfect, but because it’s the minimum that holds up when something goes wrong at 2 AM.
project/
├── config/
│ ├── prompts.py # All prompts in one place — never hardcoded inline
│ └── settings.py # Model selection, thresholds, feature flags
├── core/
│ ├── llm_client.py # Wrapper with retry logic, cost tracking, caching
│ ├── embeddings.py # Embedding generation and similarity search
│ └── chunker.py # Document splitting strategies
├── pipelines/
│ ├── ingest.py # Document ingestion and preprocessing
│ ├── rag.py # Retrieval augmented generation
│ └── agent.py # Agent loop with tool registry
├── tools/
│ ├── search.py # Web search tools
│ ├── filesystem.py # File read/write tools
│ └── database.py # Database query tools
├── ui/
│ └── app.py # Gradio interface
├── utils/
│ ├── logging.py # Loguru configuration
│ └── validators.py # Input/output validation
└── tests/
├── test_prompts.py # Prompt regression tests
└── test_pipelines.py # End-to-end pipeline tests
The one thing I wish someone had told me early: test your prompts like you test code. When you change a prompt, run it against a fixed set of test inputs and compare outputs. Prompts drift in subtle ways and you won’t notice until something important breaks.
# tests/test_prompts.py
import pytest
from core.llm_client import classify_text
TEST_CASES = [
{
"input": "My payment was charged twice",
"expected_category": "billing",
"min_confidence": 0.85
},
{
"input": "The app keeps crashing on startup",
"expected_category": "technical",
"min_confidence": 0.85
},
{
"input": "I want to cancel my subscription",
"expected_category": "account",
"min_confidence": 0.80
}
]
@pytest.mark.parametrize("case", TEST_CASES)
def test_classifier_accuracy(case):
result = classify_text(
text=case["input"],
categories=["billing", "technical", "account", "general"]
)
assert result["category"] == case["expected_category"], \
f"Expected {case['expected_category']}, got {result['category']}"
assert result["confidence"] >= case["min_confidence"], \
f"Confidence too low: {result['confidence']}"
Prompt regression tests catch the silent failures — the category that used to classify correctly but now gets misclassified after you tweaked the system prompt for a different reason.
What’s Next
Six months in, the honest takeaway is this: AI tools are genuinely powerful, and most developers are significantly underusing them — not because the technology is hard, but because the gap between a demo and a production system feels intimidating.
It’s not. Most of what I’ve shown here is Python you already understand, wrapped around API calls. The LLM is doing the heavy lifting. Your job is to give it good inputs, handle its outputs gracefully, and build the guardrails that keep it from going off the rails.
Start with one real problem. Build a thin end-to-end solution. Add the error handling, the logging, the cost controls once the core works.
Everything else is iteration.
POSTS ACROSS THE NETWORK
9 Best ITGC Tools for SOX Compliance in 2026

How I Choose Between GPT, Claude, Gemini and Open Source Models for Every Task

The $40,000 Mistake Nobody Talks About: How Most Companies Are Using AI Wrong
I Went to Bed While Claude Was Still Working. I Woke Up to a Fixed App.
