LLM & Generative AI

RAG systems, autonomous agents, fine-tuning, and rigorous evaluation — generative AI built for production, not demos.

I design and ship generative-AI systems that hold up under real traffic, real data, and real compliance review.

Where I help

Retrieval-augmented generation (RAG)

Grounded answers over your own knowledge base — with citations, access control, and evaluation. I build the ingestion, chunking, retrieval, and re-ranking stack, then prove quality with offline and online metrics.

Agents & workflows

Tool-using agents that automate multi-step work: ticket triage, document processing, internal copilots. Agency is scoped tightly, with guardrails and a human in the loop where it matters.

Fine-tuning & adaptation

When prompting is not enough, I fine-tune or adapt open models on your domain data — on infrastructure you control.

Evaluation & guardrails

Every system ships with an eval harness: golden datasets, regression tests, and production monitoring for hallucination, cost, and latency.

Typical outcomes

A support copilot that deflects 40%+ of tier-1 tickets with cited answers.
A document-processing pipeline that cuts manual handling from hours to seconds.
An internal RAG assistant deployed in your EU cloud region, GDPR-clean.

How I build

# A grounded answer is only as good as its evaluation.
# Every RAG engagement ships with a regression eval suite.
from nicojahn.eval import GoldenSet, score

results = score(
    system="support-copilot",
    dataset=GoldenSet.load("tier1-tickets-v3"),
    metrics=["faithfulness", "answer_relevance", "citation_accuracy"],
)
assert results.faithfulness > 0.95  # gate the deploy on quality

I default to the most capable models for the task and keep the architecture provider-flexible, so you are never locked to one vendor.

Next: ML Engineering & MLOps · Talk to me