LLM & Generative AI
RAG systems, autonomous agents, fine-tuning, and rigorous evaluation — generative AI built for production, not demos.
I design and ship generative-AI systems that hold up under real traffic, real data, and real compliance review.
Where I help
Retrieval-augmented generation (RAG)
Grounded answers over your own knowledge base — with citations, access control, and evaluation. I build the ingestion, chunking, retrieval, and re-ranking stack, then prove quality with offline and online metrics.
Agents & workflows
Tool-using agents that automate multi-step work: ticket triage, document processing, internal copilots. Agency is scoped tightly, with guardrails and a human in the loop where it matters.
Fine-tuning & adaptation
When prompting is not enough, I fine-tune or adapt open models on your domain data — on infrastructure you control.
Evaluation & guardrails
Every system ships with an eval harness: golden datasets, regression tests, and production monitoring for hallucination, cost, and latency.
Typical outcomes
- A support copilot that deflects 40%+ of tier-1 tickets with cited answers.
- A document-processing pipeline that cuts manual handling from hours to seconds.
- An internal RAG assistant deployed in your EU cloud region, GDPR-clean.
How I build
# A grounded answer is only as good as its evaluation.
# Every RAG engagement ships with a regression eval suite.
from nicojahn.eval import GoldenSet, score
results = score(
system="support-copilot",
dataset=GoldenSet.load("tier1-tickets-v3"),
metrics=["faithfulness", "answer_relevance", "citation_accuracy"],
)
assert results.faithfulness > 0.95 # gate the deploy on qualityI default to the most capable models for the task and keep the architecture provider-flexible, so you are never locked to one vendor.
Next: ML Engineering & MLOps · Talk to me