AI Engineer — inference, backend, infrastructure
I build production systems for serving and orchestrating large language models — from low-level transformer internals to agentic workflows with MCP, distributed inference platforms, and GPU-optimized Kubernetes infrastructure.
I work at the intersection of LLM inference systems, distributed backends, and cloud-native infrastructure. My focus is taking models from research to production — building the serving layers, orchestration pipelines, and GPU infrastructure that make LLM systems work at scale.
Complete transformer from scratch — tokenization, attention, RoPE, ALiBi, beam search, nucleus sampling. Pure NumPy, 148+ passing tests.
Production RAG platform with hybrid retrieval, cross-encoder reranking, multi-LLM routing, and RAGAS evaluation.
High-performance Go backend for LLM routing with semantic caching, multi-provider failover, and token-based rate limiting.
Multi-agent system with ReAct and plan-and-execute patterns, MCP server integration for external tools, multi-tier memory, and sandboxed execution.
End-to-end LoRA/QLoRA fine-tuning with experiment tracking via MLflow, evaluation suite, and auto-deployment to vLLM.
GPU-optimized K8s infrastructure with KServe model serving, Ray for distributed workloads, DCGM monitoring, and GitOps.
Systematic comparison of 6 attention mechanisms across 3 rounds of experiments in a zero-dependency GPT implementation (~12K params). Tested gated value attention, relational scoring, salience weighting, context-augmented attention, and difference attention. The gated-context variant consistently outperformed standard attention.
Code & ResultsMore notes on inference optimization, system design, and LLM internals.
Open to system design discussions, backend architecture, open source collaboration, and code reviews.