Sanket Nyayadhish

AI Engineer — inference, backend, infrastructure

I build production systems for serving and orchestrating large language models — from low-level transformer internals to agentic workflows with MCP, distributed inference platforms, and GPU-optimized Kubernetes infrastructure.

About

I work at the intersection of LLM inference systems, distributed backends, and cloud-native infrastructure. My focus is taking models from research to production — building the serving layers, orchestration pipelines, and GPU infrastructure that make LLM systems work at scale.

Currently working on

Low-latency backends for LLM and GenAI workloads
LLM serving infrastructure (vLLM, TensorRT-LLM, KV cache optimization)
GPU workload orchestration on Kubernetes
Agentic AI backends with multi-agent orchestration and tool use
MCP (Model Context Protocol) servers and tool integrations
Observability for AI systems (latency, drift, GPU utilization)

AI / ML

PyTorchTransformersLangChain vLLMPEFTLlamaIndex TGITensorRT-LLMMLflow MCPCrewAILangGraph

Backend

GoPythonFastAPI gRPCKafkaRedis PostgreSQLOpenTelemetry

Infrastructure

KubernetesTerraformDocker ArgoCDKServeRay AWSHelm

Projects

LLM Engineering Fundamentals

Complete transformer from scratch — tokenization, attention, RoPE, ALiBi, beam search, nucleus sampling. Pure NumPy, 148+ passing tests.

PythonNumPy

Enterprise RAG System

Production RAG platform with hybrid retrieval, cross-encoder reranking, multi-LLM routing, and RAGAS evaluation.

PythonFastAPIWeaviateLangChain

AI Gateway Microservices

High-performance Go backend for LLM routing with semantic caching, multi-provider failover, and token-based rate limiting.

GogRPCRedisOpenTelemetry

Autonomous LLM Agents

Multi-agent system with ReAct and plan-and-execute patterns, MCP server integration for external tools, multi-tier memory, and sandboxed execution.

PythonLangGraphMCPWeaviate

LLM Fine-tuning Platform

End-to-end LoRA/QLoRA fine-tuning with experiment tracking via MLflow, evaluation suite, and auto-deployment to vLLM.

PyTorchPEFTMLflowvLLM

GPU Kubernetes Platform

GPU-optimized K8s infrastructure with KServe model serving, Ray for distributed workloads, DCGM monitoring, and GitOps.

TerraformKubernetesHelmArgoCD

Research & Notes

March 2026

Exploring Attention Variants in Minimal GPT

Systematic comparison of 6 attention mechanisms across 3 rounds of experiments in a zero-dependency GPT implementation (~12K params). Tested gated value attention, relational scoring, salience weighting, context-augmented attention, and difference attention. The gated-context variant consistently outperformed standard attention.

Code & Results

Coming soon

More notes on inference optimization, system design, and LLM internals.

Connect

Open to system design discussions, backend architecture, open source collaboration, and code reviews.

Twitter/X

@Ny8Sanket

in/ny8sanket

GitHub

sanketny8