Sanket Nyayadhish

AI Engineer — inference, backend, infrastructure

I build production systems for serving and orchestrating large language models — from low-level transformer internals to agentic workflows with MCP, distributed inference platforms, and GPU-optimized Kubernetes infrastructure.

About

I work at the intersection of LLM inference systems, distributed backends, and cloud-native infrastructure. My focus is taking models from research to production — building the serving layers, orchestration pipelines, and GPU infrastructure that make LLM systems work at scale.

Currently working on

  • Low-latency backends for LLM and GenAI workloads
  • LLM serving infrastructure (vLLM, TensorRT-LLM, KV cache optimization)
  • GPU workload orchestration on Kubernetes
  • Agentic AI backends with multi-agent orchestration and tool use
  • MCP (Model Context Protocol) servers and tool integrations
  • Observability for AI systems (latency, drift, GPU utilization)

AI / ML

PyTorchTransformersLangChain vLLMPEFTLlamaIndex TGITensorRT-LLMMLflow MCPCrewAILangGraph

Backend

GoPythonFastAPI gRPCKafkaRedis PostgreSQLOpenTelemetry

Infrastructure

KubernetesTerraformDocker ArgoCDKServeRay AWSHelm

Projects

Research & Notes

Exploring Attention Variants in Minimal GPT

Systematic comparison of 6 attention mechanisms across 3 rounds of experiments in a zero-dependency GPT implementation (~12K params). Tested gated value attention, relational scoring, salience weighting, context-augmented attention, and difference attention. The gated-context variant consistently outperformed standard attention.

Code & Results

More notes on inference optimization, system design, and LLM internals.

Connect

Open to system design discussions, backend architecture, open source collaboration, and code reviews.