M

Production · Days 46-54

Inference Engineering

Inference engineering is where model quality meets infrastructure reality: hosted APIs, self-hosting, vLLM, TGI, SGLang, Ollama, llama.cpp, quantization, caches, batching, and cost-performance trade-offs.

Advanced 8 subtopics 9 daily blocks

Outcome

Understand serving choices, quantization, caching, batching, GPU economics, latency, throughput, and provider trade-offs.

Practice builds

Inference benchmark dashboardProvider latency comparisonLocal model playground

What to learn

API-based vs self-hosted inference
Inference servers: vLLM, TGI, SGLang, Ollama, llama.cpp
Quantization: GGUF, AWQ, GPTQ, INT4 and INT8
KV cache optimization, prefix caching, speculative decoding
Continuous batching and throughput tuning
Latency vs throughput trade-offs: TTFT and TPS
GPU economics, edge inference, on-device models
Inference providers: Together, Fireworks, Groq, Cerebras, Replicate, Bedrock

Daily study plan

Day 46: Compare API inference and self-hosted inference constraints.
Day 47: Run a small local model with Ollama or llama.cpp.
Day 48: Learn quantization formats and measure quality/latency trade-offs.
Day 49: Study vLLM or TGI serving architecture.
Day 50: Measure TTFT, tokens per second, and total latency.
Day 51: Test prefix caching or prompt reuse patterns.
Day 52: Explore continuous batching and throughput tuning.
Day 53: Compare provider pricing and latency for one workload.
Day 54: Write an inference decision matrix.

Resources