Production · Days 46-54

Inference Engineering

Inference engineering is where model quality meets infrastructure reality: hosted APIs, self-hosting, vLLM, TGI, SGLang, Ollama, llama.cpp, quantization, caches, batching, and cost-performance trade-offs.

Advanced 8 subtopics 9 daily blocks

Outcome

Understand serving choices, quantization, caching, batching, GPU economics, latency, throughput, and provider trade-offs.

Practice builds

Inference benchmark dashboardProvider latency comparisonLocal model playground

What to learn

API-based vs self-hosted inference

Inference servers: vLLM, TGI, SGLang, Ollama, llama.cpp

Quantization: GGUF, AWQ, GPTQ, INT4 and INT8

KV cache optimization, prefix caching, speculative decoding

Continuous batching and throughput tuning

Latency vs throughput trade-offs: TTFT and TPS

GPU economics, edge inference, on-device models

Inference providers: Together, Fireworks, Groq, Cerebras, Replicate, Bedrock

Daily study plan

Day 46: Compare API inference and self-hosted inference constraints.

Day 47: Run a small local model with Ollama or llama.cpp.

Day 48: Learn quantization formats and measure quality/latency trade-offs.

Day 49: Study vLLM or TGI serving architecture.

Day 50: Measure TTFT, tokens per second, and total latency.

Day 51: Test prefix caching or prompt reuse patterns.

Day 52: Explore continuous batching and throughput tuning.

Day 53: Compare provider pricing and latency for one workload.

Day 54: Write an inference decision matrix.

Resources

Docs

vLLM documentation

Open resource →

Tool

llama.cpp project

Open resource →

Docs

Hugging Face Text Generation Inference

Open resource →