SlideShare a Scribd company logo
Instruct.KR Summer 2025 Meetup
LLM, vLLM Production
Hyogeun Oh
Instruct.KR Summer 2025 Meetup
Hygeun Oh
( , Zerohertz)
q / Machine Learning Engineer ( ) .
qPython Kubernetes , MLOps .
qNeovim .
q ML .
0. ...
[1] https://github.com/Zerohertz
GitHub[1]
1/39
Instruct.KR Summer 2025 Meetup
0. ...
1. Introduction
2. OpenAI-Compatible Server
3. Architecture
4. Production Deployment
5. Wrap-up
2/39
Instruct.KR Summer 2025 Meetup
Introduction
PART 1
Instruct.KR Summer 2025 Meetup
Why Self-Host LLMs When Powerful Commercial APIs Exist?
q :
q / :
q : , ,
qAPI : , ,
1. Introduction
[2] https://artificialanalysis.ai/
[2]
4/39
Instruct.KR Summer 2025 Meetup
Why Self-Host LLMs When Powerful Commercial APIs Exist?
qtransformers AutoModelForCausalLM ?
1. Introduction
5/39
Instruct.KR Summer 2025 Meetup
Why vLLM?
qvLLM history
§ 2023 2 9 , GitHub CacheFlow[3]
§ 2023 6 17 , vLLM [4]
§ 2023 9 12 , UC Berkeley Sky Computing Lab “Efficient Memory Management for Large Language Model Serving with
PagedAttention” [5]
§ 2025 6 19 , GitHub star 50k
§ 2025 7 24 , v0.10.0 release
qLicense: Apache-2.0[6]
qv of vLLM[7]
1. Introduction
[3] https://github.com/vllm-project/vllm/commit/e7d9d9c08c79b386f6d0477e87b77a572390317d
[4] https://github.com/vllm-project/vllm/commit/0b98ba15c744f1dfb0ea4f2135e85ca23d572ae1
[5] https://github.com/vllm-project/vllm/blob/main/LICENSE
[6] https://arxiv.org/abs/2309.06180
[7] https://github.com/vllm-project/vllm/issues/835
6/39
Instruct.KR Summer 2025 Meetup
Why vLLM?
1. Introduction
[8] https://www.star-history.com/
[8]
7/39
Instruct.KR Summer 2025 Meetup
Why vLLM?
1. Introduction
[8] https://www.star-history.com/
[8]
8/39
Instruct.KR Summer 2025 Meetup
Why vLLM?
q LLM
§ State-of-the-art throughput:
§ Continuous batching:
q
§ PagedAttention [6]: Attention key/value
§ Optimized CUDA kernels: FlashAttention[9], FlashInfer[10]
§ Chunked prefill [11], Speculative decoding[12]
q
§ HuggingFace :
§ : ,
§ : Tensor, pipeline (Ray ), data, expert parallelism
q API
§ OpenAI-Compatible API: AI (e.g., LangChain, Gemini CLI, …)
§ Streaming output:
§ Prefix caching, Multi-LoRA
1. Introduction
[6] https://arxiv.org/abs/2309.06180
[9] https://github.com/vllm-project/flash-attention
[10] https://github.com/flashinfer-ai/flashinfer
[11] https://docs.vllm.ai/en/v0.9.2/configuration/optimization.html#chunked-prefill_1
[12] https://docs.vllm.ai/en/v0.9.2/features/spec_decode.html
9/39
Instruct.KR Summer 2025 Meetup
How to serving LLM with vLLM?
qInstallation
§ Local (CPU)
§ GPU [13]
qvllm serve
1. Introduction
[13] https://docs.vllm.ai/en/v0.9.2/deployment/docker.html
10/39
Instruct.KR Summer 2025 Meetup
OpenAI-Compaitble Server
PART 2
Instruct.KR Summer 2025 Meetup
OpenAI API Spec[14]
q/v1/models[15]
2. OpenAI-Compatible Server
[14] https://docs.vllm.ai/en/v0.9.2/serving/openai_compatible_server.html
[15] https://platform.openai.com/docs/api-reference/models/list
[16] https://platform.openai.com/docs/api-reference/chat/create
q/v1/chat/completions[16]
12/39
Instruct.KR Summer 2025 Meetup
Tool calling
qOpenAI
2. OpenAI-Compatible Server
[17] https://github.com/openai/openai-python/blob/v1.97.1/src/openai/types/shared_params/function_definition.py#L13-L45
[17]
13/39
Instruct.KR Summer 2025 Meetup
Tool calling
q“--tool-call-parser” parser [18, 19]
2. OpenAI-Compatible Server
[18] https://docs.vllm.ai/en/v0.9.2/features/tool_calling.html
[19] https://qwen.readthedocs.io/en/latest/deployment/vllm.html#parsing-tool-calls
14/39
Instruct.KR Summer 2025 Meetup
Reasoning
2. OpenAI-Compatible Server
q“chat_template_kwargs” “enable_thinking” reasoning
15/39
Instruct.KR Summer 2025 Meetup
Reasoning
q“--reasoning-parser” parser [20, 21]
q “--enable-reasoning” deprecated[22]
2. OpenAI-Compatible Server
[20] https://docs.vllm.ai/en/v0.9.2/features/reasoning_outputs.html
[21] https://qwen.readthedocs.io/en/latest/deployment/vllm.html#parsing-thinking-content
[22] https://github.com/vllm-project/vllm/blob/v0.9.2/vllm/engine/arg_utils.py#L626-L634
16/39
Instruct.KR Summer 2025 Meetup
Chat Template
q tokenizer_config.json “chat_template”
2. OpenAI-Compatible Server
[23] https://huggingface.co/Qwen/Qwen3-0.6B/blob/main/tokenizer_config.json#L230
[23]
17/39
Instruct.KR Summer 2025 Meetup
Chat Template
q“--chat-template” chat template [24]
2. OpenAI-Compatible Server
[24] https://docs.vllm.ai/en/v0.9.2/serving/openai_compatible_server.html#chat-template_1
[25] https://github.com/vllm-project/vllm/tree/main/examples
[25]
18/39
Instruct.KR Summer 2025 Meetup
Architecture
PART 3
Instruct.KR Summer 2025 Meetup
KV Cache, PagedAttention
qKV Cache[26]
§ Autoregressive token sequence
→ time complexity: 𝑂 𝑛!
§ Key/Value (KV cache)
→ time complexity: 𝑂 𝑛
qPagedAttention[6, 27, 28]
§ (OS) virtual memory
§ KV cache memory memory fragmentation
→ memory /
§ Block memory page table (logical continuity) memory
§ page mapping memory non-continuous
3. Architecture
[26] https://huggingface.co/blog/not-lain/kv-caching
[6] https://arxiv.org/abs/2309.06180
[27] https://docs.vllm.ai/en/v0.9.2/design/kernel/paged_attention.html
[28] https://blog.vllm.ai/2023/06/20/vllm.html
[29] https://github.com/HandsOnLLM/Hands-On-Large-Language-Models/blob/main/chapter03/Chapter%203%20-%20Looking%20Inside%20LLMs.ipynb
Value Token 1
Value Token 2
Value Token 3
Value Token 4
𝑉
Query Token 1
Query Token 2
Query Token 3
Query Token 4
𝑄
Key
Token
1
Key
Token
2
Key
Token
3
Key
Token
4
𝐾!
𝑄"𝐾"
𝑄!𝐾"
𝑄#𝐾"
𝑄$𝐾"
𝑄"𝐾!
𝑄!𝐾!
𝑄#𝐾!
𝑄$𝐾!
𝑄"𝐾#
𝑄!𝐾#
𝑄#𝐾#
𝑄$𝐾#
𝑄"𝐾$
𝑄!𝐾$
𝑄#𝐾$
𝑄$𝐾$
𝑄𝐾!
× = ×
[29]
[6]
20/39
Instruct.KR Summer 2025 Meetup
V0 Engine vs. V1 Engine
q Optimized Execution Loop & API Server: EngineCore AsyncLLM API server, token CPU GPU model
q Simple & Flexible Scheduler: “prefill”, “decode” , “{request_id: num_tokens}” token chunked-prefill, caching, decoding
q Zero-Overhead Prefix Caching: Hash+LRU cache cache hit rate 0% 1%
q Clean Architecture for Tensor-Parallel Inference: Worker caching diff , scheduler, worker IPC overhead
q Efficient Input Preparation: Persistent batch tensor diff , Numpy CPU overhead
q torch.compile & Piecewise CUDA Graphs: Model CUDA graph kernel customizing
q Enhanced Support for Multimodal LLMs: image , image hash KV cache, encoder cache chunked-prefill
q FlashAttention 3: batch attention kernel
q VLLM_USE_V1=1 V1 Engine
3. Architecture
[30] https://docs.vllm.ai/en/v0.9.2/design/arch_overview.html
[31] https://docs.vllm.ai/en/v0.9.2/usage/v1_guide.html
[32] https://github.com/vllm-project/vllm/issues/18571
[33] https://blog.vllm.ai/2025/01/27/v1-alpha-release.html
21/39
Instruct.KR Summer 2025 Meetup
Chat Completions
qvLLM /v1/chat/completions [34]
2. OpenAI-Compatible Server
[34] https://zerohertz.github.io/vllm-openai-2/#Conclusion
22/39
Instruct.KR Summer 2025 Meetup
Production Deployment
PART 4
Instruct.KR Summer 2025 Meetup
LoRA Adapters
qLoRA (Low-Rank Adaptation)[35]
§ model weight low-rank matrix (𝐴, 𝐵)
§ Δ𝑊 = 𝐵𝐴 model data
qStatic serving LoRA adapters
qDynamic serving LoRA adapters
4. Production Deployment
[35] https://docs.vllm.ai/en/v0.9.2/features/lora.html
[36] https://arxiv.org/abs/2106.09685
[36]
24/39
Instruct.KR Summer 2025 Meetup
LoRA Adapters
4. Production Deployment
25/39
Instruct.KR Summer 2025 Meetup
Parallelism Strategies[37]
qTensor Parallelism (TP)
§ model layer model parameter GPU
→ Model GPU single node
→ KV cache GPU memory pressure
qPipeline Parallelism (PP)
§ Model layer GPU model
→ Model node
→ Layer tensor sharding model
qExpert Parallelism (EP)
§ Mixture of Experts (MoE) model
→ “--enable-expert-parallel” MoE layer tensor parallelism expert parallelism
→ MoE model
→ GPU expert
qData Parallelism (DP)
§ GPU model batch
→ model GPU
→ Model
→ batch
4. Production Deployment
[37] https://docs.vllm.ai/en/v0.9.2/configuration/optimization.html#parallelism-strategies
26/39
Instruct.KR Summer 2025 Meetup
Multi-node Distributed Inference
qRay cluster [38]
§ VLLM_HOST_IP[39]: vLLM node IP
§ GLOO_SOCKET_IFNAME[40]: PyTorch Gloo bakcend network interface
§ NCCL_IB_DISABLE[41]: NCCL InfiniBand (IB) network
4. Production Deployment
[38] https://docs.vllm.ai/en/v0.9.2/examples/online_serving/run_cluster.html
[39] https://docs.vllm.ai/en/v0.9.2/usage/security.html
[40] https://docs.pytorch.org/docs/stable/distributed.html#common-environment-variables
[41] https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html#nccl-ib-disable
Head Node
Worker Node
27/39
Instruct.KR Summer 2025 Meetup
Multi-node Distributed Inference
qMulti-Node Multi-GPU[42, 43]
(tensor parallel plus pipeline parallel inference)
§ If your model is too large to fit in a single node, you can use tensor
parallel together with pipeline parallelism.
§ The tensor parallel size is the number of GPUs you want to use in
each node, and the pipeline parallel size is the number of nodes
you want to use.
§ E.g., if you have 16 GPUs in 2 nodes (8 GPUs per node), you can set
the tensor parallel size to 8 and the pipeline parallel size to 2.
qTP 16 vs. TP 8 + PP 2
4. Production Deployment
[42] https://docs.vllm.ai/en/v0.9.2/serving/distributed_serving.html
[43] https://blog.vllm.ai/2025/02/17/distributed-inference.html
28/39
Instruct.KR Summer 2025 Meetup
Multi-node Distributed Inference
qRDMA (Remote Direct Memory Access)[44]
§ Network node memory CPU, cache, data
§ E.g., InfiniBand, RoCE, iWARP
qInfiniBand[45]
§ RDMA native CPU node memory
§ switch network adapter (Host Channel Adapter, HCA) (Ethernet network architecture)
qRoCEv2 (RDMA over Converged Ethernet v2)[44]
§ ethernet network RDMA protocol
4. Production Deployment
[44] https://zerohertz.github.io/distributed-computing-rdma-roce/
[45] https://developer.nvidia.com/gpudirect
29/39
Instruct.KR Summer 2025 Meetup
Multi-node Distributed Inference
4. Production Deployment
[46] https://docs.vllm.ai/en/v0.9.2/serving/distributed_serving.html#gpudirect-rdma
[46]
30/39
Instruct.KR Summer 2025 Meetup
Production Deployment with GPU Cluster (Kubernetes)
qKubernetes [47]
§ Model store, Network, IPC,
qAIBrix[48, 49, 50, 51]
§ vLLM cloud native control plane
§ LoRA , LLM gateway routing, autoscaler, KV cache, …
qProduction-stack[52, 53, 54, 55]
§ vLLM LLM production codebase
§ LMCache KV cache , prefix-aware routing, Helm chart , …
4. Production Deployment
[47] https://docs.vllm.ai/en/v0.9.2/deployment/k8s.html
[48] https://github.com/vllm-project/aibrix
[49] https://arxiv.org/abs/2504.03648
[50] https://blog.vllm.ai/2025/02/21/aibrix-release.html
[51] https://aibrix.readthedocs.io/latest/
[52] https://github.com/vllm-project/production-stack
[53] https://docs.vllm.ai/en/v0.9.2/deployment/integrations/production-stack.html
[54] https://blog.vllm.ai/2025/01/21/stack-release.html
[55] https://blog.vllm.ai/production-stack/
31/39
Instruct.KR Summer 2025 Meetup
Production Deployment with GPU Cluster (Kubernetes)
qLWS (LeaderWorkerSet)[56, 57]
§ Kubernetes Leader-Worker architecture
CRD controller
4. Production Deployment
[56] https://docs.vllm.ai/en/v0.9.2/deployment/frameworks/lws.html
[57] https://github.com/kubernetes-sigs/lws
32/39
Instruct.KR Summer 2025 Meetup
Observability (Prometheus + Grafana)
q“/metrics” vLLM server Prometheus format metric [58, 59]
§ vllm:request_success_total: (EOS max )
§ vllm:request_queue_time_seconds: queue
§ vllm:request_prefill_time_seconds: prefill
§ vllm:request_decode_time_seconds: decoding
§ vllm:request_max_num_generation_tokens: token
§ …
qGrafana dashboard
4. Production Deployment
[58] https://docs.vllm.ai/en/v0.9.2/usage/metrics.html
[59] https://docs.vllm.ai/en/v0.9.2/examples/online_serving/prometheus_grafana.html
33/39
Instruct.KR Summer 2025 Meetup
Benchmark
q“vllm bench” CLI benchmark [60]
4. Production Deployment
[60] https://docs.vllm.ai/en/v0.9.2/cli/index.html#bench
[61] https://docs.vllm.ai/en/v0.9.2/contributing/profiling.html
[62] https://github.com/vllm-project/vllm/tree/v0.9.2/benchmarks
[63] https://github.com/vllm-project/guidellm
[64] https://arxiv.org/pdf/2502.06494
qvLLM repository benchmarks [61, 62]
qvLLM project Guidellm [63, 64]
34/39
Instruct.KR Summer 2025 Meetup
Production Issues
qGloo connectFullMesh failed with…[65]
§ Multi node process GLOO
→ “GLOO_SOCKET_IFNAME” interface
4. Production Deployment
[65] https://github.com/vllm-project/vllm/discussions/11353
[66] https://github.com/vllm-project/vllm/issues/17652#issuecomment-2867891239
[67] https://flashinfer.ai/2024/02/02/cascade-inference.html
qQwen model token [66, 67]
§ FlashInfer kernel cascade inference
→ “--disable-cascade-attn”
35/39
Instruct.KR Summer 2025 Meetup
Wrap-up
PART 5
Instruct.KR Summer 2025 Meetup
Roadmap Q3 2025[68]
qV1 Engine
§ V0 Engine
§ Core scheduler
§ Async scheduling, multi-modal
qLarge Scale Serving
§ Mixture-of-Experts (MoE) model scale-out serving
§ autoscaling
qModels
§ framework (training, authoring) tokenizer, configuration, processor
§ Sparse attention mechanism
§ 1B
qUse Cases
§ RLHF
→ resharding
→ Multi-turn scheduling
§ Evaluation
→ Batching order full determinism (with/without prefix cache)
§ Batch Inference
→ Prefix caching scale-out data parallel router
→ CPU KV cache offloading
5. Wrap-up
[68] https://github.com/vllm-project/vllm/issues/20336
37/39
Instruct.KR Summer 2025 Meetup
Conclusion
qOpenAI-Compatible Server
§ LangChain, Gemini CLI
§ Tool Calling, Reasoning
qArchitecture
§ KV cache PagedAttention
§ V1 Engine
qProduction Deployment
§ TP/PP/DP/EP
§ Kubernetes multi-node
§ LoRA adapter serving
§ Prometheus + Grafana observability
§ vllm bench benchmark script
5. Wrap-up
38/39
Instruct.KR Summer 2025 Meetup
vLLM Meetup in Korea[69, 70]
5. Wrap-up
[69] https://discuss.pytorch.kr/t/8-19-vllm-meetup/7401
[70] https://lu.ma/cgcgprmh
39/39
Instruct.KR Summer 2025 Meetup
EoD
Coffee Chat
LinkedIn
GitHub

More Related Content

PDF
Storytelling For The Web: Integrate Storytelling in your Design Process
Chiara Aliotta
 
PDF
Artificial Intelligence, Data and Competition – SCHREPEL – June 2024 OECD dis...
OECD Directorate for Financial and Enterprise Affairs
 
PDF
How to Leverage AI to Boost Employee Wellness - Lydia Di Francesco - SocialHR...
SocialHRCamp
 
PDF
2024 Trend Updates: What Really Works In SEO & Content Marketing
Search Engine Journal
 
PDF
2024 State of Marketing Report – by Hubspot
Marius Sescu
 
PDF
Everything You Need To Know About ChatGPT
Expeed Software
 
PDF
Product Design Trends in 2024 | Teenage Engineerings
Pixeldarts
 
PDF
How Race, Age and Gender Shape Attitudes Towards Mental Health
ThinkNow
 
Storytelling For The Web: Integrate Storytelling in your Design Process
Chiara Aliotta
 
Artificial Intelligence, Data and Competition – SCHREPEL – June 2024 OECD dis...
OECD Directorate for Financial and Enterprise Affairs
 
How to Leverage AI to Boost Employee Wellness - Lydia Di Francesco - SocialHR...
SocialHRCamp
 
2024 Trend Updates: What Really Works In SEO & Content Marketing
Search Engine Journal
 
2024 State of Marketing Report – by Hubspot
Marius Sescu
 
Everything You Need To Know About ChatGPT
Expeed Software
 
Product Design Trends in 2024 | Teenage Engineerings
Pixeldarts
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
ThinkNow
 

Recently uploaded (20)

PDF
Introduction to Ship Engine Room Systems.pdf
Mahmoud Moghtaderi
 
PPT
1. SYSTEMS, ROLES, AND DEVELOPMENT METHODOLOGIES.ppt
zilow058
 
PPTX
business incubation centre aaaaaaaaaaaaaa
hodeeesite4
 
PDF
Chad Ayach - A Versatile Aerospace Professional
Chad Ayach
 
PDF
Queuing formulas to evaluate throughputs and servers
gptshubham
 
PPTX
22PCOAM21 Session 1 Data Management.pptx
Guru Nanak Technical Institutions
 
PDF
LEAP-1B presedntation xxxxxxxxxxxxxxxxxxxxxxxxxxxxx
hatem173148
 
PPTX
Color Model in Textile ( RGB, CMYK).pptx
auladhossain191
 
PPTX
22PCOAM21 Session 2 Understanding Data Source.pptx
Guru Nanak Technical Institutions
 
PPTX
AgentX UiPath Community Webinar series - Delhi
RohitRadhakrishnan8
 
PDF
Software Testing Tools - names and explanation
shruti533256
 
PDF
July 2025: Top 10 Read Articles Advanced Information Technology
ijait
 
PDF
The Effect of Artifact Removal from EEG Signals on the Detection of Epileptic...
Partho Prosad
 
PDF
Activated Carbon for Water and Wastewater Treatment_ Integration of Adsorptio...
EmilianoRodriguezTll
 
PPTX
Chapter_Seven_Construction_Reliability_Elective_III_Msc CM
SubashKumarBhattarai
 
PPTX
easa module 3 funtamental electronics.pptx
tryanothert7
 
PDF
dse_final_merit_2025_26 gtgfffffcjjjuuyy
rushabhjain127
 
PPTX
Information Retrieval and Extraction - Module 7
premSankar19
 
PPTX
IoT_Smart_Agriculture_Presentations.pptx
poojakumari696707
 
PDF
67243-Cooling and Heating & Calculation.pdf
DHAKA POLYTECHNIC
 
Introduction to Ship Engine Room Systems.pdf
Mahmoud Moghtaderi
 
1. SYSTEMS, ROLES, AND DEVELOPMENT METHODOLOGIES.ppt
zilow058
 
business incubation centre aaaaaaaaaaaaaa
hodeeesite4
 
Chad Ayach - A Versatile Aerospace Professional
Chad Ayach
 
Queuing formulas to evaluate throughputs and servers
gptshubham
 
22PCOAM21 Session 1 Data Management.pptx
Guru Nanak Technical Institutions
 
LEAP-1B presedntation xxxxxxxxxxxxxxxxxxxxxxxxxxxxx
hatem173148
 
Color Model in Textile ( RGB, CMYK).pptx
auladhossain191
 
22PCOAM21 Session 2 Understanding Data Source.pptx
Guru Nanak Technical Institutions
 
AgentX UiPath Community Webinar series - Delhi
RohitRadhakrishnan8
 
Software Testing Tools - names and explanation
shruti533256
 
July 2025: Top 10 Read Articles Advanced Information Technology
ijait
 
The Effect of Artifact Removal from EEG Signals on the Detection of Epileptic...
Partho Prosad
 
Activated Carbon for Water and Wastewater Treatment_ Integration of Adsorptio...
EmilianoRodriguezTll
 
Chapter_Seven_Construction_Reliability_Elective_III_Msc CM
SubashKumarBhattarai
 
easa module 3 funtamental electronics.pptx
tryanothert7
 
dse_final_merit_2025_26 gtgfffffcjjjuuyy
rushabhjain127
 
Information Retrieval and Extraction - Module 7
premSankar19
 
IoT_Smart_Agriculture_Presentations.pptx
poojakumari696707
 
67243-Cooling and Heating & Calculation.pdf
DHAKA POLYTECHNIC
 
Ad
Ad

오픈소스 LLM, vLLM으로 Production까지 (Instruct.KR Summer Meetup, 2025)

  • 1. Instruct.KR Summer 2025 Meetup LLM, vLLM Production Hyogeun Oh
  • 2. Instruct.KR Summer 2025 Meetup Hygeun Oh ( , Zerohertz) q / Machine Learning Engineer ( ) . qPython Kubernetes , MLOps . qNeovim . q ML . 0. ... [1] https://github.com/Zerohertz GitHub[1] 1/39
  • 3. Instruct.KR Summer 2025 Meetup 0. ... 1. Introduction 2. OpenAI-Compatible Server 3. Architecture 4. Production Deployment 5. Wrap-up 2/39
  • 4. Instruct.KR Summer 2025 Meetup Introduction PART 1
  • 5. Instruct.KR Summer 2025 Meetup Why Self-Host LLMs When Powerful Commercial APIs Exist? q : q / : q : , , qAPI : , , 1. Introduction [2] https://artificialanalysis.ai/ [2] 4/39
  • 6. Instruct.KR Summer 2025 Meetup Why Self-Host LLMs When Powerful Commercial APIs Exist? qtransformers AutoModelForCausalLM ? 1. Introduction 5/39
  • 7. Instruct.KR Summer 2025 Meetup Why vLLM? qvLLM history § 2023 2 9 , GitHub CacheFlow[3] § 2023 6 17 , vLLM [4] § 2023 9 12 , UC Berkeley Sky Computing Lab “Efficient Memory Management for Large Language Model Serving with PagedAttention” [5] § 2025 6 19 , GitHub star 50k § 2025 7 24 , v0.10.0 release qLicense: Apache-2.0[6] qv of vLLM[7] 1. Introduction [3] https://github.com/vllm-project/vllm/commit/e7d9d9c08c79b386f6d0477e87b77a572390317d [4] https://github.com/vllm-project/vllm/commit/0b98ba15c744f1dfb0ea4f2135e85ca23d572ae1 [5] https://github.com/vllm-project/vllm/blob/main/LICENSE [6] https://arxiv.org/abs/2309.06180 [7] https://github.com/vllm-project/vllm/issues/835 6/39
  • 8. Instruct.KR Summer 2025 Meetup Why vLLM? 1. Introduction [8] https://www.star-history.com/ [8] 7/39
  • 9. Instruct.KR Summer 2025 Meetup Why vLLM? 1. Introduction [8] https://www.star-history.com/ [8] 8/39
  • 10. Instruct.KR Summer 2025 Meetup Why vLLM? q LLM § State-of-the-art throughput: § Continuous batching: q § PagedAttention [6]: Attention key/value § Optimized CUDA kernels: FlashAttention[9], FlashInfer[10] § Chunked prefill [11], Speculative decoding[12] q § HuggingFace : § : , § : Tensor, pipeline (Ray ), data, expert parallelism q API § OpenAI-Compatible API: AI (e.g., LangChain, Gemini CLI, …) § Streaming output: § Prefix caching, Multi-LoRA 1. Introduction [6] https://arxiv.org/abs/2309.06180 [9] https://github.com/vllm-project/flash-attention [10] https://github.com/flashinfer-ai/flashinfer [11] https://docs.vllm.ai/en/v0.9.2/configuration/optimization.html#chunked-prefill_1 [12] https://docs.vllm.ai/en/v0.9.2/features/spec_decode.html 9/39
  • 11. Instruct.KR Summer 2025 Meetup How to serving LLM with vLLM? qInstallation § Local (CPU) § GPU [13] qvllm serve 1. Introduction [13] https://docs.vllm.ai/en/v0.9.2/deployment/docker.html 10/39
  • 12. Instruct.KR Summer 2025 Meetup OpenAI-Compaitble Server PART 2
  • 13. Instruct.KR Summer 2025 Meetup OpenAI API Spec[14] q/v1/models[15] 2. OpenAI-Compatible Server [14] https://docs.vllm.ai/en/v0.9.2/serving/openai_compatible_server.html [15] https://platform.openai.com/docs/api-reference/models/list [16] https://platform.openai.com/docs/api-reference/chat/create q/v1/chat/completions[16] 12/39
  • 14. Instruct.KR Summer 2025 Meetup Tool calling qOpenAI 2. OpenAI-Compatible Server [17] https://github.com/openai/openai-python/blob/v1.97.1/src/openai/types/shared_params/function_definition.py#L13-L45 [17] 13/39
  • 15. Instruct.KR Summer 2025 Meetup Tool calling q“--tool-call-parser” parser [18, 19] 2. OpenAI-Compatible Server [18] https://docs.vllm.ai/en/v0.9.2/features/tool_calling.html [19] https://qwen.readthedocs.io/en/latest/deployment/vllm.html#parsing-tool-calls 14/39
  • 16. Instruct.KR Summer 2025 Meetup Reasoning 2. OpenAI-Compatible Server q“chat_template_kwargs” “enable_thinking” reasoning 15/39
  • 17. Instruct.KR Summer 2025 Meetup Reasoning q“--reasoning-parser” parser [20, 21] q “--enable-reasoning” deprecated[22] 2. OpenAI-Compatible Server [20] https://docs.vllm.ai/en/v0.9.2/features/reasoning_outputs.html [21] https://qwen.readthedocs.io/en/latest/deployment/vllm.html#parsing-thinking-content [22] https://github.com/vllm-project/vllm/blob/v0.9.2/vllm/engine/arg_utils.py#L626-L634 16/39
  • 18. Instruct.KR Summer 2025 Meetup Chat Template q tokenizer_config.json “chat_template” 2. OpenAI-Compatible Server [23] https://huggingface.co/Qwen/Qwen3-0.6B/blob/main/tokenizer_config.json#L230 [23] 17/39
  • 19. Instruct.KR Summer 2025 Meetup Chat Template q“--chat-template” chat template [24] 2. OpenAI-Compatible Server [24] https://docs.vllm.ai/en/v0.9.2/serving/openai_compatible_server.html#chat-template_1 [25] https://github.com/vllm-project/vllm/tree/main/examples [25] 18/39
  • 20. Instruct.KR Summer 2025 Meetup Architecture PART 3
  • 21. Instruct.KR Summer 2025 Meetup KV Cache, PagedAttention qKV Cache[26] § Autoregressive token sequence → time complexity: 𝑂 𝑛! § Key/Value (KV cache) → time complexity: 𝑂 𝑛 qPagedAttention[6, 27, 28] § (OS) virtual memory § KV cache memory memory fragmentation → memory / § Block memory page table (logical continuity) memory § page mapping memory non-continuous 3. Architecture [26] https://huggingface.co/blog/not-lain/kv-caching [6] https://arxiv.org/abs/2309.06180 [27] https://docs.vllm.ai/en/v0.9.2/design/kernel/paged_attention.html [28] https://blog.vllm.ai/2023/06/20/vllm.html [29] https://github.com/HandsOnLLM/Hands-On-Large-Language-Models/blob/main/chapter03/Chapter%203%20-%20Looking%20Inside%20LLMs.ipynb Value Token 1 Value Token 2 Value Token 3 Value Token 4 𝑉 Query Token 1 Query Token 2 Query Token 3 Query Token 4 𝑄 Key Token 1 Key Token 2 Key Token 3 Key Token 4 𝐾! 𝑄"𝐾" 𝑄!𝐾" 𝑄#𝐾" 𝑄$𝐾" 𝑄"𝐾! 𝑄!𝐾! 𝑄#𝐾! 𝑄$𝐾! 𝑄"𝐾# 𝑄!𝐾# 𝑄#𝐾# 𝑄$𝐾# 𝑄"𝐾$ 𝑄!𝐾$ 𝑄#𝐾$ 𝑄$𝐾$ 𝑄𝐾! × = × [29] [6] 20/39
  • 22. Instruct.KR Summer 2025 Meetup V0 Engine vs. V1 Engine q Optimized Execution Loop & API Server: EngineCore AsyncLLM API server, token CPU GPU model q Simple & Flexible Scheduler: “prefill”, “decode” , “{request_id: num_tokens}” token chunked-prefill, caching, decoding q Zero-Overhead Prefix Caching: Hash+LRU cache cache hit rate 0% 1% q Clean Architecture for Tensor-Parallel Inference: Worker caching diff , scheduler, worker IPC overhead q Efficient Input Preparation: Persistent batch tensor diff , Numpy CPU overhead q torch.compile & Piecewise CUDA Graphs: Model CUDA graph kernel customizing q Enhanced Support for Multimodal LLMs: image , image hash KV cache, encoder cache chunked-prefill q FlashAttention 3: batch attention kernel q VLLM_USE_V1=1 V1 Engine 3. Architecture [30] https://docs.vllm.ai/en/v0.9.2/design/arch_overview.html [31] https://docs.vllm.ai/en/v0.9.2/usage/v1_guide.html [32] https://github.com/vllm-project/vllm/issues/18571 [33] https://blog.vllm.ai/2025/01/27/v1-alpha-release.html 21/39
  • 23. Instruct.KR Summer 2025 Meetup Chat Completions qvLLM /v1/chat/completions [34] 2. OpenAI-Compatible Server [34] https://zerohertz.github.io/vllm-openai-2/#Conclusion 22/39
  • 24. Instruct.KR Summer 2025 Meetup Production Deployment PART 4
  • 25. Instruct.KR Summer 2025 Meetup LoRA Adapters qLoRA (Low-Rank Adaptation)[35] § model weight low-rank matrix (𝐴, 𝐵) § Δ𝑊 = 𝐵𝐴 model data qStatic serving LoRA adapters qDynamic serving LoRA adapters 4. Production Deployment [35] https://docs.vllm.ai/en/v0.9.2/features/lora.html [36] https://arxiv.org/abs/2106.09685 [36] 24/39
  • 26. Instruct.KR Summer 2025 Meetup LoRA Adapters 4. Production Deployment 25/39
  • 27. Instruct.KR Summer 2025 Meetup Parallelism Strategies[37] qTensor Parallelism (TP) § model layer model parameter GPU → Model GPU single node → KV cache GPU memory pressure qPipeline Parallelism (PP) § Model layer GPU model → Model node → Layer tensor sharding model qExpert Parallelism (EP) § Mixture of Experts (MoE) model → “--enable-expert-parallel” MoE layer tensor parallelism expert parallelism → MoE model → GPU expert qData Parallelism (DP) § GPU model batch → model GPU → Model → batch 4. Production Deployment [37] https://docs.vllm.ai/en/v0.9.2/configuration/optimization.html#parallelism-strategies 26/39
  • 28. Instruct.KR Summer 2025 Meetup Multi-node Distributed Inference qRay cluster [38] § VLLM_HOST_IP[39]: vLLM node IP § GLOO_SOCKET_IFNAME[40]: PyTorch Gloo bakcend network interface § NCCL_IB_DISABLE[41]: NCCL InfiniBand (IB) network 4. Production Deployment [38] https://docs.vllm.ai/en/v0.9.2/examples/online_serving/run_cluster.html [39] https://docs.vllm.ai/en/v0.9.2/usage/security.html [40] https://docs.pytorch.org/docs/stable/distributed.html#common-environment-variables [41] https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html#nccl-ib-disable Head Node Worker Node 27/39
  • 29. Instruct.KR Summer 2025 Meetup Multi-node Distributed Inference qMulti-Node Multi-GPU[42, 43] (tensor parallel plus pipeline parallel inference) § If your model is too large to fit in a single node, you can use tensor parallel together with pipeline parallelism. § The tensor parallel size is the number of GPUs you want to use in each node, and the pipeline parallel size is the number of nodes you want to use. § E.g., if you have 16 GPUs in 2 nodes (8 GPUs per node), you can set the tensor parallel size to 8 and the pipeline parallel size to 2. qTP 16 vs. TP 8 + PP 2 4. Production Deployment [42] https://docs.vllm.ai/en/v0.9.2/serving/distributed_serving.html [43] https://blog.vllm.ai/2025/02/17/distributed-inference.html 28/39
  • 30. Instruct.KR Summer 2025 Meetup Multi-node Distributed Inference qRDMA (Remote Direct Memory Access)[44] § Network node memory CPU, cache, data § E.g., InfiniBand, RoCE, iWARP qInfiniBand[45] § RDMA native CPU node memory § switch network adapter (Host Channel Adapter, HCA) (Ethernet network architecture) qRoCEv2 (RDMA over Converged Ethernet v2)[44] § ethernet network RDMA protocol 4. Production Deployment [44] https://zerohertz.github.io/distributed-computing-rdma-roce/ [45] https://developer.nvidia.com/gpudirect 29/39
  • 31. Instruct.KR Summer 2025 Meetup Multi-node Distributed Inference 4. Production Deployment [46] https://docs.vllm.ai/en/v0.9.2/serving/distributed_serving.html#gpudirect-rdma [46] 30/39
  • 32. Instruct.KR Summer 2025 Meetup Production Deployment with GPU Cluster (Kubernetes) qKubernetes [47] § Model store, Network, IPC, qAIBrix[48, 49, 50, 51] § vLLM cloud native control plane § LoRA , LLM gateway routing, autoscaler, KV cache, … qProduction-stack[52, 53, 54, 55] § vLLM LLM production codebase § LMCache KV cache , prefix-aware routing, Helm chart , … 4. Production Deployment [47] https://docs.vllm.ai/en/v0.9.2/deployment/k8s.html [48] https://github.com/vllm-project/aibrix [49] https://arxiv.org/abs/2504.03648 [50] https://blog.vllm.ai/2025/02/21/aibrix-release.html [51] https://aibrix.readthedocs.io/latest/ [52] https://github.com/vllm-project/production-stack [53] https://docs.vllm.ai/en/v0.9.2/deployment/integrations/production-stack.html [54] https://blog.vllm.ai/2025/01/21/stack-release.html [55] https://blog.vllm.ai/production-stack/ 31/39
  • 33. Instruct.KR Summer 2025 Meetup Production Deployment with GPU Cluster (Kubernetes) qLWS (LeaderWorkerSet)[56, 57] § Kubernetes Leader-Worker architecture CRD controller 4. Production Deployment [56] https://docs.vllm.ai/en/v0.9.2/deployment/frameworks/lws.html [57] https://github.com/kubernetes-sigs/lws 32/39
  • 34. Instruct.KR Summer 2025 Meetup Observability (Prometheus + Grafana) q“/metrics” vLLM server Prometheus format metric [58, 59] § vllm:request_success_total: (EOS max ) § vllm:request_queue_time_seconds: queue § vllm:request_prefill_time_seconds: prefill § vllm:request_decode_time_seconds: decoding § vllm:request_max_num_generation_tokens: token § … qGrafana dashboard 4. Production Deployment [58] https://docs.vllm.ai/en/v0.9.2/usage/metrics.html [59] https://docs.vllm.ai/en/v0.9.2/examples/online_serving/prometheus_grafana.html 33/39
  • 35. Instruct.KR Summer 2025 Meetup Benchmark q“vllm bench” CLI benchmark [60] 4. Production Deployment [60] https://docs.vllm.ai/en/v0.9.2/cli/index.html#bench [61] https://docs.vllm.ai/en/v0.9.2/contributing/profiling.html [62] https://github.com/vllm-project/vllm/tree/v0.9.2/benchmarks [63] https://github.com/vllm-project/guidellm [64] https://arxiv.org/pdf/2502.06494 qvLLM repository benchmarks [61, 62] qvLLM project Guidellm [63, 64] 34/39
  • 36. Instruct.KR Summer 2025 Meetup Production Issues qGloo connectFullMesh failed with…[65] § Multi node process GLOO → “GLOO_SOCKET_IFNAME” interface 4. Production Deployment [65] https://github.com/vllm-project/vllm/discussions/11353 [66] https://github.com/vllm-project/vllm/issues/17652#issuecomment-2867891239 [67] https://flashinfer.ai/2024/02/02/cascade-inference.html qQwen model token [66, 67] § FlashInfer kernel cascade inference → “--disable-cascade-attn” 35/39
  • 37. Instruct.KR Summer 2025 Meetup Wrap-up PART 5
  • 38. Instruct.KR Summer 2025 Meetup Roadmap Q3 2025[68] qV1 Engine § V0 Engine § Core scheduler § Async scheduling, multi-modal qLarge Scale Serving § Mixture-of-Experts (MoE) model scale-out serving § autoscaling qModels § framework (training, authoring) tokenizer, configuration, processor § Sparse attention mechanism § 1B qUse Cases § RLHF → resharding → Multi-turn scheduling § Evaluation → Batching order full determinism (with/without prefix cache) § Batch Inference → Prefix caching scale-out data parallel router → CPU KV cache offloading 5. Wrap-up [68] https://github.com/vllm-project/vllm/issues/20336 37/39
  • 39. Instruct.KR Summer 2025 Meetup Conclusion qOpenAI-Compatible Server § LangChain, Gemini CLI § Tool Calling, Reasoning qArchitecture § KV cache PagedAttention § V1 Engine qProduction Deployment § TP/PP/DP/EP § Kubernetes multi-node § LoRA adapter serving § Prometheus + Grafana observability § vllm bench benchmark script 5. Wrap-up 38/39
  • 40. Instruct.KR Summer 2025 Meetup vLLM Meetup in Korea[69, 70] 5. Wrap-up [69] https://discuss.pytorch.kr/t/8-19-vllm-meetup/7401 [70] https://lu.ma/cgcgprmh 39/39
  • 41. Instruct.KR Summer 2025 Meetup EoD Coffee Chat LinkedIn GitHub