Presentation
Frameworks for Large Language Model Serving in HPC Environments
DescriptionWe introduce open-source frameworks for deploying and running large language models (LLMs) within high-performance computing (HPC) environments. One such framework targets high-
throughput batch inference, enabling users to submit LLM requests in an OpenAI-compatible format as traditional HPC jobs. Another framework is based on Ray Serve and it provides dynamic, on-demand allocation of HPC resources for interactive LLM serving via APIs, supporting applications such as chatbots and AI agents. The third framework is a production-grade, always-
on platform for real-time interaction, that relies on a dedicated GPU server for model inference. These frameworks are designed to abstract away underlying computer system complexities, allowing researchers to request and utilize GPU resources for model inference without manual environment setup. We describe these systems and report LLM-specific performance metrics. Results demonstrate that the proposed frameworks enable scalable and resource-efficient LLM serving across both batch and interactive workloads in support of diverse user needs.
throughput batch inference, enabling users to submit LLM requests in an OpenAI-compatible format as traditional HPC jobs. Another framework is based on Ray Serve and it provides dynamic, on-demand allocation of HPC resources for interactive LLM serving via APIs, supporting applications such as chatbots and AI agents. The third framework is a production-grade, always-
on platform for real-time interaction, that relies on a dedicated GPU server for model inference. These frameworks are designed to abstract away underlying computer system complexities, allowing researchers to request and utilize GPU resources for model inference without manual environment setup. We describe these systems and report LLM-specific performance metrics. Results demonstrate that the proposed frameworks enable scalable and resource-efficient LLM serving across both batch and interactive workloads in support of diverse user needs.
Event Type
Workshop
TimeSunday, 16 November 202512:00pm - 12:30pm CST
Location241


