Close

Presentation

HPC Kubernetes Engineer
·
NorthMark Compute and Cloud
·
Dallas, Texas
DescriptionThe Position
We are seeking a highly skilled Senior Kubernetes Engineer to join our HPC and Infrastructure function in Dallas. In this role, you will design, implement, and optimize GPU-accelerated container platforms at scale, enabling high-performance workloads (AI/ML, HPC, LLM training) across hybrid or on-prem environments. You will have deep expertise with both NVIDIA and Kubernetes ecosystems, including GPU scheduling, device plugins and custom operators.

Responsibilities:
Architecting and operating Kubernetes clusters optimised for GPU workloads, leveraging NVIDIA GPU Operator, Network Operator and DCGM
Developing, deploying and maintaining custom Kubernetes operators and controllers to automate infrastructure services
Integrating NVIDIA device plugins, Multi-Instance GPU (MIG) and GPU sharing features into the scheduling layer
Optimizing GPU utilization and job placement through scheduler extensions, such as kube-scheduler plugins, Slurm and Volcano
Collaborating with HPC, ML and DevOps teams to ensure multi-tenant, high-throughput cluster performance
Driving observability and telemetry integrations using Prometheus, Grafana, DCGM Exporter and OpenTelemetry
Implementing secure multi-user and multi-namespace GPU isolation, with RBAC and policy enforcement, such as OPA or Gatekeeper
Maintaining CI/CD pipelines for Kubernetes infrastructure using GitOps, ArgoCD and FluxCD
Contributing to infrastructure-as-code, using Terraform, Helm, and Kustomize
Participating in performance tuning, incident response and production readiness reviews
RequirementsRequirements: Bachelor's Degree or equivalent experience Extensive experience with Kubernetes in production-grade environments and working with NVIDIA and Kubernetes, including GPU Operator, device plugin, NVML, MIG and DCGM Proficiency in Go or Python for operator development and Kubernetes controller logic Deep understanding of Kubernetes internals, including CRDs, RBAC, custom controllers and scheduler extensions Experience with GPU-intensive workloads, for example for LLMs, training pipelines and scientific computing Hands-on experience with Helm, Kustomize and GitOps workflows Familiarity with CNI plugins, especially NVIDIA CNI and Multus Experience with monitoring GPU metrics and cluster health using Prometheus and DCGM Exporter The following is beneficial: Knowledge of container runtimes with CRI-O, containerd and NVIDIA Container Toolkit Contributions to open-source projects in the Kubernetes or NVIDIA ecosystem Preferred experience working with cilium or CNI plugins
Company DescriptionNorthMark Compute & Cloud (NMC²) is backed by dedicated leadership and investment, with a clear mission as it operates at the bleeding edge of technology. Its goal is to scale and enhance the high-performance computing (HPC) and cloud infrastructure that supports its clients' research, production, and delivery, enabling breakthroughs that shape the industries of tomorrow. Its engineers build critical infrastructure to eliminate friction in scientific research, simulations, analysis, and decision-making, accelerating discovery and driving faster innovation.
·
·
Event Type
Job Posting
TimeMonday, 17 November 20254:10pm - 4:10pm CST
LocationHall 6
Countries
United States of America
Companies
NorthMark Compute and Cloud
In-Person / Remotes
In-person
Part Time / Full Times
Full Time
Position Types
Permanent