Senior HPC Cluster Engineer

14 hours ago


Prague Czech Republic Remote Europe Nebius Full time 120,000 - 180,000 per year

Why work at Nebius
Nebius is leading a new era in cloud computing to serve the global AI economy. We create the tools and resources our customers need to solve real-world challenges and transform industries, without massive infrastructure costs or the need to build large in-house AI/ML teams. Our employees work at the cutting edge of AI cloud infrastructure alongside some of the most experienced and innovative leaders and engineers in the field.

Where we work
Headquartered in Amsterdam and listed on Nasdaq, Nebius has a global footprint with R&D hubs across Europe, North America, and Israel. The team of over 800 employees includes more than 400 highly skilled engineers with deep expertise across hardware and software engineering, as well as an in-house AI R&D team.


The role

We're looking for a Senior HPC Cluster Engineer to join our team and play a key role in the development of our cutting-edge hyperscaler platform. The GPU & InfiniBand team is responsible for enhancing and optimizing the core components of our Cloud platform, with a specific focus on GPU computing, InfiniBand networks, and the KVM/QEMU stack. You'll work closely with hardware virtualization and device emulation technologies, ensuring high performance and security in multi-GPU, HPC environments. The role involves analyzing, troubleshooting, and improving infrastructure to support new hardware, fine-tuning system performance, and automating fault detection and resolution in a complex system.

In this position, you will be responsible for:

  • Tuning the performance of GPU clusters and InfiniBand networks to ensure optimal operation in HPC and GPU-based environments.
  • Analyzing and troubleshooting the root cause of issues related to GPUs and InfiniBand networks, and proposing corrective actions.
  • Integrating new hardware into the existing infrastructure, including support for new GPU hardware through software stacks like Kubernetes, QEMU, and KVM.
  • Enhancing automation systems for proactive monitoring, detecting, and resolving issues in GPU and InfiniBand environments.
  • Configuring and managing GPU devices and InfiniBand fabrics, ensuring efficient and reliable operation.

We expect you to have:

  • 5+ years of professional experience in system-level software development (focused on performance optimization, low-level programming).
  • 3+ years of hands-on experience with Linux systems (administration, troubleshooting, and performance tuning).
  • In-depth understanding of server architecture, including PCIe devices, NICs, Linux OS/Kernel, and high-performance computing (HPC) systems.
  • Strong proficiency in one or more performance-oriented programming languages (C/C++, Go, Python).

It would be a plus if you have:

  • Experience with GPU end-to-end testing in a cluster environment using InfiniBand networking.
  • Proven track record of analyzing and optimizing the performance of HPC workloads (e.g., simulations, data analysis, AI/ML workloads).
  • Familiarity with RDMA, RoCE, and InfiniBand protocols for high-performance communication.
  • Background in Software-Defined Networking (SDN) and experience with HPC cluster networking.
  • Understanding of QEMU/KVM virtualization and managing virtualized environments.
  • Experience with deep learning frameworks such as PyTorch and TensorFlow, and their integration with HPC systems.
  • Familiarity with collective communication libraries like MPI and NCCL for distributed computing. 

We conduct coding interviews as part of the process.

What we offer 

  • Competitive salary and comprehensive benefits package.
  • Opportunities for professional growth within Nebius.
  • Flexible working arrangements.
  • A dynamic and collaborative work environment that values initiative and innovation.

We're growing and expanding our products every day. If you're up to the challenge and are excited about AI and ML as much as we are, join us


  • HPC Software Engineer

    14 hours ago


    Prague, Hlavní město Praha, Czech Republic Canonical - Jobs Full time 60,000 - 120,000 per year

    Canonical is a leading provider of open source software and operating systems to the global enterprise and technology markets. Our platform, Ubuntu, is very widely used in breakthrough enterprise initiatives such as public cloud, data science, AI, engineering innovation and IoT. Our customers include the world's leading public cloud and silicon providers,...


  • Prague, Czech Republic; Remote - Europe Nebius Full time 80,000 - 180,000 per year

    Why work at NebiusNebius is leading a new era in cloud computing to serve the global AI economy. We create the tools and resources our customers need to solve real-world challenges and transform industries, without massive infrastructure costs or the need to build large in-house AI/ML teams. Our employees work at the cutting edge of AI cloud infrastructure...


  • Czech Republic / Europe (Remote) RapidSOS Full time 80,000 - 120,000 per year

    In the time it takes you to read this job description, RapidSOS will have handled ~1,380 emergencies.At RapidSOS, we are committed to using technology to build a safer, stronger future and working together to save lives. We're in an exciting phase of growth, welcoming new members from across the globe to our mission-driven, ambitious, and inclusive team. Our...


  • Europe - Remote, Czech Republic Oak'S Lab Full time 60,000 - 120,000 per year

    **This is a remote position open to candidates in Europe. For those based in the Czech Republic, a hybrid work model is also available.ABOUT USWe are a leading product development partner specializing in custom web and mobile application development. With Silicon Valley founders and top-tier European tech talent, we operate from Prague and have successfully...

  • Senior Cloud Engineer

    14 hours ago


    Czech Republic-Remote Edwards Lifesciences Full time €60,000 - €180,000 per year

    We are looking for a Senior Cloud Engineer with deep expertise in AWS to support and enhance our cloud infrastructure. In this role, you will work closely with the Cloud Architecture team to implement scalable, secure, and automated solutions aligned with established designs. Responsibilities include managing infrastructure as code, optimizing performance,...


  • Prague (CEE HQ), Czech Republic Microsoft Full time 80,000 - 120,000 per year

    Security represents the most critical priorities for our customers in a world awash in digital threats, regulatory scrutiny, and estate complexity. Microsoft Security aspires to make the world a safer place for all. We want to reshape security and empower every user, customer, and developer with a security cloud that protects them with end to end, simplified...


  • Remote, Czech Republic OpenVPN Full time

    5+ years of software engineering experience, designing and deploying distributed systems at scale. 3+ years of hands-on experience with Go (Golang), including concurrency, memory management, networking, and performance tuning. Deep understanding of network-level protocols and enforcement (TCP/IP, UDP, DNS, TLS, iptables/nftables, conntrack). Proven...

  • Sales Manager

    2 days ago


    CZ-Czech Republic (CZ-R) Supermicro Full time 1,200,000 - 3,600,000 per year

    Job Req ID: 26601 About Supermicro: Supermicro is a Top Tier provider of advanced server, storage, and networking solutions for Data Center, Cloud Computing, Enterprise IT, Hadoop/ Big Data, Hyperscale, HPC and IoT/Embedded customers worldwide. We are amongst the fastest growing company among the Silicon Valley Top 50 technology firms. Our...


  • Prague, Hlavní město Praha, Czech Republic Canonical - Jobs Full time 80,000 - 120,000 per year

    We're enabling high-performing, rock-solid MongoDB deployments on any cloud or platform our customers choose. We want to create the world's best open source analog to MongoDB Altas or Amazon DocumentDB, which can be owned, controlled and operated by end-users on their own multi-cloud or on-premise environments.Canonical is looking for an experienced Python...

  • Senior Engineer

    14 hours ago


    Europe, Czech Rep., Olomouc, Czech Republic Kk Wind Solutions Full time 30,000 - 60,000 per year

    General information Reference Contract typeStandard Contract Job titleSenior Engineer -Product Safety M/F Teaser Buďte součástí týmu, který stojí za přehlednými a přesnými pracovními návody pro špičkové průmyslové produkty v Nissens Cooling Solutions.Jsme součástí dánské skupiny A.P. Møller Group a spolu s KK Wind Solutions...