Know ATS Score
CV/Résumé Score
  • Expertini Resume Scoring: Our Semantic Matching Algorithm evaluates your CV/Résumé before you apply for this job role: SRE specialized in HPC/AI H/F.
France Jobs Expertini

Urgent! SRE specialized in HPC/AI - H/F Job Opening In Paris – Now Hiring Groupe iliad

SRE specialized in HPC/AI H/F



Job description

Le poste

Our mission Founded in 1999, Scaleway, the cloud of choice, helps developers and businesses to build, deploy and scale applications to any infrastructure.

Located in Paris, Amsterdam and Warsaw, Scaleway’s complete cloud ecosystem is used by 25,000+ businesses, including European startups, who choose Scaleway for its multi-AZ redundancy, smooth developer experience, carbon-neutral data centers and native tools for managing multi-cloud architectures.

With fully managed offerings for bare metal, containerization and serverless architectures, Scaleway brings choice to the world of cloud computing, offering customers the ability to choose where their customer’s data resides, to choose what architecture works best for their business, and to choose a more responsible way to scale. Our journey We want all our actions and decisions to bring us closer to achieving our vision: building and scaling technologies that make sense to us, to our customers and their end users.

Scaleway is the challenger nobody’s expecting.As our business scales, the customers and developers we serve are increasingly diverse and global.

Giving them an unbeatable experience is central to our business strategy and value proposition.

To better understand them, we've discovered that the best way to deliver the highest value and performance is by incorporating a well-rounded team that leverages diverse perspectives, knowledge, skills, and cross-cultural understanding.

Our values Singularity: We do it our own way.

Community: One company, one cultureAdventure: Level up if you dare, never stop innovating.

Leadership: Be the leader you want to follow.Excellence: We want to be customers' first choice as a cloud provider.Rock Solid: You can always count on us!  About the job With teraflops of computing power available for Scaleway customers, we are looking for a SRE to join our new team specialized in HPC (High Performance Computing).

We are deploying several clusters, one single cluster can be part of the top 15 of HPC listed in the Top500 ().Reporting to our Engineering Manager Emerick Mounoury, you will be responsible to ensure the deployment and the health of the components of our multiple HPC clusters composed of Nvidia hardware.We expect you to have a strong background in HPC environment and system administration, along with some DevOps experience and SRE best practices.Our systems evolve constantly and the tools we use to monitor and ensure their resilience need to evolve accordingly.

Profil recherché

  • Experience in system programming using at least one of these languages:Python, Bash, Go, etc.
  • Demonstrated ability to troubleshoot production system failures
  • A positive mindset and desire to work with a team
  • Passion for automation and incremental improvements on tooling, 
  • Experience with Linux systems: based on Debian and Centos derivatives
  • Experience with batch job schedulers like Slurm, OAR, SGE
  • Good understanding of computer networks: TCP/IP, DNS, load balancing, IPv6, firewall, network, Infiniband, vlan/partition, …
  • Storage knowledge: large pools, NAS, S3, ..
  • Experience with Nvidia, Cuda, MPI
  • Good command of English

  • Ability to meticulously identify and solve any kind of bug in any codebase.
  • Experience with infrastructure-as-code and continuous deployment
  • Experience dealing with physical hardware automation
  • Experience monitoring & logging systems
  • Experience handling account management (LDAP)
  • Knowledge of at least one cloud platform and related use-cases
  • Experience as an OSS contributor and/or maintainer
  • Knowledge in AI / LLM / ML / neuronal networks

  • Create or optimize existing tools & documentation that will help identify, diagnose, and solve production incidents, automating as much as possible
  • Troubleshoot high-impact issues by working with multiple Engineering teams (Storage, Network, Hardware)
  • Take on-call responsibilities, mitigate issues encountered in production and answer our customers in real time
  • Ensure a high quality of service for our customers by leveraging observability and monitoring technologies
  • Manage the life cycle of HPC clusters in production and take part to the escalation of the hardware and software issues to our suppliers
  • Empower your teammates to swiftly integrate and deploy software components across our systems
  • Help implementing best stability, resiliency, scalability, security, and performance practices across our systems

  • Python/Bash
  • MySQL
  • S3 API, Lustre, NAS
  • Sentry, Prometheus, Grafana, ElasticSearch, Fluentd, Kibana
  • Ansible, Salt
  • GitLab, Nexus
  • Ubuntu, Debian, CentOS
  • Nvidia hardware and software
  • MPI, Module, AI software
  • Slurm
  • K8s
  • Jira, Confluence, Slack, GSuite

  • Location Based in our offices in Paris or Lille (France), partial remote is possible. Recruitment Process Screening call - 30 mins with the recruiter Manager Interview - 45 minsTechnical Interviews 1h 30 minsHR Interview - 45 minsOffer sent - 48 hoursOn average our process lasts 2-3 weeks and offers usually follow within 48 hours Important note: if you don't see yourself ticking all the boxes don't hesitate to apply anyway.

    Don't limit yourself to a job description, you never know!  To learn more about us  | |


    Required Skill Profession

    Drafters, Engineering Technicians, And Mapping Technicians



    Your Complete Job Search Toolkit

    ✨ Smart • Intelligent • Private • Secure

    Start Using Our Tools

    Join thousands of professionals who've advanced their careers with our platform

    Rate or Report This Job
    If you feel this job is inaccurate or spam kindly report to us using below form.
    Please Note: This is NOT a job application form.


      Unlock Your SRE specialized Potential: Insight & Career Growth Guide