2024 - Machine Learning Infrastructure Observability - Expert Software Engineer

Dublin, County Dublin, Ireland

Huawei Ireland

Huawei is a leading global provider of information and communications technology (ICT) infrastructure and smart devices.

View company page

Company Overview: Our cutting-edge technology company is at the forefront of the AI revolution, and we’re seeking an Expert to join our talented team. As a global leader in Cloud & ML infrastructure, we operate large fleets with ML accelerators and distributed systems. Our work directly impacts the rapid development and deployment of AI models across various domains.

 

Role Summary: As an Expert, you will be a pivotal force in shaping the efficiency, reliability, and scalability of our ML infrastructure by designing and developing observability solutions and tools. Your role involves close collaboration with technical leaders across multidisciplinary domains, including cloud infrastructure and ML software systems. Together, we aim to design observability to help operational excellence in our fleet, ensuring seamless ML experiences for our customers.

Responsibilities:

  • Design and develop our ML fleet infrastructure observability/monitoring, including GPU clusters, distributed storage, and compute nodes.
  • Design and develop ai cluster operations related observability to help proactive maintenance and capacity planning functions.
  • Drive efficiency improvements and provide guidance for the AI/ML operations engineers with observability best practices.
  • Evaluate cutting edge observability technologies for hardware accelerators, and next generation networking infrastructure.
  • Provide technical leadership and mentorship to junior SREs, SDEs and Data Scientists.

Requirements

  • Bachelor’s or Master’s degree in Computer Science, Electrical Engineering, or related field.
  • Minimum 5 years of hands-on experience in SRE or DevOps roles, specifically focused on ML infrastructure along with AI Infra and AI monitoring.
  • Proficiency in Linux, low-level debugging, and system performance analysis.
  • Strong scripting skills (Python, Bash) for automation and monitoring.
  • Experience with Kubernetes, Docker, and container orchestration.
  • Excellent communication skills and ability to collaborate across teams.

Benefits

  • Competitive salary package
  • Long-term personal growth space
  • Opportunities to work on high profile initiatives that impact the whole company
  • Opportunities to work with the brightest minds in software engineering (including Huawei Fellow and renowned professors in the world)
  • A multi-cultural, international working environment
  • Work for an international world leader, an established yet still rapidly growing Fortune 500 company

 

Check out Life at Huawei Ireland Research Centre: https://www.youtube.com/watch?v=3gR64sYSnOA&feature=youtu.be

 

DUE TO THE HIGH VOLUME OF REPLIES, ONLY CANDIDATES WHO ARE SHORTLISTED FOR INTERVIEW WILL BE CONTACTED.

 

Privacy Statement

Please read and understand our West European Recruitment Privacy Notice before submitting your personal data to Huawei so that you fully understand how we process and manage your personal data received.

http://career.huawei.com/reccampportal/portal/hrd/weu_rec_all.html

Apply now Apply later
  • Share this job via
  • or

* Salary range is an estimate based on our AI, ML, Data Science Salary Index 💰

Tags: Computer Science DevOps Distributed Systems Docker Engineering GPU Kubernetes Linux Machine Learning ML infrastructure Privacy Python Research

Perks/benefits: Career development Competitive pay Startup environment

Region: Europe
Country: Ireland
Job stats:  4  0  0

More jobs like this

Explore more AI, ML, Data Science career opportunities

Find even more open roles in Artificial Intelligence (AI), Machine Learning (ML), Natural Language Processing (NLP), Computer Vision (CV), Data Engineering, Data Analytics, Big Data, and Data Science in general - ordered by popularity of job title or skills, toolset and products used - below.