Logo of Huzzle

Site Reliability Engineer - AI Application Platform

image

Red Hat

2mo ago

  • Job
    Full-time
    Mid & Senior Level
  • Software Engineering
    IT & Cybersecurity
  • Cork
  • Quick Apply

AI generated summary

  • You should have 3+ years in cloud tech, 1+ year in Kubernetes, 2+ years in monitoring and config management, programming experience, strong troubleshooting skills, and Agile methodology familiarity.
  • You will manage infrastructure, automate cloud services, develop AI solutions, monitor systems, troubleshoot issues, provide support, mentor peers, and enhance SRE practices in an agile team.

Requirements

  • 3+ years of experience of using cloud providers and technologies (Google, Azure, Amazon, OpenStack etc)
  • 1+ years of experience administering a Kubernetes based production environment
  • 2+ years of experience with enterprise systems monitoring
  • 2+ years of experience with enterprise configuration management software like Ansible by Red Hat, Puppet, or Chef
  • 2+ years of experience programming with at least one object-oriented language; Golang, Java, or Python are preferred
  • 2+ years of experience delivering a hosted service
  • Demonstrated ability to quickly and accurately troubleshoot system issues
  • Solid understanding of standard TCP/IP networking and common protocols like DNS and HTTP
  • Demonstrated comfort with collaboration, open communication and reaching across functional boundaries
  • Passion for understanding users’ needs and delivering outstanding user experiences
  • Independent problem-solving and self-direction
  • Works well alone and as part of a global team
  • Experience working with Agile development methodologies

Responsibilities

  • Build and manage our large scale infrastructure and platform services, including public cloud, private cloud, and datacenter-based
  • Automate cloud infrastructure through use of technologies (e.g. auto scaling, load balancing, etc.), scripting (bash, python and golang), monitoring and alerting solutions (e.g. Splunk, Splunk IM, Prometheus, Grafana, Catchpoint etc)
  • Design, develop, and become expert in AI capabilities leveraging emerging industry standards
  • Participate in the design and development of software like Kubernetes operators, webhooks, cli-tools...
  • Implement and maintain intelligent infrastructure and application monitoring designed to enable application engineering teams
  • Ensure the production environment is operating in accordance with established procedures and best practices
  • Provide escalation support for high severity and critical platform-impacting events
  • Provide feedback around bugs and feature improvements to the various Red Hat Product Engineering teams
  • Contribute software tests and participate in peer review to increase the quality of our codebase
  • Help and develop peers’ capabilities through knowledge sharing, mentoring, and collaboration
  • Participate in a regular on-call schedule, supporting the operation needs of our tenants
  • Practice sustainable incident response and blameless postmortems
  • Work within a small agile team to develop and improve SRE methodologies, support your peers, plan and self-improve

FAQs

What is the main role of a Senior Site Reliability Engineer in the IT AI Application Platform team?

The main role is to develop, scale, and operate the AI Application Platform based on Red Hat technologies, contributing to core AI services, enabling customer self-service, and automating processes to eliminate toil.

What technologies will I be working with in this position?

You will work with Red Hat technologies, including OpenShift AI (RHOAI), Red Hat Enterprise Linux AI (RHEL AI), and various cloud providers such as Google, Azure, and Amazon.

What type of infrastructure does the Site Reliability Engineer manage?

The SRE manages large-scale infrastructure and platform services across public cloud, private cloud, and data center environments.

What scripting languages should I be familiar with for this job?

You should be familiar with scripting languages such as Bash, Python, and Golang.

Is experience with Kubernetes necessary for this role?

Yes, at least 1 year of experience administering a Kubernetes-based production environment is required.

What kind of monitoring solutions will I need to utilize?

You will utilize enterprise systems monitoring solutions such as Splunk, Prometheus, Grafana, and Catchpoint.

Is experience with configuration management software important for this role?

Yes, you need at least 2 years of experience with enterprise configuration management software like Ansible, Puppet, or Chef.

What qualities are sought in candidates regarding teamwork and communication?

Candidates should demonstrate comfort with collaboration, open communication, and the ability to reach across functional boundaries.

What is the approach to incident response within the team?

The team practices sustainable incident response and conducts blameless postmortems to support continuous improvement.

Will I have opportunities for professional growth at Red Hat?

Yes, individual contributions have high visibility, which means there are numerous career opportunities and growth potential within the company.

What is the company culture like at Red Hat?

Red Hat fosters a culture of transparency, collaboration, and inclusion, encouraging contributions from individuals of diverse backgrounds and perspectives.

Is prior experience in agile development methodologies required?

Yes, experience working with Agile development methodologies is required for this position.

The leading provider of enterprise open source solutions.

Technology
Industry
10,001+
Employees
1993
Founded Year

Mission & Purpose

Red Hat is the world’s leading provider of enterprise open source solutions, using a community-powered approach to deliver high-performing Linux, hybrid cloud, edge, and Kubernetes technologies. We hire creative, passionate people who are ready to contribute their ideas, help solve complex problems, and make an impact.