Director of SRE Job Description Template

We are searching for a driven, knowledgeable director for our Site Reliability Engineering (SRE) team. This role will lead a multicultural, cooperative team of engineers who contribute to the reliability of our digital goods. You would be responsible for leading the technical strategy and vision for our underpinning infrastructure, alerting & monitoring, infrastructure provisioning, networking, and development tooling in collaboration with other engineering teams and leadership. In order to foster a culture of cooperation and innovation that will support the development of our engineers, you must be an effective people manager who can inspire and motivate others. You should have the ability to manage resources, projects with many dependencies, and be a good communicator.

Typical Duties and Responsibilities

  • Direct an SRE team of bright and driven full-stack engineers in the development, administration, and upkeep of reliable infrastructure and easy-to-use tools
  • Create a culture of honesty, openness, and inclusivity in everything we do
  • Lead the technology strategy for our infrastructure and tooling
  • Establish objectives and delivery schedules in close collaboration with the product and process teams, balancing tech debt and outside requests
  • Work with the SRE team to provide high-quality infrastructure that is easy to maintain and deploys reliably
  • Own the Observability Platform and best practices so that future teams can track and support the health of their apps
  • Mentor and guide SRE engineers to realize their full potential by providing honest criticism and setting clear goals
  • Assist in finding and hiring new talent who will improve us and add a variety of perspectives
  • Continuously enhance our infrastructure and supporting procedures while simultaneously identifying cost reductions
  • Establish baseline requirements and monitoring capabilities in collaboration with application teams
  • Be proactive in testing dependability and resilience across all of our application platforms
  • Provide seamless developer experiences in close collaboration with other engineering teams
  • Contribute to the determination of agreements and objectives for the dependability of services 
  • Collaborate with engineering groups to establish best practices and standards for incident response and management

Education

  • Bachelor’s degree in computer science or a related field

Required Skills and Experience

  • Experience leading teams through transformation programs
  • Experience with software configuration management and release engineering
  • Experience managing database administrators
  • Knowledge of agile methodologies
  • Knowledge of controlling infrastructure and associated equipment
  • Experience working with cloud service platforms (AWS/GCP) and knowledge of best practices and methods for resolving issues in those settings
  • Working knowledge of monitoring systems
  • Expertise in Unix or Windows OS
  • Excellent problem solving skills and ability to weigh several choices and suggest a course of action
  • Experience guiding a team and setting both group- and person-specific goals
  • Effective communication ability and comfortable interacting with people of various backgrounds, skill levels, and experiences
  • Passion for fostering the growth and development of SRE Engineers
Contact us

Recruit with Nexus IT Group