Job Information
Infoblox Senior Director, Site Reliability & Platform Engineering in Tacoma, Washington
Description It's an exciting time to be at Infoblox. Named a Top 25 Cyber Security Company by The Software Report and one ofInc. magazine's Best Workplaces for 2020, Infoblox is the leader in cloud-first networking and security services. Our solutions empower organizations to take full advantage of the cloud to deliver network experiences that are inherently simple, scalable, and reliable for everyone. Infoblox customers are among the largest enterprises in the world and include 70% of the Fortune 500, and our success depends on bright, energetic, talented people who share a passion for building the next generation of networking technologies-and having fun along the way. We are looking for a Senior Director, Site Reliability Engineering (SRE) and Platform Engineering to lead our SRE and DevOps teams globally, reporting to the Vice President of Engineering. In this role, you will foster a culture of product reliability across all of Engineering, drive and support the SRE team in conducting risk analyses, and work with Engineering leadership to ensure operational excellence of cloud-scale, high-availability systems. You will manage the SRE and DevOps teams, using your abilities to incorporate roadmap objectives from Product Management, Engineering, IT, and Product Security Engineering. This is an essential position in the Engineering organization with executive-level visibility, driving change with other senior leaders to achieve departmental and corporate goals. You are the ideal candidate if you are a visionary who lives and breathes reliability at scale. What you'll do: Lead and mentor a team of reliability and platform engineers, championing a culture of reliability, scalability, and continuous improvement across all Infoblox customer products, both on-prem and SaaS Establish a charter for best-in-class site reliability engineering, and drive Engineering teams toward achieving these best practices Institute a set of tools and processes that ensure monitoring, observability, capacity planning, disaster recovery, and incident management systems can support 99.999 availability for critical services Manage large-scale infrastructure and applications across multiple cloud providers using a mix of native cloud, open-source, and commercial off-the-shelf tools Work with stakeholders, including Engineering, IT, Product Management, and Customer Support, to define and ensure customer-driven SLIs/SLOs exist for both new and existing functionality Communicate progress by highlighting the accomplishments, risks, mitigation, and other pertinent key performance indicators that feed into Infoblox's overarching business strategy Facilitate continuous training programs for Engineering that reduce risk, including completion of annual reliability training for Engineering staff Drive product reliability, operational, and efficiency metrics with automation, allowing management to understand the maturity and risk levels in various product areas What you'll bring: 15+ years of experience in SRE, platform engineering, or related roles with at least 5 years of this time in a director-level role 10+ years of experience with cloud infrastructure, such as AWS, GCP, and Azure, and DevOps practices Proven experience managing large-scale, high-availability systems with an emphasis on containers and Kubernetes environments Experience with CI/CD pipelines, monitoring tools, and incident management processes Experience automation and scripting like Python and Goand experience with monitoring and observability tools, such as Prometheus, Grafana, etc. Experience maintaining SOC2, FedRAMP, or ISO 27001 certifications Experience working within a global team structure Excellent leadership, communication, and interpersonal skills Solid business analysis or financial modeling skills to run the analysis for various projectsand good understanding of product and software develo