The Director of Cloud Infrastructure Operations will lead the Hosting Operations of our Azure and GCP cloud offerings. The Director is an expert in 24×7 operations with high performing and scaling systems that meet a high degree of uptime. An expert in all facets of cloud hosting operations with the ability to effectively communicate with customers, and internal stake holders. A leader in continuous integration and continuous development with automated deployment in an Agile SDLC.
- Manage operations plans, staffing, budget and execution
- Lead escalated Incident Management team and develop maturity plans to continually improve.
- Establish automated monitoring tools to track systems’ health, uptime and outages. Lead Monitoring team to build maturity around event management.
- Ensure compliance with best security practices and constantly assess potential vulnerabilities
- Define, establish and deploy the DR strategy and RPO/RTO metrics
- Optimize operations cost across vendors and service providers
- Partner with Engineering and DevOps on CICD and automated deployment
- Identify key procedures that can be automated and either automate them or work with platform engineering team to develop automation.
- Build maturity plans to help grow and scale the business.
- Establish, report, and improve various metrics associated with the efficiency of operating the Humana foundation environment suite delivering value to our customers.
- Improve incident and problem management functions while working to build a world-class incident response function for our customers.
- Adhere to agreed upon customers’ SLA
- Ability to make solid business decisions in a dynamic and fast-paced environment.
- Strong leadership and people management skills.
What you bring:
- 5-8 years of IT Management experience managing Cloud, NOC or operational teams.
- Understanding of Azure, public cloud, Cloud certification preferred
- Experience migrating business critical applications to the cloud, requiring 24-7 up-time
- Knowledge of corporate IT, data centers, ticketing system implementations, monitoring software implementation, troubleshooting, and continuous improvement approaches.
- Skill and knowledge in ITIL processes related to Incident Management, Service Requests, Event Management, Access Management, Change Management, Knowledge Management and Escalated Incident Management.
- General knowledge of monitoring software for base monitoring as well as application performance monitoring including KPI’s and reporting approaches.
- Solution-oriented leadership and a management-based approach.
- Availability for off‐hours work related to 24×7 up-time and availability of the SaaS product suite; willingness to support the team who has on-call coverage expectations.
What you will do:
- Define short, mid, and long-term cloud production operations strategy and roadmap
- Advocate for that strategy with engineers, managers, and executives
- Provide guidance, objectives, and metrics and oversight to help teams maintain 24×7 uptime and availability of production mission critical customer facing services
- Define and roll out the processes, practices, and tooling that teams will use to meet their service level objectives
- Review application and service deployments to ensure that DR requirements have been met
- Improve overall system DR capabilities
- Oversee cloud and datacenter operational processes across all teams, from change planning to incident management
- Define and lead the rollout of new infrastructure initiatives that improve overall system resilience and efficiency
- Work across the organization to identify and resolve operational gaps
- Manage and optimize cloud service and infrastructure spend
- Lead operational roll out, working with internal and third-party resources to achieve growth goals and operational stability
What you will bring to the team:
- Minimum of 5 years of experience managing 24×7 production operations for a high-volume, business-critical cloud service
- Excellent organizational and business planning capabilities
- Experience with and enthusiasm for operating in an agile DevOps oriented organization and culture
- Understanding of the key concepts and practices of Observability, coupled with experience implementing robust systems that leverage metrics, logs, and traces to provide understanding of system state
- An inspiring and creative leadership style that inspires and influences others
- A technical business acumen that ensures the organization is operating efficiently and effectively in a hybrid environment
- The ability to communicate effectively to executives, engineers and customers
- The ability to build effective relationships with internal business stakeholders and external partners
- Significant experience managing IT vendor relationships
- Bachelor’s Degree in Computer Science, Information Technology, or equivalent experience
- You have managed Kubernetes and VM based workloads in production on premise and in Google or Amazon clouds
- You have experience with and understand the benefits and trade-offs of different serverless implementations emerging in public cloud
- You have migrated workloads from the datacenter to public cloud
- You have developed, deployed, and operated services or full applications via CI pipelines
- You have a deep understanding of observability – how to apply best practices around monitoring, alerting, and logging, and have implementation experience with one or more monitoring, alerting, and logging systems
- You have built out software platforms to support application suites, and understanding of implementation and operational trade-offs made during the build out process
- You have experience/exposure to big data storage and processing clusters, e.g. Hadoop, Cassandra, HBase, Spark
- Bachelor’s degree required; Master’s degree preferred
- 8 or more years of transformational experience running cloud at scale
- 5 or more years of management experience
- Must be passionate about contributing to an organization focused on continuously improving consumer experiences
Scheduled Weekly Hours