Real time communication Site Reliability Engineering team focused on SRE scope to support the Intelligent Conversation and Communication Cloud (IC3) Teams backend services group, which powers billions of real-time customer conversations across Microsoft’s first party (Teams, Skype), and second party (Dynamics) solutions. IC3 services enables reliable and high-quality audio/video calling, meeting, messaging services that work every time from anywhere seamlessly across all customer touchpoints. IC3 services makes conversations on our platforms more intelligent in real-time empowering best-in-class productivity tools for the modern workplace where every call, meeting or chat will make the next one better.Responsibilities
We are looking for a highly driven Engineering Manager to lead a Site Reliability Engineering (SRE) team focused on critical service environments both online and complex hybrid environments that include online and OnPrem services. In this role, you will be responsible for managing an SRE team charged with deploying high quality s/w and h/w in our data centers. You and your team will deliver the results with robust and reliable automation, leverage existing tools, performing application deployments, debugging failures. The scope of the role will span areas such as Cloud capacity management, Machine learning and self-healing, Change & Config Management, DR exercise, Monitoring, Reporting and actively discovering solutions to prevent service outages before it has any customer impact.
We are looking for a talented and passionate Site Reliability Engineer to join the team that manages large world-wide infrastructure.
• Design, write and deliver software to optimize all aspects of deployments (Resources/Applications) ‘infrastructure-as-code’.
• Optimize service releases by improving Azure DevOps release pipelines.
• Drive services towards reliable/predictable deployments achieving better ‘time-to-deploy’ metrics for Services across Microsoft Teams.
• Provide deep technical leadership to a team of highly passionate and skilled engineers.
• Recruit, on-board, and grow a team of Software Engineers focused on Site Reliability.
• Build, run and improve critical service environments in large scale data centers
Coordinate planning and execution with internal engineering teams, business partners and technical leaders across the division.
• Own deployment, availability, reliability, performance, and customer escalation targets for these environments
• Proactive identification and reduction of issues through design, testing, and implementation of software
• Uphold high organizational standards of great employee and team satisfaction.
• Analyze incidents to determine root cause and mitigation plans. Drive automation into service management tasks and processes.
• Develop safe rollout plans for a portfolio of services to prevent outages.
• Learn and enhance existing tools, developing new tools to meet new scale and features aimed at reducing manual intervention, enhancing prevention, detection, and mitigation of service impacts.
• Manage world-wide capacity for a portfolio of services to meet the usage growth and efficiency requirements.
• Coordinate planning and execution with internal engineering teams, business partners and technical leaders across the division.
• Influence and collaborate across orgs to bring best practices, architectures, standards, and methods for large-scale distributed systems.
• Analyze data and provide operational insights into service reliability, customer experience to Design and Product teams.
• BS/BSE in computer science, Management Information Systems or technical disciplines or equivalent education
• 5+ years as Site Reliability Engineer/Developer working on large scale/distributed systems.
• 3+ years implementing/automating using CICD tools.
• 3+ years of technical management experience
• Good knowledge of basic networking fundamentals & troubleshooting tools.
• Proven experience creating distributed systems tools of moderate to high complexity.
• Ability to manage and deliver multiple project phases at the same time.
• Strong analytical and problem solving and organizational skills.
• Excellent written and oral communication skills.
• Ability to deal with the ambiguity associated with working in a fast-paced and changing environment.
• Strong Windows OS / Linux troubleshooting experience.
• Solid debugging, testing, and problem-solving skills
• Ability to automate routine tasks.
• Software development experience using PowerShell, C#, Java, C++, C or other programming languages.
• 3+ years of Azure development experience (ARM templates, Azure Monitor, PowerShell, Kubernetes, Docker etc.)
• Experience in a cloud stack and leveraging cloud architecture, applying site reliability principles and/or demonstrating sensitivity to operational concerns.
Microsoft is an equal opportunity employer. All qualified applicants will receive consideration for employment without regard to age, ancestry, color, family or medical care leave, gender identity or expression, genetic information, marital status, medical condition, national origin, physical or mental disability, political affiliation, protected veteran status, race, religion, sex (including pregnancy), sexual orientation, or any other characteristic protected by applicable laws, regulations and ordinances. We also consider qualified applicants regardless of criminal histories, consistent with legal requirements. If you need assistance and/or a reasonable accommodation due to a disability during the application or the recruiting process, please send a request via the Accommodation request form.
Benefits/perks listed below may vary depending on the nature of your employment with Microsoft and the country where you work.