27 days old

Service Reliability Operations Engineer

Nvidia Corporation
Bengaluru, KA 560002
  • Job Code
    JR1940069

Our technology has no boundaries! NVIDIA is building the world's most groundbreaking and state of the art compute platforms in the world. It's because of our work that scientists, researchers and engineers can advance their ideas. At its core, our visual computing technology not only enables an amazing computing experience, it is energy efficient! We pioneered a supercharged form of computing loved by the most demanding computer users in the world - scientists, designers, artists, and gamers. It's not just technology though! It is our people, some of the brightest in the world, and our company culture make NVIDIA one of the most fun, innovative and dynamic places to work in the world! At the center of NVIDIA's culture are our core values like innovation, excellence and determination and team, that guide us to be the best we can be.

NVIDIA's NSV team is looking for highly motivated System Administrator/DevOps engineers to design, develop and implement a global, dynamic, state-of-the-artService Reliability Operations Center(known as Mission Control), to provide extraordinary levels of support for our Cloud products and services.As a key member of the Mission Control team, you will partner with other key members of our organization including Site Reliability Engineering, Security Operations Center, DevOps teams, and other datacenter operations partners to help make our services capable of providing near 100% availability. On the rare occasion that an incident occurs, you will be our front line to decrease the frequency and duration of any issue.Working in partnership with the development community the Mission Control team will develop monitors, alarms and alerts to help make the service more reliable and improve our customer experience. Additionally you will be very involved in selecting the technologies that we will use in the Mission Control to help monitor, run and measure the effectiveness of the environment.

What you will be doing:

  • The team will provide their services 24/7 with a follow-the-sun environment which will span continents.
  • You will directly report to a manager in Bangalore.
  • Each team member will need to work either a Saturday or Sunday each week. The hours worked may include an early or late start (10hrs-per-day x 4days-per-week schedule) to ensure that the combination the US and India teams provide 24/7 coverage.
  • The heart of Mission Control will be monitoring and running a growing production compute and storage environment.
  • Every Mission Control team member will utilize alerts and alarms to help prevent issues and incidents when possible. You may also work with the developer community to develop and execute predictive support or diagnostic routines.
  • Perform systems administration tasks, network administration tasks, security incident monitoring to drive your actions.
  • Mission Control team members will work with developers to learn how the service works, then translate that understanding into runbooks which the entire team will use. As new features and functionality are added, you will also update and evolve the runbooks as needed.
  • Help discover incidents and issues, including initiating the incident management procedure.
  • Bring in subject matter authorities or service owners as needed to resolve issues. Feedback will help us continually improve our service.
  • Your interpersonal skills will help keep the team engaged through resolution and ensure our client's believe we value their time and effort.
  • You may perform other tasks that will help us provide extraordinary service levels for our customers.

What we need to see:

  • Minimum of 3 years' experience administering open system servers in a Production environment.
  • At least 2 years' experience working in demanding Internet, Cloud, or Telecommunications environments in a Systems Administration, DevOps, SRE, or NOC role.
  • Expertise using monitoring tools and problem ticketing systems.
  • Strong problem-solving, analytical, and troubleshooting abilities.
  • Strong server administration experience. Shell scripting, automation, DNS, DHCP, storage concepts, basic networking, IP Tables, etc. RHCE or equivalent level of knowledge.
  • Experience scripting in Python preferred, but not required.
  • Prior experience running virtual machines under open source or commercial hypervisors.
  • Experience operating services running on public or private clouds.
  • Knowledge and understanding of application containers and container orchestration systems.
  • Basic understanding of Git.
  • Prior experience analyzing system and network performance using monitoring alerts, data, and graphs.
  • Demonstrate ability to master and maintain complicated environments.

NVIDIA offers highly competitive salaries and a comprehensive benefits package. We have some of the most forward-thinking and talented people in the world working for us and, due to unprecedented growth, our world-class engineering teams are growing fast. If you're a creative and autonomous engineer with real passion for technology, we want to hear from you.





Posted: 2021-03-26 Expires: 2021-04-24
Sponsored by:
ADP Logo
Sponsored by:
Bank of America Logo

Featured Jobs[ View All ]

Featured Employers

Before you go...

Our free job seeker tools include alerts for new jobs, saving your favorites, optimized job matching, and more! Just enter your email below.

Share this job:

Service Reliability Operations Engineer

Nvidia Corporation
Bengaluru, KA 560002

Join us to start saving your Favorite Jobs!

Sign In Create Account
Powered ByCareerCast