Software Engineer (SRE, Python, Cloud Solutions)
Company: NetApp, Inc.
Location: San Jose
Posted on: March 28, 2025
Job Description:
Job SummaryAs a Cloud Infrastructure/Site Reliability Engineer,
you will be operating at the intersection of development and
operations. Your role will involve engaging in and enhancing the
lifecycle of cloud services - from design through deployment,
operation, and refinement. You will be responsible for maintaining
these services by measuring and monitoring their availability,
latency, and overall system health.You will play a crucial role in
sustainably scaling systems through automation and driving changes
that improve reliability and velocity. As part of your
responsibilities, you will administer cloud-based environments that
support our SaaS/IaaS offerings, which are implemented on a
microservices, container-based architecture (Kubernetes).In
addition, you will oversee a portfolio of customer-centric cloud
services (SaaS), ensuring their overall availability, performance,
and security. You will work closely with both NetApp and cloud
service provider teams.Due to the critical nature of the services
we support, this position involves participation in a
rotation-based on-call schedule as part of our global team. This
role offers the opportunity to work in a dynamic, global
environment, ensuring the smooth operation of vital cloud services.
To be successful in this role, you should be a motivated
self-starter and self-learner, possess strong problem-solving
skills, and be someone who embraces challenges.Job Requirements
- Incident Response and Troubleshooting: Address and perform root
cause analysis (RCA) of complex live production incidents and
cross-platform issues involving OS, Networking, and Database in
cloud-based SaaS environments. Implement SRE best practices for
effective resolution.
- Analysis and Infrastructure Maintenance: Continuously monitor,
analyze, and measure system health, availability, and latency using
tools like Prometheus, ElasticSearch, Grafana, and SolarWinds.
Develop strategies to enhance system and application performance,
availability, and reliability. In addition, maintain and monitor
the deployment and orchestration of servers, docker containers,
databases, and general backend infrastructure.
- Document system knowledge as you acquire it, create runbooks,
and ensure critical system information is readily accessible.
- Security Management: Stay updated with security protocols and
proactively identify, diagnose, and resolve complex security
issues.
- Automation and Efficiency: Identify tasks and areas where
automation can be applied to achieve time efficiencies and risk
reduction. Develop software for deployment automation, packaging,
and monitoring visibility.
- Issue Tracking and Resolution: Use Atlassian Jira to track and
resolve issues based on their priority.
- Team Collaboration and Influence: Work in tandem with other
Cloud Infrastructure Engineers and developers to ensure maximum
performance, reliability, and automation of our deployments and
infrastructure. Additionally, consult and influence developers on
new feature development and software architecture to ensure
scalability.
- Debugging, Troubleshooting, and Advanced Support: Undertake
debugging and troubleshooting of service bottlenecks throughout the
entire software stack.
- Directly influence the decisions and outcomes related to
solution implementation: measure and monitor availability, latency,
and overall system health.
- Proficiency in Linux/Unix OS.
- Demonstrated experience in scripting and infrastructure
automation using tools such as Ansible, Python, Go.
- Deep working knowledge of Containers, Kubernetes, and
Serverless computing implementation.
- DevOps development methodologies.
- Familiarity with distributed systems design patterns using
tools such as Kubernetes.
- Experience with cloud platforms such as AWS, Azure, or Google
Cloud.Education
- A minimum of 8 - 12 years of experience is required.
- A Bachelor of Science Degree in Computer Science, a master's
degree; or equivalent experience is required.At NetApp, we embrace
a hybrid working environment designed to strengthen connection,
collaboration, and culture for all employees. This means that most
roles will have some level of in-office and/or in-person
expectations, which will be shared during the recruitment
process.Equal Opportunity Employer: NetApp is firmly committed to
Equal Employment Opportunity (EEO) and to compliance with all laws
that prohibit employment discrimination based on age, race, color,
gender, sexual orientation, gender identity, national origin,
religion, disability or genetic information, pregnancy, and any
protected classification.Did you know... Statistics show women
apply to jobs only when they're 100% qualified. But no one is 100%
qualified. We encourage you to shift the trend and apply anyway! We
look forward to hearing from you.Why NetApp? We are all about
helping customers turn challenges into business opportunity. It
starts with bringing new thinking to age-old problems, like how to
use data most effectively to run better - but also to innovate. We
tailor our approach to the customer's unique needs with a
combination of fresh thinking and proven approaches.We enable a
healthy work-life balance. Our volunteer time off program is best
in class, offering employees 40 hours of paid time off each year to
volunteer with their favourite organizations. We provide
comprehensive benefits, including health care, life and accident
plans, emotional support resources for you and your family, legal
services, and financial savings programs to help you plan for your
future. We support professional and personal growth through
educational assistance and provide access to various discounts and
perks to enhance your overall quality of life.If you want to help
us build knowledge and solve big problems, let's talk.
#J-18808-Ljbffr
Keywords: NetApp, Inc., San Jose , Software Engineer (SRE, Python, Cloud Solutions), IT / Software / Systems , San Jose, California
Didn't find what you're looking for? Search again!
Loading more jobs...