Sr. System Engineer
Company: Support Revolution
Location: San Jose
Posted on: February 17, 2025
|
|
Job Description:
Select how often (in days) to receive an alert: Create
AlertLocation: San Jose, California, United StatesAbout
Supermicro:Supermicro is a Top Tier provider of advanced server,
storage, and networking solutions for Data Center, Cloud Computing,
Enterprise IT, Hadoop/ Big Data, Hyperscale, HPC and IoT/Embedded
customers worldwide. We are the #5 fastest growing company among
the Silicon Valley Top 50 technology firms. Our unprecedented
global expansion has provided us with the opportunity to offer a
large number of new positions to the technology community. We seek
talented, passionate, and committed engineers, technologists, and
business leaders to join us.
Job Summary:As a Sr. System Engineer, you'll be the go-to person to
roll out and maintain business critical applications and services
for Supermicro. You are also responsible for resolving escalated
service issues, coaching other engineers to resolutions,
engineering and implementing complex projects. You will be a person
who is independent with leadership to drive the technical
development and with excellent communication skills.Essential
Duties and Responsibilities:Includes the following essential duties
and responsibilities (other duties may also be assigned):
--- Execute comprehensive system-level rack tests on latest NVidia
and AMD GPUs, ARM-based, Intel Xeon, and AMD EPYC processors,
encompassing functionality, compatibility, performance, stress, and
reliability testing, leveraging proprietary in-house tools.
--- Establish expertise in HPC/AI applications and benchmarks,
delivering impactful training sessions to customers and partners,
while addressing complex customer support issues, demonstrating
innovative problem-solving skills and building robust processes and
procedures for HPC/AI solutions.
--- Conduct proof of concept design and testing, providing
optimized benchmarks for HPC/AI applications in a timely manner.
Fine-tune BIOS settings, optimize OS/network configurations, and
develop diverse simulation configurations to enhance efficiency
across various workloads.
--- Deliver on-site deployment services, ensuring customer
acceptance verification and providing post-level 1&2 support.
Create and maintain technical documentation, including technical
notes, blogs, and diagrams, to facilitate knowledge
dissemination.
--- Identify and document hardware and software quality issues and
collaborate with Product Management and other Engineering teams to
integrate customer feedback into future product enhancements.
--- Proactively engage in HPC roadmap development, planning
software and hardware upgrades to sustain exceptional HPC
infrastructure performance.
--- Document and analyze test plans, reports, logs, and actively
contribute to the development of test utilities and automation
scripts to streamline testing processes.Qualifications:--- BS/MS in
Electrical Engineering, Computer Engineering or Computer
Science
--- 8+ years of work-related experience in Deep Learning and
Machine Learning
--- 8+ years of Linux/networking debugging/testing or relevant
experience preferred
--- Experience with leading AI/ML frameworks such as PyTorch,
TensorFlow, ONNX, etc.
--- Experience with DevOps or in cloud environments, including but
not limited to Docker/Containers and Kubernetes
--- Hands-on experience with workload/scheduler Managers (Slurm)
for rack/cluster
--- Familiar with MLPerf Training/Inference benchmark, LLM, HPL-AI
or RCCL/NCCL
--- Programming experience with windows and Linux shell
scripting
--- Strong sense of teamwork and good team player, strong
communication skills
--- Familiar with Intel/AMD/NVIDIA development tool kits such as
CUDA, oneAPI, ROCm is a plus
--- Experience with server/network hardware debugging and
troubleshooting is a plus
--- CCNA, OpenStack, OpenShift, Azure or AWS is a plusSalary
Range$140,000 - $158,000The salary offered will depend on several
factors, including your location, level, education, training,
specific skills, years of experience, and comparison to other
employees already in this role. In addition to a comprehensive
benefits package, candidates may be eligible for other forms of
compensation, such as participation in bonus and equity award
programs.EEO StatementSupermicro is an Equal Opportunity Employer
and embraces diversity in our employee population. It is the policy
of Supermicro to provide equal opportunity to all qualified
applicants and employees without regard to race, color, religion,
sex, sexual orientation, gender identity, national origin, age,
disability, protected veteran status or special disabled veteran,
marital status, pregnancy, genetic information, or any other
legally protected status.
#J-18808-Ljbffr
Keywords: Support Revolution, San Jose , Sr. System Engineer, Other , San Jose, California
Click
here to apply!
|