Software Engineer, AI Networking, Machine Learning Infrastructure
Company: Tesla, Inc.
Location: Palo Alto
Posted on: February 1, 2025
Job Description:
Software Engineer, AI Networking, Machine Learning
InfrastructureAs a Software Engineer within the Autopilot AI
Infrastructure team, you will work on reinforcing, optimizing, and
scaling our infrastructure components supporting AI research
activities for Autopilot and the Tesla Bot.At the core of our
autonomy capabilities are neural networks that the research team is
designing to train on very large amounts of data, across
large-scale GPU clusters and our supercomputer Dojo. Robustly
training these models at scale and in the shortest amount of time
is critical to our mission.We are optimizing the communication
collectives used in AI training and inference workloads to ensure
they are robust and performant while improving observability.What
You'll Do
- Identify gaps and optimize the performance of the collective
communication libraries used in the training software stack.
- Build infrastructure to improve observability into the
collective communication libraries to significantly reduce
cognitive load in debugging massively distributed training
jobs.
- Optimize the AI network software stack with respect to the
network topology of our AI supercomputing clusters.
- Develop and integrate various health checks to the fault
tolerance training infrastructure.
- Collaborate with the supercomputing and research team to ensure
requirements on network bandwidth and topology for modern AI
workloads are met.What You'll Bring
- Members of the Autopilot AI Infrastructure team are expected to
be adaptable to the dynamic requirements of AI research and capable
of contributing across all parts of the AI training software
stack.
- Strong work ethic and independence.
- 3+ years of relevant industry experience (HPC, lossless
networks) in a fast-paced environment.
- Strong knowledge on datacenter server systems (PCIe, NUMA, RDMA
NICs and switches).
- Experience in working with, testing and debugging datacenter
RDMA networking fabrics (IB, RoCE) and communication collectives
(e.g. NCCL).
- Experience in debugging issues or bottlenecks in the Linux
kernel.
- Experience in massively parallel programming across multiple
hosts.
- Knowledge or interest in understanding ML training workloads
and how it translates to relevant collectives.Compensation and
BenefitsAlong with competitive pay, as a full-time Tesla employee,
you are eligible for the following benefits at day 1 of hire:
- Aetna PPO and HSA plans > 2 medical plan options with $0
payroll deduction.
- Family-building, fertility, adoption and surrogacy
benefits.
- Dental (including orthodontic coverage) and vision plans, both
have options with a $0 paycheck contribution.
- Company Paid (Health Savings Account) HSA Contribution when
enrolled in the High Deductible Aetna medical plan with HSA.
- Healthcare and Dependent Care Flexible Spending Accounts
(FSA).
- 401(k) with employer match, Employee Stock Purchase Plans, and
other financial benefits.
- Company paid Basic Life, AD&D, short-term and long-term
disability insurance.
- Employee Assistance Program.
- Sick and Vacation time (Flex time for salary positions), and
Paid Holidays.
- Back-up childcare and parenting support resources.
- Voluntary benefits to include: critical illness, hospital
indemnity, accident insurance, theft & legal services, and pet
insurance.
- Weight Loss and Tobacco Cessation Programs.
- Tesla Babies program.
- Commuter benefits.
- Employee discounts and perks program.Expected
Compensation$104,000 - $360,000/annual salary + cash and stock
awards + benefits.Pay offered may vary depending on multiple
individualized factors, including market location, job-related
knowledge, skills, and experience. The total compensation package
for this position may also include other elements dependent on the
position offered. Details of participation in these benefit plans
will be provided if an employee receives an offer of
employment.Tesla is an Equal Opportunity / Affirmative Action
employer committed to diversity in the workplace. All qualified
applicants will receive consideration for employment without regard
to race, color, religion, sex, sexual orientation, age, national
origin, disability, protected veteran status, gender identity or
any other factor protected by applicable federal, state or local
laws.Tesla is also committed to working with and providing
reasonable accommodations to individuals with disabilities. Please
let your recruiter know if you need an accommodation at any point
during the interview process.Privacy is a top priority for Tesla.
We build it into our products and view it as an essential part of
our business. To understand more about the data we collect and
process as part of your application, please view our Tesla Talent
Privacy Notice.
#J-18808-Ljbffr
Keywords: Tesla, Inc., San Jose , Software Engineer, AI Networking, Machine Learning Infrastructure, IT / Software / Systems , Palo Alto, California
Didn't find what you're looking for? Search again!
Loading more jobs...