Software Engineer, Distributed Training, AI Infrastructure
Company: Tesla, Inc.
Location: Palo Alto
Posted on: February 1, 2025
Job Description:
Software Engineer, Distributed Training, AI InfrastructureAs a
Software Engineer within the Autopilot AI Infrastructure team, you
will work on reinforcing, optimizing, and scaling our
infrastructure components supporting AI research activities for
Autopilot and the Tesla Bot.At the core of our autonomy
capabilities are neural networks that the research team is
designing to train on very large amounts of data, across
large-scale GPU clusters and our supercomputer Dojo. Robustly
training these models at scale and in the shortest amount of time
is critical to our mission.We are building and improving the
in-house distributed training framework used by the research team
to train production models, ensuring good ergonomics and
flexibility for experimentation while providing good stability and
performance.What You'll Do
- Write robust Python software code in our machine learning
training repository while applying best software practices to
support the research team
- Increase the reliability of our training jobs by debugging and
root causing failures across thousands of nodes and implementing
fixes to prevent future failures
- Improve our training framework to support new training
paradigms and experimentation methods
- Build and improve our monitoring/observability infra to quickly
debug cluster and training application issues
- Profile and identify performance bottlenecks of training
software in our training cluster
- Coordinate with the supercomputing team managing the training
cluster to maintain high availability and job throughputWhat You'll
Bring
- Members of the Autopilot AI Infrastructure team are expected to
be adaptable to the dynamic requirements of AI research and capable
of contributing across all parts of the AI training software
stack
- Practical programming experience in Python and/or C/C++
- Experience working with ML training frameworks (ideally
PyTorch)
- Demonstrated experience scaling neural network training jobs
across many GPUs
- Experience with parallel programming concepts and
primitives
- Experience profiling and optimizing CPU-GPU interactions
(pipelining computation with data transfers, etc)
- Proficient in system-level software, in particular
hardware-software interactions and resource utilization
- Understanding of state-of-the-art deep learning concepts
- Experience programming in CUDA/Triton and/or NCCL
internalsCompensation and BenefitsAlong with competitive pay, as a
full-time Tesla employee, you are eligible for the following
benefits at day 1 of hire:
- Aetna PPO and HSA plans > 2 medical plan options with $0
payroll deduction
- Family-building, fertility, adoption and surrogacy
benefits
- Dental (including orthodontic coverage) and vision plans, both
have options with a $0 paycheck contribution
- Company Paid (Health Savings Account) HSA Contribution when
enrolled in the High Deductible Aetna medical plan with HSA
- Healthcare and Dependent Care Flexible Spending Accounts
(FSA)
- 401(k) with employer match, Employee Stock Purchase Plans, and
other financial benefits
- Company paid Basic Life, AD&D, short-term and long-term
disability insurance
- Employee Assistance Program
- Sick and Vacation time (Flex time for salary positions), and
Paid Holidays
- Back-up childcare and parenting support resources
- Voluntary benefits to include: critical illness, hospital
indemnity, accident insurance, theft & legal services, and pet
insurance
- Weight Loss and Tobacco Cessation Programs
- Tesla Babies program
- Commuter benefits
- Employee discounts and perks programExpected
Compensation$104,000 - $360,000/annual salary + cash and stock
awards + benefitsPay offered may vary depending on multiple
individualized factors, including market location, job-related
knowledge, skills, and experience. The total compensation package
for this position may also include other elements dependent on the
position offered. Details of participation in these benefit plans
will be provided if an employee receives an offer of
employment.Tesla is an Equal Opportunity / Affirmative Action
employer committed to diversity in the workplace. All qualified
applicants will receive consideration for employment without regard
to race, color, religion, sex, sexual orientation, age, national
origin, disability, protected veteran status, gender identity or
any other factor protected by applicable federal, state or local
laws.Tesla is also committed to working with and providing
reasonable accommodations to individuals with disabilities. Please
let your recruiter know if you need an accommodation at any point
during the interview process.Privacy is a top priority for Tesla.
We build it into our products and view it as an essential part of
our business. To understand more about the data we collect and
process as part of your application, please view our Tesla Talent
Privacy Notice.
#J-18808-Ljbffr
Keywords: Tesla, Inc., San Jose , Software Engineer, Distributed Training, AI Infrastructure, IT / Software / Systems , Palo Alto, California
Didn't find what you're looking for? Search again!
Loading more jobs...