Tesla, Inc. (TSLA)

NASDAQ: TSLA · Real-Time Price · USD

+9.83 (2.43%)

May 20, 2026, 3:24 PM EDT - Market open

AI Day 2022

Oct 1, 2022

Elon Musk

CEO, Tesla

All right, welcome everybody. Let's give everyone a moment to get back in the audience and. All right. Great. Welcome to Tesla AI Day 2022. We've got some really exciting things to show you. I think you'll be pretty impressed. I do wanna set some expectations with respect to our Optimus robot. As you know, last year it was just a person in a robot suit. We've come a long way, and it's, I think we've, you know, compared to that, it's gonna be very impressive. We're gonna talk about the advancements in AI for Full Self-Driving, as well as how they apply to more generally to real-world AI problems like a humanoid robot and even going beyond that.

I think there's some potential that what we're doing here at Tesla could make a meaningful contribution to AGI. I think actually Tesla's a good entity to do it from a governance standpoint because we're a publicly traded company with one class of stock, and that means the public controls Tesla, and I think that's actually a good thing. If I go crazy, you can fire me. This is important. Maybe I've gone crazy. I don't know. We're gonna talk a lot about our progress in AI, Autopilot, as well as the progress in with Dojo. We're gonna bring the team out and do a long Q&A.

You can ask tough questions, whatever you'd like, existential questions, technical questions, but we wanna have as much time for Q&A as possible. Let's see. With that, you guys wanna say anything?

Milan Kovac

Director of Autopilot Software, Tesla

Hey guys, I'm Milan. I work on Autopilot and the Tesla Bot.

Lizzie Miskovetz

Senior Manager Mechanical Engineering, Tesla

I'm Lizzie, a mechanical engineer on the project as well.

Elon Musk

CEO, Tesla

Okay. Should we bring out the bot?

Lizzie Miskovetz

Senior Manager Mechanical Engineering, Tesla

Before we do that.

Elon Musk

CEO, Tesla

All right.

Lizzie Miskovetz

Senior Manager Mechanical Engineering, Tesla

We have one little bonus tip for the day. This is actually the first time we try this robot without any backup support, cranes, mechanical mechanisms, no cables, nothing.

Elon Musk

CEO, Tesla

Yeah.

Milan Kovac

Director of Autopilot Software, Tesla

We wanna do it with you guys tonight.

Lizzie Miskovetz

Senior Manager Mechanical Engineering, Tesla

Kinda take a little risk with you guys.

Milan Kovac

Director of Autopilot Software, Tesla

This is the first time.

Elon Musk

CEO, Tesla

We want to.

Milan Kovac

Director of Autopilot Software, Tesla

Let's see.

Lizzie Miskovetz

Senior Manager Mechanical Engineering, Tesla

You ready? Let's go.

Elon Musk

CEO, Tesla

Go.

Milan Kovac

Director of Autopilot Software, Tesla

I think the bot got some moves too. This is essentially the same Full Self-Driving computer that runs in your Tesla cars, by the way.

Elon Musk

CEO, Tesla

This is literally the first time the robot has operated without a tether. It's on stage tonight. The robot can actually do a lot more than we just showed you. We just didn't want it to fall on its face. We'll show you some videos now of the robot doing a bunch of other things, which are less risky.

Milan Kovac

Director of Autopilot Software, Tesla

We should close this screen, guys.

Elon Musk

CEO, Tesla

Yeah.

Milan Kovac

Director of Autopilot Software, Tesla

Yeah, we wanted to show a little bit more what we've done over the past few months with the bot and just walking around and dancing on stage. Just humble beginnings, but you can see the Autopilot neural networks running as is, just retrained for the bot, directly on that new platform.

Elon Musk

CEO, Tesla

Yeah.

Milan Kovac

Director of Autopilot Software, Tesla

That's my watering can.

Elon Musk

CEO, Tesla

Yeah, when you see a rendered view, that's the robot. That's the world the robot sees. It's very clearly identifying objects. Like, this is the object it should pick up, picking it up. Yeah.

Milan Kovac

Director of Autopilot Software, Tesla

We use the same process as we did for Autopilot to collect data and train neural networks that we then deploy on the robot. That's an example that illustrates the upper body a little bit more.

Lizzie Miskovetz

Senior Manager Mechanical Engineering, Tesla

And then-

Milan Kovac

Director of Autopilot Software, Tesla

Something that we'll, like, try to nail down in a few months, over the next few months, I would say, to perfection.

Lizzie Miskovetz

Senior Manager Mechanical Engineering, Tesla

This is really an actual station in the Fremont factory as well that it's working at.

Elon Musk

CEO, Tesla

Yep.

Lizzie Miskovetz

Senior Manager Mechanical Engineering, Tesla

That's not the only thing we have to show today, right?

Elon Musk

CEO, Tesla

Yeah, absolutely. What you saw was what we call Bumble C. That's our sort of rough development robot using semi off-the-shelf actuators. We actually have gone a step further than that already. The team's done an incredible job, and we actually have an Optimus bot with fully Tesla designed and built actuators, battery pack, control system, everything. It wasn't quite ready to walk, but I think it will walk in a few weeks. We wanted to show you the robot, so something that's actually fairly close to what will go into production and show you all the things it can do. Let's bring it out. Do it. Sorry. Yeah.

Here you're seeing Optimus with these degrees of freedom that we expect to have in Optimus production unit one, which is the ability to move all the fingers independently, to have the thumb have two degrees of freedom. It has opposable thumbs and both left and right hand, so it's able to operate tools and do useful things. Our goal is to make a useful humanoid robot as quickly as possible. We've also designed it using the same discipline that we use in designing the car, which is to say, to design it for manufacturing, such that it's possible to make the robot in high volume at low cost with high reliability. That's incredibly important.

I mean, you've all seen very impressive humanoid robot demonstrations, and that's great, but what are they missing? They're missing a brain. They don't have the intelligence to navigate the world by themselves. They're also very expensive, and made in low volume. Whereas Optimus is designed to be an extremely capable robot, but made in very high volume, probably ultimately millions of units. It is expected to cost much less than a car.

Moderator

I'll just bring it directly to the right here.

Elon Musk

CEO, Tesla

I would say probably less than $20,000 would be my guess. The potential for Optimus is, I think, appreciated by very few people. Hey. As usual, Tesla demos are coming in hot.

Moderator

Okay, that's good. That's good.

Elon Musk

CEO, Tesla

Yeah. The team has put in an incredible amount of work, working days, you know, seven days a week, burning the 3 A.M. oil to get to the demonstration today. Super proud of what they've done. They've really done a great job. I'd just like to give a hand to the whole Optimus team. You know, now there's still a lot of work to be done to refine Optimus and improve it. Obviously, 'cause this is just Optimus version 1.

That's really why we're holding this event, which is to convince some of the most talented people in the world, like you guys, to join Tesla and help make it a reality and bring it to fruition at scale, such that it can help millions of people. The potential, like I said, is really boggles the mind because you have to say like, what is an economy? An economy is sort of productive entities times their productivity, capital times output, productivity per capita. At the point at which there is not a limitation on capital, it's not clear what an economy even means at that point. It an economy becomes quasi-infinite.

So what you know, taken to fruition in the hopefully benign scenario, this means a future of abundance. A future where there is no poverty. Where you can have whatever you want in terms of products and services. It really is a fundamental transformation of civilization as we know it. Obviously we wanna make sure that transformation is a positive one and safe. But that's also why I think Tesla as an entity doing this, being a single class of stock publicly traded, owned by the public, is very important and should not be overlooked. I think this is essential because then if the public doesn't like what Tesla's doing, the public can buy shares in Tesla and vote differently. This is a big deal.

Like, it's very important that I can't just do what I want. You know, sometimes people think that, but it's not true. You know, it's very important that the corporate entity that makes this happen is something that the public can properly influence. I think the Tesla structure is ideal for that. Like I said, you know, self-driving cars will certainly have a tremendous impact on the world. I think they will improve the productivity of transport by at least a half order of magnitude, perhaps an order of magnitude, perhaps more. Optimus, I think, has maybe two orders of magnitude potential improvement in economic output.

Like, it's not clear what the limit actually even is. We need to do this in the right way. We need to do it carefully and safely and ensure that the outcome is one that is beneficial to civilization and one that humanity wants. It's extremely important, obviously. I hope you will consider joining Tesla to achieve those goals. At Tesla, we really care about doing the right thing here, or aspire to do the right thing, and really not pave the road to hell with good intentions. I think road to hell is mostly paved with bad intentions, but every now and again, there's a good intention in there.

We wanna do the right thing. You know, consider joining us and helping make it happen. With that, let's move on to the next phase.

Lizzie Miskovetz

Senior Manager Mechanical Engineering, Tesla

Right on. Thank you, Elon. All right, you've seen a couple robots today. Let's do a quick timeline recap. Last year, we unveiled the Tesla Bot concept, but a concept doesn't get us very far. We knew we needed a real development and integration platform to get real-life learnings as quickly as possible. That robot that came out and did the little routine for you guys, we had that within 6 months, built, working on software integration, hardware upgrades over the months since then. But in parallel, we've also been designing the next generation, this one over here. This guy is rooted in the foundation of sort of the vehicle design process. You know, we're leveraging all of those learnings that we already have. Obviously, there's a lot that's changed since last year, but there's a few things that are still the same. You'll notice.

We still have this really detailed focus on the true human form. We think that matters for a few reasons, but it's fun. We spend a lot of time thinking about how amazing the human body is. We have this incredible range of motion, typically really amazing strength. A fun exercise is if you put your fingertip on the chair in front of you'll notice that there's a huge range of motion that you have in your shoulder and your elbow, for example. Without moving your fingertip, you can move those joints all over the place. But the robot, you know, its main function is to do real useful work, and it maybe doesn't necessarily need all of those degrees of freedom right away.

We've stripped it down to a minimum sort of 28 fundamental degrees of freedom, and then of course, our hands in addition to that. Humans are also pretty efficient at some things and not so efficient in other times. For example, we can eat a small amount of food to sustain ourselves for several hours. That's great. But when we're just kind of sitting around, no offense, but we're kind of inefficient. We're just sort of burning energy. On the robot platform, what we're gonna do is we're gonna minimize that idle power consumption, drop it as low as possible, and that way we can just flip a switch, and immediately the robot turns into something that does useful work. Let's talk about this latest generation in some detail, shall we?

On the screen here, you'll see in orange our actuators, which we'll get to in a little bit, and in blue our electrical system. Now that we have our sort of human-based research, and we have our first development platform, we have both research and execution to draw from for this design. Again, we're using that vehicle design foundation, so we're taking it from concept through design and analysis and then build and validation. Along the way, we're gonna optimize for things like cost and efficiency because those are critical metrics to take this product to scale eventually. How are we gonna do that? Well, we're gonna reduce our part count and our power consumption of every element possible. We're gonna do things like reduce the sensing and the wiring at our extremities.

You can imagine a lot of mass in your hands and feet is gonna be quite difficult and power consumptive to move around. We're gonna centralize both our power distribution and our compute to the physical center of the platform. In the middle of our torso, actually it is the torso, we have our battery pack. This is sized at 2.3 kWh, which is perfect for about a full day's worth of work. What's really unique about this battery pack is it has all of the battery electronics integrated into a single PCB within the pack. That means everything from sensing to fusing, charge management, and power distribution is all in one place. We're also leveraging both our vehicle products and our energy products to roll all of those key features into this battery.

That's streamlined manufacturing, really efficient and simple cooling methods, battery management, and also safety. Of course, we can leverage Tesla's existing infrastructure and supply chain to make it. Going on to sort of our brain, it's not in the head, but it's pretty close. Also in our torso, we have our central computer. As you know, Tesla already ships Full Self-Driving computers in every vehicle we produce. We wanna leverage both the Autopilot hardware and the software for the humanoid platform, but because it's different in requirements and in form factor, we're gonna change a few things first. It's gonna do everything that a human brain does, processing vision data, making split-second decisions based on multiple sensory inputs, and also communications. To support communications, it's equipped with wireless connectivity as well as audio support.

It also has hardware-level security features, which are important to protect both the robot and the people around the robot. Now that we have our sort of core, we're gonna need some limbs on this guy, and we'd love to show you a little bit about our actuators and our fully functional hands as well. Before we do that, I'd like to introduce Malcolm, who's gonna speak a little bit about our structural foundation for the robot.

Malcolm Burgess

Manager of Structural Concepts and Vehicle Dynamics, Tesla

Thanks. Thank you, Lizzie. Tesla has the capabilities to analyze highly complex systems. They don't get much more complex than a crash. You can see here a simulated crash from Model 3 superimposed on top of the actual physical crash. It's actually incredible how accurate it is. Just to give you an idea of the complexity of this model, it includes every nut, bolt, and washer, every spot weld, and it has 35 million degrees of freedom. It's quite amazing. It's true to say that if we didn't have models like this, we wouldn't be able to make the safest cars in the world. Can we utilize our capabilities and our methods from the automotive side to influence a robot? Well, we can make a model, and since we have crash software, we're using the same software here. We can make it fall down.

The purpose of this is to make sure that if it falls down, ideally it doesn't, but it's superficial damage. We don't want it to, for example, break its gearbox and its arms. That's the equivalent of a dislocated shoulder of a robot, difficult and expensive to fix. We want it to dust itself off, get on with the job it's been given. We can also take the same model, and we can drive the actuators using the inputs from a previously solved model, bringing it to life. This is producing the motions for the tasks we want the robot to do. These tasks are picking up boxes, turning, squatting, walking up stairs. Whatever the set of tasks are, we can place in the model. This is showing just simple walking. We can create the stresses in all the components.

That helps us optimize the components. These are not dancing robots. These are actually the modal behavior, the first five modes of the robot. Typically, when people make robots, they make sure the first mode is up around the top single figures, up towards 10 hertz. The reason to do this is to make the controls of walking easier. It's very difficult to walk if you can't guarantee where your foot is, it wobbling around. That's okay if you make one robot. We want to make thousands, maybe millions. We haven't got the luxury of making them from carbon fiber and titanium. We want to make them from plastic. Things are not quite so stiff. We can't have these high targets. I call them dumb targets. We've got to make them work at lower targets. Is that going to work?

Well, if you think about it, sorry about this, but we're just bags of soggy jelly and bones thrown in. We're not high frequency. If I stand on my leg, I don't vibrate at 10 Hz. People operate at a lower frequency, so we know the robot actually can. It just makes controls harder. We take the information from this, the modal data and the stiffness, and feed that into the control system. That allows it to walk. Just changing tack slightly, looking at the knee. We can take some inspiration from biology, and we can look to see what the mechanical advantage of the knee is. It turns out it actually represents quite similar to four-bar link, and that's quite nonlinear.

That's not surprising, really, because if you think when you bend your leg down, the torque on your knee is much more when it's bent than it is when it's straight. You'd expect a nonlinear function, and in fact, the biology is nonlinear. This matches it quite accurately. That's the representation. The four-bar link is obviously not physically a four-bar link, but as I said, the characteristics are similar. Me bending down, that's not very scientific. Let's be a bit more scientific. We've played all the tasks through this graph, and this is showing picking things up, walking, squatting, the tasks I said we did on the stress. That's the torque seen at the knee against the knee bend on the horizontal axis. This is showing the requirement for the knee to do all these tasks.

I've then put a curve through it, surfing over the top of the peaks, and that's saying this is what's required to make the robot do these tasks. If we look at the four-bar link, that's actually the green curve, and it's saying that the nonlinearity of the four-bar link is actually linearized the characteristic of the force. What that really says is that's lowered the force. That's what makes the actuator have the lowest possible force, which is the most efficient. We wanna burn energy up slowly. What's the blue curve? Well, the blue curve is actually if we didn't have a four-bar link, we just had an arm sticking out of my leg here with an actuator on it, a simple two-bar link.

That's the best we could do with a simple two-bar link, and it shows that that would create much more force in the actuator, which would not be efficient. What's that look like in practice? Well, as you'll see, well, it's very tightly packaged in the knee. You'll see it go transparent in a second. You'll see the four-bar link there. It's operating on the actuator. This is determining the force and the displacements on the actuator. I'll now pass you over to Konstantinos to tell you a lot more detail about how these actuators are made and designs optimized. Thank you.

Konstantinos Laskaris

Director and Lead of Optimus program, Tesla

Thank you, Malcolm. I would like to talk to you about the design process and actuator portfolio in our robot. There are many similarities between a car and a robot when it comes to powertrain design. The most important thing that matters here is energy, mass, and cost. We are carrying over most of our designing experience from the car to the robot. In the particular case, you see a car with two drive units, and the drive units are used in order to accelerate the car 0-60 miles per hour time or drive a city drive cycle. While the robot that has 28 actuators, it's not obvious what are the tasks at a actuator level.

We have tasks that are higher level, like walking or climbing stairs or carrying a heavy object, which need to be translated into joint specs. Therefore, we use our model that generates the torque speed trajectories for our joints, which subsequently is going to be fed in our optimization model to run through the optimization process. This is one of the scenarios that the robot is capable of doing, which is turning and walking. When we have this torque speed trajectory, we lay it over an efficiency map of an actuator, and we are able, along the trajectory, to generate the power consumption and the energy, cumulative energy for the task versus time. This allows us to define the system cost for the particular actuator and put a single point into the cloud.

We do this for hundreds of thousands of actuators by solving in our cluster. The red line denotes the Pareto front, which is the preferred area where we will look for optimal. The X denotes the preferred actuator design we have picked for this particular joint. Now we need to do this for every joint. We have 28 joints to optimize, and we parse our cloud. We parse our cloud again for every joint spec, and the red X's this time denote the bespoke actuator designs for every joint. The problem here is that we have too many unique actuator designs, and even if we take advantage of the symmetry, still there are too many. In order to make something mass manufacturable, we need to be able to reduce the amount of unique actuator designs.

Therefore, we run something called commonality study, which we parse our cloud again, looking this time for actuators that simultaneously meet the joint performance requirements for more than one joint at the same time. The resulting portfolio is six actuators, and they show in a color map in the middle figure, and the actuators can be also viewed in this slide. We have three rotary and three linear actuators, all of which have a great output force or torque per mass. The rotary actuator in particular has a mechanical clutch integrated on the high-speed side, angular contact ball bearing, and on the low-speed side, a cross roller bearing and the gear train is a strain wave gear. There are three integrated sensors here and a bespoke permanent magnet machine. The linear actuator. I'm sorry.

The linear actuator has planetary rollers and an inverted planetary screw as a gear train, which allows efficiency and compaction and durability. In order to demonstrate the force capability of our linear actuators, we have set up an experiment in order to test it under its limits. I will let you enjoy the video. Our actuator is able to lift a half ton, 9-foot concert grand piano. This is a requirement. It's not something nice to have because our muscles can do the same when they are direct driven. When they are directly driven, our quadriceps muscles can do the same thing. It's just that the knee is an up-gearing linkage system that converts the force into velocity at the end effector of our heels for purposes of giving to the human body agility.

This is one of the main things that are amazing about the human body. I'm concluding my part at this point, and I would like to welcome my colleague, Mike, who's going to talk to you about hand design. Thank you very much.

Mike Johnson

Mechanical Design Engineer, Tesla

Thanks, Konstantinos. We just saw how powerful a human and a humanoid actuator can be. However, humans are also incredibly dexterous. The human hand has the ability to move at 300 degrees per second. There's tens of thousands of tactile sensors, and it has the ability to grasp and manipulate almost every object in our daily lives. For our robotic hand design, we were inspired by biology. We have 5 fingers and opposable thumb. Our fingers are driven by metallic tendons that are both flexible and strong. We have the ability to complete wide aperture power grasps while also being optimized for precision gripping of small, thin, and delicate objects. Why a human-like robotic hand? Well, the main reason is that our factories and the world around us is designed to be ergonomic.

What that means is that it ensures that objects in our factory are graspable, but it also ensures that new objects that we may have never seen before can be grasped by the human hand and by our robotic hand as well. The converse there is pretty interesting because it's saying that these objects are designed to our hand instead of having to make changes to our hand to accompany a new object. Some basic stats about our hand is that it has six actuators and 11 degrees of freedom. It has an in-hand controller which drives the fingers and receives sensor feedback. Sensor feedback is really important to learn a little bit more about the objects that we're grasping and also for proprioception, and that's the ability for us to recognize where our hand is in space.

One of the important aspects of our hand is that it's adaptive. This adaptability is involved, essentially has complex mechanisms that allow the hand to adapt to the objects that's being grasped. Another important part is that we have a non-backdrivable finger drive. This clutching mechanism allows us to hold and transport objects without having to turn on the hand motors. You just heard how we went about designing the Tesla Bot hardware. Now I'll hand it off to Milan and our autonomy team to bring this robot to life.

Milan Kovac

Director of Autopilot Software, Tesla

Thanks, Mike. All right. All those cool things we've shown earlier in the video were possible just in a matter of a few months, thanks to the amazing work that we've done on Autopilot over the past few years. Most of those components ported quite easily over to the bot's environment. If you think about it, we're just moving from a robot on wheels to a robot on legs. Some of the components are pretty similar and some other require more heavy lifting. For example, our computer vision neural networks were ported directly from Autopilot to the bot's situation. It's exactly the same occupancy network that we'll dive into a little bit more details later with the Autopilot team that is now running on the bot here in this video.

The only thing that changed really is the training data that we had to recollect. We're also trying to find ways to improve those occupancy networks, using work made on neural radiance fields to get really great volumetric rendering of the bot's environments. For example, here, some machinery that the bot might have to interact with. Another interesting problem to think about is in indoor environments, mostly, without sense of GPS signal, how do you get the bot to navigate to its destination? Say, for instance, to find its nearest charging station. We've been training more neural networks to identify high-frequency features, key points within the bot's camera streams, and track them across frames over time as the bot navigates with its environment.

We're using those points to get a better estimate of the bot's pose and trajectory within its environment as it's walking. We also did quite some work on the simulation side, and this is literally the Autopilot simulator to which we've integrated the robot's locomotion code. This is a video of the motion control code running in the Autopilot simulator, showing the evolution of the robot's work over time. As you can see, we started quite slowly in April and start accelerating as we unlock more joints and deploy more advanced techniques like arms balancing over the past few months. Locomotion is specifically one component that's very different as we're moving from the car to the bot's environment. I think it warrants a little bit more depth, and I'd like my colleagues to start talking about this now.

Felix Sygulla

Robotics Engineer, Tesla

Thank you, Milan. Hi, everyone. I'm Felix. I'm a robotics engineer on the project, and I'm gonna talk about walking. Walking seems easy, right? People do it every day. You don't even have to think about it. But there are some aspects of walking which are challenging from engineering perspective. For example, physical self-awareness. That means having a good representation of yourself. What is the length of your limbs? What is the mass of your limbs? What is the size of your feet? All that matters. Also, having an energy-efficient gait. You can imagine there's different styles of walking, and not all of them are equally efficient. Most important, keep balance, don't fall. Of course, also coordinate the motion of all of your limbs together. Now, humans do all of this naturally, but as engineers or roboticists, we have to think about these problems.

Therefore, I'm gonna show you how we address them in our locomotion planning and control stack. We start with locomotion planning and our representation of the bot. That means a model of the robot's kinematics, dynamics, and the contact properties. Using that model and the desired path for the bot, our locomotion planner generates reference trajectories for the entire system. This means feasible trajectories with respect to the assumptions of our model. The planner currently works in three stages. It starts planning footsteps and ends with the entire motion for the system. Let's dive a little bit deeper in how this works. In this video, we see footsteps being planned over a planning horizon following the desired path. We start from this and add, then, foot trajectories that connect these footsteps using toe-off and heel strike just as humans do.

This gives us larger stride and less knee bend for higher efficiency of the system. The last stage is then finding a center of mass trajectory, which gives us a dynamically feasible motion of the entire system to keep balance. As we all know, plans are good, but we also have to realize them in reality. Let's see how we can do this.

Anand Tanavade

Staff Software Engineer, Tesla

Thank you, Felix. Hello, everyone. My name is Anand, and I'm gonna talk to you about controls. Let's take the motion plan that Felix just talked about and put it in the real world on a real robot. Let's see what happens. It takes a couple steps and falls down. Well, that's a little disappointing, but we are missing a few key pieces here which will make it walk. Now, as Felix mentioned, the motion planner is using an idealized version of itself and a version of reality around it. This is not exactly correct. It also expresses its intention through trajectories and wrenches of forces and torques, that it wants to exert on the world to locomote. Reality is way more complex than any similar model. Also, the robot is not simplified. It's got vibrations and modes, compliance, sensor noise, and on and on and on.

What does that do to the real world when you put the bot in the real world? Well, the unexpected forces cause unmodeled dynamics, which essentially the planner doesn't know about, and that causes destabilization, especially for a system that is dynamically stable, like bipedal locomotion. What can we do about it? Well, we measure reality. We use sensors and our understanding of the world to do state estimation. Here you can see the attitude and pelvis pose, which is essentially the vestibular system in a human, along with the center of mass trajectory being tracked when the robot's walking in the office environment. Now we have all the pieces we need in order to close the loop.

We use our better bot model, we use the understanding of reality that we've gained through state estimation, and we compare what we want versus what we expect that reality is doing to us in order to add corrections to the behavior of the robot. Here, the robot certainly doesn't appreciate being poked, but it does an admirable job of staying upright. The final point here is a robot that walks is not enough. We need it to use its hands and arms to be useful. Let's talk about manipulation.

Eric Earley

Robotics Engineer, Tesla

Hi, everyone. My name's Eric, robotics engineer on Tesla Bot, and I wanna talk about how we've made the robot manipulate things in the real world. We want it to manipulate objects while looking as natural as possible, and also get there quickly. What we've done is we've broken this process down into two steps. First is generating a library of natural motion references, or we could call them demonstrations, and then we've adapted these motion references online to the current real-world situation. Let's say we have a human demonstration of picking up an object. We can get a motion capture of that demonstration, which is visualized right here as a bunch of key frames representing the locations of the hands, the elbows, the torso. We can map that to the robot using inverse kinematics.

If we collect a lot of these, now we have a library that we can work with. But a single demonstration is not generalizable to the variation in the real world. For instance, this would only work for a box in a very particular location. What we've also done is run these reference trajectories through a trajectory optimization program, which solves for where the hand should be, how the robot should balance during when it needs to adapt the motion to the real world. For instance, if the box is in this location, then our optimizer will create this trajectory instead. Next, Milan is going to talk about what's next for the Optimus, Tesla Bot. Thanks.

Milan Kovac

Director of Autopilot Software, Tesla

Thanks, Eric. Right. Hopefully by now you guys got a good idea of what we've been up to over the past few months. We started with something that's usable, but it's far from being useful. There's still a long and exciting road ahead of us.

I think the first thing within the next few weeks is to get Optimus at least at par with Bumble C, the other bot prototype you saw earlier, and probably beyond. We are also going to start focusing on the real use case at one of our factories and really gonna try to nail this down and iron out all the elements needed to deploy this product in the real world. I was mentioning earlier, you know, indoor navigation, graceful fall management or even servicing, all components needed to scale this product up. I don't know about you, but after seeing what we've shown tonight, I'm pretty sure we can get this done within the next few months or years, and make this product a reality and change the entire economy.

I would like to thank the entire Optimus team for all their hard work over the past few months. I think it's pretty amazing. All of this was done in barely six or eight months. Thank you very much.

Ashok Elluswamy

Director of Autopilot Software, Tesla

Hey, everyone. Hi, I'm Ashok Elluswamy. I lead the Autopilot team alongside Milan. God, it's gonna be so hard to top that Optimus section. We'll try nonetheless. Anyway, every Tesla that has been built over the last several years, we think has the hardware to make the car drive itself. We have been working on the software to add higher and higher levels of autonomy. This time around last year, we had roughly 2,000 cars driving our FSD Beta software. Since then, we have significantly improved the software's robustness and capability, that we have now shipped it to 160,000 customers as of today. Thank you. This does not come for free. It came from the sweat and blood of the engineering team over the last year. For example, we trained 75,000 neural network models just last year.

That's roughly a model every eight minutes, you know, coming out of the team. We evaluate them on our large clusters, and then we ship 281 of those models that actually improve the performance of the car. This pace of innovation is happening throughout the stack. The planning software, the infrastructure, the tools, even hiring, everything is progressing to the next level. The FSD Beta software is quite capable of driving the car. It should be able to navigate from parking lot to parking lot, handling city street driving, stopping for traffic lights and stop signs, negotiating with objects at intersections, making turns, and so on. All of this comes from the camera streams that go through our neural networks that run on the car itself. It's not coming back to the server or anything.

It runs on the car and produces all the outputs, to form the world model around the car, and the planning software drives the car based on that. Today, we'll go into a lot of the components that make up the system. The occupancy network acts as the base geometry layer of the system. This is a multi-camera video neural network that, from the images, predicts the full physical occupancy of the world around the robot. So anything that's physically present, trees, walls, buildings, cars, balls, what have you, if it's physically present, it predicts them along with their future motion. On top of this base level of geometry, we have more semantic layers. In order to navigate the roadways, we need the lanes, of course. The roadways have lots of different lanes, and they connect in all kinds of ways.

It's actually a really difficult problem for typical computer vision techniques to predict the set of lanes and their connectivities. We reach all the way into language technologies and then pull the state-of-the-art from other domains, and not just computer vision, to make this task possible. For vehicles, we need their full kinematic state to control for them. All of this directly comes from neural networks. Video streams, raw video streams, come into the networks, go through a lot of processing, and then outputs the full kinematic state, like positions, velocities, acceleration, jerk, all of that directly comes out of the networks with minimal post-processing. That's really fascinating to me because how is this even possible? What world do we live in that this magic is possible, that these networks predicts fourth derivatives of these positions, and people thought we couldn't even detect these objects?

My opinion is that it does not come for free. It requires tons of data, so we had to build sophisticated auto-labeling systems that churn through raw sensor data, run a ton of offline compute on the servers. It can take a few hours, run expensive neural networks, distill the information into labels that train our in-car neural networks. On top of this, we also use our simulation system to synthetically create images, and since it's a simulation, we trivially have all the labels. All of this goes through a well-oiled data engine pipeline, where we first train a baseline model with some data, ship it to the car, see what the failures are. Once we know the failures, we mine the fleet for the cases where it fails, provide the correct labels, and add the data to the training set.

This process systematically fixes the issues, and we do this for every task that runs in the car.

Milan Kovac

Director of Autopilot Software, Tesla

Yeah, to train these new massive neural networks, this year we expanded our training infrastructure by roughly 40%-50%. That sits us at about 14,000 GPUs today across multiple training clusters in the United States.

Ashok Elluswamy

Director of Autopilot Software, Tesla

We also worked on our AI compiler, which now supports new operations needed by those neural networks and map them to the best of our underlying hardware resources. Our inference engine today is capable of distributing the execution of a single neural network across two independent system on chips, essentially two independent computers interconnected within the same Full Self-Driving computer. To make this possible, we had to keep a tight control on the end-to-end latency of this new system. We deployed more advanced scheduling code across the full FSD platform. All of these neural networks running in the car together produce the vector space, which is again the model of the world around the robot or the car.

The planning system operates on top of this, coming up with trajectories that avoid collisions, are smooth, make progress towards the destination using a combination of model-based optimization, plus neural network that helps optimize it to be really fast. Today, we are really excited to present progress on all of these areas. We have the engineering leads standing by to come in and explain these various blocks. These power not just the car, but the same components also run on the Optimus robot that Milan showed earlier. With that, I welcome Paril to start talking about the planning section.

Paril Jain

Engineering Lead and Manager of AI, Tesla

Hi, all. I'm Paril Jain. Let's use this intersection scenario to dive straight into how we do the planning and decision-making in Autopilot. We are approaching this intersection from a side street, and we have to yield to all the crossing vehicles. Right as we are about to enter the intersection, the pedestrian on the other side of the intersection decides to cross the road without a crosswalk. Now, we need to yield to this pedestrian, yield to the vehicles from the right, and also understand the relation between the pedestrian and the vehicle on the other side of the intersection. It's a lot of these intra-object dependencies that we need to resolve in a quick glance. Humans are really good at this.

We look at a scene, understand all the possible interactions, evaluate the most promising ones, and generally end up choosing a reasonable one. Let's look at a few of these interactions that Autopilot system evaluated. We could have gone in front of this pedestrian with a very aggressive longitudinal lateral profile. Now, obviously, we are being a jerk to the pedestrian, and we would spook the pedestrian and execute it. We could have moved forward slowly, sought for a gap between the pedestrian and the vehicle from the right. Again, we are being a jerk to the vehicle coming from the right, but you should not outright reject this interaction in case this is only safe interaction available. Lastly, the interaction we ended up choosing, stay slow initially, find the reasonable gap, and then finish the maneuver after all the agents pass.

Evaluation of all of these interactions is not trivial, especially when you care about modeling the higher-order derivatives for other agents. For example, what is the longitudinal jerk required by the vehicle coming from the right when you insert in front of it? Relying purely on collision checks with marginal predictions will only get you so far because you will miss out on a lot of valid interactions. This basically boils down to solving a multi-agent joint trajectory planning problem over the trajectories of ego and all the other agents. However much you optimize, there's gonna be a limit to how fast you can run this optimization problem. It will be close to order of 10 milliseconds, even after a lot of incremental approximations.

Now, for a typical crowded unprotected left, say you have more than 20 objects, each object having multiple different future modes, the number of relevant interaction combinations will blow up. The planner needs to make a decision every 50 milliseconds. How do we solve this in real-time? We rely on a framework what we call as interaction search, which is basically a parallelized search over a bunch of maneuver trajectories. The state space here corresponds to the kinematic state of ego, the kinematic state of other agents, the nominal future multi-modal predictions, and all the static entities in the scene. The action space is where things get interesting. We use a set of maneuver trajectory candidates to branch over a bunch of interaction decisions and also incremental goals for a longer horizon maneuver.

Let's walk through this research very quickly to get a sense of how it works. We start with a set of vision measurements, namely lanes, occupancy, moving objects. These get represented as sparse abstractions as well as latent features. We use this to create a set of goal candidates, lanes, again, from the lanes network, or unstructured regions which correspond to a probability mask derived from human demonstrations. Once we have a bunch of these goal candidates, we create seed trajectories using a combination of classical optimization approaches as well as our network planner, again, trained on data from the customer fleet. Now, once we get a bunch of these seed trajectories, we use them to start branching on the interactions. We find the most critical interaction.

In our case, this would be the interaction with respect to the pedestrian, whether we assert in front of it or yield to it. Obviously, the option on the left is a high-penalty option. It likely won't get prioritized. So we branch further onto the option on the right, and that's where we bring in more and more complex interactions, building this optimization problem incrementally with more and more constraints. The tree search keeps flowing, branching on more interactions, branching on more goals. Now, a lot of tricks here lie in evaluation of each of this node of the tree search. Inside each node, initially we started with creating trajectories using classical optimization approaches, where the constraints like I described would be added incrementally. This would take close to 1-5 milliseconds per action.

Now, even though this is fairly good number, when you want to evaluate more than 100+ interactions, this does not scale. We ended up building lightweight queryable networks that we can run in the loop of the planner. These networks are trained on human demonstrations from the fleet, as well as offline solvers with relaxed time limits. With this, we were able to bring the runtime down to close to 100 microseconds per action. Now, doing this alone is not enough because you still have this massive search that you need to go through, and you need to efficiently prune the search space. You need to do scoring on each of these trajectories. Few of these are fairly standard. You do a bunch of collision checks, you do a bunch of comfort analysis.

What is the jerk and accels required for a given maneuver? The customer fleet data plays an important role here again. We run two sets of, again, lightweight queryable networks, both really augmenting each other. One of them trained from interventions from the FSD Beta fleet, which gives a score on how likely is a given maneuver to result in interventions over the next few seconds. Second, which is purely on human demonstrations, human-driven data, giving a score on how close is your given selected action to a human-driven trajectory. The scoring helps us prune the search space, keep branching further on the interactions, and focus the compute on the most promising outcomes.

The cool part about this architecture is that it allows us to create a cool blend between data-driven approaches, where you don't have to rely on a lot of hand-engineered costs, but also grounded in reality with physics-based checks. Now, a lot of what I described was with respect to the agents we could observe in the scene, but the same framework extends to objects behind occlusions. We use the video feed from eight cameras to generate the 3D occupancy of the world. The blue mask here corresponds to the visibility region we call it. It basically gets blocked at the first occlusion you see in the scene. We consume this visibility mask to generate what we call as ghost objects, which you can see on the top left.

Now, if you model the spawn regions and the state transitions of these ghost objects correctly, if you tune your control response as a function of their existence likelihood, you can extract some really nice human-like behaviors. Now, I'll pass it on to Phil to describe more on how we generate these occupancy networks. Thank you.

Phil Duan

Director of Engineering, Tesla

Hey, guys. My name is Phil. I will share the details of the occupancy network we built over the past year. This network is our solution to model the physical world in 3D around our cars, and it is currently not shown in our customer-facing visualization. What you will see here is the raw network output from our internal dev tool. The occupancy network takes video streams of all our eight cameras as input, produces a single unified volumetric occupancy in vector space directly. For every 3D location around our car, it predicts the probability of that location being occupied or not. Since it has video context, it is capable of predicting obstacles that are occluded instantaneously. For each location, it also produces a set of semantics such as curb, car, pedestrian, and road debris, as color-coded here. Occupancy flow is also predicted for motion.

Since the model is a generalized network, it does not tell static and dynamic object, explicitly. It is able to produce and model the random motions such as a swerving trailer here. This network is currently running in all Teslas with FSD computers, and it is incredibly efficient, runs about every 10 milliseconds with our neural net accelerator. How does this work? Let's take a look at the architecture. First, we rectify each camera images with a camera calibration. The images we're showing here, we're giving to the network, it's actually not the typical 8-bit RGB image. As you can see from the first, image on top, we're giving the 12-bit raw photon count imagery to the network.

Since it has four bits more information, it has 16x better dynamic range, as well as reduced latency, since we don't have to run ISP in the loop anymore. We use a set of RegNet and BiFPNs as a backbone to extract image space features. Next, we construct a set of 3D position query along with the image space features as keys and values fit into an attention module. The output of the attention module is high-dimensional spatial features. These spatial features are aligned temporally using vehicle odometry to derive motion. Last, these spatial temporal features go through a set of deconvolution to produce the final occupancy and occupancy flow output. They're formed as fixed-size voxel grid, which might not be precise enough for planning and control.

In order to get a higher resolution, we also produce per-voxel feature maps, which we feed into MLP with 3D spatial q-point queries to get position and semantics at any arbitrary location. After knowing the model better, let's take a look at another example. Here, we have an articulated bus parked on the right side of the road, highlighted as an L-shaped voxel here. As we approach, the bus starts to move. The front of the bus turns blue first, indicating the model predicts the front of bus has a non-zero occupancy flow. As the bus keeps moving, the entire bus turns blue, and you can also see that the network predicts the precise curvature of the bus.

Well, this is a very complicated problem for traditional object detection network, as you have to see whether I'm gonna use one cuboid or perhaps two to fit in the curvature. For occupancy network, since all we care about is the occupancy in the visible space, and we'll be able to model the curvature precisely. Besides the voxel grid, the occupancy network also produces a drivable surface. The drivable surface has both 3D geometry and semantics that are very useful for control, especially on hilly and curvy roads. The surface and the voxel grid are not predicted independently. Instead, the voxel grid actually aligns with the surface implicitly. Here, we're at a hill crest where you can see the 3D geometry of the surface being predicted nicely.

Planner can use this information to decide perhaps we need to slow down more for the hill crest. As you can also see, the voxel grid aligns with the surface consistently. Besides the voxels and the surface, we're also very excited about the recent breakthrough in neural radiance field or NeRF. We're looking to both incorporate some of the latest NeRF features into occupancy network training, as well as using our network output as the input state for NeRF. As a matter of fact, Ashok is very excited about this. This has been his personal weekend project for a while.

Ashok Elluswamy

Director of Autopilot Software, Tesla

About these NeRFs, because I think, you know, academia is building a lot of these foundation models for language using like tons of large datasets for language. I think for vision, NeRFs are gonna provide the foundation models for computer vision because they are grounded in geometry, and geometry gives us a nice way to supervise these networks and frees us of the requirement to define an ontology. The supervision is essentially free because we just have to differentially render these images. I think in the future, this occupancy network idea where, you know, images come in, and then the network produces a consistent volumetric representation of the scene that can then be differentially rendered into any image that was observed, I personally think is the future of computer vision.

You know, we do some initial work on it right now, but I think in the future, both at Tesla and in academia, we will see that these combination of one-shot prediction of volumetric occupancy will be the future. That's my personal bet.

Phil Duan

Director of Engineering, Tesla

Thanks, Ashok. Here is an example early result of a 3D reconstruction from our fleet data. Instead of focusing on getting perfect RGB reprojection in 2D image space, our primary goal here is to accurately represent the world in 3D space for driving. We wanna do this for all our fleet data all over the world, in all weather and lighting conditions. Obviously, this is a very challenging problem, and we're looking for you guys to help. Finally, the occupancy network is trained with large auto label dataset without any human in the loop. With that, I'll pass to Tim to talk about what it takes to train this network.

Tim Zaman

Head of AI Infrastructure and Platform, Tesla

Thanks, Phil. All right. Hey, everyone. Let's talk about some training infrastructure. We've seen a couple of videos, you know, four or five. I think we care more and worry more about a lot more clips than that. We've been looking at the occupancy networks just from Phil's videos. It takes 1.4 billion frames to train that network, what you just saw. If you have 100,000 GPUs, it would take one hour. If you have 1 GPU, it would take 100,000 hours. That is not a human time period that you can wait for your training job to run, right? We wanna ship faster than that. That means you're going to need to go parallel. You need more compute for that. That means you're going to need a supercomputer.

This is why we've built in-house three supercomputers comprising of 14,000 GPUs, where we use 10,000 GPUs for training and run 4,000 GPUs for auto labeling. All these videos are stored in 30 PB of a distributed managed video cache. You shouldn't think of our datasets as fixed, let's say, as you think of your ImageNet or something, you know, with like 1 million frames. You should think of it as a very fluid thing. We've got 500,000 of these videos flowing in and out of these clusters every single day. We track 400,000 of these kind of Python video instantiations every second. That's a lot of calls. We're gonna need to capture that in order to govern the retention policies of this distributed video cache.

Underlying all of this is a huge amount of infra, all of which we build and manage in-house. You cannot just buy, you know, 14,000 GPUs and then 30 PB of flash NVMe, and just put it together and let's go train. It actually takes a lot of work, and I'm gonna go into a little bit of that. What you actually typically wanna do is you wanna take an accelerator, so that could be the GPU or Dojo, which we'll talk about later. Because that's the most expensive component, that's where you want to put your bottleneck. That is really complicated.

That means that your storage is going to need to have the size and the bandwidth to deliver all the data down into the nodes. These nodes need to have the right amount of CPU and memory capabilities to feed into your machine learning framework. This machine learning framework then needs to hand it off to your GPU, and then you can start training. You need to do so across hundreds or thousands of GPU in a reliable way, in lockstep, and in a way that's also fast. You're also going to need an interconnect. Extremely complicated. We'll talk more about Dojo in a second. First, I wanna take you through some optimizations that we've done on our cluster.

We're getting in a lot of videos, and video is very much unlike, let's say, training on images or text, which I think is very well-established. Video is quite literally a dimension more complicated. That's why we needed to go end-to-end from the storage layer down to the accelerator and optimize every single piece of that. Because we train on the photon count videos that come directly from our fleet, we train on those directly. We do not post-process those at all. The way it's just done is, we seek exactly to the frames we select for our batch. We load those in, including the frames that they depend on, so these are your I-frames or your key frames.

We package those up, move them into shared memory, move them into a double buffer on the GPU, and then use a hardware decoder that's only accelerated to actually decode the video. We do that on the GPU natively, and this is all in a very nice Python PyTorch extension. Doing so unlocked more than 30% training speed increase for the occupancy networks and freed up basically a whole CPU to do any other thing. You cannot just do training with just videos. Of course, you need some kind of a ground truth, and that is actually an interesting problem as well.

The objective for storing your ground truth is that you wanna make sure you get to your ground truth that you need in the minimal amount of file system operations and load in the minimal size of what you need in order to optimize for aggregate cross-cluster throughput because you should see a compute cluster as one big device which has internally fixed constraints and thresholds. For this, we rolled out a format that is native to us that's called Smol. We use this for our ground truth, our feature cache, and any inference outputs, so a lot of tensors that are in there. Just a cartoon here. Let's say these are your table that you wanna store, then that's how that would look out if you rolled out on disk.

What you do is you take anything you'd want to index on, so for example, video timestamps, you put those all in the header so that in your initial header read, you know exactly where to go on disk. Then if you have any tensors, you're going to try to transpose the dimensions to put a different dimension last as the contiguous dimension and then also try different types of compression. Then you check out which one was most optimal and then store that one. This is actually a huge tip if you do feature caching. Output from the machine learning network, rotate around the dimensions a little bit, you can get up to 20% increase in efficiency of storage.

When you store that, we also order the columns by size so that all your small columns and small values are together, so that when you seek for a single value, you're likely to overlap with a read on more values which you'll use later so that you don't need to do another file system operation. I could go on and on. I just went on, touched on two projects that we have internally. This is actually part of a huge continuous effort to optimize the compute that we have in-house. Accumulating and aggregating through all these optimizations, we now train our occupancy networks twice as fast just because it's twice as efficient. Now if we add in bunch more compute and go parallel, we can now train this in hours instead of days.

With that, I'd like to hand it off to the biggest user of compute, John Emmons.

John Emmons

Lead of Autopilot Vision Team, Tesla

Hi, everybody. My name is John Emmons. I lead the Autopilot vision team. I'm gonna cover two topics with you today. The first is how we predict lanes, and the second is how we predict the future behavior of other agents on the road. In the early days of Autopilot, we modeled the lane detection problem as an image-based instance segmentation task. Our network was super simple, though. In fact, it was only capable of predicting lanes from a few different kinds of geometries. Specifically, it would segment the ego lane, it could segment adjacent lanes, and then it had some special casing for forks and merges. This simplistic modeling of the problem worked for highly structured roads like highways. Today we're trying to build a system that's capable of much more complex maneuvers.

Specifically, we wanna make left and right turns at intersections where the road topology can be quite a bit more complex and diverse. When we try to apply this simplistic modeling of the problem here, it just totally breaks down. Taking a step back for a moment, what we're trying to do here is to predict the sparse set of lane instances and their connectivity. What we wanna do is to have a neural network that basically predicts this graph, where the nodes are the lane segments and the edges encode the connectivities between these lanes. What we have is our lane detection neural network. It's made up of three components. In the first component, we have a set of convolutional layers, attention layers, and other neural network layers that encode the video streams from our 8 cameras on the vehicle and produce a rich visual representation.

We enhance this visual representation with a coarse, road-level map data, which we encode with a set of additional neural network layers that we call the lane guidance module. This map is not an HD map, but it provides a lot of useful hints about the topology of lanes inside of intersections, the lane counts on various roads, and a set of other attributes that help us. The first two components here produce a dense tensor that sort of encodes the world, but what we really wanna do is to convert this dense tensor into a smart set of lanes and their connectivities. We approach this problem like an image captioning task, where the input is this dense tensor, and the output text is predicted into a special language that we developed at Tesla for encoding lanes and their connectivities.

In this language of lanes, the words and tokens are the lane positions in 3D space, and the ordering of the tokens, inferred modifiers in the tokens, encode the connective relationships between these lanes. By modeling the task as a language problem, we can capitalize on recent autoregressive architectures and techniques from the language community for handling the multimodality of the problem. We're not just solving the computer vision problem at Autopilot, we're also applying the state-of-the-art in language modeling and machine learning more generally. I'm now gonna dive into a little bit more detail of this language component. What I have depicted on the screen here is a satellite image which sort of represents the local area around the vehicle. The set of nodes and edges is what we refer to as the lane graph, and it's ultimately what we want to come out of this neural network.

We start with a blank slate. We're gonna wanna make our first prediction here at this green dot. This green dot's position is encoded as an index into a coarse grid which discretizes the 3D world. Now, we don't predict this index directly 'cause it would be too computationally expensive to do so. There's just too many grid points, and predicting a categorical distribution over this has both implications at training time and test time. Instead what we do is we discretize the world coarsely first. We predict a heat map over the possible locations, and then we latch in the most probable location. Condition on this, we then refine the prediction and get the precise point. Now we know where the position of this token is, but we don't know its type.

In this case, though, it's a beginning of a new lane, so we predict it as a start token. Because it's a start token, there's no additional attributes in our language. We then take the predictions from this first forward pass, and we encode them using a learned conditional embedding, which produces a set of tensors that we combine together, which is actually the first word in our language of lanes. We add this to the, you know, first position in our sentence here. We then continue this process by predicting the next lane point in a similar fashion. Now, this lane point is not the beginning of a new lane, it's actually a continuation of the previous lane, so it's a continuation token type. Now, it's not enough just to know that this lane is connected to the previously predicted lane.

We want to encode its precise geometry, which we do by regressing a set of spline coefficients. We then take this lane, we encode it again, and add it as the next word in the sentence. We continue predicting these continuation lanes until we get to the end of the prediction grid. We then move on to a different lane segment. You can see that cyan dot there. Now, it's not topologically connected to that pink point. It's actually forking off of that green point there. It's got a fork type. Fork tokens actually point back to previous tokens from which their fork originates. You can see here the fork point predictor is actually the index zero, so it's actually referencing back to tokens that it's already predicted, like you would in language.

We continue this process over and over again until we've enumerated all of the tokens in the lane graph, and then the network predicts the end of sentence token.

Ashok Elluswamy

Director of Autopilot Software, Tesla

Yeah, I just want to note that that the reason we do this is not just because we wanna build something complicated. It almost feels like a Turing-complete machine here with neural networks, though. It's that we tried simpler approaches, for example, trying to just segment the lanes along the road or something like that. But then the problem is, when there's uncertainty, say you cannot see the road clearly, and there could be two lanes or three lanes, and you can't tell, a simple segmentation-based approach would just draw both of them. It's kind of a 2.5-lane situation. The post-processing algorithm would hilariously fail when the predictions are such.

John Emmons

Lead of Autopilot Vision Team, Tesla

Yeah, the problems don't end there. I mean, you need to predict these connective lanes inside of intersections, which it's just not possible with the approach that Ashok's mentioning, which is why we had to upgrade to this sort of.

Ashok Elluswamy

Director of Autopilot Software, Tesla

Yeah, I mean, like, overlaps like this, segmentation would just go haywire. Even if you try very hard to, you know, put them on separate layers, it's just a really hard problem. Languages offers a really nice framework for getting a sample from a posterior as opposed to, you know, trying to do all of this in post-processing. This doesn't actually stop for just Autopilot, right, John? This can be used for Optimus.

John Emmons

Lead of Autopilot Vision Team, Tesla

Yeah. You know, I guess they wouldn't be called lanes, but you could imagine, you know, sort of in this, you know, stage here, that you might have sort of paths that sort of, you know, encode the possible places that people could walk.

Ashok Elluswamy

Director of Autopilot Software, Tesla

Yeah. Basically, if you're in a factory or in a, you know, home setting, you can just ask the robot, "Okay, please route to the kitchen," or, "Please route to some location in the factory." Then we predict a set of pathways that would, you know, go through the aisles, take the robot, and say, "Okay, this is how you get to the kitchen." It just really gives us a nice framework to model these different paths that simplify the navigation problem for the downstream planner.

John Emmons

Lead of Autopilot Vision Team, Tesla

All right. Ultimately, what we get from this lane detection network is a set of lanes and their connectivities, which comes directly from the network. There's no additional step here for sparsifying these, you know, dense predictions into sparse ones. This is just a direct, unfiltered output of the network. Okay. I talked a little bit about lanes. I'm gonna briefly touch on how we model and predict the future paths and other semantics on objects. I'm just gonna go really quickly through two examples. The video on the right here, we've got a car that's actually running a red light and turning in front of us.

What we do to handle situations like this is we predict a set of short time horizon future trajectories on all objects. We can use these to anticipate the dangerous situation here and apply whatever, you know, braking and steering action is required to avoid a collision. In the video on the right, there's two vehicles in front of us. The one in the left lane is parked. Apparently it's being loaded, unloaded. I don't know why the driver decided to park there. The important thing is that our neural network predicted it, that it was stopped, which is the red color there. The vehicle in the other lane, as you notice, also is stationary, but that one's obviously just waiting for that red light to turn green.

Even though both objects are stationary and have zero velocity, it's the semantics that is really important here so that we don't get stuck behind that awkwardly parked car. Predicting all of these agent attributes presents some practical problems when trying to build a real-time system. We need to maximize the frame rate of our object detection stack so that Autopilot can quickly react to the changing environment. Every millisecond really matters here. To minimize the inference latency, our neural network is split into two phases. In the first phase, we identify the locations in 3D space where agents exist. In the second stage, we then pull out tensors at those 3D locations, append it with additional data that's on the vehicle, and then we, you know, do the rest of the processing.

This sparsification step allows the neural network to focus compute on the areas that matter most, which gives us superior performance for a fraction of the latency cost. Putting it all together, the Autopilot vision stack predicts more than just the geometry and kinematics of the world. It also predicts a rich set of semantics which enable safe and human-like driving. I'm now gonna hand things off to Sri, who will tell us how we run all these cool neural networks on our FSD computer. Thank you.

Speaker 45

Hi, everyone. I'm Sri. Today I'm gonna give glimpse of what it takes to run these FSD networks in the car and how do we optimize for the inference latency. Today I'm gonna focus just on FSD lanes network that John just talked about. When we started this track, we wanted to know if we can run this FSD lanes network natively on the Trip engine, which is our in-house neural network accelerator that we built in the FSD computer. When we built this hardware, we kept it simple, and we made sure it can do one thing ridiculously fast, dense dot products. This architecture is autoregressive and iterative, where it crunches through multiple attention blocks in the inner loop, producing sparse points directly at every step.

The challenge here was how can we do this sparse point prediction and sparse computation on a dense dot product engine? Let's see how we did this on the Trip. The network predicts the heat map of most probable spatial locations of the point. Now, we do an argmax and a one-hot operation, which gives the one-hot encoding of the index of the spatial location. Now, we need to select the embedding associated with this index from an embedding table that is learned during training. To do this on Trip, we actually built a lookup table in SRAM, and we engineered the dimensions of this embedding such that we could achieve all of this thing with just matrix multiplication.

Not just that, we also wanted to store this embedding into a token cache so that we don't recompute this for every iteration, rather reuse it for future point prediction. Again, we pulled some tricks here where we did all these operations just on the dot product engine. It's actually cool that our team found creative ways to map all these operations on the Trip engine in ways that were not even imagined when this hardware was designed. That's not the only thing we had to do to make this work. We actually implemented a whole lot of operations and features to make this model compilable to improve the in-state accuracy as well as to optimize performance. All of these things helped us run this 75 million parameter model just under 10 milliseconds of latency, consuming just 8 watts of power.

This is not the only architecture running in the car. There are so many other architectures, modules, and networks we need to run in the car. To give a sense of scale, there are about 1 billion parameters of all the networks combined producing around 1,000 neural network single signals. We need to make sure we optimize them jointly and such that we maximize the compute utilization throughput and minimize the latency. We built a compiler just for neural networks that shares the structure to traditional compilers. As you can see, it takes the massive graph of neural nets with 150K nodes and 370K connections, takes this thing, partitions them into independent subgraphs and compiles each of those subgraphs natively for the inference devices.

We have a neural network linker which shares the structure to traditional linker where we perform this link time optimization. There we solve an offline optimization problem for with compute memory and memory bandwidth constraints so that it comes with an optimized schedule that gets executed in the car. On the runtime, we designed a hybrid scheduling system which basically does heterogeneous scheduling on one SoC and distributed scheduling across both the SoCs to run these networks in a model parallel fashion.

To get 100 TOPS of compute utilization, we need to optimize across all the layers of software right from tuning the network architecture, the compiler, all the way to implementing a low latency, high bandwidth RDMA link across both the SoCs, and in fact, going even deeper to understanding and optimizing the cache coherent and non-coherent data paths of the accelerator in the SoC. This is a lot of optimization at every level in order to make sure we get the highest frame rate and as every millisecond counts here. This is just the visualization of the neural networks that are running in the car. This is our digital brain, essentially. As you can see, these operations are nothing but just the matrix multiplication, convolution, to name a few, real operations running in the car.

To train this network with 1 billion parameters, you need a lot of labeled data. Yagen is gonna talk about how do we achieve this with the auto labeling pipeline.

Yagen Zhang

Lead of Geometric Vision, Tesla

Thank you, Sri. Hi, everyone. I'm Yagen Zhang, and I'm leading Geometric Vision at Autopilot. Yeah, let's talk about auto labeling. We have several kinds of auto labeling frameworks to support various types of networks. Today, I'd like to focus on the awesome lanes net here. To successfully train and generalize this network to everywhere, we think we need tens of millions of trips from probably 1 million intersections or even more. How to do that? It is certainly achievable to source sufficient amount of trips because we already have, as Tim explained earlier, like 500,000 trips per day catch rate. However, converting all those data into a training form is a very challenging technical problem.

To solve this challenge, we've tried various ways of manual and auto labeling. From the first column to the second, from the second to the third, each advance provided us nearly 100x improvement in throughput. Still, we want an even better auto labeling machine that can provide us good quality, diversity, and scalability. To meet all these requirements, despite the huge amount of engineering effort required here, we've developed a new auto labeling machine powered by multi-trip reconstruction. This can replace 5 million hours of manual labeling with just 12 hours on cluster for labeling 10,000 trips. How we solved? There are three big steps. The first step is high-precision trajectory and structure recovery by multi-camera visual- inertial odometry.

Here, all the features, including ground surface, are inferred from videos by neural networks, then tracked and reconstructed in the vector space. The typical drift rate of this trajectory, in car is like 1.3 centimeter per meter and 0.45 milliradian per meter, which is pretty decent, considering its compact compute requirement. The recovery surface and road details are also used as a strong guidance, for the later manual verification step. This is also enabled in every FSD vehicle, so we get preprocessed trajectories and structures along with the trip data. The second step is multi-trip reconstruction, which is the big and core piece of this machine. The video shows how the previously shown trip is reconstructed and aligned with other trips. Basically, other trips from different vehicle, not the same vehicle.

This is done by multiple interim steps like coarse alignment, pairwise matching, joint optimization, then further surface refinement. In the end, the human analyst comes in and finalizes the label. These heavy steps are already fully parallelized on the cluster, so the entire process usually takes just a couple of hours. The last step is actually auto labeling the new trips. Here we use the same multi-trip alignment engine, but only between pre-built reconstruction and each new trip. It's much, much simpler than fully reconstructing all the clips altogether. That's why it only takes 30 minutes per trip to auto label instead of manual labeling, several hours of manual labeling. This is also the key of scalability of this machine. This machine easily scales as long as we have available compute and trip data.

About 50 trips were newly auto labeled from this scene, and some of them are shown here. 50 trips from different vehicles. This is how we capture and transform the space-time slices of the world into the network supervision.

Speaker 45

Yeah, one thing I'd like to note is that, Yagen just talked about how we auto label our lanes, but we have auto labels for almost every task that we do, including our planner, and many of these are fully automatic. There's no humans involved. For example, for objects, all the kinematics, their shapes, their features, everything just comes from auto labeling, and the same is true for our occupancy too. We have really just built a machine around this.

Tim Zaman

Head of AI Infrastructure and Platform, Tesla

Yeah. If you can go back one slide. One more. It says, "Parallelized on cluster." That sounds pretty straightforward, but it really wasn't. Maybe it's fun to share how something like this comes about. A while ago, we didn't have any auto labeling at all, and then someone makes a script, it starts to work. It starts working better until you reach a volume that's pretty high, and we clearly need a solution. There were two other engineers in our team who were like, "You know, that's an interesting, you know, thing." What we needed to do was build a whole graph of essentially Python functions that would need to run one after the other.

First you pull the clip, then you do some cleaning, then you do some network inference, then another network inference until you finally get this. You need to do this at a large scale. I tell them, "We probably need to shoot for, you know, 100,000 clips per day or like 100,000 items. That seems good." The engineer says, "Well, we can do, you know, a bit of Postgres and a bit of elbow grease. We can do it." Meanwhile, we are a bit later, and we're doing 20 million of these functions every single day. Again, we pull in around 500,000 clips, and on those, we run a ton of functions, each of these in a streaming fashion.

That is kind of the back-end infra that's also needed to not just run training but also auto-labeling.

Ashok Elluswamy

Director of Autopilot Software, Tesla

Yeah. It really is like a factory that produces labels.

Tim Zaman

Head of AI Infrastructure and Platform, Tesla

Yeah.

Ashok Elluswamy

Director of Autopilot Software, Tesla

Like production lines, yield, quality, inventory, like all of the same concepts apply to this label factory that applies for, you know, the factory for our cars.

Tim Zaman

Head of AI Infrastructure and Platform, Tesla

That's right.

Paril Jain

Engineering Lead and Manager of AI, Tesla

Okay. Thanks, Tim, Ashok. Yeah. Concluding this section, I'd like to share a few more challenging and interesting examples for network for sure, and even for humans probably. From the top, there's like examples for like lack of lights case or foggy night or roundabout and heavy occlusions by parked cars and even rainy night with the raindrops on camera lenses. These are challenging, but once their original scenes are fully reconstructed by other clips, they all of them can be auto-labeled so that our cars can drive even better through these challenging scenarios. Now let me pass the mic to David to learn more about how Sim is creating the new world on top of these labels. Thank you.

Speaker 45

Thank you, Yagen. My name is David, and I'm gonna talk about simulation. Simulation plays a critical role in providing data that is difficult to source and/or hard to label. However, 3D scenes are notoriously slow to produce. Take for example, the simulated scene playing behind me, a complex intersection from Market Street in San Francisco. It would take two weeks for artists to complete, and for us, that is painfully slow. However, I'm gonna talk about using Yagen's automated ground truth labels along with some brand-new tooling that allows us to procedurally generate this scene and many like it in just five minutes. That's an amazing 1,000 times faster than before. Let's dive in to how a scene like this is created. We start by piping the automated ground truth labels into our simulated world creator tooling inside the software Houdini.

Starting with road boundary labels, we can generate a solid road mesh and retopologize it with the lane graph labels. This helps inform important road details like crossroad slope and detailed material blending. Next, we can use the line data and sweep geometry across its surface and project it to the road, creating lane paint decals. Next, using median edges, we can spawn island geometry and populate it with randomized foliage. This drastically changes the visibility of the scene. Now, the outside world can be generated through a series of randomized heuristics. Modular building generators create visual obstructions, while randomly placed objects like hydrants can change the color of the curbs, while trees can drop leaves below it, obscuring lines or edges. Next, we can bring in map data to inform positions of things like traffic lights, or stop signs.

We can trace along its normal to collect important information like number of lanes and even get accurate street names on the signs themselves. Next, using lane graph, we can determine lane connectivity and spawn directional road markings on the road and their accompanying road signs. Finally, with lane graph itself, we can determine lane adjacency and other useful metrics to spawn randomized traffic permutations inside our simulator. Again, this is all automatic, no artist in the loop, and happens within minutes. Now this sets us up to do some pretty cool things. Since everything is based on data and heuristics, we can start to fuzz parameters to create visual variations of the single ground truth. It can be as subtle as object placement and random material swapping to more drastic changes like entirely new biomes or locations of environment like urban, suburban, or rural.

This allows us to create infinite targeted permutations for specific ground truths that we need more ground truth for. All this happens within a click of a button. We can even take this one step further by altering our ground truth itself. Say John wants his network to pay more attention to the directional road markings to better detect an upcoming captive left turn lane. We can start to procedurally alter our lane graph inside the simulator to help create entirely new flows through this intersection to help focus the network's attention to the road markings to create more accurate predictions. This is a great example of how this tooling allows us to create new data that could never be collected from the real world.

The true power of this tool is in its architecture and how we can run all tasks in parallel to infinitely scale. You saw the tile creator tool in action, converting the ground truth labels into their counterparts. Next, we can use our tile extractor tool to divide this data into geohash tiles about 150 meters square in size. We then save out that data into separate geometry and instance files. This gives us a clean source of data that's easy to load and allows us to be rendering engine agnostic for the future. Using a tile loader tool, we can summon any number of those cache tiles using a geohash ID. Currently, we're doing about these 5-by-5 tiles or 3-by-3, usually centered around fleet hotspots or interesting lane graph locations.

The tile loader also converts these tile sets into UAssets for consumption by the Unreal Engine and gives you a finished project product from what you saw on the first slide. This really sets us up for size and scale. As you can see on the map behind us, we can easily generate most of San Francisco city streets. This didn't take years or even months of work, but rather two weeks by one person. We can continue to manage and grow all this data using our PDG network inside of the tooling. This allows us to throw compute at it and regenerate all these tile sets overnight. This ensures all environments are of consistent quality and features, which is super important for training since new ontologies and signals are constantly released.

Now to come full circle, because we generated all these tile sets from ground truth data that contain all the weird intricacies from the real world, and we can combine that with the procedural, visual, and traffic variety to create limitless targeted data for the network to learn from. That concludes the sim section. I'll pass it to Kate to talk about how we can use all this data to improve Autopilot. Thank you.

Kate Park

Manager of Product and Technical Program, Tesla

Thanks, David. Hi, everyone. My name is Kate Park, and I'm here to talk about the data engine, which is the process by which we improve our neural networks via data. We're gonna show you how we deterministically solve interventions via data and walk you through the life of this particular clip. In this scenario, Autopilot is approaching a turn and incorrectly predicts that crossing vehicle is stopped for traffic, and thus a vehicle that we would slow down for. In reality, there's nobody in the car, it's just awkwardly parked. We built this tooling to identify the mispredictions, correct the label, and categorize this clip into an evaluation set. This particular clip happens to be one of 126 that we've diagnosed as challenging parked cars at turns.

Because of this infra, we can curate this evaluation set without any engineering resources custom to this particular challenge case. To actually solve that challenge case requires mining thousands of examples like it, and it's something Tesla can trivially do. We simply use our data sourcing infra, request data, and use the tooling shown previously to correct the labels. By surgically targeting the mispredictions of the current model, we're only adding the most valuable examples to our training set. We surgically fixed 13,900 clips, and because those were examples where the current model struggles, we don't even need to change the model architecture. A simple weight update with this new valuable data is enough to solve the challenge case. You see, we no longer predict that crossing vehicle as stopped, as shown in orange, but parked, as shown in red.

In academia, we often see that people keep data constant, but at Tesla, it's very much the opposite. We see time and time again that data is one of the best, if not the most deterministic lever to solving these interventions. We just showed you the data engine loop for one challenge case, namely these parked cars at turns, but there are many challenge cases even for one signal of vehicle movement. We apply this data engine loop to every single challenge case we've diagnosed, whether it's buses, curvy roads, stopped vehicles, parking lots. We don't just add data once, we do this again and again to perfect the semantic. In fact, this year, we updated our vehicle movement signal 5 times, and with every weight update trained on the new data, we push our vehicle movement accuracy up and up.

This data engine framework applies to all our signals, whether they're 3D multicam video, whether the data is human-labeled, auto-labeled, or simulated, whether it's an offline model or an online model. Tesla is able to do this at scale because of the fleet advantage, the infra that our eng team has built, and the labeling resources that feed our networks. To train on all this data, we need a massive amount of compute, so I'll hand it off to Pete and Ganesh to talk about the Dojo supercomputing platform. Thank you.

Pete Bannon

VP of Custom Silicon and Low Voltage Electronics, Tesla

Thank you, Katie. Thanks, everybody. Thanks for hanging in there. We're almost there. My name is Pete Bannon. I run the custom silicon and low voltage teams at Tesla.

Ganesh Venkataramanan

Senior Director of Autopilot Hardware, Tesla

My name is Ganesh Venkataramanan. I run the Dojo program.

Pete Bannon

VP of Custom Silicon and Low Voltage Electronics, Tesla

Thank you. I'm frequently asked, "Why is a car company building a supercomputer for training?" This question fundamentally misunderstands the nature of Tesla. At its heart, Tesla is a hardcore technology company. All across the company, people are working hard in science and engineering to advance the fundamental understanding and methods that we have available to build cars, energy solutions, robots, and anything else that we can do to improve the human condition around the world. It's a super exciting thing to be a part of, and it's a privilege to run a very small piece of it in the semiconductor group.

Tonight, we're gonna talk a little bit about Dojo and give you an update on what we've been able to do over the last year. Before we do that, I wanted to give a little bit of background on the initial design that we started a few years ago. When we got started, the goal was to provide a substantial improvement to the training latency for our Autopilot team. Some of the largest neural networks they train today run for over a month, which inhibits their ability to rapidly explore alternatives and evaluate them. You know, a 30x speed up would be really nice if we could provide it at a cost-competitive and energy-competitive way. To do that, we wanted to build a chip with a lot of arithmetic units that we could utilize at a very high efficiency.

We spent a lot of time studying whether we could do that using DRAM, various packaging ideas, all of which failed. In the end, even though it felt like an unnatural act, we decided to reject DRAM as the primary storage medium for this system and instead focus on SRAM embedded in the chip. SRAM provides, unfortunately, a modest amount of capacity, but extremely high bandwidth and very low latency, and that enables us to achieve high utilization with the arithmetic units. Those choices, that particular choice led to a whole bunch of other choices. For example, if you wanna have virtual memory, you need page tables. They take up a lot of space. We didn't have space, so no virtual memory. We also don't have interrupts.

The accelerator is a bare bones, raw piece of hardware that's presented to a compiler, and the compiler is responsible for scheduling everything that happens in a deterministic way, so there's no need or even desire for interrupts in the system. We also chose to pursue model parallelism as a training methodology, which is not the typical situation. Most machines today use data parallelism, which consumes additional memory capacity, which we obviously don't have. All of those choices led us to build a machine that is pretty radically different from what's available today. We also had a whole bunch of other goals. One of the most important ones was no limits. We wanted to build a compute fabric that would scale in an unbounded way for the most part.

I mean, obviously there's physical limits now and then, but, you know, pretty much if your model was too big for the computer, you just had to go buy a bigger computer. That's what we were looking for. Today, the way packaged machines are packaged, there's a pretty fixed ratio of, for example, GPUs, CPUs and DRAM capacity and network capacity, and we really wanted to disaggregate all that so that as models evolved, we could vary the ratios of those various elements and make the system more flexible to meet the needs of the Autopilot team.

Ganesh Venkataramanan

Senior Director of Autopilot Hardware, Tesla

Yeah. It's so true, Pete, like, no limits philosophy was our guiding star all the way. All of our choices were centered around that and to the point that we didn't want traditional data center infrastructure to limit our capacity to execute these programs at speed. That's why we integrated vertically our data center, the entire data center. Sorry about that. By doing a vertical integration of the data center, we could extract new levels of efficiency. We could optimize power delivery, cooling, and as well as system management across the whole data center stack, rather than doing box by box and integrating those boxes into data centers. To do this, we also wanted to integrate early to figure out limits of scale for our software workloads.

We integrated Dojo environment into our Autopilot software very early, and we learned a lot of lessons. Today, Bill Chang will go over our hardware update as well as some of the challenges that we faced along the way. Rajiv Kurian will give you a glimpse of our compiler technology as well as go over some of our cool results.

Pete Bannon

VP of Custom Silicon and Low Voltage Electronics, Tesla

Great. Here we go.

Speaker 45

Thanks, Pete. Thanks, Ganesh. I'll start tonight with a high-level vision of our system that'll help set the stage for the challenges and the problems we're solving, and then also how software will then leverage this for performance. Our vision for Dojo is to build a single unified accelerator, a very large one. Software would see a seamless compute plane with globally addressable, very fast memory, and all connected together with uniform, high bandwidth, and low latency. To realize this, we need to use density to achieve performance. We leverage technology to get this density in order to break levels of hierarchy all the way from the chip to the scale-out systems. Silicon technology has used this, has done this for decades.

Chips have followed Moore's Law for density and integration to get performance scaling. Now, a key step in realizing that vision was our Training Tile. Not only can we integrate 25 dies at extremely high bandwidth, but we can scale that to any number of additional tiles by just connecting them together. Now, last year, we showcased our first functional Training Tile, and at that time, we already had workloads running on it. Since then, the team here has been working hard and diligently to deploy this at scale. Now, we've made amazing progress and had a lot of milestones along the way, and of course, we've had a lot of unexpected challenges. This is where our fail fast philosophy has allowed us to push our boundaries. Now, pushing density for performance presents all new challenges. One area is power delivery.

Here, we need to deliver the power to our compute die, and this directly impacts our top-line compute performance. We need to do this at unprecedented density. We need to be able to match our die pitch with a power density of almost one amp per millimeter squared. Because of the extreme integration, this needs to be a multi-tiered vertical power solution. Because there's a complex heterogeneous material stackup, we have to carefully manage the material transition, especially CTE. Now, why does the coefficient of thermal expansion matter in this case? CTE is a fundamental material property, and if it's not carefully managed, that stackup would literally rip itself apart. We started this effort by working with vendors to develop this power solution. We realized that we actually had to develop this in-house.

To balance schedule and risk, we built quick iterations to support both our system bring up and software development and also to find the optimal design and stackup that would meet our final production goals. In the end, we were able to reduce CTE over 50% and meet our performance by 3x over our initial version. Needless to say, finding this optimal material stackup while maximizing performance at density is extremely difficult.

We did have unexpected challenges along the way. Here's an example where we pushed the boundaries of integration that led to component failures. This started when we scaled up to larger and longer workloads, and then intermittently, a single site on a tile would fail. They started out as recoverable failures, but as we pushed to much higher and higher power, these would become permanent failures.

Now, to understand this failure, you have to understand why and how we build our power modules. Solving density at every level is the cornerstone of actually achieving our system performance. Now, because our XY plane is used for high bandwidth communication, everything else must be stacked vertically. This means all other components other than our die must be integrated into our power modules. Now, that includes our clock and our power supplies, and also our system controllers. Now, in this case, the failures were due to losing clock output from our oscillators. After an extensive debug, we found that the root cause was due to vibrations on the module from piezoelectric effects on nearby capacitors.

Now, singing caps are not a new phenomenon, in fact, very common in power design, but normally, clock chips are placed in a very quiet area of the board and often not affected by power circuits. Because we needed to achieve this level of integration, these oscillators need to be placed in very close proximity. Now, due to our switching frequency, and then the vibration resonance created, it caused out-of-plane vibration on our MEMS oscillator that caused it to crack. Now, the solution to this problem is a multiprong approach. We can reduce the vibration by using soft terminal caps. We can update our MEMS part with a lower Q factor for the out-of-plane direction. We can also update our switching frequency to push the resonance further away from these sensitive bands.

In addition to the density at the system level, we've been making a lot of progress at the infrastructure level. We knew that we had to reexamine every aspect of the data center infrastructure in order to support our unprecedented power and cooling density. We brought in a fully custom-designed CDU to support Dojo's dense cooling requirements, and the amazing part is we're able to do this at a fraction of the cost versus buying off the shelf and modifying it. Since our Dojo cabinet integrates enough power and cooling to match an entire row of standard IT racks, we need to carefully design our cabinet and infrastructure together. We've already gone through several iterations of this cabinet to optimize this.

Earlier this year, we started load testing our power and cooling infrastructure, and we were able to push it over 2 MW before we tripped our substation and got a call from the city. Yeah. Now, last year, we introduced only a couple of components of our system, the custom D1 die and the training tile, but we teased the ExaPOD as our end goal. We'll walk through the remaining parts of our system that are required to build out this ExaPOD. Now, the system tray is a key part of realizing our vision of a single accelerator. It enables us to seamlessly connect tiles together, not only within the cabinet but between cabinets. We can connect these tiles at very tight spacing across the entire accelerator, and this is how we achieve our uniform communication.

This is a laminated busbar that allows us to integrate very high power, mechanical and thermal support in an extremely dense integration. It's 75 mm in height and supports six tiles at 135 kg. This is the equivalent of 3-4 fully loaded high-performance racks. Next, we need to feed data to the training tiles. This is where we've developed the Dojo Interface Processor. It provides our system with high bandwidth DRAM to stage our training data, and it provides full memory bandwidth to our training tiles using TTP, our custom protocol that we use to communicate across our entire accelerator. It also has high-speed Ethernet that helps us extend this custom protocol over standard Ethernet. We provide native hardware support for this with little to no software overhead. Lastly, we can connect to it through a standard Gen 4 PCIe interface.

Now, we pair 20 of these cards per tray, and that gives us 640 GB of high-bandwidth DRAM. This provides our disaggregated memory layer for our training tiles. These cards are a high-bandwidth ingest path, both through PCIe and Ethernet. They also provide a high-rate XZ connectivity path that allows shortcuts across our large Dojo accelerator. Now, we actually integrate the host directly underneath our system tray. These hosts provide our ingest processing and connect to our interface processors through PCIe. These hosts can provide hardware video decoder support for video-based training. Our user applications land on these hosts so we can provide them with a standard x86 Linux environment.

Now, we can put two of these assemblies into one cabinet and pair it with redundant power supplies that do direct conversion of three-phase 480-volt AC power to 52-volt DC power. Now, by focusing on density at every level, we can realize the vision of a single accelerator. Now, starting with the uniform nodes on our custom D1 die, we can connect them together in our fully integrated Training Tile, and then finally seamlessly connecting them across cabinet boundaries to form our Dojo accelerator. Altogether, we can house two full accelerators in our ExaPOD for a combined 1 exaflop of ML compute. Now, altogether, this amount of technology and integration has only ever been done a couple of times in the history of compute. Next, we'll see how software can leverage this to accelerate their performance.

Thanks, Bill. My name is Rajiv, and I'm gonna talk some numbers. Our software stack begins with the PyTorch extension that speaks to our commitment to run standard PyTorch models out of the box. We're gonna talk more about our JIT compiler and the ingest pipeline that feeds the hardware with data. Abstractly, performance is tops times utilization times accelerator occupancy. We've seen how the hardware provides peak performance. It's the job of the compiler to extract utilization from the hardware while code is running on it. It's the job of the ingest pipeline to make sure that data can be fed at a throughput high enough for the hardware to not ever starve. Let's talk about why communication-bound models are difficult to scale. Before that, let's look at why ResNet- 50-like models are easier to scale.

You start off with a single accelerator, run the forward and backward passes, followed by the optimizer. To scale this up, you run multiple copies of this on multiple accelerators. While the gradients produced by the backward pass do need to be reduced, and this introduces some communication, this can be done pipelined with the backward pass. This setup scales fairly well, almost linearly. For models with much larger activations, we run into a problem as soon as we wanna run the forward pass. The batch size that fits in a single accelerator is often smaller than the batch norm surface. To get around this, researchers typically run the setup on multiple accelerators in sync batch norm mode. This introduces latency-bound communication to the critical path of the forward pass, and we already have a communication bottleneck.

While there are ways to get around this, they usually involve tedious manual work best suited for a compiler. Ultimately, there's no skirting around the fact that if your state does not fit in a single accelerator, you can be communication-bound. Even with significant efforts from our ML engineers, we see such models don't scale linearly. The Dojo system was built to make such models work at high utilization. The high-density integration was built to not only accelerate the compute-bound portions of a model, but also the latency-bound portions like a batch norm or the bandwidth-bound portions like a gradient all reduce or a parameter all gather. A slice of the Dojo mesh can be carved out to run any model.

The only thing users need to do is to make the slice large enough to fit a batch norm surface for their particular model. After that, the partition presents itself as one large accelerator, freeing the users from having to worry about the internal details of execution. It's the job of the compiler to maintain this abstraction. Fine-grained synchronization primitives and uniform low latency makes it easy to accelerate all forms of parallelism across integration boundaries. Tensors are usually stored sharded in SRAM and replicated just in time for a layer's execution. We depend on the high Dojo bandwidth to hide this replication time. Tensor replication and other data transfers are overlapped with compute, and the compiler can also recompute layers when it's profitable to do so. We expect most models to work out of the box.

As an example, we took the recently released Stable Diffusion model and got it running on Dojo in minutes. Out of the box, the compiler was able to map it in a model parallel manner on 25 Dojo dies. Here are some pictures of a Cybertruck on Mars generated by Stable Diffusion running on Dojo. Looks like it still has some ways to go before matching the Tesla design studio team. We've talked about how communication bottlenecks can hamper scalability. Perhaps an acid test of a compiler and the underlying hardware is executing a cross-die batch norm layer. As mentioned before, this can be a serial bottleneck. The communication phase of a batch norm begins with nodes computing their local mean and standard deviations, then coordinating to reduce these values, then broadcasting these values back, and then they resume their work in parallel.

What would an ideal batch norm look like on 25 Dojo dies? Let's say the previous layer's activations are already split across dies. We would expect the 350 nodes on each die to coordinate and produce die local mean and standard deviation values. Ideally, these would get further reduced with the final value ending somewhere in towards the middle of the tile. We would then hope to see a broadcast of this value radiating from the center. Let's see how the compiler actually executes a real batch norm operation across 25 dies. The communication trees were extracted from the compiler, and the timing is from a real hardware run. We're about to see 8,750 nodes on 25 dies coordinating to reduce and then broadcast the batch norm mean and standard deviation values.

The local reduction followed by global reduction towards the middle of the tile. The reduced value broadcast radiating from the middle, accelerated by the hardware's broadcast facility. This operation takes only five microseconds on 25 Dojo dies. The same operation takes 150 microseconds on 24 GPUs. This is an order of magnitude improvement over GPUs. While we talked about an all-reduce operation in the context of a batch norm, it's important to reiterate that the same advantages apply to all other communication primitives, and these primitives are essential for large-scale training. How about full model performance? While we think that ResNet-50 is not a good representation of real-world Tesla workloads, it is a standard benchmark, so let's start there. We are already able to match the A100 die for die.

However, perhaps a hint of Dojo's capabilities is that we're able to hit this number with just a batch of eight per die. Dojo was really built to tackle larger, complex models. When we set out to tackle real-world workloads, we looked at the usage patterns of our current GPU cluster, and two models stood out. The auto labeling networks, a class of offline models that are used to generate ground truth, and the occupancy networks that you heard about. The auto labeling networks are large models that have high arithmetic intensity, while the occupancy networks can be ingest-bound. We chose these models because together they account for a large chunk of our current GPU cluster usage, and they would challenge the system in different ways. How did we do on these two networks?

The results we're about to see were measured on multi-die systems for both the GPU and Dojo, but normalized to per die numbers. On our auto labeling network, we're already able to surpass the performance of an A100 with our current hardware running on our older generation VRMs. On our production hardware with our newer VRMs, that translates to doubling the throughput of an A100. Our model showed that with some key compiler optimizations, we could get to more than 3x the performance of an A100. We see even bigger leaps on the occupancy network. Almost 3x with our production hardware with room for more. What does that mean for Tesla? With the current level of compiler performance, we could replace the ML compute of one, two, three, four, five, and six GPU boxes with just a single Dojo tile.

This Dojo tile costs less than one of these GPU boxes.

Moderator

Whoa.

Speaker 45

What it really means is that networks that took more than a month to train now take less than a week. Alas, when we measured things, it did not turn out so well. At the PyTorch level, we did not see our expected performance out of the gate, and this timeline chart shows our problem. The teeny-tiny little green bars, that's the compiled code running on the accelerator. The row is mostly white space where the hardware is just waiting for data. With our dense ML compute, Dojo hosts effectively have 10x more ML compute than the GPU hosts. The data loaders running on this one host simply couldn't keep up with all that ML hardware. To solve our data loader scalability issues, we knew we had to get over the limit of this single host. The Tesla Transport Protocol moves data seamlessly across hosts, tiles, and ingest processors.

We extended the Tesla Transport Protocol to work over Ethernet. We then built the Dojo Network Interface Card, the DNIC, to leverage TTP over Ethernet. This allows any host with a DNIC card to be able to DMA to and from other TTP endpoints. We started with the Dojo mesh, then we added a tier of data loading hosts equipped with the DNIC card. We connected these hosts to the mesh via an Ethernet switch. Now, every host in this data loading tier is capable of reaching all TTP endpoints in the Dojo mesh via hardware-accelerated DMA. After these optimizations went in, our occupancy went from 4% to 97%. The data loading sections have reduced drastically, and the ML hardware is kept busy. We actually expect this number to go to 100% pretty soon.

After these changes went in, we saw the full expected speed up from the PyTorch layer, and we were back in business. We started with hardware design that breaks through traditional integration boundaries in service of our vision of a single giant accelerator. We've seen how the compiler and ingest layers build on top of that hardware. After proving our performance on these complex real-world networks, we knew what our first large-scale deployment would target, our high arithmetic-intensity auto labeling networks. Today, that occupies 4,000 GPUs over 72 GPU racks. With our dense compute and our high performance, we expect to provide the same throughput with just four Dojo cabinets. These four Dojo cabinets will be part of our first ExaPOD that we plan to build by quarter one of 2023. This will more than double Tesla's auto labeling capacity.

The first ExaPOD is part of a total of seven ExaPODs that we plan to build in Palo Alto right here across the wall. We have a display cabinet from one of these ExaPODs for everyone to look at. six tiles densely packed on a tray, 54 petaflops of compute, 640 GB of high bandwidth memory with power and host interface. It's a lot of compute. We're building out new versions of all our cluster components and constantly improving our software to hit new limits of scale. We believe that we can get another 10x improvement with our next-generation hardware. To realize our ambitious goals, we need the best software and hardware engineers. Please come talk to us or visit tesla.com/ai. Thank you.

Elon Musk

CEO, Tesla

All right. Let me know when.

Moderator

All right. Okay.

Elon Musk

CEO, Tesla

All right, hopefully that was enough detail. Now we can move to questions. Guys, like I think all the team come out on stage and we really wanted to show the depth and breadth of Tesla in artificial intelligence, compute hardware, robotics actuators and try to really shift the perception of the company away from you know, a lot of people think we're like just a car company or we make cool cars, whatever. Most people have no idea that Tesla is arguably the leader in real-world AI hardware and software and that we're building what is arguably the first, the most radical computer architecture since the Cray-1 supercomputer.

I think if you're interested in developing some of the most advanced technology in the world that's gonna really affect the world in a positive way, Tesla's the place to be. Yeah, let's fire away with some questions. I think there's a mic at the front and a mic at the back.

Moderator

Yes.

Elon Musk

CEO, Tesla

Uh.

Moderator

Chris, on this side.

Elon Musk

CEO, Tesla

Just throw mics at people.

Moderator

Yeah.

Elon Musk

CEO, Tesla

Jump all for the mic.

Speaker 36

Yeah. Hi. Thank you very much. I was impressed very much by Optimus, but I wonder why tendon-driven, the hand? Why did you choose a tendon-driven approach for the hand? Because tendons are not very durable. Why spring-loaded?

Elon Musk

CEO, Tesla

Yeah.

Speaker 45

Hello. Is this working? Cool. Awesome. Yes, that's a great question. You know, when it comes to any type of actuation scheme, there's trade-offs between, you know, whether or not it's a tendon-driven system or some type of linkage-based system.

Elon Musk

CEO, Tesla

Just keep the mic close to your mouth.

Speaker 45

A little bit closer?

Elon Musk

CEO, Tesla

Yeah.

Speaker 45

Can you hear me? Cool.

Elon Musk

CEO, Tesla

Yeah.

Speaker 45

The main reason why we went for a tendon-based system is that, you know, first we actually investigated some synthetic tendons, but we found that metallic Bowden cables are, you know, a lot stronger. One of the advantages of these cables is that it's very good for part reduction. We do wanna make a lot of these hands, so having a bunch of parts, a bunch of small linkages ends up being, you know, a problem when you're making a lot of something. One of the big reasons that, you know, tendons are better than linkages in a sense is that you can be anti-backlash. Anti-backlash essentially, you know, allows you to not have any gaps or, you know, stuttering motion in your fingers.

Speaker 36

Spring-loaded mainly what spring-loaded allows us to do is allows us to have active opening. Instead of having to have two actuators to drive the fingers closed and then open, we have the ability to, you know, have the tendon drive them closed and then the springs passively extend. This is something that's seen in our hands as well, right? We have the ability to actively flex, and then we also have the ability to extend.

Elon Musk

CEO, Tesla

I mean, our goal with Optimus is to have a robot that is maximally useful as quickly as possible. There's a lot of ways to solve the various problems of a humanoid robot. We're probably not barking up the right tree on all the technical solutions. I should say that we're open to evolving the technical solutions that you see here over time. They're not locked in stone. We have to pick something, and we wanna pick something that's gonna allow us to produce the robot as quickly as possible and have it, like I said, be useful as quickly as possible.

We're trying to follow the goal of fastest path to a useful robot that can be made at volume, and we're gonna test the robot internally at Tesla, in our factory and just see, like, how useful is it. Because you're gonna close the loop on reality to confirm that the robot is in fact useful. We're just gonna use it to build things. We're confident we can do that with the hand that we have currently designed, but for sure there'll be hand version two, version three, and we may change the architecture quite significantly over time.

Speaker 37

Sorry. Hi. The Optimus robot is really impressive. You did a great job. Bipedal robots are really difficult. What I noticed might be missing from your plan is to acknowledge the utility of the human spirit, and I'm wondering if Optimus will ever get a personality and be able to laugh at our jokes while it folds our clothes?

Elon Musk

CEO, Tesla

Yeah, absolutely. I think we wanna have really fun versions of Optimus, and so that Optimus can both be utilitarian and do tasks, but can also be kind of like a friend, and a buddy and hang out with you. I'm sure people will think of all sorts of creative uses for this robot. You know, once you have the core intelligence and actuators figured out, then you can actually, you know, put all sorts of costumes, I guess, on the robot. I mean, you can make the robot look. You can skin the robot in many different ways. I'm sure people will find very interesting ways to yeah, versions of Optimus.

Speaker 38

Thanks for the great presentation. I wanted to know if there was an equivalent to interventions in Optimus. It seems like labeling through moments where humans disagree with what's going on is important, and in a humanoid robot, that might be also a desirable source of information.

Elon Musk

CEO, Tesla

Pass it to Vasudev?

Konstantinos Laskaris

Director and Lead of Optimus program, Tesla

Yeah. I think we will have ways to remote operate the robot and intervene when it does something bad, especially when we are training the robot and bringing it up. Hopefully, we you know design it in a way that we can stop the robot. If it's gonna hit something, we can just, like, hold it and it'll stop. It won't, like, you know, crush your hand or something, and those are all intervention data. Yeah, we can learn a lot from our simulation systems too, where we can check for collisions and supervise that those are bad actions.

Elon Musk

CEO, Tesla

I mean, Optimus, we want over time for it to be, you know, an android, the kind of android that you'd seen in sci-fi movies, like Star Trek: The Next Generation, like Data. But obviously, we could program the robot to be less robot-like and more friendly and, you know, it can obviously learn to emulate humans and feel very natural. As AI in general improves, we can add that to the robot and, you know, it should be obviously able to do simple instructions or even intuit what it is that you want. You could give it a high-level instruction, and then it can break that down into a series of actions and take those actions.

Speaker 39

Hi. Yeah, it's exciting to think that with the Optimus, you will think that you can achieve orders of magnitude of improvement in economic output. That's really exciting. When Tesla started, the mission was to accelerate the advent of renewable energy or sustainable transport. With the Optimus, do you still see that mission being the mission statement of Tesla, or is it going to be updated with, you know, mission to accelerate the advent of, I don't know, infinite abundance or?

In limitless economy.

Elon Musk

CEO, Tesla

Yeah, I mean, it is not strictly speaking, Optimus is not strictly speaking, directly in line with accelerating sustainable energy. It, you know, to the degree that it is more efficient at getting things done than a person, it does, I guess, help with, you know, sustainable energy. It. I think the mission effectively does somewhat broaden with the advent of Optimus, to, you know, I don't know, making the future awesome. You know, I think you look at Optimus and, I don't know about you, but I'm excited to see what Optimus will become. You know, this is like, you know, if you could.

I mean, we can tell like any given technology. Do you wanna see what it's like in a year, two years, three years, four years, five years, 10? I'd say for sure. You definitely wanna see what's happening with Optimus. Whereas, you know, a bunch of other technologies are, you know, have sort of plateaued. Won't name names here, but, you know, so I think Optimus is gonna be incredible in like five years. 10 years, like mind-blowing. I'm really interested to see that happen, and I hope you are too.

Speaker 40

Oh, thank you. I have a quick question here. Justin. I was wondering, like, are you planning to extend like conversational capabilities for the robot? My second follow-up question to that is, what's like the end goal? What's the end goal with Optimus?

Elon Musk

CEO, Tesla

Yeah. Optimus will definitely have conversational capabilities. You'd be able to talk to it and have a conversation, and it would feel quite natural. From an end goal standpoint, I don't know, I think it's gonna keep evolving, and I'm not sure where it ends up, but someplace interesting for sure. You know, we always have to be careful about, you know, don't go down the Terminator path. That's a, you know. I thought maybe we should start off with a video of like the Terminator starting off with this, you know, skull crushing, but, well, that might be, I don't know, people might take that too seriously.

You know, we do want Optimus to be safe, so we are designing in safeguards where you can locally stop the robot. With like basically a localized control ROM that you can't update over the internet, which I think that's quite important. Essential, frankly. Like a localized stop button, remote control, something like that cannot be changed. But I mean, it's definitely gonna be interesting. It won't be boring.

Speaker 40

Okay, yeah. I see you today you have very attractive product with Dojo and its applications. So I'm wondering what's the future for the Dojo platform. Will you like provide like infrastructure as service like AWS, or you be like a sell the chip like the NVIDIA? So basically what's the future? Because I see you use 7 nanometer, so the development cost is like easily over $10 million. How do you make the business like business-wise?

Elon Musk

CEO, Tesla

Yeah, I mean, Dojo is a very big computer, and actually will use a lot of power and needs a lot of cooling. I think it's probably gonna make more sense to have Dojo operate in like a Amazon Web Services manner than to try to sell it to someone else. The most efficient way to operate Dojo is just have it be a service that you can use, that's available online and that where you can train your models way faster and for less money. As the world transitions to Software 2.0-

Speaker 40

That's on the bingo card.

Elon Musk

CEO, Tesla

That's someone I know and has to now drink five tequilas. Software 2.0

Speaker 41

Yeah.

Elon Musk

CEO, Tesla

Will use a lot of neural net training. You know, it kinda makes sense that over time, as there's more neural net stuff, the people will want to use the fastest, lowest cost neural net training system. I think there's a lot of opportunity in that direction.

Speaker 41

Hi. My name is Ali Jahanian. Thank you for this event. It's very inspirational. My question is, I'm wondering what is your vision for humanoid robots that understand our emotions and art and can contribute to our creativity?

Elon Musk

CEO, Tesla

Well, I think there's this, you're already seeing robots that at least are able to generate very interesting art with like DALL·E and DALL·E 2. I think we'll start seeing AI that can actually generate even movies that have coherence, like interesting movies and tell jokes. It's quite remarkable how fast AI is advancing at many companies besides Tesla. We're headed for a very interesting future. Yeah, any of you guys wanna comment on that?

Felix Sygulla

Robotics Engineer, Tesla

Yeah, I guess, the Optimus robot can come up with physical art, not just digital art. You can, you know, you can ask for some dance moves in text or voice, and then it can, like, produce those in the future. It's a lot of, like, physical art, not just digital art. Yeah.

Elon Musk

CEO, Tesla

Oh, yeah. Computers can absolutely make physical art. Yeah.

Felix Sygulla

Robotics Engineer, Tesla

Yeah.

Elon Musk

CEO, Tesla

100%.

Felix Sygulla

Robotics Engineer, Tesla

Yeah, like.

Elon Musk

CEO, Tesla

Sure

Felix Sygulla

Robotics Engineer, Tesla

play soccer or whatever you

Elon Musk

CEO, Tesla

Yeah

Felix Sygulla

Robotics Engineer, Tesla

I mean, it needs to get more agile, but over time, for sure.

Elon Musk

CEO, Tesla

Mm-hmm.

Speaker 42

Thanks so much for the presentation. For the Tesla Autopilot slides, I noticed that the models that you were using were heavily motivated by language models, and I was wondering what the history of that was and how much of an improvement it gave. I thought that was a really interesting, curious choice to use language models for the lane transitioning.

John Emmons

Lead of Autopilot Vision Team, Tesla

There are sorta two aspects for why we transitioned to language modeling.

Elon Musk

CEO, Tesla

Talk loud and close to the-

John Emmons

Lead of Autopilot Vision Team, Tesla

Okay.

Elon Musk

CEO, Tesla

It's not coming through very clearly.

John Emmons

Lead of Autopilot Vision Team, Tesla

Okay, got it. Yeah, the language models help us in two ways. The first way is that it lets us predict lanes that we couldn't have otherwise. As Ashok mentioned earlier, basically, when we predicted lanes in sort of a dense 3D fashion, you can only model certain kinds of lanes. But if we wanna get those crisscrossing connections inside of intersections, it's just not possible to do that without making it a graph prediction. If you try to do this with dense segmentation, it just doesn't work. Also, the lane prediction is a multimodal problem. Sometimes you just don't have sufficient visual information to know precisely how things look on the other side of the intersection, so you need a method that can generalize and produce, you know, coherent predictions. You don't wanna be predicting two lanes and three lanes at the same time.

You wanna commit to one, and a generative model like these language models provides that.

Speaker 43

Hi. My name is Giovanni. Yeah, thanks for the presentation. It's really nice. I have a question for FSD team. For the neural networks, how do you do unit tests, software unit tests on that? Like, do you have, like, a bunch or, I don't know, mid-thousands or yes, cases where the neural network that after you train it, you have to pass it before you release it to as a product, right? Yeah, what's your software unit testing strategies for this basically?

Speaker 45

Yeah, glad you asked. There's, like, a series of tests that we have defined, starting from, you know, unit test for the software itself. For the neural network models, we have VAP sets defined where, you know, you can define. If you just have a large test set, that's not enough what we find. We need, like, sophisticated, VAP sets for different failure modes, and then we curate them and grow them over the time of the product. Over the years, we have, like, hundreds of thousands of examples where we have been failing in the past that we have curated. For any new model, we test against the entire history of these failures, and then keep adding to this test set.

On top of this, we have shadow modes, where we ship these models silently to the car, and we get data back on where they are failing or succeeding. There's extensive QA program. It's very hard to ship a regression. There's, like, nine levels of filters before it hits customers, but then we have really good infra to make this all efficient.

Elon Musk

CEO, Tesla

I'm one of the QA testers, so I QA the car.

Speaker 45

Yeah, me too. QA tester.

Elon Musk

CEO, Tesla

I'm constantly in the car just being QA-ing, like, whatever the latest alpha build is that doesn't totally crash.

Speaker 45

Yeah. Finds a lot of bugs.

Elon Musk

CEO, Tesla

Yeah.

Speaker 43

Hi. Great event. I have a question about foundational models for autonomous driving. We have all seen that big models that really can, when you scale up with data and model parameter, right, from GPT-3 to PaLM, it can actually now do reasoning. Do you see that it's essential scaling up foundational models with data and size, and then at least you can get a teacher model, right, that potentially can solve all the problems, and then you distill to a student model? Is that how you see foundational models relevant for autonomous driving?

Speaker 45

That's quite similar to our auto-labeling models. We don't just have models that run in the car. We train models that are entirely offline, that are, like, extremely large, that can't run in real time on the car. We just run those offline on our servers, producing really good labels that can then train the online networks. That's one form of distillation of these teacher student models. In terms of foundation models, we are building some really, really large datasets that, you know, are multiple terabytes, and we are seeing that some of these tasks work really well, when we have these large datasets. Like the kinematics, like I mentioned, video in, all the kinematics out of all the objects and up to the fourth derivative. People thought we couldn't do detection with cameras, depth, velocity, acceleration.

Imagine how precise these have to be for these higher order derivatives to be accurate, and this all comes from these kind of large datasets and large models. We are seeing the equivalent of foundation models in our own way for geometry and kinematics and things like those. Do you wanna add anything, John?

John Emmons

Lead of Autopilot Vision Team, Tesla

Yeah, I'll keep it brief. Basically, whenever we train on a larger dataset, we see big.

Elon Musk

CEO, Tesla

Talk a little loud.

John Emmons

Lead of Autopilot Vision Team, Tesla

Okay. Basically, whenever we train on a larger dataset, we see big improvements in our model performance. Basically, whenever we initialize our networks with, you know, some pre-training step from some other auxiliary task, we basically see improvements. The self-supervised or supervised with large datasets both help a lot.

Speaker 44

Hi. At the beginning, Elon said that Tesla was potentially interested in building artificial general intelligence systems. Given the potentially transformative impact of technology like that, it seems prudent to invest in technical AGI safety expertise specifically. I know Tesla does a lot of technical narrow AI safety research. I was curious if Tesla was intending to or try to build expertise in technical artificial general intelligence safety specifically.

Elon Musk

CEO, Tesla

Well, I mean, if it starts looking like we're gonna be making a significant contribution to artificial general intelligence then we'll for sure invest in safety. I'm a big believer in AI safety. I think there should be an AI sort of regulatory authority at the government level, just as there is a regulatory authority for anything that affects public safety. We have a regulatory authority for aircraft and cars and sort of food and drugs and because they affect public safety, and AI also affects public safety. I think, and this is not really something that government I think understands yet, but I think there should be a referee that is ensuring or trying to ensure public safety for AGI.

You think of like, well, like what are the elements that are necessary to create AGI? Like, the accessible data set is extremely important, and if you've got a large number of cars and humanoid robots processing, you know, petabytes of video data and audio data from the real world, just like humans, that might be the biggest data set. It probably is the biggest data set. Because in addition to that, you can obviously incrementally scan the internet. But what the internet can't quite do is have millions or hundreds of millions of cameras in the real world and with, like I said, with audio and other sensors as well.

I think we probably will have the most amount of data, and probably the most amount of training power, therefore probably we will make a contribution to AGI.

Speaker 26

Hey, I noticed the Semi was back there, but we haven't talked about it too much. I was just wondering, for the Semi truck, what are the changes you're thinking about from a sensing perspective? I imagine there's very different requirements obviously than just a car. If you don't think that's true, why is that true?

Elon Musk

CEO, Tesla

No, I think basically you can drive a car. I mean, think about it, what drives any vehicle? It's a biological neural net with eyes or cameras essentially. What is your primary sensors are two cameras on a slow gimbal, a very slow gimbal. That's your head. If you know, if a biological neural net with two cameras on a slow gimbal can drive a Semi truck, then if you've got like eight cameras with continuous 360-degree vision, operating at a higher frame rate and much higher reaction rate, then I think it is obvious that you should be able to drive a Semi or any vehicle much better than a human.

Speaker 27

Hi, my name is Akshay. Thank you for the event. Assuming, you know, Optimus would be used for different use cases and would evolve at different pace for these use cases, would it be possible to sort of develop and deploy different software and hardware components independently and deploy them, you know, in the, in Optimus so that the overall, you know, feature development is faster for Optimus?

Elon Musk

CEO, Tesla

What was the question? Okay. All right. We did not comprehend. Unfortunately, our neural net did not comprehend the question. Yeah, next question.

Speaker 28

Hi. I want to switch gear to the Autopilot. When you guys plan to roll out the FSD Beta to countries other than U.S. and Canada? My next question is, what's the biggest bottleneck or the technological barrier you think in the current Autopilot stack, and, how you envision to solve that to make the Autopilot is considerably better than human in terms of like performance metrics, like safety assurance and the human confidence? I think you also mentioned for V11, you guys are going to combine the highway and the city as a single stack and some architectural, big improvements. Can you maybe expand a bit on that? Thank you.

Elon Musk

CEO, Tesla

Well, that's a whole bunch of questions. I think from a technical standpoint, it should be possible to roll out FSD Beta worldwide by the end of this year. We, you know, for a lot of countries, we need regulatory approval, and so we are somewhat gated by the regulatory approval in other countries. I think from a technical standpoint, it will be ready to go to a worldwide beta by the end of this year.

There's quite a big improvement that we're expecting to release next month that will be especially good at assessing the velocity of fast-moving cross traffic and a bunch of other things. Anyone wanna elaborate?

John Emmons

Lead of Autopilot Vision Team, Tesla

Yeah, I guess so. There used to be a lot of differences between production Autopilot and the Full Self-Driving Beta, but those differences have been getting smaller and smaller over time. I think just a few months ago, we now use the same vision-only object detection stack in both FSD and in the production Autopilot on all vehicles. There's still a few differences, the primary one being the way that we predict lanes right now. We upgraded the modeling of lanes so that it could handle these more complex geometries like I mentioned in the talk. In production Autopilot, we still use a simpler lane model, but we're extending our current FSD Beta models to work in all sort of highway scenarios as well.

Elon Musk

CEO, Tesla

Yeah, the version of FSD Beta that I drive actually does have the integrated stack. It just uses the FSD stack both in city streets and highway, and it works quite well for me. We need to validate it in all kinds of weather, like heavy rain, snow, dust, and just make sure it's working better than the production stack across a wide range of environments. We're pretty close to that. I mean, I think it's, I don't know, maybe it'll definitely be before the end of the year and maybe November.

Tim Zaman

Head of AI Infrastructure and Platform, Tesla

Yeah, in our personal drives, the FSD stack on highway drives already way better than the production stack we have. We do expect to also include the parking lot stack as a part of the FSD stack before the end of this year. That will basically bring us to you sit in the car in the parking lot and drive till the end of the parking lot at a parking spot before the end of this year.

Elon Musk

CEO, Tesla

Yeah, in terms of the fundamental metric to optimize against is how many miles between necessary interventions. Just massively improving how many miles the car can drive in full autonomy before an intervention is required that is safety critical. Yeah, that's the fundamental metric that we're measuring every week, and we're making radical improvements on that.

Speaker 29

Oh, hi. Thank you. Hi. Thank you so much for the presentation. Very inspiring. My name is Daisy. I actually have a non-technical question for you. I'm curious, if you were back to your twenties, what are some of the things you wish you knew back then? What are some advice you would give to your younger self?

Elon Musk

CEO, Tesla

Well, I'm trying to figure out something useful to say.

John Emmons

Lead of Autopilot Vision Team, Tesla

Join Tesla.

Elon Musk

CEO, Tesla

Yeah. Join Tesla would be one thing. Yeah, I think just trying to expose yourself to as many smart people as possible. I don't know, read a lot of books. You know, I did do that though. I think there's some merit to just also like not being like necessarily too intense and like enjoying the moment a bit more, I would say to 20 or 20-something me. Just, you know, stop and smell the roses occasionally would probably be a good idea.

You know, it's like when we were developing the Falcon 1 rocket and on the Kwajalein Atoll, and we had this beautiful little island that we were developing the rocket on, and not once during that entire time did I even have a drink on the beach. I'm like, "Well, I should have had a drink on the beach. That would've been fine.

Speaker 30

Thank you very much. I think you have excited all of the robotics people with Optimus. This feels very much like ten years ago in driving, but driving has proved to be harder than it actually looked ten years ago. What do we know now that we didn't ten years ago that would make, for example, AGI on a humanoid come faster?

Elon Musk

CEO, Tesla

Well, I mean, it seems to me that AGI is advancing very quickly. Hardly a week goes by without some significant announcement and, yeah, I mean, at this point, like AI seems to be able to win at almost any rule-based game. It's able to create extremely impressive art, engage in conversations that are very sophisticated, you know, write essays, and these just keep improving. And there's so much more, so many more talented people working on AI, and the hardware is getting better. I think it's just AI is on a super, like a strong exponential curve of improvement independent of what we do at Tesla. And obviously we'll benefit somewhat from that exponential curve of improvement with AI.

Like Tesla just also happens to be very good at actuators, at motors, you know, motors, gearboxes, controllers, power electronics, batteries, sensors, and you know really like I'd say that you know the biggest difference between the robot on four wheels and the robot with arms and legs is getting the actuators right. It's an actuators and sensors problem. Obviously the you know how you control those actuators and sensors, but it's yeah actuators and sensors and how you control the actuators. It's I don't know we have to have like the ingredients necessary to create a compelling robot, and we're doing it so.

Speaker 31

Hi, Elon. You are actually bringing the humanity to the next level. Literally, Tesla and you are bringing the humanity to the next level. You said Optimus Prime, Optimus will be used in next Tesla factory. My question is, will a new Tesla factory will be fully run by Optimus program, and when can general public order a humanoid?

Elon Musk

CEO, Tesla

Yeah. I think it'll, you know, we're gonna start Optimus with very simple tasks in the factory. You know, like, maybe just, like, loading a part, like you saw in the video, loading a part, for, you know, carrying a part from one place to another or loading a part into a one of our more conventional robot cells, to, you know, that welds a body together. We'll start, you know, just trying to how do we make it useful at all, and then gradually expand the number of situations where it's useful. I think that the number of situations where Optimus is useful will grow exponentially, like, really, really fast. In terms of when people can order one, I don't know.

I think it's not that far away. Well, I think you meant when can people receive one? I don't know. I'm like, I'd say probably within three years and not more than five years. Within three to five years, you could probably receive an Optimus.

Speaker 31

I feel the best way to make the progress for AGI is to involve as many smart people across the world as possible. Given the size and resource of Tesla compared to robot companies, and given the state of humanoid research at the moment, would it make sense for the kind of Tesla to sort of open source some of the simulation hardware parts? I think Tesla can still be the dominant platformer, where it can be something like Android OS or like a iOS stuff for the entire humanoid research. Would that be something that rather than keeping the Optimus to just Tesla researchers or the factory itself, can you open it and let the whole world explore humanoid research?

Elon Musk

CEO, Tesla

I think we have to be careful about Optimus being potentially used in ways that are bad, 'cause that is one of the possible things to do. I think we'd, you know, provide Optimus where you can provide instructions to Optimus, but where those instructions are, you know, governed by some laws of robotics, that you cannot overcome. You know, not doing harm to others and we'd have, I think, probably quite a few safety related things with Optimus. Yeah. All right, we'll just take maybe a few more questions and then thank you all for coming.

Speaker 32

Questions, one deep and one broad. On the deep, for Optimus, what's the current and what's the ideal controller bandwidth? In the broader question, there's this big advertisement for the depth and breadth of the company. What is it uniquely about Tesla that enables that?

Elon Musk

CEO, Tesla

Anyone wanna tackle the bandwidth question?

Speaker 45

The technical bandwidth

Elon Musk

CEO, Tesla

Yep

Speaker 58

... of the-

Elon Musk

CEO, Tesla

Close to your mouth and loud.

Speaker 45

Okay. For the bandwidth question, you have to understand or figure out what is the task that you want it to do, and if you took a frequency transform of that task, what is it that you want your limbs to do? That's where you get your bandwidth from. It's not a number that you can specifically just say. You need to understand your use case, and that's where the bandwidth comes from.

Elon Musk

CEO, Tesla

What is the broad question? I don't quite remember. The breadth and depth thing. I can answer the breadth and depth one. Yeah.

Speaker 45

We can switch.

Elon Musk

CEO, Tesla

... um.

Speaker 58

Oh.

Elon Musk

CEO, Tesla

I mean, it's interesting. On the bandwidth question, I think we probably will just end up increasing the bandwidth or, you know, which translates to the effective dexterity and reaction time of the robot. Like, you can, it's safe to say it's not one hertz, and it's maybe you don't need to go all the way to 100 hertz, but I don't know, maybe 10, 25. I don't know. Over time, I think the bandwidth will increase quite a bit, or translated to dexterity and latency. You'd wanna minimize that over time. Yeah. Minimize latency, maximize dexterity.

I mean, in terms of breadth and depth, I guess, you know, we're a pretty big company at this point, so we've got a lot of different areas of expertise that we necessarily had to develop in order to make electric cars and then in order to make autonomous electric cars. We've just, I mean, Tesla is like a whole series of startups basically, and so far they've like almost all been quite successful. We must be doing something right.

I consider one of my core responsibilities in running the company is to have an environment where great engineers can flourish. I think in a lot of companies, I don't know, maybe most companies, if somebody's a really talented, driven engineer, they're unable to actually, their talents are suppressed at a lot of companies. It's, you know, some of the companies, the engineering talent is suppressed in a way that is maybe not obviously bad, but where it's just so comfortable and you're paid so much money and the output you actually have to produce is so low that it's like a honey trap, you know?

Like, there's a few honey trap places in Silicon Valley where they don't necessarily seem like bad places for engineers, but you have to say, like, a good engineer went in, and what did they get out? The output of that engineering talent seems very low, even though they seem to be enjoying themselves. That's why I call it there's a few honey trap companies in Silicon Valley. Tesla is not a honey trap. We're demanding, and it's like we're gonna get a lot of shit done, and it's gonna be really cool. It's, you know, not gonna be easy. But if you are a super talented engineer, your talents will be used, I think, to a greater degree than anywhere else, you know. SpaceX also that way.

Speaker 33

Hi, Elon. I have two questions, so both to the Autopilot team. The thing is, like, I have been following your progress for the past few years. Today, you have made changes on, like, the lane detection. Like, you said that, like, previously you were doing instance semantic segmentation. Now you guys have built transformer models for, like, building the lanes. What are some other common challenges which you guys are facing right now, like, which you are solving in future as a curious engineer, so that, like, we as a researcher can work on those, start working on those? And the second question is, like, I'm really curious about the data engine. Like, you guys have, like, told a case, like, where the car is stopped.

How are you finding cases which is very much similar to that from the data which you have? Like, so little bit more on the data engine would be great. That's it.

Phil Duan

Director of Engineering, Tesla

Okay. I'll start. I'll answer the first question, using occupancy network as an example. What you saw in the presentation did not exist a year ago. We only spent one year of our time. We're actually shipping more than 12 occupancy network. To have a one foundation model actually, to represent the entire physical world, around everywhere, any and all weather condition is actually really, really challenging. Only over a year ago, we're kind of like driving a 2D world. If there's a wall and if there's a curb, we kind of represent with the same static edge, which is obviously, you know, not ideal, right? There's a big difference between a curb and a wall. When you drive, you make different choices, right? After we realized that, we have to go to 3D.

We have to basically rethink the entire problem and think about how we address that. This will be like one example of challenges we have conquered in the past year.

Lizzie Miskovetz

Senior Manager Mechanical Engineering, Tesla

Yeah. To answer the question about how we actually source examples of the tricky stopped cars, there's a few ways to go about this, but two examples are, one, we can trigger for disagreements within our signals. Let's say that parked bit flickers between parked and driving, we'll trigger that back. The second is we can leverage more of the shadow mode logic. If the customer ignores the car, but we think we should stop for it, we'll get that data back too. These are just different, like, various trigger logic that allows us to get those data campaigns back.

Speaker 34

Hi.

Elon Musk

CEO, Tesla

Sorry, go ahead.

Speaker 34

Thank you for the amazing presentation. Thanks so much. There are a lot of companies that are focusing on the AGI problem, one of the reasons why it's such a hard problem is because the problem itself is so hard to define. Several companies have several different definitions. They focus on different things. What is Tesla, how is Tesla defining the AGI problem, and what are you focusing on specifically?

Elon Musk

CEO, Tesla

Well, we're not actually specifically focused on AGI. I'm simply saying that AGI seems likely to be an emergent property of what we're doing, because we're creating all these autonomous cars and autonomous humanoids that are actually within a truly gigantic data stream that's coming in and being processed. It's by far the most amount of real world data and data you can't get by just searching the internet 'cause you have to be out there in the world and interacting with people and interacting with the roads and just, you know, Earth is a big place and reality is messy and complicated.

I think it's sort of like, it just seems likely to be an emergent property of, if you've got, you know, tens or hundreds of millions of autonomous vehicles and maybe even a comparable number of humanoids, maybe more than that on the humanoid front. Well, that's just the most amount of data. If that video is being processed, it just seems likely that, you know, the cars will definitely get way better than human drivers, and the humanoid robots will become increasingly indistinguishable from humans, perhaps. Like you said, you have this emergent property of AGI.

Arguably, you know, humans collectively are sort of a super intelligence as well, especially as we improve the data rate between humans. I mean, think like, it seems to me way back in the early days of the Internet was like, the Internet was like humanity acquiring a nervous system, where now all of a sudden, any one element of humanity could know all of the knowledge of humans by connecting to the Internet, almost all the knowledge, or certainly a huge part of it. Whereas previously, we would exchange information by osmosis, you know, like, in order to transfer data, you would have to write a letter, someone would have to carry the letter by person to another person, and then a whole bunch of things in between.

It was like, yeah, I mean, insanely slow when you think about it. Even if you were in the Library of Congress, you still didn't have access to all the world's information, and you certainly couldn't search it. Obviously, very few people are in the Library of Congress. I mean, one of the great sort of equality elements, like the Internet has been the most, the biggest equalizer in history in terms of access to information and knowledge. Any student of history, I think, would agree with this because, you know, you go back 1,000 years, there were very few books.

Like, and books would be incredibly expensive, and only a few people knew how to read, and only an even smaller number of people even had a book. Now, look at it, like you can access any book instantly. You can learn anything, basically for free. It's pretty incredible. You know, I was asked recently, what period of history would I prefer to be at the most? My answer was right now. This is the most interesting time in history, and I read a lot of history. Let's do our best to keep that going. Yeah.

To go back to one of the other questions that were asked, the thing that's happened over time with respect to Tesla Autopilot is that the neural nets have gradually absorbed more and more software. In the limit, of course, you could simply take the videos as seen by the car and compare those to the steering inputs from the steering wheel and pedals, which are very simple inputs. In principle, you could train with nothing in between, 'cause that's what humans are doing with a biological neural net.

You could train based on video and the what trains the video is the moving of the steering wheel and the pedals with no other software in between. We're not there yet, but it's gradually going in that direction. All right. Oh, maybe one last question.

Moderator

How you going?

Elon Musk

CEO, Tesla

I think we got a question at the front here. Hello. There, right there. Oh, we'll do two questions. Fine.

Speaker 45

Over there or here?

Elon Musk

CEO, Tesla

Um-

Speaker 60

Hi. Thanks for such a great presentation.

Elon Musk

CEO, Tesla

We'll do your question last.

Speaker 35

Okay, cool. With FSD being used by so many people, how do you evaluate the company's risk tolerance in terms of performance statistics? And do you think there needs to be more transparency or regulation from third parties as to how what's good enough and defining like thresholds for performance across some many miles?

Elon Musk

CEO, Tesla

Well, you know, the number one design requirement at Tesla is safety. Like, and that goes across the board. In terms of the mechanical safety of the car, we have the lowest probability of injury of any cars ever tested by the government, for just a passive mechanical safety, essentially crash structure, and airbags and whatnot. We have the highest rating for active safety as well. I'm just gonna get to the point where the active safety is so ridiculously good, it's like just absurdly better than a human.

With respect to Autopilot, we do publish, broadly speaking, the statistics on miles driven with cars that have no autonomy, or Tesla cars with no autonomy, with kind of Hardware 1, Hardware 2, Hardware 3, and then the ones that are in FSD Beta. We see steady improvements all along the way. You know, sometimes there's this dichotomy of, you know, should you wait until the car is like, I don't know, three times safer than a person before deploying any technology. I think that is actually morally wrong. At the point at which you believe that adding autonomy reduces injury and death, I think you have a moral obligation to deploy it.

Even though you're gonna get sued and blamed by a lot of people because the people whose lives you saved don't know that their lives are saved, and the people who do occasionally die or get injured, they definitely know or their estate does, you know, whatever there was a problem with Autopilot. That's why you have to look at the numbers in sort of total miles driven, how many accidents occurred, how many accidents were serious, how many fatalities. You know, we've got well over 3 million cars on the road, so that's a lot of miles driven every day. It's not gonna be perfect, but what matters is that it is very clearly safer than not deploying it. Yeah.

I think, last question.

Moderator

Last.

Elon Musk

CEO, Tesla

I think, yeah, so, thanks. Well, the last question here.

Moderator

You gotta come out here. I got it.

Speaker 61

Okay.

Moderator

Yeah, I got it.

Speaker 37

Okay, hi. I do not work on hardware, so maybe the hardware team and you guys can enlighten me. Why is it required that there be symmetry in the design of Optimus? Because humans, we have handedness, right? We use some set of muscles more than others. Over time, there is wear and tear, right? Maybe you'll start to see some joint failures or some actuator failures more over time. I understand that this is extremely pre-staged. Also, we as humans have based so much fantasy and fiction over superhuman capabilities. Like, all of us don't want to walk right over there. We want to extend our arms, and, like, we have all these.

Elon Musk

CEO, Tesla

Sure.

Speaker 32

You know, a lot of fantasy, fantastical designs. Considering everything else that is going on in terms of, batteries and intensity of compute, maybe you can leverage all those aspects into coming up with something, well, I don't know, more interesting in terms of your, the robot that you're building. I'm hoping you're able to explore those directions.

Elon Musk

CEO, Tesla

Yeah. I mean, I think it would be cool to have, like, you know, make Inspector Gadget real. That would be pretty sweet. Yeah, I mean, you know, right now we just wanna make a basic humanoid work well, and our goal is fastest path to a useful humanoid robot. I think this will ground us in reality, literally, and ensure that we are doing something useful. Like, one of the hardest things to do is to be useful, and then to have high utility under the curve of, like, how many people did you help, you know, and how much help did you know, provide to each person on average, and then how many people did you help? The total utility.

Like, trying to actually ship useful product that people like to a large number of people is so insanely hard, it boggles the mind. You know, that's why I can say, like, man, there's a hell of a difference between a company that has shipped product and one that has not shipped product. It's, again, this is night and day. Then even once you ship product, can you make the cost, the value of the output worth more than the cost of the input, which is, again, insanely difficult, especially with hardware. I think over time, I think it'd be cool to do creative things and have, like, eight arms and whatever, and have different versions, and maybe, you know, there'll be some hardware, like, companies that are able to add things to an Optimus.

Like, maybe we've, you know, add a power port or something like that or attachment. You can add, you know, attachments to your Optimus, like you can add them to your phone. There could be a lot of cool things that could be done over time, and there could be maybe an ecosystem of small companies that, or big companies that, make add-ons for Optimus. With that, I'd like to thank the team for their hard work. You guys are awesome. Yeah. Thank you all for coming, and for everyone online, thanks for tuning in. I think this will be one of those great videos where you can, like, you can fast-forward to the bits that you find most interesting.

We try to give you a tremendous amount of detail, literally so that you can look at the video at your leisure, and you can focus on the parts that you find interesting and skip the other parts. Thank you all. We'll try to do this every year. We might do a monthly podcast even. I think it'd be, you know, great to sort of bring you along for the ride and, like, show you what cool things are happening. Yeah. Thank you. All right. Thanks.