Bank of America Global A.I. Conference 2023

Sep 11, 2023

Operator

Ladies and gentlemen, the program is about to begin. Reminder that you can submit questions at any time via the Ask Questions tab on the webcast page. At this time, it is my pleasure to turn the program over to your host, Vivek Arya.

Vivek Arya

Managing Director and the Senior Semiconductor Analyst, Bank of America

Thank you so much, and, good day, everyone. Glad you could join us in this afternoon, keynote session. Really delighted and honored to have Ian Buck, General Manager and Vice President of NVIDIA's Accelerated Computing Business. Also, importantly, the inventor of CUDA, which is the key operating system underlying every NVIDIA accelerator. So really glad to have some time with him, so he can share his perspectives. So, Ian, I'll turn it over to you.

I think you have one opening remark, but what I would really love to do is, you know, get your perspective on how have requirements for AI hardware changed throughout your tenure at NVIDIA, and especially, you know, we always talk about hardware, but sometimes we forget that, you know, a very key part of that is the software ecosystem. So if you could give us a perspective of how NVIDIA's software capability has really helped to cement your dominance on the hardware side in AI.

Ian Buck

General Manager and VP, NVIDIA

Yeah, thank you, and a pleasure being with you here this morning. And of course, as a reminder, this presentation contains forward-looking statements, and investors are always advised to read our reports filed with the SEC for information related to risks and uncertainties facing our business. Yes, we've been working on accelerated computing for quite some time. In fact, it dates all the way back to 2006, when we first introduced CUDA.

Initially, the goal was to address how to program this new kind of architecture, this new kind of processor, that had reached a level of programmability beyond just playing video games and rendering and making beautiful pictures, but become a computing platform, a place where we can accelerate not every workload, and to this day, we always want to make sure we have the best CPUs matched with our GPUs in the right configurations and the right ratios. But for portions of the computation that can be accelerated, that are typically either highly data parallel or massively parallel or just compute intensive, we work to with the community to figure out how to accelerate those workloads, to run them on an architecture that's designed for high compute, high throughput needs.

It started, of course, with high-performance computing, a community that obviously is looking for using computers, in some cases supercomputers, to simulate nature, to simulate physics, to simulate a problem that can't necessarily be easily identified either in a wet lab or under a microscope, or are happening at a scale, on the scale of the Earth or the cosmos, where we just need a computer, a digital instrument, an instrument of science. We've for all through 2006 up until that first AI moment in 2012, we made our platform available as a software platform.

You know, we made CUDA available with every one of our GPUs, including the gaming and graphics GPUs, the ones everyone had in their workstations and their laptops and their PCs. That was, of course, before a lot of the cloud traction. By building a software platform that engaged developers, rather than a strictly hardware platform, which defined an ISA, we met the developers where they were. So it made it very easy for researchers, Ph.D. students, engineers at companies to take their NVIDIA GPU, download CUDA for free, all the libraries and the software that had been developed over time, and figure out how to apply it to their problem, to port their code, whether it be C, Fortran, today, Python, Java, others, and move that compute-rich portion over.

That decision up front to make it a software platform, in combination with a hardware platform, was really important for a couple of reasons. First, it met the developers where they were. We didn't have to wait for others to build a software ecosystem around. Frankly, it would have been difficult to do so, and taken a long time, given the, the bootstrapping problem. Second, it expanded the innovation space. We could innovate at the hardware layer, at the compiler layer, the system software layer, the library layer, and of course, everyone else had their opportunity to also contribute to it, so that the performance delivered over time is the compounding of all of those innovations, at the hardware, system driver and, and developer software, and of course, all the libraries on top.

If you track that progress over time, it's quite dramatic. That's the benefit of accelerated computing. It allows for compounding value to be delivered. It also allows NVIDIA to innovate at an extreme clip. We are not constrained by the interface at these lower levels, like instruction sets. We're only constrained by the problems we think we can address up here. And if it requires us to change our architecture, to change our instruction set, to build a totally different kind of GPU, or build a GPU that can talk to other GPUs over NVLink and scale across the GPUs in a system or GPUs in a rack or across the entire data center.

Because we define the interface up here and how we engage, we can do all that and do it at an extremely rapid clip, which allows our engineers to produce new GPU architectures roughly every two years now, in some cases sooner. It allows us to think differently about how CPUs and GPUs want to be connected, and also allows us to expand to the entire data center being our canvas for innovation, for making change, for influencing. So, that first decision, I think, to basically think about it at a different engagement point up here, has allowed us to really innovate, move quickly, and invite everyone else to participate in that ecosystem. And we've been doing it, well, I guess now for approaching about 20 years in NVIDIA.

There you go.

Vivek Arya

Managing Director and the Senior Semiconductor Analyst, Bank of America

Got it. What part of that software stack, Ian, is substitutable? So, for example, you know, in the early days, it made a lot of sense, right, to couple the two, but now you have so many other people who are also involved in the ecosystem, whether it's the hyperscalers or whether it's the R&D software, R&D teams of many of your hardware competitors. So what part of your software ecosystem is substitutable? You know, can I take you know, an application written for NVIDIA and find a way to port it over somebody else's hardware, as an example, using a combination of these third-party tools and other open source software?

Ian Buck

General Manager and VP, NVIDIA

Yeah. Great question, and I get asked this a lot. You know, certainly it is possible to take one workload or one AI model or one, you know, specific algorithm and get it working on anyone's hardware and platform. What makes it hard is to make it a platform for continuous optimization and evolution, and be a platform that can run all the workloads that want to be run inside of a data center. So today, you know, if you look at our software stack, we have, of course, multiple hardware platforms, ranging from PCIe cards that are, you know, run at 70, 70 watts, fit in any server, L4, to larger 300-watt PCIe cards, up to HGX-based boards, which have multiple GPUs talking over NVLink.

We even have shared how we can scale to entire rack scale, or even row scale, you know, GPUs effectively. So then on top of that, you have, of course, the system software, all the compilers and all the libraries that then get integrated to all of the, the open ecosystem. And these include our, the hyperscalers, software like PyTorch, software like, PaxML. And, you know, the wonderful part about AI is it's so open that we can all innovate together in that ecosystem. So it's certainly possible to spike different implementations of different models into those stacks. What makes it hard is those platforms, that I've mentioned, that is in the community, you know, need to run on all the different workloads that to operate today across the entire data center.

You don't build a data center from one model. You're going to run a data center to run to do large language models, to do all of generative AI, as well as, you know, other data science or other use cases that you need, may need to do. You also want to accelerate end to end. You know, often I'll see someone spike a particular layer or a particular model, but, you know, to deploy an AI service, we have to do all of the ingestion, data prep, run the query, run the model, as well as produce the output, in some cases, perform multiple other stages of AI. Like, I want to have it talk back to me instead of just replying to text. That's also now being done in AI.

The other part of that is once it has to be a platform for innovation, because large language models are, and generative AI is not standing still. You know, a few years ago, I'd still be talking to you about ResNets, or I'd be talking to you about convolutional neural networks. I'd be talking to you about some of the U-Nets and other recommender things. These things are still important, but with so many people innovating inside of LLM and some from generative AI, what, you know, what models are being, are being innovated at a clip that's way faster than, you know, we're actually producing new architectures.

So in order to be a you know platform for that, you have to and of course be investing in the data center scale, which of course is a huge capital investment and takes lots of time. You need to be a platform where you can trust that the innovations that are happening in generative AI, you're going to be able to run really well. And again, that comes to the end-to-end performance and optimizations that we're trying to make. Certainly, you can pipe one model to get the to run all the models and be the innovation platform is a much more challenging task and one that requires a connection and a benchmarking, and you know all those customers that are giving you that input in order to continue to make your platform improving over time.

And we find optimizations from, from everywhere. One of the benefits and, and fun parts about working in NVIDIA is we get to work with all the different AI companies, so we get to optimize those layers of the stack that, that matter. There isn't just one part of that stack that needs to be, that is, can be simply replaced in order to port. You really have to get to the end-end, workload. And again, it is possible, but it's, it's challenging to be sustainable.

Vivek Arya

Managing Director and the Senior Semiconductor Analyst, Bank of America

... Now let's talk about generative AI. Obviously has you know caught everyone by surprise in a good way, right? And demand seems to be exploding. So we talked first about training and then generative AI inference. So on the training side, it seems like every day somebody is launching you know yet another you know large language model, and NVIDIA you know dominates the market for training a lot of those models. Do you see a point at which you know we get to some kind of cliff or maturation or demand for training? And do you think as people start to then look at optimizing the size of these models, that that actually somehow puts pressure on the demand for training hardware?

Like, how sustainable is the demand for AI training when we are already producing so many large language models?

Ian Buck

General Manager and VP, NVIDIA

Yeah, large language models are a different. What made them so and what make them, you know, why, why they're so—why are they so large? Is one question you could ask. Large language models, unlike, you know, computer vision models in the past or, you know, simple, more simpler recommender models. Large language models are effective because they're directly interacting with humans, typically. In order to directly interact with humans, they need to understand, you know, human knowledge. One of the reasons why, you know, GPT is so large is it's, it's trained on, it's trying to represent, you know, interact with us with the corpus of human understanding.

You know, compare, so that they take the, you know, they download the Internet, if you will, and they teach it, you know, what humans know, so we can have a start, a baseline, a foundational model that captures human, understanding and knowledge. Which obviously is much larger than perhaps what you would need for a computer vision model, which is, still very important, but you can be trained on a set of images, and eventually, it can be known that those sets of images, what they are. So they tend to be very large models. And they also tend to be a great foundation models for specialization. So you can specialize for different workloads, and you can specialize it when starting from that foundational model toward perhaps your data.

So you're starting from something that understands humans or understands how to interact with humans or understands the base of humans, and then, you know, take it to your proprietary data, and then be able to interact with it, to ask questions of that data, and of course, leverage the, the general capability. So when you ask about the capacity and how this is going to grow over time, this is that. It is how you interact with, you know, the, the, with computers, with the cloud, with your data. And that, that's hugely, immensely valuable.

It's immensely valuable for improving how customers want to interact with companies, how people who are helping customers want to understand and have an assistant sitting right with them to be able to ask questions and get the prompted information from knowledge bases and other things, to have a better experience. It allows large language [audio distortion] enable recommenders, people who want to give content, provide the right content as you're on your news feed or in your e-commerce-

Vivek Arya

Managing Director and the Senior Semiconductor Analyst, Bank of America

Right

Ian Buck

General Manager and VP, NVIDIA

... to have, to get the right words and the right context being shared with you. So it literally touches every part of e-commerce, of company interactions with customers, and it's sort of the answer to understanding the decades of big data we've been living in. Does this tail off? I think it becomes a continuous space for innovation, just across the board. And there's no—there's not gonna be one model to rule them all. It will be diverse—there'll be a large diversity of different models based upon the innovations that are gonna continue down the space, and also a specialization across all of these fields. And by the way, we're seeing it in healthcare and science and drug discovery. You know, large language models doesn't have to just be the language of humans.

It could be the language of biology or physics or material science as well. So what is the growth vector and what does it look like? It becomes that, you know, how many, it's the rate at which how many innovators are adding and defining and inventing new optimization techniques, new kinds of models. It may start from some of these heroic, amazing models coming from people like OpenAI, with you know, the GPT models and what we're seeing from there. But much of this research is being published or the models are being published, they do influence and create alternatives or derivatives. So that is the scale you we should be, we are thinking about for generative AI and large language models.

The scope isn't necessarily the size of the model per se. They're going to remain large in the sense that they have to remain large in order to be, have a baseline level of foundational intelligence. It is really—the scale will grow as more and more industries and more and more companies and the rest of enterprise beyond just, adopts this technique for how they interact with customers' data and apply it to their businesses. Certainly, the hyperscalers were the first to jump on it. They obviously had the talent and the capital and the, ability to basically invent much of this technology side by side with NVIDIA. I got to experience that. It was, it was a fascinating, experience. They continue to do so and continue to push the limits and figure out how to apply it.

You can see them starting to scale AI across their businesses, and now it's branching out to the rest of enterprise, the rest of the industry. And you're seeing a whole tier of both, more, more cloud offerings. We're seeing specialty, regional GPU data centers popping up everywhere to serve the market that, you know, operate differently, a little more agile, perhaps a bit smaller, but can be more focused. And then a large litany of middleware and solutions and software companies that are trying to help enterprises and other companies, you know, deploy this technology across the board. So there's definitely a broadening of the large language model ecosystem.

The adaptation of generative AI and language models to business is really the scaling factor that we experience, and that will continue, for sure.

Vivek Arya

Managing Director and the Senior Semiconductor Analyst, Bank of America

Now, kind of a similar question, but now applied to, the AI, generative AI inference, side. You know, what is NVIDIA's strategy for generative AI inference? Because the perception is that on the training side, the company dominates, but most of the products are very expensive. So when it comes to really, scaling generative AI inference, which is really, I think, the way your customers will monetize that, right, at the end of the day, how are you going to help them, monetize that? Like, what, what, what are-- what's your product pipeline, look like to help them with gen AI inference? And does the competitive landscape change as you move from training to, inference?

Ian Buck

General Manager and VP, NVIDIA

So, thank you for that question, and I think people often get a little bit confused, perhaps. Certainly, your starting point for some of these models for deploying them begins with their training clusters. And so they're, they'll stand up infrastructure, you know, previously, A100s, HGX systems. These systems are, you know, designed for, they have eight GPUs, NVLink connected, running at the maximum possible performance, and of course, have InfiniBand to scale-

Vivek Arya

Managing Director and the Senior Semiconductor Analyst, Bank of America

Right.

Ian Buck

General Manager and VP, NVIDIA

- across an entire data center. Today, it's, you know, being deployed right now with Hopper. What you train on is the natural platform for what to do inference on. Since training and inference are highly related, the model, in order to train a model, you have to first infer and then calculate the error and then apply the error back to the model to make it smarter. The first step of training is inference with every... and repeatedly. So it is natural that customers are deploying their inference models with their training clusters, with their HGX. It's not the only place where we see inference.

We see inference happening across the spectrum, from all the way down to the L4 GPU, which is. I should have brought one. It's a 72-watt GPU. It's half height, half length, about a candy bar size. It's smaller than my phone and fits in any server. Any server that has a PCI slot can now become an accelerated server. We've seen the clouds adopt it, and the OEM, the rest of the system infrastructure, adopt it because it's great for inferencing. It's ideal. It has the video encode and decode capabilities, so we're seeing it used for smart city applications and image processing.

It can also run small LLMs for recommenders or small tasks, and we also see it for generative AI, for image generation, for running Stable Diffusion-like models. And it provides, and it's at a price point that's very comparable to CPUs. So in fact, in many cases, a much better TCO than the CPU running the same model. We've got plenty of material on that. If you need to go up a click, you have the L40, which is, you know, a full-size PCIe card and runs, which is often used for larger inferencing and fine-tuning tasks.

So you can take an existing foundational model and then to fine-tune it, to do that sort of last mile specialization for your data workload, is a much lighter task than the larger training cluster, and it can be done on an L40 or an L40S PCIe-based server. Again, available across with every OEM system. So these provide different price points and different capabilities, and then you can do all the way to scale up to an NVLink-connected system. For NVLink-connected systems, we often see people running on a single node, and there you just need to get, you know, a model of a certain size that just needs to execute a certain amount of latency, you know, say, and to be interactive, you know, half a second latency response for Q&A, for example.

So by connecting them with NVLink, we can basically build a eight GPUs in terms of one GPU, and it just in order to run the model that much faster to provide that real-time latency. So our inference platform consists of many, many choices to optimize for TCO, for workload, and for delivered performance. In the case of inference, it usually is about data center throughput at a certain latency, and that's important. The other part of it, the roadmap, is software. So I want to go back to that because it's easy to look at a benchmark result and see a bar chart and assume that's the speed of the hardware. What is often underreported or not, or not represented is the investment that NVIDIA makes in the software stack for inference.

It's actually even you can apply even more optimizations that you can do than in just in training, because in inference, you're kind of at the last mile, so you can do further optimizations of the model beyond what is perhaps capable in training to optimize further. For Hopper, for example, we've just released, actually, last week, a new piece of software called TensorRT-LLM. TensorRT is our optimizing compiler for inference, and we have a new LLM version. The optimizations we made in that software, just in the last month, doubled Hopper's performance on inference. And that came through a whole bunch of optimizations in both optimizing for the Tensor Core that's in H100, using eight-bit floating point, and improving the scheduling and execution software of managing the GPU's resources to increase its effective throughput and computational efficiency.

It's a really hard task. You're trying to basically optimize by using reduced precision, by serving all different size requests, from quick Q&A to summarization tasks, to write me a long email or generate a full PowerPoint. A data center that's going to be running Hopper or a data center running inference, generally, it's going to be asked to do all those things. Getting that to run efficiently and be able to manage all that workload and keep GPUs 100% utilized is actually pretty hard mathematical, statistical, AI system software, and even, hardware-level optimization. So we will continue to do that. Just in the last month, we've doubled our performance on Hopper for inference, and we'll continue to do so, and you'll see that in, you know, as we continue.

Vivek Arya

Managing Director and the Senior Semiconductor Analyst, Bank of America

Ian, do you think that the industry has the right cost structure for generative AI inference at scale? I see that, you know, more as a user, when I go to, you know, take your pick of search engines, right? Whether it's Bard or, you know, ChatGPT or what have you. Even when we put in queries today, it takes several seconds to get an answer, right? It's a very different experience than we are used to in traditional search engines. So do you think the industry is there? You know, today it seems like everyone is training a lot of things and trying a lot of things, but do you think the industry actually has the cost structure to take generative AI and scale the inference side?

Because I imagine that's what it will take to really grow this industry in a very sustainable way over the next several years.

Ian Buck

General Manager and VP, NVIDIA

Yeah, it's a great question. Today, most of the live inferencing you're experiencing, of course, is on our previous generation GPUs. That's just naturally what it was originally developed and optimized and deployed on. And many of our large customers actually just now are bringing on their Hopper versions. So you'll see that 8X you have been. So in terms of our performance, where we were with A100 a year ago to today, it's about an 8X improved performance going from to H100. And that, again, is a bump from the hardware side and activating those capabilities, and another bump again from the software side. So I expect that interaction that you guys are experiencing to get better and get more intelligent.

I think the, you know, there's a fixed latency that we all want to experience, and then it becomes a question of the size and capability of the model that, can, can fit in that latency window. So it'll can be a process of continuous improvement. You asked about search. You know, can every search I type in be, take advantage, or be fully optimized if it takes this long? There are aspects of generative AI and language models that are already being used today that you may not, that you may not know. You know, when you type in the search, they're not using those words literally to index in. They actually are applying language models to, to generate a more optimized, query string, if you will, to, to search on based on your history and other things.

So we are seeing aspects of that. We also see things like transformers and large language model tech technologies being applied in last mile recommender systems. So, you know, as they get down to the last 100 documents or pieces of information they wanted to understand or ingest to produce a result, you know, can I run a smaller, more constrained transformer-based model in order to provide that last mile recommender from the tens or hundreds or 1,000, whatever I can afford, last mile for the recommender? So you are seeing some of that technology being deployed today, and being deployed on GPUs today. You know, the next click up, of course, will be having a richer experience with search.

I expect to see more of that with Hopper. It may take a few more clicks with every generation of our GPUs and with every invention of new software optimization techniques, and every invention by the community, you know, whether it be the... What's the next Llama 2, GPT, we bring down the cost of inference. Hopper brought it down from A100 by 8X. TCO is also on the order of 5X. And you compound that with continuous software improvement and compound it with new model and algorithms techniques, you know, there's an order of magnitude more capability that's going to be available to everyone. And the best part, it's on the GPUs they've already purchased. It's already there.

In fact, this performance we're delivering with every one of these new pieces of software, or the performance that's capable with this more optimized GPUs, more op—I'm sorry, more optimized AI algorithms or models, you know, it's free continual investment improvement in, in that TCO and that performance and that experience. So it's a fascinating time. It's super busy. You know, we're seeing new innovations come in all the time, and it's definitely keeping NVIDIA and the community busy optimizing, continuously optimizing the platform.

Vivek Arya

Managing Director and the Senior Semiconductor Analyst, Bank of America

... Got it. I wanted to get your perspective, Ian, now on the competitive landscape. You know, when we look at the demand profile for NVIDIA's accelerated products, right? Tens of billions, right? Expected to increase next year. Doesn't that give a lot more incentive to your hyperscale customers to create more custom ASIC solutions? You know, one customer is already, you know, with the TPU product, they have had a custom solution for a long time. There are, you know, the others are, there's a lot of headlines about them wanting to have internal. So, first of all, what is the right positioning of your product versus their internal solution?

Do they use one for, you know, one kind of workload and one for the other, or does it become a greater competitive threat for NVIDIA, going forward?

Ian Buck

General Manager and VP, NVIDIA

One way to look at this would be what just happened at the GCP Next conference, Google's conference. I think it was about two weeks ago. They announced their new variant of their processor on that day in their keynote. And in that same keynote, Jensen joined them on stage and talked about all the innovation that we're doing together with Google, both at, with GCP, not-- and not just new instances, you know, bringing-- they announced GA of their availability of A, their A3 instance, but also the integrations of GPU into their Vertex AI platform. Many of the research innovations that are happening on GPUs inside of Google elsewhere.

It gives you an example of how, you know, the fact that while hyperscalers absolutely have the means to invest and optimize and build something that may be tailored for, obviously, important workloads for their business, they continue to partner deeply with NVIDIA and our GPUs and our software teams. That's two big companies advancing what we can do together, helping us, helping them, us partnering together on many of the software platforms to continue to innovate. And what you see and you see that, you see NVIDIA out there as an open platform, of course, available on any, every cloud, and as an open software ecosystem, to help advance the state of the art in AI, in data science, and accelerated computing holistically.

That lift comes from now almost 20 years of investing in a software developer ecosystem, and you'll continue to see some of the hyperscalers, of course, building their own silicon, if they have the means to optimize for specific workloads that they can focus on for their businesses. But they still can remain in close connection with NVIDIA because they see the opportunity to not just serve a broader ecosystem, but also innovate and be a platform for accelerating computing across the board. And that is something we're quite comfortable with, and it's been a good partnership, and I think it was really evident in that keynote.

Vivek Arya

Managing Director and the Senior Semiconductor Analyst, Bank of America

Got it. Do you see that change at all as we are moving more towards generative AI? You know, just where the cost of training is so expensive, the cost of inference is also going to be quite expensive. That, do you think it increases their desire to bring on more ASIC solutions than they have done in the past?

Ian Buck

General Manager and VP, NVIDIA

You know, they're, that's a choice for them to fit where they want to optimize and invest. You know, one thing that is, NVIDIA is spending and investing billions in R&D to optimize for generative AI, for training and inference scale. And with every generation of our GPU, with every generation of our interconnect and InfiniBand and CX, and networking technology, with every innovation of NVLink, you know, those things bring the TCO and increase performance dramatically and also bring down the cost of training. Now, they're obviously motivated to scale up what they can possibly do in order to develop something uniquely advanced or uniquely new or different that they can capitalize on.

But by working with NVIDIA, they can basically, you know, leverage the billions of dollars in investment that we're doing on those core workloads of training and deploying for inference for large language models, for generative AI in that workspace. And, you know, it's a question for them of where they're going to decide to optimize and to take that step further to do something which may be, you know, different, doesn't necessarily take advantage of all the time and energy and investment that NVIDIA is making. So that's a choice that they have to consider and make. We're gonna continue, regardless, to innovate at a pace that, you know, they'll need that and the benefit of them and the entire community. So, we will continue to see those things happen.

I'm sure it would make sense, but you know, our focus doesn't change. It continues to swarm and innovate, to increase our performance, lower costs, and also increase capability for generative AI and large language model.

Vivek Arya

Managing Director and the Senior Semiconductor Analyst, Bank of America

Makes sense. Next topic, Ian, I wanted to approach was, this, these emerging class have kind of converged- CPU, GPU, product, you know, for example, Grace Hopper, and, you know, your competitor is also announcing, right, some of their own, products. So what are the pros and cons of using those kind of, I don't know whether converged CPU, GPU is the right way to refer to them, but how do they stack up against the more discrete solution where I'm just using, you know, standard x86 CPUs with, you know, one or many, GPUs? What's the pros and cons of moving to this kind of converged architecture?

Ian Buck

General Manager and VP, NVIDIA

Yeah, it's we've been optimizing and the community's been optimizing for accelerated computing and AI for, you know, for the 20 years. We've moved a huge amount of the computation to the GPU at this point. So for many workloads, including, many of the AI, you know, 95%, 99% of the computing is done, of course, on the GPUs, and they are directly communicating with each other or, either through NVLink or across InfiniBand, they never touch. And all of the CPU workload can be either, is either small or, can be optimized or done in parallel and overlapped with a GPU computation.

What combining this, you know, of course, you have to have a high-performance CPU there to do the other tasks, usually around data prep, scheduling, managing, and coordinating the execution. And every time we increase our GPU performance, of course, we need to make sure that our CPU performance keeps up, or we find, so we don't make that the bottle neck, Amdahl's law. Ways to manage that, one, of course, is use the best possible CPUs, which we encourage and use ourselves. You can also adjust the ratio of CPU to GPUs.

Today, you know, if you look at a DGX system, it's two CPUs for eight GPUs, but we can do one to four, we can go one to four, we can do one to eight, we can do, of course, two to one, or in Grace Hopper, you know, we went all the way to one to one, put them next to each other. That's, that's one angle. The other part, though, is about converging. What happens when you combine CPUs and GPUs and put and do something different than a traditional x86 architecture, where a CPU is sitting over here, and you go over PCIe to a GPU over there? First, by bringing the two converged together, you can dramatically improve the bandwidth, the communication between those two processors.

Today, you know, to hundreds of gigabytes a second versus the 60 or maybe 100 gigabytes from PCIe connection. You can also be much more coherent, so you can bring the two memory systems together. The memory on the GPU, which today we ship an 80 GB HBM GPU, we're going to, and we've announced going to up to 144 GB per GPU. But you can then connect it to a Grace, and because the connection is so fast, the 600 GB of Grace memory around the CPU basically becomes a combined fast memory platform, allow you to run even larger models, basically effectively making a 600 GB GPU.

This activates certain different, both allows you to run larger models, with a single platform, a single GPU-CPU complex, and it opens up new avenues for new kinds of workload acceleration, with, especially working on large data's, applications like vector databases, applications like graph neural networks, which is used a lot in finance and fraud and e-commerce, also used for recommenders. These are very large datasets that often, you know, want to either be run on, you know, they can run, be run today across many GPUs, but could be run more perhaps optimally or a different TCO point by having a, a much larger GPU like Grace Hopper, 600 GB, combined in one because they've been tied together. The third thing about convergence is that allows us, it's another vector for innovation.

You know, we can add things to a CPU that could, can optimize for the workloads we already know about or other opportunities we see in the future, to innovate in, in CPU, in the CPU ecosystem, in the CPU space, in addition to the GPU, in addition to networking at data center scale. And that, you see that in the work we're doing with our DGX GH200, by connecting even more GPUs together, the having that excellent CPU-GPU ratio, having the NVLink, having the large memory, really gives a vision of the future of infrastructure for generative AI. One where you have basically 256 GPUs connected all with NVLink, fully backed by 256 Grace CPUs.

And because it's all NVLink, it effectively acts as one exaflop GPU, which is an amazing generative AI platform for both training and extreme large language model inference, where we need, might need multiple GPUs connected, optimally, for, for the lowest. So it's really those, you know, some of those three things, larger, provide larger memory as a, as a starting point for a building block. It's a great scale-out platform for inference as a result. You know, Grace Hopper, if it's in any server, it's, it's a, a complete complex for CPU, GPU, and memory. It allows us to play with the ratios and explore different ratios for different kinds of workloads with CPU, GPU, and it's an innovation space.

Some of the innovations that we've made in Grace w hile we're using an Arm-based core, the SoC architecture of Grace and how those cores can talk to each other is quite powerful, and it's showing up in many of our benchmarks, and it provides a great companion to those compute-rich workloads that NVIDIA has been focused on for the last two decades.

Vivek Arya

Managing Director and the Senior Semiconductor Analyst, Bank of America

Got it. I know we only have a few minutes left, but I wanted to get your take on the last two questions, Ian. One of which is, what's the role of the networking stack in the optimized generative AI cluster? So how much of an advantage does NVIDIA have, you know, because you're able to leverage InfiniBand. But when that InfiniBand changes over to Ethernet, then does it mean conversely, you lose some of that advantage also? Because hyperscalers want to move more to Ethernet. So first, what is the role of that networking as part of the cluster, and does anything change when it moves from InfiniBand to Ethernet?

Ian Buck

General Manager and VP, NVIDIA

Yeah, great question. So, there's basically three interconnects at this point, that are a choice in how to design and deploy AI. There's NVLink, which previously was inside how GPUs could talk to each other directly inside of a system, now going to more of the rack and row scale. You have InfiniBand, which is originally developed for, HPC, and it's from, you know, the supercomputing industry, for, you know, the lowest possible latency at data center scale, and it's really was designed for that. And then, of course, you have Ethernet, industry established, designed, of course, for, high manageability and capability, and comes with a rich ecosystem of all the features that not just enterprises, but the clouds need in order to do a managed and software-defined infrastructure.

What you will see is, of course, NVLink will continue to be very closely tied to the innovations we'll be making inside of our own GPUs. And there we, you know, it's as fast as we can go, because we know bringing those GPUs, as GPUs get faster, and we want to connect them as quickly as possible in order to, and continue to allow them to operate as one. And to get the lowest possible latency for inference on some of these giant models, you need to be doing the techniques around model parallelism, which have extremely high intercommunication requirements, so that basically you can split the model, you know, this way instead of just that way to decrease latency.

As InfiniBand also continues to grow, you know, its design point, of course, is the lowest possible shortest latency, and as a result, it as well as providing the excellent bandwidth that it does. And we see that it does provide a significant performance improvement over leveraging perhaps RoCE, a converged Ethernet stack, which still has a lot of the management, which can deliver comparable performance. In fact, we support many clusters and many deployments in our cloud at scale with Ethernet, with RoCE, and it works great. For the best possible performance, InfiniBand gets that extra, extra click up, and that basically comes from its HPC heritage of having the lowest latency with high bandwidth. And we can do other optimizations as well as in-network computation.

We can actually do some math inside the switch and inside the network fabric with InfiniBand. I fully expect Ethernet, and we are working with the community to actually improve Ethernet's performance as well, and which is great because it comes with all that manageability and that software-definedness. And all three will exist in the ecosystem for a while, and continue to get you know, as three layers of performance and scale, and of course, requirements between reliability or manageability or security or enterprise deployment versus maximum possible performance over time. The roadmap will continue, performances will go up. I expect it to be staggered, and but they'll continue to learn and absorb from each other, those technologies.

Vivek Arya

Managing Director and the Senior Semiconductor Analyst, Bank of America

Got it. And finally, I would love to get your perspective on where are we in terms of rolling out generative AI? Because when we you know look at applications, they seem to be in their infancy, right? There are not that many applications. But when we look at just the rate of the growth of your data center business, that seems to be a very big proportion of what is the total spending pie, right? So what gives you pause in thinking about, you know, we are already such a big part of the spending pie. How sustainable is this growth rate for NVIDIA over the next several years?

Ian Buck

General Manager and VP, NVIDIA

So it's a fascinating question to think about. You know, today, if you think about where we are, you know, and the growth we're experiencing right now, it's people taking their existing data centers and optimizing them to incorporate more and more GPUs, more and more LM and generative AI workloads. And that may be coming from the hyperscalers themselves. Enterprises want to get on board using the clouds, for example, or now seeing the GPU regional specialty providers also standing up infrastructure. But we're largely going into the data centers that already exist because you can't just build data centers overnight. It takes two years plus to build out the infrastructure. You know what?

What I see is the world looking to pivot their, how they're building data centers in the future, and we're seeing really exciting growth in... They all realize they need to build out more capacity and of course, be able to build not just the data centers they had before, which were generic in nature, perhaps more CPU-focused, because that was the majority of the servers going in. Now building everyone from hyperscale to regional, to on-prem, to basically building out GPU data centers at scale. So I, you know, if I look at the growth of data center build-out, you can kinda see the opportunity for LMs continuing to grow beyond what's being able to be, in some cases, quite literally crammed into the data centers they already have.

And then, to establish where we are today and versus the size of the opportunity, the size of the market, just from a data center footprint, growth capacity. We've gone from being a corner of the data center to being, you know, what data centers are now being designed for, which is really exciting, and it gives me the confidence in the continued growth of our business to see how much companies are investing, the world is investing in building out that infrastructure, for all the different demand, all the different needs.

Vivek Arya

Managing Director and the Senior Semiconductor Analyst, Bank of America

Excellent. So on that exciting note, Ian, thank you so much for taking the time to be with us, sharing your perspective. Really appreciate that. And thanks to everyone who joined this webcast. I got another 45 questions on the chat, so I'll see if I can work with Simona to help answer some of those questions. But really, thank you so much, Ian, for taking the time. It's immensely useful to get your perspective. Thank you so much.

Ian Buck

General Manager and VP, NVIDIA

Always a pleasure, and thank you very much.

Vivek Arya

Managing Director and the Senior Semiconductor Analyst, Bank of America

Pleasure. Take care.