This event includes forward looking statements about future products and other topics, which are based on our current expectations and subject to risks and uncertainties. Please refer to the press release for this event and our SEC filings at intc.com for more information on the risk factors that could cause actual results to differ materially.
Welcome to Architecture Day 2021. In 2018, Intel highlighted the need for continued innovation in 6 key areas. Intel also identified 4 foundational compute that are necessary for the next era of compute. Since then, Intel has continued to execute this vision, Shipping technologies and products like Ice Lake and Tiger Lake CPUs. Intel's 1st discrete GPUs on SELP architecture and Agilex FPGAs.
In March, Intel announced our new manufacturing strategy, leveraging both internal and external Fab infrastructure. In July, Intel unveiled one of our most detailed roadmaps yet, extending the promise of Moore's Law with New transistor, interconnect and packaging innovations. To accelerate these innovations, Intel created 1 API, a unified programming model that seamlessly works across domains and is now being used by developers worldwide. Please welcome Intel's Chief Architect and Senior Vice President of Accelerated Computing and Graphics, Raja Kadoori.
Hello, everyone. It's great to see you all. Thank you for joining us for our 3rd Architecture Day. Me and my fellow architects And engineers are so happy to be with you virtually. However, I'm so, so looking forward to doing this with you all Live in the same room and immerse in deep geeky conversations over some quality ice cream and hot wings.
This is an awesome time to be a computer architect. Technology is changing at a torrid pace. Looking back at just the last year, technology was at the heart of how we all communicated, worked, played and survived through the pandemic. Inaugural computing power proved crucial. We witnessed the development of life saving vaccines, advances in computer vision, cryptocurrencies, decentralized finance, Augmented reality and space travel.
And it's accelerating. We are seeing meta humans, GitHub Copilot, Software creating software and this little thing called AI. Some of these save lives, others altered our lifestyles, Some are game changing and some controversial. No matter where you look, our lives are intertwined with digital technology. Every demanding workload we look into and every innovating customer we talk to have one meta performance task, 1,000 desks.
They ask, can you make our workloads 1000x faster by 2025? So that's just 4 years away. 1000 times is Moore's Law to the power of 5. And that sounds impossible, right? How will we do that?
To meet this 1000x demand by 2025, we will need to achieve the minimum Moore's Law Improvement, 4x or so in each of these technology areas, process, packaging, memory and interconnect and architecture. Architecture is the alchemy that brings them all together with software. And together, they give us the multiplicative factor. So all those 4x improvements could combine to give us the 1,000 times we need for demanding workloads. This is just an illustration to show why it's an exciting Time to be an architect.
And speaking of architects, Pat Gelsinger recently rejoined us as CEO and he's a renowned architect. Pat reminds us that what we do is important because the world is counting on engineers to solve the most difficult problems, to enrich people's lives, to make them happier, healthier, safer and to Architect our silicon at the speed of software, which means a torrid pace. We have a rich selection of compute engines to choose from, Several flavors of scalar vector matrix and spatial engines to combine and make hybrid computing architectures that deliver nonlinear gains on demanding workloads. When we leverage the best transistor for a given engine, connect them through advanced packaging, Integrate high bandwidth low power caches, equip them with high capacity memories and low latency interconnect. We have hybrid computing clusters in a package.
Every product I look at in our roadmap looks like collection of systems on packages, products ranging from watts to kilowatts. Today, my fellow architects will share our advances in accelerated hybrid computing with architectures that establish new foundations in products whose releases are imminent. You'll hear about one of the biggest shifts in x86 architecture in over a decade. We will begin by introducing 2 next generation x86 core microarchitectures. 1st, We will present the EfficientCore, a highly scalable microarchitecture optimized for multi core performance per watt.
Next, We'll present the performance core optimized for single threaded performance and AI. Then we will walk you through the architectural magic that combines these two cores to deliver our 1st performance hybrid architecture, Alder Lake, which will delight billions of PC users. You will hear about the advances we are making in visual computing architectures with Xe HPG and our discrete graphics. Later, we will look at a new accelerated hybrid architecture designed for data center, Sapphire Rapids, which combines our performance course with new accelerator course. Next, we'll show you Mount Evans' IPU, Infrastructure processing unit.
This is the beginning of hybrid infrastructure computing in a package. And we'll close it with the tour de force of all the latest silicon technologies, our moonshot, I'd like to welcome the Chief Architect for the Efficient X86 Core, Stephen Robinson.
Hey, Raja.
Steven, start us off with the bank.
Absolutely. Hey, everybody. I am excited to introduce to you our new microarchitecture, previously code named Gracemont. When we started this journey, we wanted to deliver a Scalable microarchitecture that could address computing needs across our entire spectrum of products, from low power mobile applications to many core microservices. Our primary goal was to build the world's most efficient X86 CPU core.
We wanted to do that while still delivering more IPC than Intel's most prolific We also set an aggressive silicon area target so that multi core workloads could be scaled out using as many With these architectural anchors in place, we also wanted to deliver a wide frequency range. This allows us to save power by running at low voltage and creates headroom to increase frequency and ramp up performance for more demanding workloads. Finally, we wanted to provide rich ISA features such as advanced vector and AI instructions that accelerate modern workloads. I am pleased to say that we delivered on all of our goals. It's my honor to introduce Intel's newest efficient X86 core microarchitecture.
Thanks to a deep front end, a wide back end and a design optimized to take advantage of Intel 7, this CPU core delivers a breakthrough in multi core performance. Let's now dive deeper into the details, starting with the front end. The first aspect in driving efficient IPC is to make sure we can process instructions as quickly as possible. This starts with accurate branch prediction. Without accurate branch prediction, much of the work ends up being unused, which is wasteful.
We implemented a 5,000 entry branch target cache. We complemented it with a long history based branch prediction. This helps us quickly generate accurate instruction pointers. With accurate branch prediction, things like instruction cache misses can be discovered and remedied early before becoming critical to program execution. Workloads like web browsers, databases, packet processing, these all benefit from these capabilities.
We also have a 64 kilobyte instruction cache that keeps the most useful instructions close without expending power in the memory subsystem. This microarchitecture features Intel's 1st on demand instruction length decoder, which generates pre decode information that's stored alongside the instruction cache. This gives us the best combination of characteristics where code that has never been seen before is decoded quickly yet. The next time it's executed, we bypass the length of the decoder and save energy. The new core also features Intel's revolutionary clustered out of order decoder that enables decoding up to 6 instructions per cycle, while maintaining the energy efficiency of a much narrower core.
It also includes hardware driven load balancing, which takes long chains of sequential instructions and automatically inserts toggle points to ensure parallelism. The second main aspect to achieving performance is ensuring you extract any parallelism inherent in the program. With 5 wide allocation, 8 wide retire, a 256 entry out of order window and 17 execution ports. This microarchitecture delivers more general Integer IPC than Intel Skylake core, while consuming a fraction of the power. The execution ports are scaled to the unique requirements of each unit, which maximizes both performance and energy efficiency.
4 general purpose integer execution ports are complemented by dual integer multipliers and dividers. We can also resolve 2 branches per cycle. Now for vector operations, we have 3 SIMD ALUs. The SIMD integer multiplier supports Intel's virtual neural network instructions. 2 symmetric floating point pipelines allow executing 2 independent add or multiply operations.
Thanks to advanced vector extensions, we can also execute 2 floating point multiply add instructions per cycle. Advanced crypto units round out the vector stack, which provide AES and Shaw acceleration. Now, the final aspect to achieving efficient performance is a fast memory subsystem. 2 load pipelines plus 2 store pipelines enable 32 byte read and 32 byte write bandwidth at the same time. The L2 cache, which is shared among 4 cores, can be 2 or 4 megabytes depending on product level requirements.
This large L2 provides high performance and power efficiency for single threaded workloads by keeping data close. It also provides enough bandwidth to service all 4 cores. The L2 can provide 64 bytes of read bandwidth per cycle with 17 cycle latency. The memory subsystem has debuffering And each 4 core module can have up to 64 outstanding misses to the last level cash and beyond. Advanced Prefectures exist at all cash levels to automatically detect a wide variety of streaming behavior.
Now, Intel Resource Director Technology ensures that software can control resources among the cores. A robust set of security features along with having an ISA that can support a wide range of data types is important for every new microarchitecture. We support features like Intel Control Flow Enforcement Technology and Intel Virtualization Technology Redirection Protection. We put additional focus on security validation and developed several novel techniques to harden against certain attack vectors to maintain tight security. We also implemented the AVX ISA along with new extensions to support Integer AI operations.
This allows software to run with great performance. In addition to choosing what to include, one of the most important aspects of designing new microarchitecture is deciding what not to include. We balance this trade off by focusing on those features that were needed and keeping out the rest. This results in area efficiency, which in turn allows products to scale out the number of cores. This also helps reduce energy per instruction.
Now, minimizing power is the biggest design challenge for today's processors. Power is a combination of multiple factors at which voltage is the most important. This microarchitecture and our focused design effort allow us to run at low voltage to reduce power consumption, while at the same time creating the power headroom to operate at higher frequencies. Okay, now let's take a look at the results of this new design. First, looking at latency, if we compare our core to a single skylight core for a single logical process, We deliver 40% more performance with the same power.
Now, we also deliver the same performance While consuming less than 40% of the power. To say it differently, a Skyly core would consume 2.5 times more power to achieve the same performance. This is a tremendous achievement. However, we're even more excited about the throughput results. If we compare 4 of our new CPU cores against 2 Sky Lakes running 4 threads, we deliver 80% more performance while still consuming less power.
Alternatively, we deliver the same throughput while consuming 80% less power. Again, this means that Skylake would need to consume 5 times the power for the same performance. As you can imagine, these are very exciting results for us. What makes all this truly incredible is when you consider that we can deliver 4 of our new cores in a similar footprint as a single Skylake core. In conclusion, we are extremely proud of our new highly scalable microarchitecture.
Thanks to our deep front end innovations, our wide back end and design optimizations using Intel 7, We created a microarchitecture that excels at throughput efficiency. We exceed Skylake core performance while consuming less power in a smaller footprint. I want to thank all of the talented engineers on the architecture and design teams and I also want to acknowledge the support we got from everyone within the company. Thank you for your time. Back to you, Raja.
As we just heard, the new efficient x86 core provides a highly scalable architecture that will address compute requirement across the entire spectrum of our customer needs, from low power mobile applications to mini core microservices. Next, we are going to take a deep dive into our new performance x86 core architecture. While maintaining efficiency, this core is designed for raw speed, pushing the limits of low latency and single threaded application performance. So without further ado, I'd like to hand things over to the Chief Architect of the Performance Core, Adi Yoas.
Hey, Ajah.
Take it away.
Thank you, Ajah. It's an honor to be here. Hello, everyone. I'm really excited for this opportunity to make the first public introduction of the new Performance Core architecture, which was previously codenamed Golden Cove. When we started this journey, we did so with an ambitious goal, not only to deliver the highest performing CPU core Intel has ever built, but also to deliver a step function in CPU architecture performance that will drive the next decade of compute.
To do that, we focused on both General purpose compute as well as accelerated compute for emerging workloads. As we looked at the trends in current and future workload patterns, We also saw that data sets are massively growing and data bandwidth requirements are becoming increasingly critical. We also see exciting trends in demands for greater AI compute. So we wanted to not only continue on the path of improving the performance of our existing AVX vector acceleration hardware and ISA, but also to expand on this with a new technology to deliver yet another step function in AI performance acceleration. Last but certainly not least, We wanted to deliver a scalable architecture with a wide dynamic range to power the broadest set of devices from low TDP laptops to desktops to data centers.
Thus, with our new performance core architecture, we wanted to enable core configuration options to cover the various needs across all these segments. I'm thrilled to announce that we delivered on all of our objectives, and it is my privilege to introduce Intel's new performance X86 core architecture designed for speed, Pushing the limits of low latency and single threaded application performance. To keep driving general purpose performance, We have architected the machine to become wider, deeper and smarter. It has a deeper out of order scheduler and buffers, pose higher degrees of parallelism and provide higher performance only if it is fed with instructions from the correct path and with data coming in on time for execution. To make this new wider and deeper machine effective, We also made it smarter with features that improve branch prediction and the instruction supply, collapse dependency chains and bring data closer to the time when it is needed.
On top of the baseline features that speed up most common workloads, we added dedicated features for workloads with particular properties. For example, In order to better support applications with large code footprints, we are now tracking many more branch prediction targets. For emerging workloads with large irregular datasets, the machine can simultaneously service 4 page table works. And for the evolving trends in machine learning, we added dedicated hardware and new ISA to perform matrix multiplication operations for an order of magnitude performance increase in AI acceleration. This is architected for software ease of use leveraging the EX-eighty six's programming model.
Additional performance is achieved through core autonomous fine grain power management technology. The Performance Core integrates a new microcontroller that can capture and account for events in granularity of microseconds instead of milliseconds and tighten the power budget utilization based on actual application behavior. The result is higher average frequency for any given application. This is our largest architectural shift in over a decade. And with Tata's introduction, Let me now dive deeper into the details of our performance core architecture starting with the machine's front end.
The first step in building a balanced wider core is to widen and enhance the core's front end. Micro operation supply was improved both from the decoder side and from the MicroOp cache path. The length decode is now doubled running at 32 bytes per cycle and 2 decoders were added to enable 6 decoded micro ops coming per cycle from the decoders. When delivering micro operations out of the MicroOp cache, we can now get 8 MicroOp's per cycle And the micro op cash itself has increased to hold 4 ks instead of 2.25 ks micro operations. This allows us to better feed the out of our air engine, deliver higher micro bandwidth and do so in the lower latency shorter pipeline.
To better support software with large code footprint, we doubled the number of 4 ks pages and large pages stored in the ITLBs. We have a smarter code prefetch mechanism hiding much of the instruction cache miss latency and improved branch prediction accuracy to reduce jump mispredicts. The branch target buffer is more than 2x larger than the one on the previous generation, which Greatly improves the performance of workloads that have a lot of code. It uses a machine learning algorithm to dynamically grow and shrink its size. It shuts off excess capacity when it's not needed to save power and it turns on extra capacity when it's needed to improve performance.
With a wider and smarter front end, we now turn to the out of order part of the machine. The out of order engine is where the magic happens, and this is what separates CPU architectures from all other architectures. We are widening the machine by going from 5 to 6 wide rename allocation and from 10 to 12 execution ports. The machine is also becoming significantly deeper with a 5 12 entry reorder buffer, more physical registers and a deeper distributed per operation type scheduling window, all tuned for performance and power efficiency. To further improve both performance and power efficiency, P core smart features enable collapsing dependency chains by executing some simple instructions at the allocation stage, thereby saving resources down the pipe.
This allows other operations reside on the critical path, run faster while better utilizing execution bandwidth and safe power. With a wider, deeper and smarter out of order engine, we also wanted to enhance our execution unit significantly. Let me start with the Integer part. We added a 5th general purpose execution port with the 5th Integer ALU and the 5th single cycle LEA. All 5 LEA execution units can also be used for generic arithmetic calculations like additions and subtractions as well as for fast multiplications by some fixed numbers.
DELIA added on port 10 can also do scaled operations in a single cycle, similar to the layer we have on ports 15. It was important for us to increase compute resources here as ALU operations are so common that much of the software will take advantage of it. Similarly, on the floating point vector side, for the many cases where vector code is prevalent, We have added new fast There's with 2 cycle bypass between back to back floating point ad operations. In our previous generation course, Floating point ad operations are executed on the FMA units with a 4 cycle latency when executed on port 0 and 1 and the 6 cycle latency when executed on port 5. The new performance score supports the execution of new data types with new ISA and associated hardware.
FP16 data type is now added in AVX-five twelve mode where it comes with a complex number of support and is highly effective in speeding up networking applications. I'm super excited about one more big thing regarding our new matrix execution architecture. But before I get to that, let me Analyze our general purpose compute advancements. So turning now to the memory subsystem. The L1 data cache was opened up to supply more data to the new wide execution machine, planned for 50% higher throughput for the most common scalar and vector loads, while still supporting 1 kilobytesits per clock in the case of efficient 5 12 bit wide vector loads.
A very deep out of order machine with deeper load buffer and store buffer has the potential to expose much more memory parallelism. But Unlocking that potential requires a whole lot of smarts. The new performance core tackles this problem from multiple angles. The memory subsystem has learned how to identify independent loads and stores more effectively than ever before. When conflicts are recognized, functionality has been added to react immediately and recover with minimal disruption.
We have greatly increased opportunities where store data can be directly steered to a load and latency for such cases has been optimized. The new performance core was architected recognizing that modern workloads demand more data from all cache levels. To better address mid range working sets, the L1 data TLB has been increased by 50% And the L1 data cache itself can fetch 25% more misses in parallel. The L1 data prefetcher has been enhanced to confidently lock on to stride patterns even in the face of an aggressively out of order execution architecture and has extended its reach 8x compared to the previous generation. The machine can now simultaneously service 4 page table works.
This is a 2x capability improvement beneficial for emerging workloads with large irregular datasets. A hungry compute engine needs feeding and the L2 Cash subsystem is engineered to satisfy that need. The L2 cache itself has been customized for 2 different market segments. The client performance core gets a latency optimized 1.2 5 meg L2 cache, while Data Center Performance Core gets a generous 2 meg private cache, allowing large code and data workloads to scale to larger core counts. For big data workloads, feeding the core means pulling data into the core from across the system.
To that end, the L2 Cash App system has more than doubled the number of demand or prefetch operations that can be serviced in parallel. A completely new L2 prefetch engine was developed to leverage a deeper understanding of program behavior. The prefetch engine observed the running program in order to estimate the likelihood of future memory access patterns. It can identify multiple potential future sequences and can prefetch down multiple potential path each path at ANAhead depth individually tailored for its estimated likelihood. This chart shows the performance improvement of our current 11th gen core architecture to the new performance core at ISO frequency.
As we see, the effect of the microarchitectural enhancements we've discussed thus far on general purpose performance provides an average improvement of 19% across a wide range of workloads. This level of improvement is even larger than what we delivered with the SONICOV core over the Skylake core. And it is just to give you a taste for what the performance improvement looks like for existing workloads. Of course, for new workloads that take advantage of our new ISA and architectural advancements, the numbers go up significantly. To dramatically increase the IPC of AI applications, we developed a new technology called AMX.
AMX is our next generation built in AI acceleration advancement for machine learning inference and training targeted for data center. Today, we are already industry leaders in CPU AI acceleration in the data center market. With INT8 used for inference, our VMMI technology delivers 256 INT8 operations per cycle per core, And that's already over 2x our x86 CPU competition, but we were not satisfied with that. So with AMX, we will expand that by 8x, delivering 2,048 intake operations per cycle per core. There are 2 components in the AMX architecture today.
The first component is tiles, Which is a new state component consisting of 8 2 dimensional registers, each 1 kilobyte in size. The program and part of this architecture is straightforward configure, load, store, clear. The more Operations are to be carried out by co processors that operate on tiles. Tile state is OS managed, Which required a new extension in the XAVE architecture. The second component is TMUL, which stands for tile matrix multiplication and is the 1st co processor attached to the tiles.
It is a systolic array supporting all flavors of int8 with 32 bit accumulation and bfloat16 with single precision accumulation. I also wanted to give you a broader view of the underlying architecture behind AMX. The architecture is highly flexible and can be expanded to implement further co processors down the line to address different types of computation needs. In the current implementation, we have the TMUL engine as a tightly coupled co processor within the P core host. The host is doing all loop and address management as well as the SIMD processing of the data.
The Tmall engine performs matrix multiplication in parallel to the host. The resulting power Performance is much better than simply running these algorithms on the SIMD hardware. The metric engine is much more efficient and the remaining work on the host is generally light. So all of the core power budget is given to the matrix engine and the cash subsystem to feed it, which is exactly what you want. The typical flow of a layer in the deep learning topology is for the data to come in straight to the tiles.
Meanwhile, the host is running ahead dispatching tile loads while TMUL is operating on the ready data. At the end of the multiplication, the tiles are stored to the nearest cash level to the cost, thanks to the tight coupling with the core. The EnsimDecode is used to post process the output and store it to the location where the next layer will get it from. There are software techniques to fuse and interleave these operations so that both the host and the AMX unit are busy simultaneously, which provides maximum performance. As I said, on our current implementation, AMX peak compute throughput is 2 ks int8 operations per cycle per core, which is 8x higher compared with VNNI running on 2 wide FMAs at less than 3x the execution power.
With AMX, we can also perform 1 ks bfloat16 operations per cycle to get higher accuracy, multiply, These are the order of magnitude type increases developers and users are looking for. And that is what we have delivered with our new AMX technology. In conclusion, we are very excited about this new performance core architecture. This Pico is not only the highest performing CPU core Intel has ever built, but also delivers a step function in CPU architecture performance that will drive the next decade of compute. It is a wider, deeper and smarter machine that delivers substantial improvement for general purpose compute.
It is tailored for the increased needs of large datasets and large code footprint applications and it also delivers an order of magnitude in accelerated performance for AI workloads. The new architecture has enhanced power management capabilities that improve frequencies and optimize power budget utilization. And it also supports core configuration options for scalability across different market segments. Finally, I really want to thank the team of talented architects and engineers at Intel that made these advancements possible. Thank you all for your time.
Back to you, Raja.
Thank you, Adi. With our new performance and efficient course, you've See the details of one of the biggest advancements in X86 architecture in over a decade. So let's put that in context. They've shown you this road map of coves and mounts in the past. You may have noticed we have changed our nomenclature since.
Our mounts were designed for the best area efficient multithreaded performance. Our cores were designed for maximum single threaded performance. Now while EfficientCore truly excels in throughput efficiency, it's also getting a boost in single threaded performance. And the performance score is not only pushing the limits of low latency and single threaded performance, it's also getting a boost in multi threaded performance with additional AI acceleration. But where we want to be is here, where we can combine the best of both, Best of both in one system to get the raw performance of the P Core with the scalability of the E Core.
We need a very high performance hybrid. Talking about hybrid, everyone understands the idea of the hybrid car As using hybrid technology to get the most miles out of a tank of gas. And that's a good analogy for the conventional notion of hybrid computing, Getting the most hours out of a battery. But there is another type of automotive hybrid. The fastest racing cars in the world, like in Formula 1, use hybrid technology to achieve maximum performance.
In addition to the conventional turbocharged engines that give them top speed and enough range to make it to the finish line on a tank of fuel, they add electric power to blast them out of the corners with acceleration that cannot be achieved with conventional engines. We need a high performance hybrid. The biggest challenge, the magic is to bring these two cores together, working efficiently with existing software. This is a huge undertaking, And we have been hard at work on this problem for years. We now have the solution.
To share more, I would like to invite our client architect, Raja Shree.
Thanks, Raja, and hello, everyone. I'm really excited to share with you unique solution we have developed to ensure 2 new course you just heard about efficient core and performance core work seamlessly together so we can maximize system performance and efficiency As we all know, performance expectation can vary drastically for different computing tasks So one of the most important consideration we had while designing our next generation client CPU SoC was ensuring optimal task scheduling across 2 different core types. The challenges given to the team were first, how do we go beyond traditional hybrid as we know it? And second how do we get both core types to work together intelligently to maximize performance as raja mentioned we could have taken conventional approach of simply assigning threads to course based on static rules. But that leaves a lot of performance on the table and creates overhead with software development.
Our solution needed to be dynamic and autonomous to software stack that is running on top. So we decided to help OS make more intelligent decisions. It is needed to handle a wide range of common client activities such as gaming, gaming with streaming, content creation productivity while dynamically adapting to operating conditions such as temperature and power budget. We also wanted to eliminate the need for software developers to having to rewrite their existing code and remove overhead of handling scheduling task in software. Only a hardware solution could meet all these requirements.
So we developed Intel Thread Director Technology. Thread Director Technology is one of the most significant and exciting innovations in our client roadmap. Thread Director technology allows us to provide smarter assistance to the OS by monitoring instruction mix, current state of each core and relevant microarchitecture telemetry at a granular level This also allows OS to utilize information it didn't previously have any visibility into at time of making its scheduling decisions. By implementing Intel Thread Director in hardware, we are able to keep advantage of our performance monitoring unit, which provides the best hardware telemetry in the industry. Access to this information allows Intel threat director to assist OS in optimal run time scheduling.
Traditionally, OS would have made decisions based on limited information that was available, such as foreground versus background. Thread Director adds a new dimension to the hardware telemetry. So threads with higher performance requirements are assigned to most performant course. Let's walk through a scheduling example on a real scenario. Let's say a user starts a performance critical task such as a game or a content creation software.
Those threads first will be assigned to our performance scores. Now if a background task such as you know email sync or network drive backup starts, Those lower priority, less demanding tasks will go to our efficient course Next, let us assume a case where all the performance codes are busy but a thread needing even higher performance becomes ready such as an AI thread using CPU AI instruction In this situation, thread director provides what we refer to as a hint to the OS indicating there's a higher performance thread needing attention. Thread Director also identifies a candidate thread that could be moved from performance course to efficient course based on relative performance ordering, making room for that AI thread. This is where the dynamic nature of our innovation shines. Nothing is static based on any software.
Everything is dynamic based on the current context of whatever is running on the system, all augmented by hardware telemetry. And last, if a thread running on performance score enters a spinning stage, you know, waiting for work to show up, Then, thread director reports this situation back to us. This thread will be moved over to an e core, thereby making room for a more performant threat to be allocated to our performance scores. This is our animation explanation of the technology. Let's see it in action on Alder Lake running Windows 11 incorporating threat director feedback in real time.
In typical cases, we see combination of scalar and vector instruction mix of important and background task. We are going to see how thread director helps OS with placement of these threads. In the first example, we have representation of a typical media content creation usage we see with real world software the green threads that you are seeing here are mostly Scalar instruction The dark blue threads that you see are vector instructions and the light blue are background task The vector instructions for the most part get prioritized on performance scores as they require more performance while some of the green threads and the light blue thread go to efficient course Let's look at one more example this time with office productivity and background application. Office productivity, video conferencing, CPU AI instruction usage in these cases is increasing consistently. Placement of these AI threads is going to be key to maximize performance.
As we have seen with previous example, we have green threads which are mostly scalar. Then we have the orange a I threads added to the mix where you can see threat director prioritizes them to performance scores. As seen before, The light blue tasks are running on efficient course. All these threads here in this mix go through various phases within their own execution. You might have noticed there are phases once in a while where the dark blue vector threads or the orange AI thread go to e course.
These are the phases where they have some scalar instructions in them. This is where the dynamic nature of this technology prioritizing right thread to right core based on current execution context comes in Lastly, I do want to show something else I want to show fully multi threaded synthetic workload running same instruction mix here. All threads go use all available core to enable this level of fine grain coordination for real performance. Intel jointly worked with Microsoft to incorporate this revolutionary capability into upcoming Windows 11 release. Speaking of which, I would like to invite Mehmet Egun from Microsoft to share more details.
Hello. Throughout the Windows 11 development cycle, my team has been working with our colleagues at Intel to enlighten and optimize our upcoming OS to take full advantage of the performance hybrid architecture and Thread Director in particular. Much of this work centers around the OS thread scheduler, the kernel component that decides which threads to run and where to run them. These decisions have a huge impact on user perceived performance and power consumption, especially on devices built on hybrid processor architectures. To make its decisions, the scheduler considers attributes such as thread priority, the owning application and whether the application is foreground or background.
For example, threads belonging to Foreground applications should be scheduled to high performance cores. However, up until now, The scheduler had no visibility into the workload running on a thread, whether it's copying memory, spinning in a loop or performing complex calculations. As such, when demand for high performance cores exceeded supply, it made suboptimal decisions because it couldn't identify the workloads that would benefit most from the performance scores. Thread Director helps close this gap. With Thread Director feedback, the Windows 11 thread scheduler is much smarter about dynamically picking the most appropriate core based on the workload to achieve the best power and performance.
Even when all P Cores are busy, it can preempt a thread running on a P core to swap it with a thread running on an E core if the latter can benefit more from the P core. The scheduler, of course, does this without violating any of its priority based fairness guarantees. Beyond threat scheduling, Windows 11 also uses threat director hints when deciding which course to park and unpark for power savings. For example, if all outstanding work targets e cores, the system can refrain from unparking at p Core. In addition to automatic workload classification provided by Thread Director, Windows 11 also extends the power throttling API, which allows developers to explicitly specify quality of service attributes for their threads.
The new EcoQuas classification informs the scheduler that the thread prefers power efficiency over performance. Such threads get scheduled on E cores, reducing power consumption and leaving the P Cores available for performance critical threats. The Edge browser as well as various Windows 11 components Now take advantage of the EcoQuest API to boost energy efficiency. This was a short summary of the improvements we put into Windows 11 in collaboration with multiple teams at Microsoft and Intel. There are many more optimizations I did not talk about and many more in planning stages.
What is clear is that we're at an inflection point in heterogeneous computing, and we're going to continue seeing tighter integration and information exchange between hardware and the OS to unleash further performance and better life improvements. I'm looking forward to continuing our collaboration with Intel. Thank you.
We are very excited about the performance benefits of this technology and the potential it holds for future innovation. With that, I would like to hand it over to my colleague Arik Gihon to give you more details about Alder Lake SoC. Thank you.
Hi everyone, I'm very excited to introduce Cyber Lake today. Intel's new client architecture on Intel 7 Process that scales from the highest performance enthusiast desktop to the thinnest, most responsive EVO laptops. Alder Lake is Intel's 1st performance hybrid core design, introducing 2 core types brought together seamlessly through Intel Thread Director Technology. Alteryx supports the latest state of the art industry standards in memory, IO and connectivity for no compromise PC experience. Let's get to know Alder Lake.
Alder Lake was built for performance. We started with desktop architecture and scaled all the way down to ultra mobile. One of our most important goals when designing Alder Lake was to support all client segments through a single highly scalable SoC architecture with 3 design points. 1st, a maximum performance 2 chip socketed desktop with leadership performance, power efficiency, memory and IO second, a high performance mobile BGA package that adds imaging, larger axi graphics and Thunderbolt 4 connectivity. And finally, a thin, low power, high density package with optimized IO and power delivery.
Supporting this huge range of power and performance is an incredible silicon design challenge. Designing Alerlik was an amazing technical journey for me, and I would like to share with you how we achieved it. Alder Lake architecture is built from a selection of forward class IPs for die level configurability. We enabled multiple combination of performance cores and efficient cores, a large number of IOs and accelerators, virus level of axi graphics and 4 different memory types. Both P cores and E cores are built as interchangeable slices that include a portion of the last level cash, allowing us to build multiple die topologies spanning Alder Lake's huge design range.
The accelerators and IOs are connected through an hierarchical structure configured to support the required bus width, queue sizes, number of ports and memory access feature set. Alder Lake leverages the Xe LP graphics from Tiger Lake ported to Intel 7. It supports 1080p gameplay and 12 bit end to end video pipeline. Finally, the memory subsystem is designed in tiles to cope efficiently with the large acquired bandwidth range, memory encryption, integrity and multiple Let's take a deeper look at each of the key technologies. Everything starts with the new performance hybrid architecture.
We use up to 8 high single thread performance cores and up to 8 efficient cores, both supporting high dynamic frequency range and Perko power states. The eCores are clustered with a shared L2 cache and deliver scalable multi cloud performance and efficient offload of background tasks. The last level cache has a shared structure between the cores and graphics that provides higher effective cache size on a lightly threaded load. For controlling the operation of this massive compute, we have developed an autonomous power management scheme. This controls the various cores to optimize performance per watt according to software and platform preferences.
As we saw in the demo, all of this is orchestrated by the threat director for proper workload scheduling. We developed a sophisticated hardware runtime mechanism that identified the class of each workload. The class, Together with the energy and performance core scoring mechanism guides the OS to schedule threads on the right core for performance or efficiency per demand. Let me show you the memory capabilities. Alder Lake delivers a separate set of DDR technologies with Intel's unique PHY, supporting DDR4 and DDR5 as well as LPDDR4 and LPDDR5 in a single chip, leading the industry to this major memory transition.
Alteryx supports high frequency DDR speeds and can alter the speed based on runtime bandwidth requirements while tracking the workload behavior. This allowed high speed, high power or low speed, low power operations based on real time heuristics. If you think this is cool, we also upgraded our PCIe capabilities. Alder Lake is leading the transition to PCIe Gen 5 with up to twice the bandwidth of Gen 4. It supports up to 16 lanes and reach 64 gigabyte per second, ready for the next generation of SSDs and discrete graphics.
So let's see how all of this works together. The challenge of building such a highly scalable architecture is that we need to meet the incredible bandwidth demands of the compute and IO agents without compromising on power. To solve this challenge, we have designed 3 independent fabrics, each with real time demand based heuristics. While the workload is running, the power management unit collects telemetry from the fabric sources, track the traffic and select the most efficient work point. The compute fabric can support up to 1,000 gigabyte per second, which is 100 gigabyte per second per core or per cluster.
It connects the cores and graphics through the last level cache to the memory. It has high dynamic frequency range and it's capable of dynamically selecting the data path for latency versus bandwidth optimization based on actual fabric loads. It also dynamically adjusts the last level cash policy, inclusive or non inclusive, based on utilization. The IO fabric supports up to 64 gigawatt per second and connects the different types of IOs as well as internal devices. It can change speed seamlessly without interfering with devices' normal operation, selecting the fabric speed to match the required amount of data transfer.
And last, the memory fabric can deliver up to 204 gigabytes of data. It dynamically scales its bus width and speed to support multiple operating points for high bandwidth, low latency or low power. This real time scaling enables Alder Lake to take dynamically shift the power budget to where it matters the most. Finally, I'm working with the team on Alder Lake in the lab now, and I can't wait to have one at home. Thank you and back to you, Raja.
Thanks, Arik. We look forward to seeing Alder Lake in customer's hands later this year. Let's change gears from hybrid computing to visual computing. I'm sure you all want to hear a little bit more about our discrete exe GPU architectures. If you step back and look at what we have been doing recently, you'll see that We have made tremendous progress on integrated GPU hardware and software in less than 3 years.
We have effectively doubled Performance year over year, 2 years in a row now, first with GEN-nine to GEN-eleven and then with GEN-eleven to XELP. This was an incredible start, and this effort removed friction for millions of users. We raised the performance bar for ultrathin mobile devices. But as we have already announced, we have bigger plans. Integrated graphics is constrained, But discrete, on the other hand, is unconstrained, well, relatively.
This is a great time to be entering the high performance Why are we doing this? Why now? These are extremely exciting times. Game engines and software teams are producing Near real life in game visuals in real time. And leading edge hardware is enabling everything to come together to deliver incredible experiences that allow gamers to feel more immersed than ever before.
Publishers, developers, gamers and creators are constantly pushing the boundaries of What is possible, always asking more from their hardware, and they are also looking for more innovation and choice. The PC ecosystem is built around innovation and choice. Many different OEMs, many different shapes, sizes and form factors in many different operating system and a diversity of software. The vast choice of what to buy or the flexibility to build your own PC to customize, upgrade and to modify. These are some of the reasons the PC market has been so popular And exciting since the beginning.
These are the reasons the PC ecosystem is great. 1,500,000,000 people are PC gamers. Our focus today is to deliver a better experience for gamers and creators to give them innovation and choice in hardware coupled with Open and accessible software and tools. To create a great experience across form factor segments and users. To remove friction and deliver high performance graphics experiences to everyone, we recently unveiled ARC, Which is a brand for our visual computing products.
The word art is used to describe the narrative flow and the various plot inflections of a story. Every gamer, game and creator has a story, And every story has an arc. The arc brand represents the next chapter of our story and our commitment to removing friction from gamers. Building great GPU hardware is necessary, but nowhere sufficient. Great software plays a critical role for the user experience.
To discuss the progress we are making on software and user experiences, Please welcome the leader of our GPU software team, Lisa Pierce.
Thanks, Raja. With our first high performance gaming GPU, it goes without Saying the performance and quality are job 1. 1st, at the heart of our focus is the design of the core driver itself that covers integrated and discrete graphics products in one unified code base. We've completed a rearchitecture of these core driver components, specifically our memory manager and compiler. As a result, this year, we've improved the throughput of CPU bound titles by up to 18% and improve the game load times by up to 25%.
This load time reduction was accomplished through enhancements to our stage compilation technology, Such as eliminating redundant shader compilation and improving task scheduling for our compiler threads. We also completed a major refactor 30 large optimizations affecting over 100 gaming workloads that rolled out in our existing install base this year with our unified graphics driver. Now, we are always about new APIs and engines, since new games are always pushing up the visual quality bar. For the last 3 years, we've been co engineering new features for DX12 Ultimate with Microsoft. We're excited that at launch, We will support hardware based ray tracing, mesh shading and sampler feedback.
Together, these technologies deliver next gen visuals in games like Hitman 3, Chivalry 2, and many others. We've also been working closely with Epic, and I'm excited to tell you that Unreal Engine 5 runs on our discrete graphics GPUs today. We can't wait to see what game developers will do next in their next generation engines. At launch, we'll also enable updates to our user controls to help gamers take advantage of these technologies, including support for AI assisted virtual cameras, Game highlights and, of course, capture for streaming that will make use of our high performance and quality hardware encoders. And we also will integrate all of our overclocking and performance interfaces directly to the app.
And finally, in the end, it is all about experiences. One thing that gets us really excited is enabling a range of experiences across product segments from integrated graphics to high performance discrete. And today, I'm excited to share a new feature that will do just that. But first, some background. Given a fixed amount of performance potential in a GPU, gamers are forced to make a choice between high quality and acceptable performance.
There are cases where you have content that already runs close to 60 FPS at 4 ks, and the frame rate could be further increased by upscaling. Or you can have a more recent content like ray tracing that needs a performance boost even to achieve playable frame rates. Over the years, games have developed various technologies to reconstruct a high res image from fewer pixels. These technologies use novel algorithms to reconstruct details from neighboring pixels in space or time, but they're often accompanied by issues like blurring or ghosting. And these technologies can often fall short with high quality rendering, like ray trace reflections and shadows, detailed geometry or high res textures.
Additionally, their computational overhead in doing these operations. Ultimately, we want to target this region of high performance and high quality. Our solution to this problem is XeSS. XeSS is an easy to integrate API, and it fits within today's game engine flow. It uses deep learning to synthesize images that are very close to the quality of native high res rendering.
It works by reconstructing sub pixel details From neighboring pixels, as well as motion compensated previous frames. This reconstruction is performed by a neural network trained to deliver high performance and great quality. Now, let's see XeSS in action. This is a demo prepared by Renz using Unreal Engine. We can see the demo rendering real time in 4 ks.
But in reality, the engine Upscaling from 1080p to 4 ks with XeSS gives the same quality image as rendered in native 4 ks. In this scene to the right, you can see the actual content rendered by the engine. The left side shows how this 1080p image is upscaled by XeSS to achieve the final high quality result. Rendering to a smaller 1080p render target allows to significantly reduce the rendering time and achieve higher frame rates. The cost of upscaling operation remains relatively small compared to the overall render time.
Thanks to the use of AI assisted scaling, we can also achieve up to 2x Performance boost. And now, games that would only be playable in low quality settings can run smoothly at 4 ks. It may be a challenge to see these details in the live stream. So, we've added a high resolution video in the demo area that can be viewed on demand. Our goal with XeSS was to deliver neural network based super sampling for a wide range of GPU hardware across the industry.
The demo you just saw was running on our recently announced ARC Alchemist SoC, leveraging our new XMX Hardware acceleration that we'll discuss further in a few minutes. In addition, we also came up with another innovation to enable XeSS on a broad set of hardware, including our competition, with a smart quality performance trade off. We accomplished this by using DP4A instruction, which is available on a wide range of shipping hardware, including integrated graphics. This brings the benefits of XeSS' neural super sampling approach to millions of gamers. I'm thrilled that everyone can experience XeSS.
We're excited to have several early game developers engage on XeSS. The SDK for the initial XMX version will be available for ISVs this month, and the DP4A version will be available later this year. At Intel, we believe in open source standards. XeSS APIs will mature as we gain broad support for games and hardware, and we will open up the tools and SDKs for everyone. To learn more about XMX and our Xe HPG GPU microarchitecture, let me bring in our GPU Chief Architect, David Blythe.
Thanks, Lisa. Hi, everyone. It's time for our new GPU mic architecture to be introduced to the world. This microarchitecture, which we call Xe HPG for high performance gaming, is the convergence of our Xe LP, HP and HPC microarchitectures. XE HPG is engineered to deliver great scalability and compute efficiency with advanced graphics features.
Today, I'll give you a brief introduction to the new compute graphics and scalability capabilities of Xe HPG. To deliver scalability beyond Xe LP in Xe MAX and build enthusiast class hardware, we've had to rework the fundamental architecture of our GPU. Starting with the heart of the engine, We defined a new compute building block, which serves as a foundation for the XE architecture. As part of this architecture change, we're also taking the opportunity to update some of the naming. So you won't hear us talking about execution units or EUs much anymore.
The execution units are getting too large to reason about it, and the generational changes make it difficult to do comparisons. So allow me to introduce the Xe Core. Xe Cores include efficient arithmetic units, caches and load store logic. The arithmetic units include engines for traditional floating point and integer vector operations, along with engines to accelerate convolution and matrix operations commonly found in AI workloads. The architecture gives us a core ISA with the flexibility to adopt the Xe Cores for specific workloads and market segments.
I I don't have time to tell you everything about the Xe Cores we've built for our Xe HPG Mark Architecture today, but I can share some details. For Xe HPG, Xe Cores include 16 vector engines and 16 matrix engines, which we refer to as XMX or Xe Matrix Extensions. And you've already seen XeSS in action with the XeSS demo presented by Lisa. With more and more workloads infused with AI, Pexamax is a key engine for delivering more efficient compute. The Xe HPG microarchitecture is designed to be gaming first and to build much larger GPUs compared to the maximum of 6 Xe Cores in Xe LP.
We scale the hardware required for real time rendering in a larger building block we call our render slice. The render slice contains 4 Xe Cores and the rendering fix function for real time three d graphics. The rendering fix function includes geometry, rasterization, Samplers and Pixel Backends, designed for DirectX 12 Ultimate with support for variable rate shading Tier 2, mesh shading and sampler feedback. Each slice also includes 4 new ray tracing units architected to accelerate ray traversal, bounding box intersections and triangle intersections. They provide full support for DXR or Direct X Ray Tracing as well as Vulcan Ray Tracing.
Thanks, David. With XeCores and their advanced feature set, we now have next generation real time graphics. But we also wanted to scale this to enthusiast class performance. Can you walk us through how we achieved that?
Of course, Raja. To scale our performance to enthusiast class GPUs, we work on 2 fronts. First, we replicate these slices, And then we connect them to a shared L2 cache through a high bandwidth memory fabric. We have the flexibility to scale different configurations up to 8 slices. To hit our performance goals, not only did we create an architecture for building larger GPUs and add additional features, but we also challenged ourselves to increase the power efficiency and the operating frequency of the design.
Working across architecture and engineering, we performed detailed analysis for power reduction opportunities, Which resulted in new methodologies and optimizations at every level of the design from the Xe Core on up. The changes were spread across microarchitecture, Project design and physical design and often needed complementary changes. It was truly a team effort, and I'm delighted to say that compared to the XELP IP in our Iris XE Max product, we increased the relative operating frequency and the performance per watt each by roughly 1.5 times. I should also point out that the team effort included close cooperation with Process Technology and that too contributed to the great results. I'll hand things back to Raja, and he can elaborate on that a little bit more.
Thanks, David. Last year, I told you that Xe HPG GPUs should be built with an external foundry partner. Today, I'm happy to unveil that our partner is TSMC And Alchemist GPUs are built on the N6 process. This is the wafer of Alchemist. This is a great example of our IDM 2.0 partnerships.
We have the flexibility to make the right process choice for each architecture. The progress I witness in our labs week by week with Alchemist GPUs and its software is very encouraging. Alchemist GPUs are now sampling to ISVs and partners. I can't wait to get the 1st generation of ARC products in your hands by Q1 of 2022. While our post silicon and software teams are working very hard to get this to you, our design and architecture teams are busy creating the next few generations of our gaming GPUs.
Here, I have the code names, Battlemage with Xe2, Celestial with Xe3 and drew it after. Can't wait to share more at a future event, hopefully live. Welcome back. In the first half of today's event, We shared details about Intel's new X86 compute codes. We also saw Intel's Alder Lake, which reinvents multi core architecture.
In graphics, we showcased our new Xe Assist technology and upcoming Alchemist GPU. Now let's shift gears to the data center where we have even more exciting announcements. You'll hear about Mount Evans, our new infrastructure processing unit and then Ponte Vecchio, our GPU for excess care. Let's start with Sapphire Rapids. The technology building blocks for this architecture have been years in the making.
This includes a performance 886 core built for the data center. New accelerator cores, new memory architecture, new fabric Architecture, new IO architecture and a host of new software and security features. This is a big deal for Intel and a Big deal for the entire data center ecosystem. To tell you more, let's bring on Sailesh, our Chief Architect for Data Center. Hey, Raja.
Take us away.
Thanks, Raja. Hello, everybody. It's great to be here today. I'm excited to introduce Sapphire Rapids because I believe Sapphire Rapids will establish a new standard in data center architecture. Sapphire Rapids is our next generation Xeon Scalable processor.
It delivers great out of the box performance with enhanced capabilities for the breadth of workloads and deployment
models in the data center.
Sapphire Rapids delivers a models in the data center. Sapphire Rapids delivers a step function and performance across a broad set of scalar and parallel workloads. More importantly, it is fundamentally architected for breakthrough performance in elastic computing models like containerized microservices and for rapidly expanding use of AI in all forms of data centric compute. Sapphire Rapids also advances the state of the art in memory and IO technologies. Our overall architecture philosophy for Xeon is to deliver the best infrastructure in the data center.
Xeon spans a wide range from monolithic Server node deployments to data center scale elastic solutions. It delivers consistent performance across compute, Storage and network usages. Xeon architecture is optimized to deliver great node level performance as well as data center level performance. Sapphire Rapids delivers big improvements at both levels. The new performance core in Sapphire Rapids brings significant scalar performance improvements.
In addition, The multiple integrated accelerator engines and increased core counts provide a massive increase in data parallel performance. Furthermore, These performance scores are paired with right levels of cash and industry leading system capabilities of DDR5 and PCI Gen5 to provide optimal balance across compute, memory and IO. Finally, all of these are integrated through a modular SoC architecture that provides consistent and efficient performance scaling across the socket, the node and the data center. At data center scale, it is critical to deliver great performance and utilization under multi tenant usages, low jitter performance to meet the tight SLA or service level agreements as well as elasticity across the entire infrastructure. In contrast, the industry standard benchmarks focus on node level compute throughput and do not reflect the reality of data center scale usages.
We have drawn deep insights from multiple generations of Xeon products deployed at cloud scale to inform Sapphire Rapids' architecture. As a result, we deliver big advances in each of these areas with Sapphire Rapids. For example, it offers several virtualization and telemetry capabilities to improve multi tenant usages. We expand the QoS capabilities and architecture enhancements to reduce jitter for performance consistency under high utilization. In addition, we are introducing several microarchitecture and architecture capabilities to improve performance across a broad set of workloads to deliver better data center elasticity.
Data center deployment models exhibit significant overheads. Sapphire Rapids fundamentally changes the paradigm of handling these overheads through These accelerators not only speed up the overhead processing multifold, but also significantly offload the course, enabling them to deliver more application workload performance. As I said, this will be the new standard of data center architecture. Ladies and gentlemen, this is Sapphire Rapids. I would like to call on Chief Engineer Naveen Nassif to introduce the breakthrough SoC architecture that is Sapphire Rapids.
Thank you, Sailesh. At the heart of Sapphire Rapids is a new modular tile architecture that allows us to scale the balanced Xeon architecture beyond the limits of the physical reticle. Sapphire Rapids is the 1st Xeon product built using eMIB, our latest 55 micron bumpage silicon bridge technology. This innovative new technology enables independent tiles to be integrated in the package to realize a single logical processor. The resulting performance, power and density are comparable to an equivalent monolithic die.
We're now able to increase core counts, cache, memory and IO free from the physical constraints that would otherwise have been imposed on the architecture and would have led to difficult compromises. This base SoC architecture is critical for providing balanced scaling and consistent performance across all workloads. This is key for data center scale elasticity and achieving optimal data center utilization. With this architecture, we're now able to provide software with a single balanced unified memory access with every thread having full access to all resources on all tiles including cache, memory and IO. The result is consistent low latency and high cross sectional bandwidth across the entire SoC.
This is one of the critical ways we achieve low jitter in Sapphire Rapids. While Sapphire Rapids delivers out of the box scalability for existing for an ecosystems, users can enable clustering at sub NUMA and sub Ooma levels for additional performance and latency improvements. Sapphire Rapids sets a new standard for data center architecture with the seamless integration of cores and acceleration engines providing a heterogeneous compute infrastructure. It delivers the highest levels of compute performance through a combination of a high performance score, increased core counts, increased AI performance and the industry's broadest range of data center relevant accelerators. And Sapphire Rapids delivers leadership IO capabilities through CXL 1.1, PCIe Gen 5 and UPI 2.0 technologies.
All these are provisioned with Intel's highest bandwidth and low latency memory solutions through industry leading DDR5, obtained in HPM memory technologies. Now back to Sailesh.
Thanks, Navin. Now let's start with the details of the 3 major pillars that Navin outlined, starting with the data center performance core. As mentioned earlier, Optimizing exclusively for standard benchmarks would have been the easy path, but that does not reflect the full picture of real data center usages. We use the insights from generations of Xeon large scale deployments to inform our microarchitecture choices for the performance core. Just to provide a flavor of this, data center workloads exhibit large code footprints and are fundamentally bottlenecked by the front end performance of the core.
We fundamentally redesigned the front end to address these bottlenecks in the performance core. Consistent performance under multi tenant usages is critical. The core delivers several improvements like Fast VM migration, enhanced cash and new TLB QS capabilities for multi tenant usages. We introduced autonomous and fine grain power management to improve core performance without jitter. In addition, we added several new architecture capabilities in the core, including instructions and capabilities relevant for the data center.
I want to provide a few examples of new ISA capabilities here. As Adi mentioned, we integrated AMX capabilities to accelerate Tensor operations for AI workloads. We are also introducing accelerator interfacing architecture instruction set, AIA, which supports efficient dispatch, synchronization and signaling to the accelerators and devices from user mode as opposed to high overhead kernel mode. To address the growing demands for signal processing, we introduced have precision floating point to AVX. Another example is the CML demote instruction.
It helps with optimal movement of data across cash hierarchy to improve shared data usage models. Another major area of focus for Sapphire Rapids compute capability was to improve performance significantly for common functions and overheads with at scale data center deployments. I would like to invite Harijit, The lead architect on Sapphire Rapids to tell us more.
Thank you, Soej. One of my key focus areas on Sapphire Rapids was to explore breakthrough improvements for the high levels of common mode tasks causing overhead that we see in data center scale deployment models. Instead of traditional approaches, we embarked on a new direction using optimized acceleration engines. We found these engines to vastly improve processing of these overhead tasks and enable greater utilization of the performance course for higher user workload performance. We address the key challenge of seamlessly integrating acceleration engines with performance cores on Sapphire Rapids through a set of novel technologies such as AIA and advanced virtualization that enables us to avoid kernel mode overheads and complex memory management typically associated with such schemes.
Sapphire Rapids supports several critical acceleration engines for processing the most common overheads. I'm excited to introduce a couple of them today. Data center usage models involve significant data movement overhead as part of workload processing. Examples include packet processing, data reductions and fast checkpointing for virtual machine migration. Sapphire Rapids introduces the data streaming as well as IO attached devices.
In this graph, we show an open virtual switch use case in which with up to 4 instances of DSA, We see a nearly 40% reduction in CPU utilization and a 2.5x improvement in data movement performance. This results in nearly doubling the effective core performance for this workload. Intel Quick Assist Technology is not new to Intel products. Sapphire Rapids provides seamless integration of the next generation QAT engine, greatly increasing its performance and usability. All data in the data center is cryptographically protected during storage, transmission and use.
Furthermore, the ever growing data footprint is increasingly maintained in a compressed format. Our next generation QAT acceleration engine supports the most popular crypto, Hash and compression algorithms and can chain these together. Performing these functions using QAT is significantly faster than the Performance Core and reduces the number of course needed for those same functions. Sapphire Rapids QAT achieves up to 400 gigabits per second of crypto And simultaneous compression and decompression at up to 160 gigabits per second each. In this example, with the Zlib L9 compression algorithm, We see a 50x drop in CPU utilization while also speeding up the compression by 22x.
Without Q18, this level of performance would require upwards of 1,000 performance cores to achieve. Thank you. Back to you, Suresh.
Thanks, Harajit. With growing compute capabilities, A balanced architecture needs to deliver commensurate improvements in IO. Sapphire Rapids delivers breakthrough advancements with IO interfaces. We introduced the industry standard Compute Express Link Technology, CXL, for memory expansion and accelerator usages in the data center. To cater to the growing IO speeds and feeds, We introduce support for PCI Agent 5, while also enhancing the QoS and DDIO capabilities that go with it.
Sapphire Rapids delivers optimal multi socket performance scaling through advancements to our UPI technology that brings more links at wider width and higher speeds compared to our prior generations. For the data center processor to deliver across all workloads, the compute and IO capabilities need to be augmented with the right balance of cache and memory architecture to deliver sustained bandwidth at low latencies. Sapphire Rapids supports a large shared cache that allows dynamic sharing across the entire socket. We are almost doubling the shared capacity over prior generations and enhancing the critical QoS capabilities to further improve effectiveness. With industry leading DDR5 memory technologies, we are delivering the next big step function in bandwidth, while simultaneously improving power efficiency.
In addition, Sapphire Rapids delivers multifold performance improvements and QoS capabilities with our next generation Intel Optane memory. And we are not done with memory just yet. In addition to support for DDR5 and Optane memory technologies, Sapphire Rapids also offers a product version that integrates HBM technology in package for high performance in dense parallel computing that is prevalent with HPC, AI, machine learning and in memory data analytic workloads. Typically, CPUs are optimized for capacity, while accelerators and GPUs are optimized for bandwidth. However, With the exponentially growing model sizes, we see constant demand for both capacity and bandwidth without trade offs.
I'm happy to say that Sapphire Rapids does just that by supporting both natively. We further enhance this with support for memory tiering that include software visible HPM plus DDR and software transparent caching between HPM and DDR. AI usages will become ubiquitous in the data center due to the success relative to traditional methods. In order to deliver data center scale elasticity, great AI performance is required across all tiers of compute. So this was one of the major focus areas for Sapphire Rapids.
We introduced AMX capabilities that provide massive speed up to the tensor processing that is at the heart of deep learning algorithms. We can perform 2 ks intake operations and 1 k bfloat16 operations per cycle. This represents a tremendous increase in compute capabilities that are seamlessly accessible through industry standard frameworks and run times. We augment this with strong general purpose capabilities, large caches, high memory bandwidth and capacity to deliver breakthrough performance improvements for CPU based training and inference. Let's take a look at AMX in action in our validation labs.
Thanks, Eilish.
To enhance performance for a variety of deep learning workloads for both inference and training. As you mentioned, we can do more matrix multiplies per clock cycle so we can process data faster. Here in our lab, we have a Sapa Rapid server running an internally optimized general matrix multiplied gem kernel. On the left hand side, we are running without Amex and on the right hand side, we are running with the Amex Extensions. With Amex's ability to do more matrix multiplies by clock cycle, you can see that we are executing the gem kernel approximately 7.8x faster with advanced matrix extensions.
While this demo highlights a Highly efficient gem kernel to show the architectural capabilities of this platform, we expect substantial performance gain across the AI workloads for both training and inference.
We expect the vast majority of new scalable services will be built using Elastic compute models like containerized microservices. This trajectory was clear when we started architecting for Sapphire Rapids. To address this, We focused on capabilities and architecture choices to improve the computing model for throughput under tight SLA with low infrastructure overheads. We made architecture enhancements across the product, spanning the core, the accelerators and the SoC capabilities to really deliver on this. For example, the AIA capabilities we talked about fundamentally reduce the microservices startup And a number of capabilities like QAT and DSA help with reducing the network stack overhead with microservices service mesh.
We have been using multiple proxy workloads to develop these capabilities and optimize the open source software stack to benefit from these capabilities. This chart shows the speed up we are modeling in our architecture models And with some early silicon measurements on deadstar bench and other example proxies that is normalized at the Perko level. And as you can see, we are seeing some great improvements in performance with the microservices computing model. In summary, Sapphire Rapids provides a big leap in performance and capabilities to establish a new standard in the data center architecture. At the root of Sapphire Rapids is a modular tiled SoC architecture, thanks to the EMIC technology that enable significant scalability by maintaining a monolithic view.
It delivers substantial performance across scalar usages and massive performance in emerging panel workloads like AI. It delivers great improvements for monolithic workload deployment models, while exclusively optimized for elastic compute models like microservices. It brings industry leading memory and IO technologies to feed the massive computing capabilities in a balanced way. As one would expect, Sapphire Rapids is a complex undertaking, And I would like to take the opportunity to thank the teams across all of Intel that are bringing Sapphire Rapids to market. Thank you.
Back to you, Raja.
Thanks, Shailesh. The entire data center ecosystem is eagerly awaiting Sapphire Rapids early next year. Just as important as compute itself is how we deliver that compute. We are in the middle of the infrastructure revolution that is transforming not only how we deliver compute from edge to cloud, But how we build data centers itself. Intel hinted at a new category called infrastructure processing unit, IPU at the recent Six 5 Summit.
To tell us more about IPUs and the problems they solve for our customers,
Thanks, Ratja. And this is exactly right. We are in the middle of a revolution. So when we started this revolution, the systems that made up cloud data centers look pretty much like the systems in a classic enterprise data center. But that has changed.
We're starting to see these 2 architectures diverge. And the reason for this is that in a classic data center, everything is owned by 1 party. In the cloud, the workload and the system owned by different ones, the tenant and the cloud service provider. So here's an example of a typical server in a classic enterprise data center. The physical infrastructure, the hypervisor and the application are all owned by one entity.
In this case, it's a bank. All the software runs on the CPU. But for servers that are built for cloud infrastructure, where different architecture has emerged. They have a dedicated processor that runs the infrastructure functions in the cloud. And we call this new category of processor an IPU or infrastructure processing unit.
The cloud service provider software runs on this IPU and the revenue generating guest software runs on the CPU. So for example, a bank's financial app running on the CPU would not be cleanly separated from the cloud service providers infrastructure software running on the IPU. You know if you want to think about an analogy, this is a little bit like hotels versus single family homes. In my home, I want it to be easy to move around from the living room to the Kitchen to the dinner table. In a hotel, it's very different.
The guest rooms, the dining hall and the kitchen are cleanly separated. The areas where the hotel staff works is different from the area where the hotel guests are. You need to get a badge if you want to move from one to the other in some cases. And essentially this is the same trend that we're seeing in cloud infrastructure today. Now this IPU based architecture has several major advantages.
First, the strong separation of infrastructure functions and tenant workloads allows tenants to take full control of the CPU. 2nd, the cloud operator can offload infrastructure tasks to the IPU. This helps maximize the utilization of the CPU and for public clouds also helps maximize revenue. And 3rd, IPUs allow for fully diskless server architecture in the cloud data center. So So let me explain each of these in more detail.
So in servers with an IPU, infrastructure and tenant workloads are cleanly separated with the tenant workload running on the CPU and the infrastructure software running on the IPU. The video result of this is much better isolation between the 2. So for example, if I have a spike An infrastructure load, it will no longer lead to performance issues for the CPU. That's obviously a very good property. But more importantly, it now allows the tenant to take full control of the CPU.
So for example, a tenant can bring their own hypervisor and run it on the CPU. But at the same time, the IPU can still confine that hypervisor to a virtual network segment or specific storage volumes that allows for much, much more flexible architecture. The second advantage of the IPU is about infrastructure function offload. So modern applications today are often structured as microservices that incur substantial communication overhead. In some cases, the majority of all CPU cycles are actually spent on the infrastructure overhead and the IPU can help reduce this as you can see in this slide here.
With an IPU, the cloud operator can offload these infrastructure tasks to the IPO. And thanks to the IPU's accelerators, they can process these very, very efficiently. This not only optimizes performance, But if you're a cloud operator, you can now take 100% of the CPU cycles of that CPU and rent them out for guests, which helps to maximize your revenue for the overall system. The third advantage of the IPU is that it can enable the migration to a fully disclos Server architecture. This is a big architectural change.
And let me explain why this is a great thing. So traditionally in a cloud data center, you will have disk attached to every single As tenant demand for disk space is hard to predict, you have to over provision each of these servers, basically attach more disks than you really need And you end up with a lot of stranded capacity, so capacity that can't be utilized in a good way. With an IPU, you can move to an entirely distilless model. As all storage is on a stand alone storage service, when a customer starts a workload on the server, the CSP basically creates virtual volume on the storage Service via the management network, the CSP tells the IP to create a new NVMe SSD based on that virtual volume. And as this virtual NVMe SSD shows up on the PCI Express for us just like a regular SSD, this will work with most operating systems and hypervisors out of the box And we can now boot from that SSD.
Now you may wonder what does it do for performance? I mean, all this network traffic that is coming from these disks. And the really brilliant thing about the IPU here is that the actual storage traffic between the storage server and the workload on the server happens on the fast path, meaning there's no involvement of any CPU course on the IPU or the CPU. It's low latency, it's high throughput with maximum flexibility, a very powerful solution. So with the strong separation of infrastructure and tenant With accelerators that allow us to efficiently offload infrastructure functions and this ability to move to a really diskless architecture, we think the IPU will be a central component for future data center architectures.
Now if you look at IPUs today, there's basically 2 types of architectures that are commonly used. The first one, are dedicated ASIC IPUs and the second one are FPGA based IPUs. So each type has their own advantages and disadvantages. FPGA based IBS give you the ability to implement new protocols quickly. You can react to changing requirements Calls or you can, for example, implement your proprietary protocols that are not publicly known on these FPGAs.
On the other hand, a dedicated ASIC IPO maximizes performance and efficiency. And both of these are actually different from classic smart NICs, right, which lack the capability of executing the infrastructure control plane, Because there's no one size fit all for the different types of infrastructure acceleration, Intel will continue to invest in both types of IPUs as well as SmartNICs. We're deeply engaged with the world's leading cloud providers including Microsoft, Baidu, JD dotcom and VMware and we're already the volume leader in the IPU market with our ZND, PGA and Ethernet components. I'm thrilled to announce the arrival of 2 exciting new FPGA based products in our IPO portfolio targeted for the cloud and comms market and that's Oak Springs Canyon and ArrowCreek. Let's start with Oak Springs Canyon.
Oak Spring Canyon is an FPGA based IPU that uses Intel's Agilex FPGA together with the Xeon D system on a chip. Agilex is the industry's leading FPGA in power, efficiency and performance, working in concert with Xeon based servers to provide the performance necessary to offload 2x 100 gig workloads and a rich software ecosystem optimized around X86. Oak Springs Canyon leverages the Intel Open FPGA stack, a scalable source accessible software and hardware infrastructure stack. Oak Springs Canyon is aligned with the needs for the next wave CSP deployments at 100 gig. Oak Springs Canyon also features a hardened crypto block that allows you to secure all infrastructure traffic, storage and networking and line rate performance.
And today, that's an important thing. So the second product I want to talk about today is called ArrowCreek. AeroCreek is an acceleration development platform based on the Agilex FPGA and the E810 100 gig Ethernet controller. It builds upon the success of Intel's PAC M3000, which is deployed today at some of the top comm service providers worldwide. Arrow Creek will help telco providers to offer flexible accelerated workloads like Juniper Contrails, OVS and SR V6.
With these 2 FPGA based additions to our portfolio, Intel covers the needs of both cloud and communication service providers. What I'm actually most excited about today is that we're announcing Intel's first dedicated ASIC based IPU code named Mount Evans. Co developed with a large CSP, Mahleve is the foundation of a family of forthcoming ASIC IPUs. So, Naru, You want the key architects of the amazing team that built this technology. Tell us more about it.
Thank you, Guido. As Guido just mentioned, Intel is helping to lead this industry transformation by building leadership IPUs based on our FPGA and ASIC assets. I'm here today to introduce you to a product I'm really excited about. That product, code named Mount Evans, is our first 200 gig ASIC IPU Or infrastructure processing unit. We have architected and developed Mt.
Evans hand in hand with a top cloud provider. This has provided tremendous insights into deployment requirements for networks at scale. Intel has been working closely with other cloud providers through our FPGA based solutions And our learnings with those products includes many of the Mt. Evans architecture and design trade offs. Mt.
Evans has been designed for performance at scale under real world workloads. Finally, in order to be hyperscale ready, we designed in security and isolation from the ground up throughout the chip. On the technology front, Mt. Evans is loaded with innovation. To start with, the focal point of the product is what we believe to be a best in class Packet processing engine that supports a large number of existing use cases like vSwitch offload, firewalls and virtual routing as well as providing significant headroom for future use cases.
Another technology created by extending Intel's proven high performance Optane NVMe controller enables Mt. Evans to emulate NVMe devices. A third technology innovation I'm excited about is a next generation reliable transport protocol. We have co innovated on this technology with our CSP partner To solve the long tail latency problem on lossy networks. Lastly, a 4th enabling technology that can be used across a variety of use cases is our advanced Crypto and compression accelerators leveraging our high performance quick assist technology.
Finally, at Intel, we really want to make IPUs a compelling technology across Segments beyond cloud. And this, 1st and foremost, means enabling software developers to do what they do best. We start with innovative performant hardware designed for flexibility and ease of programmability. We add to this the Expertise that came in through our Barefoot acquisition, driving the use of the P4 language in the industry as a standard framework for programming network data planes onto IPUs. We'll extend well known SDKs like DPDK and SPDK to take advantage of IPU capabilities for data and storage processing.
Here, I'm showing a high level block diagram of Mt. Evans. As you can see, Mt. Evans is organized as a networking subsystem on the left And a compute subsystem on the right. I won't go through every block in the short time we have today, but I did want to highlight a few areas.
Mount Evans supports 200 gigabits per second of throughput, connecting up to 4 Xeon hosts together. We recognize that Cloud performance needs will drive many applications like storage, messaging and high performance computing to migrate to RDMA based protocols. Mount Evans supports this with implementations of both RoCE V2 and the new reliable transport technology I mentioned earlier. Our Optane derived NVMe engine exposes high performance NVMe devices to the host processors, enabling Infrastructure providers to use the IPU to implement their storage protocol of choice, whether it's hardware accelerated NVMe over fabrics or a Custom software backend on the compute system. The programmable packet processor delivers leadership support for use cases like vSwitch offload, Firewalls, telemetry functions, all while supporting up to 200,000,000 packets per second performance on real world implementations.
Finally, Mount Evans provides inline IPsec to secure every packet being sent across the network. On the right hand side, our compute complex is built on the ARM Neoverse architecture using the N1 Ares core. These 16 high frequency cores come with a large system level cache backed by 3 LPDDR4 controllers. The compute complex is tightly coupled with the network subsystem, allowing the network subsystem accelerators to use the system level cache as a last level of cache, Providing high bandwidth, low latency connections between the 2 and enabling a flexible combination of hardware and software packet processing. Our Look Aside crypto and compression engine is derived from Intel's Quick Assist technology that you can see in the Xeon roadmap, But we've adapted it for IPU use models.
This includes support for the Zstandard compression algorithm. Finally, Our dual core management processor provides an interface to the platform and orchestration layers, supporting robust system manageability. We designed Mt. Evans from a software first mindset. Enabling applications on IPUs requires a robust software foundation.
I already shared a few details on using the P4 language for programming network data planes and extending well known SDKs like DPDK and SPDK. We'll share more details in the next few months. Thank you. Back to you, Raja.
Fantastic, Naru. There is a larger software ecosystem story to tell here. We look forward to share more at Intellon. The first step in making progress is to admit we have a problem. At Dental, We had a problem, almost a decade long problem.
We were behind on throughput compute density and support for high bandwidth memories, both of which are essential metrics for HPC and AI and the cornerstones of GPU architecture. The first chart is FP64 flops. The blue line is Intel versus the green line, which is the best in the industry. The second is a similar chart for memory bandwidth. As is obvious, the gaps were quite large.
And in 2017, when GPU architecture started adding special engines for matrix So we needed a moon shot. We set for ourselves some very ambitious goals. We started a brand new architecture built for scalability designed to take advantage of the most advanced silicon technologies, And we leaned in fearlessly. Let me hand it over to Hong to walk you through this brand new architecture, XE HPC.
I'm here to talk about how we designed HE HPC architecture. How do we scale our architecture to realize the vision set up by Roger. We broke this problem down for 4 hierarchical building blocks: core, slice, stacks and link. Now let me walk you through each of them. First, I want to introduce the XD Core, our foundational processing unit to which we scale our architecture.
XD Cores are highly efficient arithmetic machines. In each XD Core, There are 8 vector engines. Each vector engine provide floating point and integer operation on 5 12 bit wide vectors. There are also 8 matrix engines referred to as XMX or XE Matrix Extensions. Each XMX engine is built with an 8 deep systolic array.
XMX performs 8 set of 5 12 bit wide Vector Compute Operations Per Clock. Those vector and the matrix engines are supported by a wide motor store unit that can fetch 512 byte per clock. Each Xe core has a large 512 kilobyte L1 data cache, currently the largest in the industry. We optimize Xe core for large datasets and this huge L1 cache helps tremendously. L1 cache is also software configurable as a scratch pad, also known as shared local memory.
Comparing ops per clock for critical data formats is essential for high performance computing and AI. Here, I'm showcasing those data format and what Xe core can do. But this is not all. We can also co issue instructions to exceed Those single op per clock reads. Our Intel libraries and kernels take full advantage of this for increased performance of the Xe Core.
The next type of building block is the slides. For Xe HPC, a slice have 16 Xe Cores, totaling 8 megabyte of L1 cache, 16 retracing units and providing one hardware context. The retracing units provide fixed function computation for ray traversal, bounding box intersection and triangle intersection. This makes Edge HPC very attractive to professional visualization applications. The hardware context feature enables Edge HPC GPUs to execute multiple applications concurrently without expensive software based context switches.
This greatly improves the utilization of GPUs in the cloud. At the top level, we have the stack. This can be a full GPU in itself. A stack contains 4 slices. This adds up to 64xE cores, 64 resource unit and 4 Hardware context.
The stack has a massive L2 cache, 4 HBM3 controllers, A state of art, media engine and 8 XeLinks. The Xe memory fabrics connects copy engines, the media engine, XeLink blocks, HBM and PCIe. HE HPC Tetra is also scalable, allowing us to do multi stack design. This is an industry first. We could only accomplish this because of our eMIP packaging technology.
Here, we connect HE memory fabric on each stack directly. This enables unified coherent memory between the stacks. This is a big deal for software. We can now deliver leadership compute and memory bandwidth density for a wide range HPC and AI system with a single design. The 4th dimension to our scaling strategy is our Intel XeLink.
XeLink provides high speed coherent Unified Fabric for GPU to GPU communication. It supports load and store, bulk data transfer and synchronization semantics. It includes an 8 port switch, enabling up to 8 Fully connected TPUs in a node without any additional components. This leads to the ability to build very flexible topologies. It is easier to show than tell.
Here, we have actually a link between 2 Xe HP CPUs, so we could connect them with up to 8 HEV scaling to 4 GPU for large problem is a popular configuration. 6 GPU per node may look familiar to you As this is the topology of Aurora Accelerated Network. A popular configuration for AI and large problem is to have 8 GPUs in an OEM form factor for universal baseball design, following the open compute project standard. The flexibility of Xe Link enables a high number of coherent and unified accelerators in a single node. There's no need for additional components to scale up.
This is a massively scalable architecture, The magnitude of which has never been built before, as far as we know. Now my colleague, Masuma, We'll take you through how we turn this architecture into an implementation.
Hong talked about the amazing XC HPC architecture. My team and I, along with the help of our partners, IP, test, packaging, Process technology and manufacturing teams have the challenge and privilege to bring this architecture to life as the Ponte Vecchio chip. It is an understatement to say that Ponte Vecchio is the most complex chip and product that I have worked on in my 30 years of chip building. Actually, I'm not even sure if it is accurate to call it a chip. It is a collection of chips that we call tiles that are woven together with high bandwidth interconnects that are made to function like 1 monolithic silicon.
Planning Ponte Vecchio execution was a completely different paradigm. I have worked on new SoC architecture, new IP architecture, new memory architecture, new IO architecture, new packaging technology, new power delivery technology, new interconnects, New signal integrity techniques, new reliability methodology, completely new software and new verifications methodology. But never have I dealt with All of this newness in one product. And that was the challenge that was Ponte Vecchio. It is amazing and somewhat unbelievable that the chip is alive and fully kicking with workloads.
The Ponte Vecchio chip, as you see in this picture, is composed of several complex designs that manifest in tiles, compute tile, Rambo tile, XeLink tile and a base tile with high speed HBM memory, which are then assembled to eMIP tile that enables a low power, high speed connection between the tiles. These are put together in a Foveros packaging that creates the 3 d stacking of active silicon for power and interconnect density. And then the high speed MDFI interconnect allows the stacked to scale from 1 to 2. All of this comes together in a manufacturing marvel across several different process technology nodes. Ponte Vecchio was new and novel in many ways with a myriad challenges.
While the multi tile approach helped break down the problem into smaller chunks and provided flexibility, execution planning was orders of magnitude more complex. I want to walk you through a few big challenges from the many that we had on Ponte Vecchio. Foveros was critical for Ponte Vecchio 3 d stacking, and we have some key learnings with its implementation, both functional and physical. We had to transfer data at 1.5x speed over our original plan to minimize the number of Foveros connections. We also had to lock the Foveros locations early in the design on all the tiles, which meant the floor plan was locked very early.
Since we pioneered this 3 d implementation, we had to innovate continuously on die to die implementation and verification methodology. We developed many tools, methods and scripts in real time and performed validation at multiple levels of hierarchy with new BFMs and test benches to keep the tile independent and keep hierarchies clean and crisp. This facilitated an independent schedule for each of the 4 main tiles and enabled their own debug packages. With this divide and conquer approach, we were able to stage both pre- and post silicon validation Such that the chip booted within few days of the SoC package assembly with the flashing of Hello World. This was a huge sigh of relief and a cheer for thousands of engineers across Intel.
The staged approach, while essential, meant that the RTL versions of the various tiles had to be in sync for the integrity of the top level model. High power, multi tile package posed its own challenges related to signal integrity, reliability and power delivery as there was no precedent internal or external to Intel. Foveros implementation was complex and time consuming. Just for context, Ponte Vecchio has 2 orders of magnitude more Foveros connections than any previous Intel designs. All the electrical and physical collaterals had to be generated from scratch and verified prior to delivery to our partner teams.
Now let me tell you more about some of the most sophisticated and complex of these Ponte Vecchio tiles. While Ponte Vecchio was a challenge in aggregate, these individual tiles had a level of design complexity of their own. Compute tile is a dense package of Xe Cores and is the heart of Onto Vecchio. One tile has 8 Xe Cores with a total of 4 megabyte L1 cache, are key to delivering power efficient compute. It is built on the most advanced TSMC process technology called Node 5.
We paved the way with the design infrastructure setup, tool flows and methodology for this node at Intel. This style has an extremely tight 36 micron bumpage for 3 d stacking with Foveros. This is just one example of our IDM 2.0 strategy of combining internal and external process nodes that Pat has outlined. Base Tile is the connective tissue of Ponte Vecchio. It is a large dye built on Intel 7 optimized for Foveros technology.
It is where all the complex IO and high bandwidth components come together with the SoC infrastructure, PCIe Gen 5, HBM2e memory, MDFI links to connect tile to tile and EMIP bridges that challenged physics. Super high bandwidth, 3 d connect with high-2d interconnect and low latency makes this an infinite connectivity machine. Implementation of this tile was the hardest design challenge on Ponte Vecchio. We work closely with the Intel technology development team to match the requirements on bandwidth, bump pitch and signal integrity. XeLink Tile provides the connectivity between GPUs, supporting 8 links per tile.
It is critical for scale up for HPC and AI. We are targeting the fastest SerDes supported at Intel, up to 90 gig. When we won the Aurora Exascale Supercomputer contract, This was a new tile added to enable the scale of solution as per their requirement. We built this incredible tile in less than 1 year. It is highly gratifying to see Ponte Vecchio powered on and successfully running hundreds of workloads and hitting some industry leading performance numbers on A0 silicon.
Here in my hand is this marvel, Ponte Vecchio. Let me now hand this to Raja.
Thank you, Masooma. You and your team have done a Fantastic job.
Thank you, Raja. Highly appreciate it.
This is an incredibly proud moment to be holding this marvel of engineering in my hand. What began as a moonshot that many said could not be done. And nothing inspires Intel engineers like hearing those four words. It can't be done. Thousands of engineers said we can.
And let me show you what they have already done. That GPU Masuma handed me is azerosilicon, as she noted, which is our first stepping. It already produces greater than 45 teraflops of sustained vector single precision performance, validating that our compute tiles are healthy. We also measured greater than 5 terabytes per second of sustained memory fabric bandwidth, which validates our Forwardoz 3 d packaging technology And over 2 terabytes per second of aggregate memory and scale up bandwidth, and this proves all our EMU bridges are very healthy. And there is still more performance to be had.
These are all leadership compute and bandwidth numbers that already erase the huge flop and bandwidth gap problem I mentioned earlier today. Ponte Vecchio will be available in PCIe cards with XeLink interconnect bridge. The OAM module form factor that I just showed you will be integrated into a carrier baseboard that brings together multiple GPUs with XeLinks. Our OEM partners will provide various accelerated compute systems utilizing this Ponte Vecchio subsystems and Sapphire Rapids. For years, taking advantage of GPU accelerated computing systems like this has been a major headache for software developers.
They had to rewrite the parts they wanted to accelerate in different specialized languages, OpenCL, CUDA, etcetera, etcetera. Otherwise, the GPU did them no good. We already led the industry in CPU So we needed another moon shot, a software moon shot. We needed a programming framework that let Software developers transparently programmed for any mix of CPUs and accelerators. Many said this could not be done.
So we created 1API. The 1API industry initiative provides an open standards based unified software stack that is cross architecture and cross vendor. The first version of the industry spec was released in September of last year, which specified a common hardware abstraction layer, data parallel programming language and comprehensive collection of performance libraries addressing math, deep learning, data analytics and video processing domains. 1API allows developers to break free from proprietary languages and programming models. It exposes and exploits cutting edge features of the latest hardware.
A Comprehensive set of libraries speed development of frameworks, applications and services. And the language and libraries work seamlessly with other ecosystem languages like Python, C plus plus and Fortran. Releasing an open specification is one thing. The question I'm sure that's on your mind is whether the industry sees the value and will invest their own effort to adopt. The answer is a resounding yes.
There are now DPC plus plus and 1API library implementations for NVIDIA GPUs, AMD GPUs and ARM CPUs. It's also being adopted broadly by ISVs operating system vendors, end users and academics. We know that 1API version 1.0 is just the beginning of the journey. Key industry leaders are helping to support additional use cases and architectures. The provisional version 1.1 spec was released in May, which adds new graph interfaces for deep learning workloads and advanced ray tracing libraries.
We expect version 1.1 spec to be finalized by the end of the year. Here is a sampling of key ecosystem players who support and are actively engaged in 1API. 1API has developed broad momentum across the industry. For example, U. S.
National labs that are developing exascale computers have adopted 1API components. This will allow them to use CPU and GPU architectures from different vendors. Beyond the industry spec, Intel released the 1st commercial implementation of the full oneAPI stack. Our oneAPI Product offering includes the foundational base toolkit, which adds compilers, analyzers, debuggers and porting tools beyond the spec language and libraries. Over 200,000 developers have installed Intel's 1API product since our first production release in December 2020, and that was before they had access to Xe HPC.
We anticipate an exponential growth in developer base when we enable to this architecture. There are over 300 applications already deployed in market from ISVs across multiple segments that utilize the unified programming model of 1API. And we have over 80 key HPC applications, AI frameworks And middleware functional on Xe HPC that utilize oneAPI to quickly port from either existing CPU only or CUDA based GPU implementations. Let's look at 1API in action with AI Analytics Toolkit.
It's been exciting over the last 4 to 5 years to see the growth in HPC and AI. And there's no better way to see the excitement than to look at the progression and performance to the image recognition benchmark ResNet 50. The gold standard has been set with 1 architecture over the last several years with record setting performance. Well, we're pleased to announce a new era with Ponte Vecchio built on the XE HPC microarchitecture With an alchemy of technologies and more than 100,000,000,000 transistors, the Ponte Vecchio GPU was designed to take on the most challenging AI and HPC workloads. ResNet-fifty inference throughput on Ponte Vecchio with Sapphire Rapids exceeds 43,000 images per second, surpassing the standard you see today in market.
And with training, while we're still in early stages, Initial testing shows the compute, memory and interconnect bandwidth of Xe HPC have unlocked the capacity to train the largest data sets and models. Today, we are already seeing leadership performance on Ponte Vecchio with over 3,400 images per second. And this is only the beginning as we continue with software optimizations and tuning. We're excited about the dawn of a new era where a new architecture can raise the bar to meet the ever growing compute demands of the data center.
Xe Architecture And oneAPI are more than AI training and inference and HPC flops. Let's take a look at some eye candy with 1API rendering toolkit.
Now I'm excited to show you early results of our oneAPI implementation of the advanced Ray tracing in the provisional 1.1oneAPI specification running on 1API based CPU and Xe GPU platforms. The Intel 1API rendering toolkit has 6 high performance feature rich open source software components, including the Academy Award winning Embree Ray Tracing Library. These are already running on Intel and third party CPUs like Apple's M1. And now you'll be the first to see 1 API rendering toolkit Running cross architecture on CPUs and GPUs. Let's show a typical artist workflow creating, reviewing, then delivering a movie quality scene Backed by tools using Intel 1API Rendering Toolkit.
Everything you'll see is an untouched live computer screen capture Using film quality assets at native HD 1080p resolution. First, let's show an artist creating a scene backed by Intel Embry using the tool Houdini from SideFX. The artist creates an HD with interactive path trace rendering on a Xeon workstation without a discrete GPU. For this phase of the design, the CPU provides the interactivity the artist needs. When they pause to review, The Pathways rendering converges towards photo real quality.
Next, it's time for the artist to review the scene with the director. This is where the oneAPI game changer comes in. You're looking at a real time walk through of an Intel history Inspired path tracing at the fictitious 4,004 more lane. Using the 1API software architecture, We show Embree and AI based Intel Open Image Denoise which took less than 3 days to pour it onto a pre production ray tracing capable Xe GPU. So now the same feature rich render kit capabilities artists and app developers crave on CPUs including ray tracing and AI are now accelerated on GPUs.
The artist and director can review the scene instantly and interactively With full featured native HD denoise path trace rendering. Okay, once the scene is ready for final movie ready 4 ks rendering, Studios can choose an Intel Xeon CPU based render farm or seamlessly add 1 API capable Xe GPUs to improve their workflow. Here is one 4 ks full fidelity frame rendered with a ray tracing capable Lexi GPU. The full 4 ks movie is available for viewing in the demo Showcase. So in quick summary, 2 years ago we announced 1API with the goal of open, cross platform, Cross architecture development and execution.
Today we've shown that oneAPI has gone from an ambitious goal to a delivered reality for developers and creators.
That was a fantastic demo of 1API and the rendering capabilities of XE. All of this was set in motion with Argonne National Labs and the Aurora project, which combines Sapphire Rapids, Ponte Vecchio, Optane memory and oneAPI to power the next generation of exascale applications. Here is an individual Aurora blade With 2 sapphire Rapids and 6 Ponte Vecchio addressing the need of converged HPC and AI workloads. Tens of thousands of these blades connected via high speed fabric will be deployed next year to unleash exascale. Less than 2 years ago, I shared our goals for Ponte Vecchio.
It's an incredible moment for us Seeing this extraordinary silicon engineering effort and ambitious software initiative coming to life in our labs, This is no longer a moonshot for us. We still have a ways to go, and we are not done yet. But We can't wait to take you along on this journey when we bring this architecture to all our customers early next year. Thank you for joining me And my colleagues and for being part of our architecture journey. Please welcome a famous Intel architect and now Intel CEO, Pat Gelsinger.
I appreciate the opportunity to join you as we bring Architecture Day to a close. I'm extremely proud of what our technology leaders just Showed you this was the result of years of hard work by the most talented team of architects and engineers in the world. You have just seen one of Intel's most significant advances in X86 architectures in over a decade. It's that big. For generations, the primary driver of compute was process, lithography, geometry, getting to the next node.
All the exciting foundational innovations that will power products through 2025 beyond. We laid out one of the most Detailed process and packaging roadmaps we've ever produced at our recent Intel Accelerated event. Looking ahead, We face daunting compute challenges that can only be solved through revolutionary architectures and platforms. The good news, We already have developed many of these. Microarchitectures for performance and efficiency.
Heterogeneous computing at Every level and in every dimension, from subchip to board to system, the data center, and from edge and endpoint devices to network to cloud. Everything is designed to intelligently use the best compute resource, the optimal architecture for each task. Much of this goodness you have seen at this event, for the billions of PC users in the world, Alder Lake is an Entirely new performance hybrid client CPU architecture, reinventing our multi core architecture with 2 different X86 cores And a revolutionary hardware scheduler. The new performance X86 core is the highest performing CPU core we have ever built. Faster, wider, smarter, deeper and with built in AI acceleration designed for the highest performance general purpose compute.
It pushes the limits of low latency and single threaded applications. Our new efficient X86 core is built for scale And designed to push the limits of multi core performance per watt. We engaged early with developers and APIs and Engine leaders on our new discrete GPU for enthusiast gaming. The new scalable Xe HPG architecture Takes a software first design approach to deliver high performance and reduced friction for gamers and creators. Sapphire Rapids sets a new standard for data center architecture.
It is the architectural underpinning of a heterogeneous compute infrastructure With our highest compute density and highest memory bandwidth, our innovative eMIB packaging technology Helps make all this possible. Sapphire Rapids brings new higher performance CPUs, increased core counts, new memory capabilities, new interface standards, Increased AI performance and the industry's broadest range of accelerators. Ponte Vecchio It's a tour de force of Intel Technologies, providing our highest compute density and bandwidth for exascale computing. We co architected Mount Evans, our newest IPU or infrastructure processing unit with 1 of the Up cloud providers to offload infrastructure tasks. Our talented architectures and engineers made possible all this technology magic.
It's an exciting time for Intel. Our strategy and execution are accelerating. We are charting the course For a new era of innovation and technological leadership. Our breadth and depth of software, silicon and platforms, our Packaging and Process Technologies and Intel's at scale manufacturing uniquely positions Intel to capitalize on the vast growth opportunity. In addition, our IDM 2.0 approach is the powerful combination of 3 capabilities: Intel's internal factory network, strategic use of foundry capacity and Intel Foundry Services.
It's powered by Intel's leading edge packaging and process technology and our world class IP portfolio. Intel is back. And this story is just beginning. We have even more technical magic saved for our innovation event in October. Let me hand it over to Greg Lavender, our CTO, to give you just a little bit more detail on this incredible technical event.
I can't wait to see you all again.
Thank you, Pat. I'm so excited to host the Intel Innovation Event, my first event as the new CTO of Intel. Intel Innovation will be the inaugural flagship event and a tour de force of technology. From the smallest devices at the edge, to cutting edge mobile, laptop, PC clients through the network of today and tomorrow into the heart of the data center. We have 2 full days of technical keynotes, Breakout sessions, hands on demos and networking events planned on today's hottest topics from AI to 5 gs, the edge, the data center, cloud and client solutions.
The agenda is packed and awesome. I hope you will attend. Our teams at Intel can't wait to see you in person or virtually online October 27 through 28.
I can't wait for you to join us at Intel Innovation in October.
So come join us.
October can't get here fast enough.
I really hope you'll come join us in October. See you there.
Thank you for joining us at