Hello everyone, we are live. Thank you for coming. This is the oneAPI Dev Summit for I-AI and HPC in 2022. I'm really happy to have you all here. To start us off, I'd like to first introduce myself. My name is Sriram Ramkrishna. I'm the oneAPI community manager. My purpose here is to care and feed and grow the oneAPI community. A little bit about myself. I've been working on technical communities for over 25 years. I really enjoy working with communities, technical communities of all types. It's one of my big passions. My technical background, I was a systems administrator, Linux systems administrator for many years, maybe over 30 years.
Just to date myself, I started working on my first Unix computer in 1986. I subsequently got in trouble. Don't do what I did and access things you shouldn't. With that, I'd like to introduce my co-host, Susan Kahler. Susan, wanna come up on the virtual stage?
Hey, everyone. Greetings from Raleigh, North Carolina, where it is a cloudy 51 degrees Fahrenheit here. I'm Susan Kahler, and I'm on the AI product marketing team for Intel. I've been in the AI space for several years now. I'm not gonna date myself like Sri, but I started with my research studies on intelligent agents. We know it's Q4, and it's a busy time for all of us, and we really appreciate that you chose to spend time with us today. I hope that you enjoy the Dev Summit. Happy to see everyone in the chat session. Back to you, Sri.
All right. All right. Thank you. Before we start off with our headline speaker, I'd like to run through a few slides that make you comfortable with the user interface and so that you could have a problem-free conference attendance. All right? I have our initial slides here for oneAPI Dev Summit host. Let me go to the next one. Here you'll see a screenshot of the agenda. If you are, you know, we're currently we're on the Central Time zone, and if you wanna change the time zone, you have a little spot here that you can change the time zone.
If there are specific talks that you're particularly interested in, you wanna add them to your calendar, there's a thing on the side here that says, "Join the presentation and add to calendar." One important part of this conference is Once a presentation is complete, you have to exit out of the presentation and then go back to the agenda page so that you can get to the next presentation. I think that's really important, so make sure you keep that in mind when you're joining, when a session is done that you need to close out and come back in again. All right? Good. If you have any questions, of course, there's chat, we'll be happy to be online and help you out if you have any problems.
All right. The next one is, you're probably looking at this right now as you're viewing this conference. Just to familiarize yourself, you have, like, three active windows that you can move and resize and move around. The first part here is you'll see the speaker video. That's where you see me. Hello. If you go down here, you'll see a window at the bottom left, which has three tabs. There's a chat, there's an abstract, and the speaker info. If you click on the speaker info right now, you'll see the information for both Susan and myself. We're probably a little more extensive than what we've said online. On the chat, there's a chat box. This is the important part.
This is where you are able to comment and ask questions and thus when a speaker is talking, you can put your questions there, and we'll make sure and monitor them and make sure that the speaker sees those questions and is able to answer them. As well, if you happen to suddenly forget what the talk is about, there is a abstract and speaker info in the middle tab. Hopefully all of that is clear to you. At the bottom, there are some quick links that you can look at. If you hover over them, you'll see what each of them do. If you have problems, especially, there's a little window with a question mark, that's your tech support.
If you have problems there, you can take a look at that. To the far right, there's a closed captioning for those of you who need that. I think pretty much everything else is self-explanatory. Of course, if you have any questions, just throw it out there on the chat. We'll try to help you out. All right? The next one, moving on to the next slide. This is a good opportunity to earn badges. If you look here, there's a certain number of badges you can get or the leaderboard. If you see the badges, you can get an easy 50 points just visiting the Resources tab. If you go out, there's a Resources tab, easy 50 points there.
There's a various other things, Collaboration Hero, Session Attendee Champ, so forth. They're not overly hard to do, but it's a great way to get points and sort of have a reward and compete. Hopefully you have fun with that. I know we would love. We'll be monitoring the badge leaderboard and watching all the fun there as well. Let's see. We are here. Let's see. We got another seven minutes, so make sure I go a little slower here. I mentioned the Resources area. If you look here, there's Resources. Sometimes we're gonna be talking about things like DevCloud or AI toolkits and whatnot. This is the information.
This is kind of like a library, and you can find out all the things about, like, the Intel Developer Cloud, oneAPI. If you want to know what that is. There is also a community forum for accelerator computing if you want to find out more about that. These are just kind of a sample list of things, if you want to know what those resources are. Moving on. No conference is without our social... I mean, this is one of the most important parts of community and being at conferences is the social aspect of it. We have a social aspect, and we do want to give out some prizes and just sort of build a feeling of community. If you...
what's important, of course, is to really show that you're having a great time, and one way to do that is to call it out on Twitter. If you're on Twitter, include the hashtag #oneAPIDevSummit, and throw it out there, and you get your opportunity to get, like, a $20 gift card or out there as a prize. You do have to be 18 or older to do this, to do the social. Make sure you're an adult and post it out there. I also, you know, if you're also a Mastodon fan, feel free to put it on Mastodon. I'm a big Mastodon fan. If you're on there, I know there's a bunch of folks on HPCA who have their own server.
I would love to see a oneAPI DevSummit on Mastodon. Let's see what you guys got. All right? Of course, if you wanna know more about the social, there's people out there in Discord, and we encourage you to come over to Discord after the end of the conference. Now. We have a really exciting agenda today. Just to call it out, our headliner is Andres Rodriguez, and he's going to be talking about AI software and hardware acceleration. We have Peter Ma, who's gonna talk about using oneAPI actually in a business context. There's a short break. We have a talk from Kirt and Zhen on efficient inference and training and a host of other things.
We have a quick lunch break, 30-minute lunch break, and as you can see, there's a whole bunch of great stuff. Make sure you catch Rod Burns's talk on interest in the oneAPI Community Forum, which is gonna be an important milestone for oneAPI, especially as moving to an industry standard. After that, there's a hands-on with TensorFlow. Then we have a special guest, Stefanos Sotira, who's coming from RISC-V to talk about accelerating the future of heterogeneous compute. Finally, we have a conclusion, and soon Susan will come and finish the day one of the oneAPI DevSummit. Finally, we have Russ, who's gonna have a happy hour with the rest of us. A really fun packed day today.
Again, as Susan said, you can spend your time however you want today, but you're spending it with us, and we're really grateful that you're willing to share your day with us. Day two. We'll talk a little bit more about this day two, but this is generally the HPC portion of the day two. The only thing I wanna call out, I wanna really call out the live demo showcase that's gonna be there. You know, we have a lot of previous DevSummits. Everybody's been telling us, like, we want to do more hands-on. We've heard and we're listening, and we would love to have you attend our demo showcase here, so more call out.
We'll talk about more about all of this next on the next day for on the agenda for day two. All right, I only got three minutes here before we introduce the speaker, so I'm gonna go through this quick. Don't worry if you don't see this. During the conclusion, we're gonna show these slides again. These are questions to the DevCloud. This one is a oneAPI Meetup community. And we have our meetup.com that I do every two weeks, talking about various topics around AI and HPC and oneAPI. And of course, you can always hang out with us in Discord even after the conference is over or during the conference, however you wanna spend your time. And this is the way to get to that.
We have a happy hour, and you can join us here for we have games and prizes, and you get a chance to win some stuff. Please hang out with us for that. Let me now introduce our headliner, Andres Rodriguez. Andres Rodriguez is the Intel Fellow and Chief AI Architect. He actually designs deep learning solutions for Intel's data center customers, and provides technical leadership across Intel for deep learning hardware and software products. He was the lead instructor in the Coursera course, Introduction to Practical Deep Learning to over 20,000 students, and is the author of the popular Deep Learning Systems book.
Again, to remind you, if you have any questions throughout the presentation, put them in the chat box, and Andres will answer them at the end of the presentation. I'm now gonna hand it over to Andres. Please give a warm welcome to Andres Rodriguez, Intel Fellow and Chief Architect.
Thank you for the introduction. It's a pleasure to be here with you today. Thanks for making the time. I want to share with you what Intel is doing to accelerate artificial intelligence applications through both hardware and software innovations powered by oneAPI. You can see the agenda of the content. First I'm gonna share with you a quick overview of why AI is important and then talk about the various products that we are bringing to market, both in hardware and software, to accelerate artificial intelligence. Finally, I conclude with some of the ecosystem programs that we have. Why invest in AI? As many of you know, AI is rapidly growing across many different sectors. It's used in language applications. It's used to create images, to detect objects, to detect persons. It's used for recognition.
It's also used in the medical field, among many other fields. There are multiple articles that are published about the AI goodness every single day. What is Intel doing to help developers and data scientists accelerate their AI workloads? Our goal is to provide solutions that are simple to use, that are performant, and that increases the productivity of the developers. To start with, we have multiple hardware platforms that are available for our developers, starting with our Intel Xeon scalable processors. In our Intel Xeon scalable processors, we're adding AI acceleration. We're also bringing to market the Intel Data Center GPUs that can be used for both AI as well as HPC applications. For the dedicated AI training workloads, we're bringing to market the Habana Gaudi2 dedicated AI processor.
At the root of this hardware, we have oneAPI. oneAPI has two components. One is the oneAPI specific, the open specification that is available not just for Intel, but also for non-Intel hardware. It provides a standard that many hardware vendors can use. You can take a workload, and you can accelerate it through oneAPI, and it can run on various hardware backends. In addition, we have a oneAPI product to accelerate Intel's hardware platforms. I'll talk about more details on how this applies to AI in the next few minutes. To give you an example, here you can see the developers are writing applications using various AI libraries like TensorFlow, PyTorch, scikit-learn, et cetera. These libraries then leverage our oneAPI products. For example, TensorFlow can leverage oneDNN, scikit-learn can leverage oneDAL.
These libraries then are optimized to run across Intel's various hardware backends. As a developer, you don't have to worry about knowing all the hardware details of our CPUs or our GPUs, but you can just leverage the high-level libraries like TensorFlow and scikit-learn and know that those are going to run with high performance across our hardware backends. To give you an idea of the end-to-end software and solutions that we are offering, you can see on the left-hand side some of the popular tools that AI developers use for data analysis. In the center, you can see some of the machine learning libraries, and on the right, you can see tools to increase productivity. All these tools, or most of these tools are leveraging oneAPI acceleration, and can be executed across most of Intel's hardware platforms.
In, particularly in artificial intelligence, we see that models are growing at a fast pace, and they're not just growing in size, but they're growing in complexity. Often you used to see the models just being a bunch of sequential layers, but now they are growing in much more complexity, where the layers are not always just sequentially. Sometimes you have a mixture of graph neural networks, a mix with the traditional multilayer perceptron, and this complexity can be hard for both software and hardware. What are we doing to meet the demands of this growth in AI and its complexity? At the center, we have the Intel Xeon Scalable processors. Now, the main benefits of this product is that it's widely available, it's simple to use, simple to program, simple to debug.
Each core in the Intel Xeon processors is fast. It operates with a high frequency. We have a large memory capacity so that you can store very large models and datasets. The software is robust. We've been working on the software for several years to make sure that a broad set of AI workloads run efficiently on the Intel Xeon scalable processors. You can do all. You can do the entire end-to-end on the same processors. From the data preprocessing to the training or the inference of your AI models and the post-processing. On Xeon, we have added various hardware acceleration from AVX-512 to Intel Deep Learning Boost via an instruction called VNNI, and more recently to the next generation of Intel Xeon processors that we're going to be launching in a month.
We added the Intel Advanced Matrix Extensions, so Intel AMX, and I'll talk about what this is in a little bit. We've worked with the ecosystem community to optimize popular libraries, TensorFlow, PyTorch, ONNX Runtime, XGBoost, et cetera, so that you can take these libraries and run them across our hardware, and you might not even realize that you're using Intel's optimizations. Intel Xeon processors, as I mentioned, can be used for both the data preprocessing, for both the model development, both for traditional machine learning, as well as for deep learning, as well as for the deployment. In fact, the overwhelming majority of deep learning inference happens on Intel Xeon processors. Just to give you an example, we work with several companies.
eBay published or gave a talk in which they showed the acceleration that they were getting by leveraging Intel's hardware and software. They showed that for their ranking algorithm, which is important to show their end users more relevant research results, they were showing over 2, at around 2.5 improvement in both speed and throughput. Tencent, they have an application for text-to-speech. Not just Tencent, but many other companies, of course. They leverage for the vocoder accelerator, which gives higher quality speech synthesis and they were getting 4.7x improvement in performance. Also for other applications like reinforcement learning, another team at Tencent, they use this for a very popular game called the Honor of Kings, and they were able to do distributor training.
They took 16 clusters of Intel Xeon Scalable processors and trained their reinforcement learning algorithm on these processors. They showed that they were able to get near linear scaling across the 16 processors. You can use this processor not just for inference but also for training and distributed training workload across multiple processors to reduce the time to train. In the upcoming generation of Intel Xeon Scalable processors that we're launching next month, we added specialized hardware AI acceleration to every single core. This expands the reach of the Intel Xeon processors into the accelerator space, so that for many applications, you can do both the training and inference on an Intel Xeon CPU. This accelerator is called the Intel Advanced Matrix Extensions, and it has two components.
One is tiles, which are essentially two-dimensional registers, as well as TMUL, which is a tile matrix multiply. Essentially, you're taking a matrix multiplication accelerator and embed it into every core of an Intel Xeon CPU. As many of you know, matrix multiplications are one of the most compute-intensive operations in artificial intelligence, and it's a very common operation. By accelerating this computation, you end up accelerating the end-to-end workload. To show you some results, if you compare the performance of Intel Xeon processors in the upcoming generation with the previous generation, you can see 38 improvement by leveraging both oneAPI with software acceleration and hardware acceleration.
In this slide, you can see by leveraging Intel TensorFlow with oneDNN, which is one of the oneAPI products, you can get a 50% improvement in performance. By leveraging the hardware acceleration of that offer in that generation, in the third generation, you get an additional 3.9x performance boost. By leveraging the Advanced Matrix Extensions in the upcoming generation, you get another 4.x improvement. This is something that I am very excited that our developers can use their Intel Xeon CPUs for applications that in the past often require more of a dedicated accelerator. How does this compare to an NVIDIA GPU?
You can see, you can get much faster performance on an Intel CPU by leveraging both the hardware and software accelerator over an NVIDIA GPU. This is a popular benchmark called ResNet-50, and you can see a significant improvement in performance. We'll have more proof points and examples when we launch the product next month. I won't share many more, but I do wanna share one. Before that, where can you use the Intel Xeon processors for then? For all inference workloads, for small, medium training models, as well as for fine-tuning or transfer learning of large and small models. In addition, all your traditional machine learning, so when you're leveraging XGBoost, scikit-learn, those are well optimized for Intel Xeon processors.
Even if you're doing a very large deep learning training model, you can leverage multiple Xeon processors and distribute the training workload across multiple processors and reduce the time to train. Even for large language models. This is an example of Alibaba. They did some initial testing, and then they were excited to share that they were getting a 15.9x performance boost by leveraging both the hardware and software acceleration in this upcoming generation. One of the things to keep in mind is this acceleration works for lower precision. For 16-bit precision, specifically a numerical format called bfloat16, as well as for 8-bit precision, specifically for int8.
Often, particularly for int8, some developers have struggled in the past taking a model and knowing which layers they can convert to 8-bit precision because some models are sensitive to a loss of accuracy when you move them to 8 bits. Some layers in a deep learning model can be quantized to 8 bits, while others should be kept at higher precision. It often takes a long time to decide or to know which ones to quantize versus which ones to keep at higher precision.
To help our developers, we developed the Intel Neural Compressor, which is a tool that you essentially take your deep learning model, whether it's a TensorFlow model, a PyTorch model, or an ONNX Runtime model, and then you put it through the Intel Neural Compressor, and the output is a compressed model where all the layers may be quantized or only some of the layers may be quantized. It keeps any layer that is sensitive to loss in accuracy at the higher precision, so that you can get the acceleration from using lower precision for the layers that can get that acceleration and maintain the high accuracy for the layers that would be. Sorry, the high numerical format for the layers that would be susceptible to a loss of accuracy. This tool is super simple to use.
We've worked with the CERN to help them adopt it, and they showed that they were getting 10x productivity. Rather than them trying to find which layers they can quantize, they just use the Intel Neural Compressor, and they were able to get an acceleration with very little effort. Again, one of our goals is to increase the productivity of our developers. One of the key oneAPI products that we use in deep learning is called a oneAPI Deep Neural Network Library or oneDNN. Previously we used to call this library MKL-DNN, but now it's renamed as oneDNN. It is included into the default versions of TensorFlow and PyTorch. If you were to do a pip install TensorFlow today or similar with PyTorch, the binary includes oneDNN acceleration.
You, you can take the default versions and run your deep learning workloads on Intel Xeon processors with oneDNN goodness. You might not even realize you're getting the oneDNN acceleration because it's included by default, you don't have to change any parameters in your models. Now, that's true if you're using some of the recent versions. If you're using some of the older versions, then there are some parameter settings that you have to set. I'm including here the versions where oneDNN becomes default into this library. You know if you're using an older version, you can come to our documentation page, which I'll show you the link towards the end of the presentation, and set the right parameters as shown in our documentation.
Again, if you're using one of the recent versions, then there is nothing that you have to do on your, on your end. TensorFlow for example, by adding oneDNN acceleration, you can get a 3x performance boost for a number of workloads. oneDNN works not just for Intel Xeons, but also for Intel Core products found in your laptops, as well as for our GPUs. Even some of our competitors are contributing to oneDNN, such as Arm, so that you can, for some workloads, you can leverage oneDNN on non-Intel platforms. We've worked with the developer community of TensorFlow, which is primarily led by Google, and we have very great relationship with them. They ask Intel to take ownership, for example, of all future Windows builds of TensorFlow, which we're doing.
We continue to support of course, the Linux builds as well. oneAPI doesn't just accelerate deep learning AI workloads. It also accelerates traditional machine learning workloads, both in the data pipeline, in the data analytics and data preprocessing pipeline, such as with Pandas. With Pandas, we've introduced a library called Modin that can accelerate Pandas applications, which is one line of code as shown on the left-hand side. Similar with scikit-learn, we've introduced a library called the scikit-learn extensions, by adding the two lines of code that are shown in the middle of the slide, you can get significant performance boost. 38x performance boost in your scikit-learn applications. With TensorFlow, similar with PyTorch, you don't have to do any changes in your code, and you can get the acceleration that oneDNN provides.
Let me tell you a little bit about our Intel GPUs discrete accelerators. We've recently introduced what we call the Intel Data Center GPU Max Series. This is a GPU that has built-in AI accelerator acceleration into every core. To show you some of the performance that you can get, and support across various numerical formats, you can see here that, for example, if you were to use a popular 16-bit precision, you can get 839 teraflops, which is a significant performance for both training and inference. If you can do inference with 8 bits of precision, then that doubles the available compute. Now, to feed all that compute, you need a well-structured memory hierarchy, which our GPUs provide.
You can see the high bandwidth between caches and the large cache sizes that our Intel GPUs have. We're bringing this to market in different platforms from an OEM to a subsystem with 4 GPUs to one with that's built in with our 4th generation Xeon Scalable processors. This is gonna be available through multiple OEMs. I'm showing some of them here in this slide, from HPC, Dell, and others. The last hardware product that I want to briefly mention is the Intel Habana Gaudi. We recently announced the Gaudi2 accelerator, deep learning accelerator. This accelerator, we competed in the MLPerf benchmark, which is a benchmark dedicated for deep learning workloads.
You can see that on the left-hand side, the blue is the Gaudi2 results, and it's beating A100 for both BERT and ResNet-50, comfortable beating ResNet-50. For NVIDIA H100, the results are using different numerical formats, so the results are not necessarily apples to apples, but they are with the A100. This is something that we're very proud of and to share that product with our developers so they can reduce the time to train significantly by leveraging the Habana Gaudi2 accelerator. In addition, one of the main benefits is not just the raw performance, but the much better cost. I'll show you in a few slides the significant cost advantages that this product has over the A100 from NVIDIA.
Just to give you a high-level architectural view, this is built on 7 nanometer processors. One of the many things that is a very strong strength for this product is not just the performance of a single Gaudi 2 accelerator, as was shown in the previous slide, but also the ability to do distributed training across multiple Gaudi processors. We have 24 terps, 100 Gb DRAM NICs, so that you can take out a very large model and you can train it across hundreds or even over 1,000 Gaudi processors. That way you can reduce the time to train. The scaling across the processors is nearly linear because of the high bandwidth between processors. This is available through various OEMs like Supermicro, through Inspur, and through Wiwynn.
How to leverage the acceleration is quite simple. It only takes for TensorFlow models, just two lines of code in order to tell the model to leverage the Habana acceleration. For PyTorch, it takes only a few lines of code, again, just to tell the model to leverage the Habana accelerator. At Intel, we're very transparent about the performance. If you go to the Intel Habana GitHub page, you can see a number of performance that we've measured and that you can reproduce across various versions of TensorFlow, PyTorch, and across many models. You can tell all the models that have been well optimized, and the performance is not just limited to the models that we're showing. The performance covers a much broader set of models.
This can give you an idea of the performance for the types of models that you may have. As far as the cost, the first generation is already available at AWS. If you are to leverage AWS DL1 instances, which are the ones that have the Habana accelerator, Habana Gaudi accelerator, you can see on the left-hand side, you can see the Gaudi cost is much cheaper than both A100, the high-end GPUs, the 80 GBs both and also 40 GBs, as well as the V100 for a number of models.
Again, in addition to the higher raw performance of Gaudi 2, I think the main benefit to our developers is going to be, the cost savings as well as the ability to distribute a model across hundreds of over 1,000 Gaudi instances. This is something that Sundar, the head of machine learning at AWS, highlighted the nice advantages, the nice things that developers can get by leveraging these instances. Highlighted that, for, that you can take a very large model and you can do inference, with 15 milliseconds across multiple, very large models. We work with the community as well to with many companies to make sure that they can take advantage of Habana Gaudi acceleration. We work with Leidos, which is an R&D company.
They were doing COVID research and time to train is important for them. They saw significant cost savings, over 60% by leveraging Amazon EC2 DL1 instances over the GPU instances. We work with Mobileye. They saw over 40% in cost savings for training their models. Gaudi can not just be used for training but also for inference. We did a fun demo using diffusion models. These are models where you pass a text description, and it generates an image based on the text description. You can see on the right-hand side the performance of Habana Gaudi2, where lower is better. You can see across various workloads that Habana Gaudi2 comfortably outperforms NVIDIA A100.
I mentioned one of the key advantages is this ability to do near linear scaling across hundreds or even over 1,000 Gaudi instances. We, in partnership with Microsoft Research, have done various examples where we're leveraging the Gaudi processors. We did some training with 512 Gaudis for multimodal understanding using some of the latest transformer-based models, as well as using some diffusion-based models. We are not just working with companies, but we're also working with academia. We want the developers that are pushing the state of the art, developing new models, to be able to leverage the acceleration, because we believe that this is going to accelerate the state-of-the-art. One of the popular companies that is at the forefront of transformers is called Hugging Face.
We partner with Hugging Face to make sure that the optimization that are offering the Optimum library has acceleration for both Intel Xeon Scalable processors and Habana Gaudi processors. If you go to the Hugging Face, to the Optimum library, then you can leverage a Hugging Face models accelerator for both Xeons and Habana Gaudi processors. You can get access to both the latest generation of Intel Xeon Scalable processors as well as the Gaudi2 processors, as well as our Intel GPUs through our Intel Developer Cloud. If you, if you do a Google search for Intel Developer Cloud, it will bring you to the page, and you can request access to run a number of your experiments. We work with many ecosystem partners.
We have two programs that I just want to quickly highlight. One is called the Intel Disruptor Initiative. This is where we partner very closely with a number of companies that are pushing the limits of innovation. We provide them support, technical support, so they can more rapidly innovate on Intel's platforms for software and hardware. For the larger ecosystem, we have the Intel AI Builders. We partner with many, many companies, I'm listing a few of them in the slide, across various vertical supports like retail, healthcare, FSI, et cetera, and horizontal partners. If you are a developer in one of these, or in a particular company, we love to partner with you. What's next?
We want to invite you to visit our developer website at developer.intel.com/ai. You can see there all the documentations for oneAPI and how you can leverage oneAPI for AI. You can learn more about the work we've done on TensorFlow, PyTorch, scikit-learn, Pandas, et cetera. We are excited to partner with you to accelerate your workloads and give you additional productivity. Thank you for your time today. I'm happy to take some questions.
Hi, Andres, this is Susan. Can you hear me okay?
Yes.
All right. We have three questions that have come in during your presentation. The first one is from Shreyans. Shreyans wants to know, when you were on the slides where you were talking about the 4th-gen Xeon, the question is: Would this accelerator remove the need for any other hardware accelerator? We are using TPU, GPU.
Yes, that's a great question. Let me see if I can go back to this slide. I don't know if the audience can see it. Essentially.
Yes.
Okay. Thank you. Essentially, the acceleration built into the Xeon Scalable processors does expand into the accelerator space. I would recommend that developers start with Xeon, because I think... Well, I'm confident that Xeon will be able to cover the majority of your AI workloads. All your traditional machine learning should be done on Xeons, and most of your deep learning workloads. Certainly all the inference workloads and the small training models. Now, if you are a developer that is constantly training very large models, then it makes more sense to have a dedicated accelerator. That's why we have both the Gaudi2 scalable processors, sorry, the Gaudi 2 AI processors and our discrete GPUs. We do see a need for those, for those products, of course.
But we see Xeon being able to meet the majority of your AI needs.
Okay. I have a question. Oh, sorry. Go ahead. Go ahead, Sri.
Yeah. Yeah. We had another question. It was... I'm sorry, I forgot to copy the name. It's from... Give me one sec while I pull that up. It's from Stefan Tron. He says, "Hi, Xeon acceleration seem very promising. Is there any notable hardware or software acceleration for the core CPU families that would have a significant impact on the training and deployment? Any tips and tricks to use the most of the core CPU family?
That's a great question. We are adding hardware acceleration to our core. I cannot share all the details at this time, but we have added software acceleration. Even though I highlighted all the goodness on the fourth generation of Intel Xeon Scalable processors, both previous generations of Xeon processors as well as our core CPUs that are using client and some workstations, those can leverage from the software acceleration. If you are using some older version of TensorFlow or PyTorch, I would encourage you to try the newer version, because we're constantly adding more acceleration to all these products. Many developers are leveraging older CPUs and many developers develop on client. They don't just go straight to servers.
Of course, we want the developers to take advantage of the performance boost. oneAPI DNN, our deep learning library for software acceleration is not just targeting the latest hardware, but it's also optimized for older hardware products, older hardware CPUs, so that you can get a performance boost. Again, I would invite you to use one of the more recent versions of these libraries, whether it's TensorFlow, PyTorch or XGBoost.
Andres, we have one more question. This is from Yustov who wants to know, is Habana available for Julia via the oneAPI library?
Yeah. Habana is targeting TensorFlow and PyTorch, as long as you're using those two libraries, then you can take your workloads and run them on the Habana Gaudi processors. There is no support for other libraries outside of TensorFlow and PyTorch.
Great. I've just asked, Andres. I've just posted in the chat to see if there's any more questions. If you'll stick around for a little bit longer, just let me see if anybody comes up with any more questions, please. Okay. I don't see any more questions appearing in the, in the chat. Over to you, Sri.
All right. Well, Andres, since there are no more questions, we wanna thank you for coming to the oneAPI Summit and presenting for us. If there's any chance if you wanna hang out at Discord at the end of the conference and answer any questions that would be fantastic. Other than that, this concludes our presentation. Please as a matter of course, close this session and then join us for the next session starting at 10 P.M. 10 A.M. Sorry. I'm not always confused about the time, but I guess it's just in my mind at P.M. We will see you then in the next time slot. Thank you for attending this session.
Hello, everybody, and welcome to the talk, Using oneAPI to Predict Anonymous Web Visitor Behavior. Our speaker is Peter Ma, and he will show us how SiteMana benefits from oneAPI and daal4py. Peter is co-founder at SiteMana, an AI company that predicts anonymous visitor purchasing intent. He's also an Intel Software Innovator, TED speaker, and TechStars alumni. If you have any questions throughout the presentation, put them in the chat box and Peter will answer them at the end of his presentation. Over to you, Peter.
All right. Good morning, everyone, or good evening at different places. Good afternoon, somewhere in, I think right now, probably somewhere in Europe. I'm here based in San Francisco, and I'll be talking about how can we predict purchasing behavior when someone just visits your website. Again, we're SiteMana. We use AI to identify and to retarget anonymous site visitors with high purchasing intent. That, you know, Currently, I'm a co-founder and CTO, part of the TechStars alumni, Intel Software Innovator, NVIDIA AI Innovator, Arm Innovator, and I won more than 150 hackathons, probably most in the world as far as I know. I previously started different projects. My last company was Mixpose. We used AI for yoga poses.
We raised over half million and previously with the Clean Water AI, which I won more than $300,000 just off hackathons. Doctor Hazel, which was the AI for skin cancer, HiSnooze was AI to detect whether drowsy driving. I built a lot of the projects. One thing in common is like none of those projects ever made me money. When I started new projects, I started finding out where the pain points of the users are. This is where we come in to start finding out that people spend almost $6-$7 on a single click through online advertising platforms.
As you guys probably heard, there was a lot of these D2C, direct to consumer sales of products such as Casper, like, you know, the mattress and many others across the world. What their sales cycle begins is that purely through advertising. The thing is, what people used to go IPO with currently is no longer a market because of the rising advertising costs and how we come in to solve that. Next slide. How we are actually making this, how we're actually driving a lot more ROIs when a visitor comes to the website, we actually using the AI to predict what is gonna likelihood of the user is gonna purchase.
How this is actually doing this is you train the neural network based on the person already have made a purchase. When a new person come in, you go use their behavior to go against other people's behavior. The people have already purchased behavior. Then once we figure out they're likely to make a purchase, you can use real-time communication such as, you know, popping up coupons and things like that. This way you don't annoy the user when the user come in, like, "Please sign up with your email." Like, that's sort of like annoying for the user. From there you would have aggregated targeting of the high intent purchasers.
This we can provide their email addresses, their phone numbers, their postal address for retarget purpose. This is all powered by oneAPI. We're basically using the Adel for Pi as well as Scikit for our project. We have a live demo for you. Let's see the live demo. Chrome tab. This is open source. I'm gonna put this on a GitHub very soon. This is a, this is in a Jupyter notebook, as you can see. From here, you could... In a Python note, you pretty much can launch this, you know, by importing the libraries. After that, what we're gonna do is the monotrain is actually the CSV, the website you wanna train.
Of the privacy purposes, we only anonymize all the data for the demo. From here, we basically get their visits, duration, clicks, and the probability of the higher likelihood of the purchase, as you can see. From here, you can see once we run this is basically loading the data in. After loading the data in, we start training of the data set using a linear regression based on those three parameters. You know, training test results done. Here's basically our model. One second. On a linear... This is basically the model itself, linear regression batch, PKL. Once we can basically load the model, and this is basically the features we're able to get.
Now let's make a prediction into. Next is basically we load all the new users into predictions. Now, okay, let's just see. The three parameters are pretty much visits, duration, and clicks. Let's say the person have over, you know. Let's say they have, like, 20 visits or, you know, 25 minutes of duration and then 30 clicks. Again, this is a simple version. You can see this is 73% chance the person's gonna make a purchase. Let's say that we lower the visits and still maintain the duration and clicks. If the visits come down to five, the purchase chance becomes 44. What if we decrease the duration as well as the amount of clicks? Let's do seven.
You only click 7x on the website. It decreases down to 22x . What if we have, let's say, like, 20 visits? It just basically increases the chance. This is a prediction that we make to have actual, actionable items on the site as well as use it for retargeting. Let me go back to the finishing the demo. Okay. This is a, the very basic line of what we did. We obviously it's a lot more complex than that. In our actual model, we actually use up to 220 different type of parameters between online and offline parameters.
That includes scrolling, like the hovering, as well as how long they stay on each page and what color they need. Offline parameters like their credit score, their home worth, as well as, like, how much money they make, these things kind of all determines what the person is likely to make a purchase. Yep. Because of the oneAPI processing speed, inference speed, then this way we pretty much can do this, we can pretty much do this for every visitor that come to the website itself. One thing is like when, coming from an innovator and a hackathon background, I've always pretty much in focused on what are my ideas is and how the technology can fit into to solve a problem.
When you actually launch a startup in the real world practice, you start finding out what users care about. In the end, user care about, like, you know, retargeting higher intent visitors. This way, they don't actually send out any emails to people that does not really unlikely to convert. The most thing they actually care about is ROI. When user ask us, they're literally just telling us, "Hey, I'm giving you $100. How much money am I getting back?" Like, all the ideas of, all these ideas of, hey, we can, we can use this for all sorts of things, kinda fades away because you are user-driven type of, when you're actually doing a startup rather than a project, you have to be user-driven.
We don't have any more ideas. It's more of every single feature that we build is based on what user tell us and what based on user requests, and also based on what user is willing to pay. In this specific case, user specifically wants higher intent, focused people. Just so this way they spend less in terms of advertising, and they spend less in terms of emailing. Our target market currently is basically e-commerce, financial services and the technology. In terms of the e-commerce, we basically focus more on, Like, our customer includes... One second. Five more seconds. Our customer includes usually women's healthcare products, as well as things like olive oil, down to people selling fossils. They generally see a 28% in terms of revenue increase.
For e-commerce, it's extremely easy for us to actually tell what kind of ROI they're receiving. For every $1,000 they put in, they usually get $3,000 back. Financial services, we increases their bottom line, but the thing is we can't really get their ROI because of the tracking method. 'Cause direct response you follow up, and you call up the users. It's more than just what the technology can do. It's more of LTV, lifetime value of a customer rather than the AOV, which is like average order value.
On average, which Like, every time you use this technology, you actually generates at least 25% in increase in terms of revenue a month, just after one month. Our business models, actually, we let you Like, usually if we actually let you use the first month just to see how much money you're making, and you're paying us a fee on the second month through the first month. This is like a case study for using us, well, for oneAPI. Specifically, it's like when you use traditional method, you kinda have to use multiple different stacks. You need a stack for training, and you, like, in terms of TensorFlow, you need a TensorFlow to train. You use the, you use the TensorFlow Lite to actually inference. In oneAPI, everything's under one stack.
Normally, also every time you change a server, you have to basically update the server driver. Again, in terms of oneAPI, this is completely, this is completely It supports multiple hardware, so it doesn't really matter what the driver of the, a specific thing is. If you guys use TensorFlow, you know that every update on the TensorFlow, your entire stack kind of is screwed up. The entire driver and the stack is screwed up. In terms of benchmark on the training, oneAPI is actually the 15% faster if you have over 10 million in your datasets. The learning curve for scikit on oneAPI is like practically, you can do this project under 24 hours if you have samples.
I will basically update this in our open source project, so you can train your own thing. Most importantly is when you use PyTorch or TensorFlow, that becomes more of a vendor lock-in. With oneAPI, it's actually completely open, so you can kind of like switch off this to something else pretty quickly. If you wanna convert your anonymous web traffic into revenue, feel free to email me at peter@nana.com. I can start taking questions.
Peter, this is Susan. I did not see any questions yet in the chat. Why don't we just give it a few seconds here?
Right. It's pretty simple.
Peter, I'm not seeing any questions. What I think we can do is we'll go ahead and close out the session. I wanna say thank you, Peter, so much for showing us how oneAPI and daal4py are used to predict customer purchasing behavior. Wait. Before I close it out, there is a question from Ashok who says, "What are the parameters used to predict behavior?
In the demo that we pretty much showed you the amount of visits, amount of clicks, and amount of duration. We use that to predict. This is for open source demo. In our closed source, in our actual thing, we obviously use a lot more than that. We actually use your scroll, amount of times you hover on this online. We have almost, I think we use about 50 of the online parameters and about 150 offline parameters. On the offline parameters include, actually includes your name. From the name itself, we use NLP to see what kind of like, what's the likelihood of the purchasing on this.
If the Alex is likely to purchase, when another Alex comes in, does that increase or decrease the chance that person comes in and make a purchase? From the name itself, you actually can get the ethnicity and all that things. Again, we don't go that far. We basically just like you use part of the NLP to actually get that. There's also location, and things like that.
Okay, wonderful.
What are some-
Thank you. Oh, do you see the question from Satish?
Yeah. What are some of the challenges you're seeing in this domain? Some of the challenges in this domain is because it is getting ultra-competitive, every brand is trying to get an actual edge on the to actually make profit because each channel is getting more and more filled. Like, the more ads you display, the more money you pay. Again, it doesn't mean those people are gonna convert. What they really cared about is conversion. Like how many ads I put in, and then if I give you, if I give you 20 people, if I give you, like, 20 contacts, right?
With 20 hashed emails, you put in the Google Ads, these, none of these 20 convert, then you basically paid for all that clicks on the 20 people clicked. The thing is, if I give you, if I give you 10, let's say five of them are gonna convert, then that increases your ROI by a whole lot. This is where the AI prediction is kind of used or becomes really useful in this industry. As well as a different retargeting tool. What advice would you give to those who might want to start company in this space? I have to tell you that if you wanna make money, this e-commerce is actually a great space to get into.
We originally started this application through because we wanted to do AI prediction for user testimonials. How our idea. Our first idea was actually we wanted to, when a user come to a website, then we would basically, you know, use that to figure out what user's social network is. From social network, we would at least capture a photo, whether from Instagram or from the Facebook or from LinkedIn. Then we would give like 20 different testimonials on the site, and we would match the face of the visitor to the face of the person who's giving testimonial. This way you're actually seeing the, you're kinda seeing yourself giving a testimonial.
In this case, if you have like an Asian guy coming to a wine shop of, you know, you'll see an Asian guy giving like a testimonial, "Hey, you know, this is a great wine." If you see a grandma coming to the website, you will see a grandma giving a testimonial like, "Hey, this is a great wine. Like, I really like this wine because, you know, A, B, C." What we found out about this was that the friction between us and the user getting the testimonials, we had the technology pretty much ready. But the user could, like, the user testimonial quality and the quantity both are lacking. Like, user basically can barely get 2, so.
Eventually what the user tell us is like, "Hey, weren't you guys getting the emails and getting those, you know, to actually get those information on the user?" We're like, "Yeah." The thing is you need to give us testimonial for this to work. They're like, "Hey, the thing is, it's hard for us to get testimonials. Can't you just sell us the emails?" This is how the entire business started. We're like, "Yeah, of course, we can." We kind of swapped out almost like 50% or 60% of the code, and we're like, "This is much more easier." We're gonna get those information to give it to you anyway.
From there we're like, "Hey," but the user is like, "Hey, this is not like... We're not driving ROI." Our email, a lot of these emails get, are turning to spam. This is where we're like, "Okay, look, what can we do with this?" This is one, a point where you kinda use the oneAPI to give you the people that's not likely to respond to your email, and not likely to respond. Like, you know, if they're not, have no intent of purchasing, then we shouldn't even put them into the TikTok and the Google Ads. That's how this business started. Is 15% faster in GPU or Xeon? Currently, we're actually just running this on Xeon.
When you punch into the Xeon itself. The GPU unfortunately is pretty expensive. Because of the, we're doing ML and not really using deep learning, the Xeon itself can pretty much process at, we can process how much? We can process almost like 20,000 within a minute. It doesn't really matter in terms of the speed so far. I have not tried running this on FPGA because we need to FPGA yet. Because we are merely running this on the server. All my previous project was AI on the edge, this is actually AI on the server because, just because every visitor, it's like Google Analytics.
Just imagine Google Analytics for every single one of your visitor needs to run through the entire process. Now imagine every one of your customer and every one of their visitor comes in. We process about almost 1 million hits a day.
All right.
All right.
If you have any more questions for Peter, head over to Discord after the event to ask your questions and continue the conversation. This concludes the presentation. We're gonna have a 10-minute break. Get a beverage, snack, come back at 10:35 A.M. Central. Please close this window and go back to the agenda page to join the next presentation, Efficient Inference and Training of Large Neural Network Models. See you there.
All right, everyone. Welcome back. I hope you had a great break and have a beverage or a snack in hand. We're now gonna start our session, Efficient Inference and Training of Large Neural Network Models. Our speakers are Zhen Dong and Kurt Keutzer. A brief introduction for both of them. Zhen Dong received his BS from Peking University in 2018, and a PhD from the University of California at Berkeley in 2022, and he's currently a postdoc at UC Berkeley, working with Professor Keutzer. His research interests includes efficient deep learning, quantization of model compression, and hardware-software co-design. Our other co-speaker is Kurt Keutzer, and he's a professor of EECS at the University of California, Berkeley, where he is a member of BAIR Lab and co-director of Berkeley Deep Drive Research Consortium.
His research covers all aspects of deep learning. I give the floor to you all.
Great. Thanks, Ray. Are we all good with the audio? I hope that's a yes. I'm just here to do a brief introduction at Intel's request, and thanks for this opportunity to share our work. My group has since 2008 or so been working on diverse application areas of computer vision, audio, multimedia, and natural language processing, and our approach to that has been initially machine learning and then more recently deep learning. That transition occurred around 2012 or so when we saw one by one, beginning with computer vision, that deep neural nets were supplying the most efficient and most accurate approaches to all the problems. That's quite surprising, isn't it?
We have a single algorithmic paradigm which is able to cover these very, very broad research areas, again, from computer vision through speech, recommendation systems, and most recently natural language. To take that even a step further, not only is a single algorithmic family of deep neural nets solving the problem, but increasingly transformers are providing those solutions. Our approach to these applications has been to focus on efficiency, and I think we're best known for our work on efficient inference in the edge with our squeeze family of deep neural nets, beginning with SqueezeNet, and most recently, we presented Squeezeformer at NeurIPS just this last week.
I think what a lot of people don't realize is that we've only been able to do this experimental work in deep neural nets, because we became adept pretty early on efficient and scalable training of deep neural nets in the cloud and then went on, of course, to attack inference. This, this work on efficient, training inference in the cloud is what Zhen will be talking about. Zhen, if you wanna take it from here.
As we know, the model size and computation is increasing. In this figure, the y-axis is the parameter size. As we can see, compared to the BERT-Large in 2018, the parameter size of the Megatron-Turing has increased by over 1,000x to a formidable number of 530 billion parameters. As a result, performing inference and training of these models becomes very hard. Here I will talk about our efforts to achieve both efficient and the inference and training on distributed systems. First, I'll introduce LTP, which is a fast post-training pruning framework for transformers. As shown in the figure, we propose a three-stage pruning pipeline to obtain high accuracy without retraining.
Based on feature-based mask search, the feature-based mask rearrangement, and the final mask tuning, we can achieve comparable accuracy FLOPS trade-off compared to other state-of-the-art transformer pruning methods. In the figure, we show results on 4 different NLP downstream tasks. From the table, we can see that the end-to-end pruning time of our LTP method is two to three orders of magnitude less than other methods. LTP generally takes a few seconds, while previous methods can take several hours. LTP was published in KDD 2022, while a more advanced version of the method was just published in NeurIPS 2022. Besides inference, we studied the efficient training in the work staged training for transformer LLM.
In order to train a large transformer-based model, the main idea of this work is to first train a small model and then add an expansion on the small model to create a large model. Finally, from training on the large model, as we can see in the figure, this jump operation in the training process can actually preserve the final accuracy while having a faster convergence compared to the baseline. For example, it can save up to 20% compute in total for the training on GPT-2 Large. In the following work, Task, we also explored general efficient distributed training that can work for both CNNs and transformer-based models. To accelerate the standard data parallel distributed training, there are two major problems of previous works.
The first one is that previous methods need very high sparsity level to alleviate the overhead of supporting sparsity. As illustrated here, the sparsity level will go down when the number of machines in a node goes up. For example, in this figure, each local machine has a very high sparsity level. After the gradient averaging on the main machine, the sparsity level becomes 3x larger. The second problem is about the top-K selection, where the K means to only communicate K values instead of all the gradients during distributed training. The current top-K selection is solely based on the parameter size, meaning that if a layer has fewer parameters, it rarely gets updated during the training process. We know that different layers in a neural network have different roles, and therefore have different sensitivities.
As such, we should consider the importance of different parameters when conducting the top-K selection. For better understanding of the sensitivity, in these figures we show that the importance of a gradient value is relative to the topology of the loss landscape. A small value in a sharp loss landscape can be important, while a large value in a flat loss landscape is not important. This topological metric of the loss landscape can be captured by leveraging the first and second-order information, and based on which we propose our solution. In our solution, the topology-aware structured communications, briefly, we use the granularity of channels to avoid extra costs supporting sparse communications.
That is to say, we only communicate top-K channels in the neural network. The tensor is dense inside each channel being communicated, so that our communication can be more efficient without the need to support unstructured communication or unstructured sparsity. To find out the sensitivity channels, I mean, to find out the sensitive channels, we propose to look at the Taylor expansion listed here and use the Hessians method to efficiently collect the first and second-order information of different channels. I saw some questions on the chat, we will have that answered during the Q&A session. Yes. Just to move on, in this table, we show our speed up achieved on training VGG-19 on ImageNet.
This Ring-AllReduce plus hide here refers to the PyTorch solution, where the communications there are partially hide behind the corresponding computations. Based on the comparison between the original Ring-AllReduce, the Ring-AllReduce plus hiding, and also our method, TASC, as well as no syncing at all. We show the speed up and list it here. We can see that the TASC, the task methods can achieve better speed up compared to PyTorch on a single node. Furthermore, as we show in this table, the speed up achieved by TASC, using multiple nodes is actually more significant. This is due to the low bandwidth of the InfiniBand between different nodes.
Here, compared to the previous slides, where the task speed up is around 2.6, here, we can actually achieve a speed up over 10x . That's because we aggressively compress the communications, and so that the workload among different nodes becomes much smaller, so that we have a better speed up number. Although, as I mentioned, task generally takes care of CNNs and transformers, more issues will arise for the training and inference of even larger models, such as the recommendation systems, which we try to address in this work of DQRM, which refers to deep quantized recommendation models. As demonstrated in the previous works, training DLRM models leads to fast overfitting. We can clearly see it in the orange and blue lines of the figures.
In our work, DQRM, we found that 4-bit quantization of embedding tables can actually alleviate the overfitting. As illustrated in the green lines. This observation is consistent on all our experiments conducted on both the Kaggle dataset and also the Terabyte dataset for DLRM. Consequently, we are able to take advantage of smaller, quantized version of recommender models, while we maintain or even outperform the baseline accuracies. In order to efficiently achieve this 4-bit quantization, we propose two techniques which significantly improve the overall performance of the system. First, as illustrated here, we know that the embedding tables of the recommendation models generally have formidable sizes and are the bottleneck of the system. We propose a better quantization-aware training pipeline with no copy of embedding table.
The details are illustrated below, where we just take advantage of the fact that only several rows of the embedding table are actually active during each during each inference iteration. We just like we just copy or save the data for these specific rows instead of copying all the embedding table. This can significantly reduce the time cost and also the memory cost. Second, as shown in the bar graph of this inference latency, the process of calculating the quantization scales can actually cost half of the time during quantization-aware training, which we refer to as the QAT method. Even further, we also testify the training, the QAT without calculating the quantization scales and activation quantization.
Also we compare with the training without scales, activation, quantization out, and as well as the dequantization. You can see that the speed up from all the other components, such as the activation quantization, dequantization and rounding, is actually trivial compared to the time cost of calculating the quantization scales. The reason is straightforward. It's just that we have a like, the reason is that our embedding table is so large, so it is very costly to calculate the scaling in every iteration. We propose, as a result, we propose to periodically update the quantization scales during QAT of the DLRM models. The resulting training logs are shown in this slide. We find that calculating mean and max once every 200 iterations actually doesn't hurt the accuracy.
The periodical update is actually beneficial to make the convergence and the less fluctuance. Finally, in order to enhance the performance, we want to simultaneously support specification and quantization in our system. However, currently on GPUs, the NCCL backend doesn't support specification well, and although the GLU backend is available, it actually has many restrictions. In contrast, our CPUs, on CPUs, we have more flexibilities. Both the GLU, MPI and oneAPI, oneCCL, all the backends are available. In our work, the DQRM work, we use oneCCL for the best support and optimization. Here I give a brief conclusion. In our works, we systematically studied the efficient inference and training of large neural network models. First in IoTp, we accelerate the inference of transformers.
In stage training and task, we accelerate the training of CNNs and transformers. For even larger models, such as the DLRM, we propose DQRM to alleviate the cost of communications during training on distributed systems. In our works, we find that CPUs and oneAPI, oneCCL are actually more suitable for running the training and inference of large models such as the DLRM models or other recommended models. Our codes are open source in the following repos, and I think the slides will be made public. Welcome to Thanks for your interest on your on the following reports. I think we can just move on to the Q&A session, and I'll give it back to Sri.
All right. Thank you, Zhen, for your talk and we have a couple questions. Both of them are from one more Ashok has. Let me give you the first one. On slide 15, Ashok had a question. Is the top K selection, is it the model parameter?
Uh, let me-
Wanna go back to 15 there around... Ashok, you can type on the chat if you wanna go further, if that's where you were looking at. I recorded it on slide 15 when you asked.
Sure. I think the top K selection is on other slides, but that's in the list.
Yeah.
Yeah. It refers to.
Zhen, why don't you go to the top K selection slide?
Yeah.
Yeah. It is related to parameters, specifically regarding to, regarding the gradients. We know that the number of gradients is the same as the number of parameters. During each, iteration of communication, we only want to communicate the top, like, K values out of all the gradients so that we can save the bandwidth or save the communications. Does that answers your question?
While he's, while we wait for Ashok's answer, he has another question. That's around slide 19. Ashok asks, why time increases in multi-nodes? Either 18 or 19.
Yeah. Oh, I think that's.
Why is the time increasing, I think is what?
Yeah, yeah. For multi-nodes, we are actually having extra time cost on the communications between nodes, and inside each node, the communications between different machines inside a single node is actually much faster than the interconnection between nodes. That's where the extra time costs come from. Maybe the question is about why the time increases when we have more machines. Previously, in the previous slides, some of the number is around 1K, and in this multi-nodes slides, the times increases. The reason is because of overhead, just as I mentioned.
Since we have more machines, so that, we, since we are using the distributed data parallel support from PyTorch, actually, more machines we will have more communications because we need to do the synchronization across all the machines. Yeah, I hope that helps.
All right.
Excuse me. Zhen, do you want to explain the difference between the time column and the communication time column?
Oh, yes. The time column is the total time cost of each iteration, and the communication time is just like a small portion of that total time cost. It's just it's the time used to communicate the gradients. Yeah.
I think what may be counterintuitive here is that we're doing distributed training to accelerate the wall clock time, right? What you're showcasing here is essentially the communication time because that's what you're actually going to try to reduce, right?
Yeah.
Okay. Andrew Downs had a follow-up question about the interconnects. What kind of networking is really ideal for multi-node training? Is gigabit sufficient or do you really want something like InfiniBand?
From my side, I think the networking would be the better networking the faster of the system. Our work mainly focus on reducing the communication cost. From our side, maybe the networking can be less critical, but for the normal distributed training, the networking is very crucial. That's what I can got from my side. Kurt, do you want to comment on the EVE event?
Yeah. I don't think we have the slide in the deck, but if you, if you look across the arithmetic intensity, that is the essentially the amount of computation per data movement. As you move from something like computer vision or the VGG-19 net onto natural language and then onto recommendation systems, the amount of communication required just increases dramatically. As long as the communication is big. Basically, for both natural language and recommendation systems, communication is the bottleneck. As those are already for many of your customers such as Meta, already the bottleneck in performing the...
80% of Meta's kind of quote AI computation, or AI workloads are in recommendation systems, and a growing percentage are in natural language. As long as those are your workloads, then it's probably worth really investing in the fastest communication that you have if you wanna run those workloads as quickly as possible.
Andrew had a follow-up to that, and then we'll move on to another person. When designing an AI system, you say that the network traffic generated warrants a substantial networking investment?
Yeah. Now, you know, data center economics, I can't say that, you know, I've got enough broad information in terms of the additional quality of the better service, you know, to what extent that will be a re-revenue generator. If we look at this strictly technically, I think, you know, it's not very mysterious. I, if you have a bandwidth limited problem and, you know, here as you do distributed training, the bandwidth limitation can be processed with processor because as you can see, there's only so much you can hide in terms of the latency. When you have a bandwidth problem, then that investment in the interconnect is worthwhile from a technical standpoint.
In terms of the overall business of the data center, I don't have all the facts.
Mm-hmm.
to say that investment will be justified by, say, the increased click-through rate and so forth. From a technical standpoint, it's clear. I hope.
All right. I have, two more questions, and we have two minutes left, I believe. One is from Vlad-Andrei Negru, and he says, "Hello. In one of the first slides you showed that you worked on machine learning, deep learning-based audio enhancements. What kind of audio enhancements was that slide referring to?
Yeah. Okay. The audio enhancement is essentially just, you know, noise elimination. Just to be completely transparent, initially we did work on this using various machine learning approaches. Most of the deep learning work that we've done was really, to be honest, more me looking over the shoulder of engineers at a company called BabbleLabs that I was advising, that was acquired by Cisco. Basically, you know, we're living proof at this very moment that during lockdown, and in the move to more virtual, the quality of the audio that I'm speaking from right now became permanently important and the, you know, elimination of noise and other artifacts from that became really crucial.
All right.
Yeah. Let me just say that in terms of, you know, our latest work is actually on speech recognition. That I can point you to something at NeurIPS. The audio enhancement work was many years ago.
Excellent. Let's see. I believe we're at time. Thank you, Zhen and Professor Kurt. If you have any more questions for them, you can head over to Discord after the event and ask away. We'll put the link on later. That concludes the presentation today. Thank you both of you for presenting. Really appreciate that. Please go ahead and close this session. Our next one is Accelerating PyTorch deep learning models on Intel XPUs. See you there.
Hello, everybody, and welcome to the first hands-on training of the day, Accelerating PyTorch deep learning models on Intel XPUs. Our speakers are Pramod Pai and Ashok Emani. They will take us through computer vision models using IPEX optimize and NLP models with quantization. Pramod is an AI software solutions engineer at Intel who enables customers to optimize their machine learning workflows using solutions from Intel. His areas of focus include Intel oneAPI AI Analytics Toolkit and the Intel Extension for PyTorch. Ashok is an AI frameworks engineer working on enabling Intel optimizations for PyTorch. If you have any questions throughout the presentation, please put them in the chat box and they will answer them during their presentation and also at the end of their presentation. Over to you, Pramod.
Thanks for that introduction, Susan. Hi, everyone. Welcome to this session, Accelerating PyTorch Deep Learning Models on Intel XPUs. Today we will begin with a short presentation and after the presentation there's gonna be a hands-on lab. For the hands-on lab, all the attendees will be receiving a link to a JupyterLab session with the notebook that has all of the code that we're going to be using today. We'll drop a spreadsheet with all of the links for you to use, and we'll get onto that once we begin the hands-on part of today's presentation. With that, let's begin with the presentation for today. Here's the agenda for today. There's a bit of an overview about what we're going to be talking about.
We'll learn what Intel optimizations for PyTorch are, what Intel Extension for PyTorch is, and how they are different. There's also going to be a performance showcase. We'll talk a bit about the performance numbers that are here in the slides. Firstly, I would like us to go through this graph that speaks about Intel optimizations for PyTorch. This is the structure, as you can see. At the bottom of the graph, you can see that the recommended hardware are of Xeon scalable processors and Intel Xe platforms like the GPU. Now, based on this back-end hardware, we utilize oneDNN and oneCCL as acceleration libraries to speed up model execution. With these two libraries, we have optimized PyTorch and Intel Extension for PyTorch.
The term, Intel optimization for PyTorch actually involves two components. One is the stock PyTorch, which is released officially by Meta. The other one is the extension, the Intel Extension for PyTorch. Based on this combination, we have optimized workloads belonging to the PyTorch ecosystem, like TorchVision, TorchServe, as well as topologies in Hugging Face and PyTorch Lightning. 'Cause there's also, you can also optimize the generic workloads that belong to PyTorch as well. Moving on to the next slide. There are several major optimization methodologies that we've utilized to accelerate the performance of PyTorch. The first one is that we are doing just general performance optimizations and enabling new Intel features on PyTorch upstream. That means we are directly contributing code to stock PyTorch.
At the same time, we are doing additional performance boost and early adoption of aggressive optimizations through the Intel Extension for PyTorch. There are mainly three major aspects of the optimizations that we do in the extension. We have operators, graph, and runtime optimizations, as you can see in this slide. Just to speak briefly about each one of these. First one is the operator optimization, which involves vectorization and parallelization to maximize the efficiency of CPU capability and usage. We've optimized for memory-related operations as well as for low precision data types. The second major methodology is the graph optimization. As you
As you know, PyTorch by default runs in eager mode, but if we were to convert it to TorchScript mode, you can convert the whole topology into a graph and apply fusions on into the graph to improve performance. As you can see in the graph under this operator fusion and constant folding are the two major components. Finally, the third one is the runtime extension. This can further avoid overhead by utilizing thread affinity and tweaking memory allocation methodologies. In essence, you can maximize the inference throughput. That was an overview of the different optimization methodologies. Here we have how Intel has been contributing our source code to PyTorch upstream. We started the optimizations in 2018. At the time, we targeted the Skylake servers with the float 32 data type.
Again, we utilized oneDNN as the computation backend to accelerate the operations. We utilized the VNNI instruction set on the Cascade Lake Xeon scalable processors, and bfloat16 on Cooper Lake and now AMX and again bfloat16 on Sapphire Rapids. We have contributed more and more features into stock PyTorch. At the same time, we're also working on implementing more aggressive optimizations with Intel Extension for PyTorch, which we'll look at more during the hands-on session. All right. This is a graph structure for the Intel Extension for PyTorch. The extension extends PyTorch with optimizations for extra performance boost on Intel hardware. It supports both eager mode and graph mode, as you can see, and also the runtime extension. This extension is open source on GitHub.
You can read access code, you can even contribute to this extension. The optimized operators that you see here are registered through the PyTorch dispatch mechanism. From PyTorch's perspective, these are just normal operators. This extension can be loaded dynamically in a Python script, again, which you will see during the hands-on lab. You can choose to not import it also. This extension not only works for Python, but we also provide a C++ interface, so you can dynamically link the extension to your C++. On the next slide, we have some example usage, just to take a look at. Here, you can see we set model.eval to inf...
We set it to model.eval for inference mode and the ipex.optimize is the API which accelerates the workload. On the right, you can see how bfloat16 is utilized. If the hardware supports instruction sets such as AVX, 512 bfloat16 or AMX bfloat16, you can utilize this for both training and inference. At the bottom, on the right side, you can see torch.cpu.amp.autocast is used. This uses automatic mixed precision where some layers are converted to bfloat16, depending on their type, so that the whole execution can be speeded up. Right. Now I have some performance snapshots here which compares. Here we've chosen models from TorchVision and the Hugging Face ecosystems.
You can see how using the Intel Extension for PyTorch, you can have the speedup compared to stock PyTorch. The Faster R-CNN, which you see here, is one of the models that we're going to be working with today during the hands-on session. These performance numbers are for batch inference. The next slide that I'm going to be showing is for online inference, so real-time inference. You're just running just one sample. Here also you can see these are the performance comparisons between stock PyTorch and Intel Extension for PyTorch. Also, you can also just take a look at the versions that we've used here for comparison. All right. Moving on. Here I'd like to highlight that...
This slide, which tells us Intel works with common ecosystem projects, as I mentioned, like TorchServe and Hugging Face. This particular slide is actually from Hugging Face when they scaled up BERT-like models for inference. This performance was achieved by enabling features of Intel Extension for PyTorch or IPEX, so thereby enabling Intel's optimizations. You can use this link to know more about how Hugging Face scaled up BERT-like models and how Intel's technologies were used here. This has been a brief introduction to the extension. Now, we are going to move on to the hands-on lab part. Before that, there's a small slide here for call to action.
We recommend using the latest PyTorch release and the latest Intel Extension for PyTorch release, which happens to be 1.13.0 for the extension. There's also a model zoo, which has a lot of training and inference scripts and benchmarking scripts that we would really like you to just try out. Right. Now I'm going to share my screen so that we can begin the hands-on portion of the session. As I said, we have prepared links for you to use, these links have Jupyter Lab instances preloaded with all of the code. Let me share the links on the chat box before I share my screen. All right. There's a section in the spreadsheet to write your name before you claim a link.
I would request you to please do that before you claim a link. There are two spreadsheets here, and there's a token as well to authenticate these links, and I've pasted that in as well. Please go ahead and access these links with this authentication token, and we'll wait for a few minutes before everybody is set up with their link so that we can begin the hands-on section. All right. I'm sharing my screen now. There we go. I see that the spreadsheet is being filled out. We'll begin in a couple more minutes. All right. I think we can get started with the hands-on section. I hope everybody has their notebooks set up. I'm just going to restart the kernel, like so, before I get started. I think it should be fine for everyone else's notebooks.
All right. In this hands-on lab, we're going to be exploring the features of Intel Extension for PyTorch, also known as IPEX. We're going to be seeing practical examples of everything that we spoke about during the presentation. Some of the key takeaways from this hands-on lab. You'll know how to get started with IPEX for drop-in acceleration just by changing a few lines of code on in your stock PyTorch. You'll learn how to use the optimize method. This just wraps around your PyTorch model and provides the optimizations. You'll learn how to use the quantization features that are provided by IPEX. Quantization, it converts the models that use FP32 data types into lower precision so that it becomes faster. We'll look at more details about quantization.
Finally, you'll learn how to use the IPEX launch script module. This is also a feature of IPEX which, on top of the optimizations that we discussed, this launch script allows us to tune things further in order to achieve maximum throughput. We'll focus on two types of workloads: a computer vision workload and a NLP workload. For computer vision, we're going to be working with the Faster R-CNN network with a ResNet-50 backbone. This is a ConvNet that does object detection. Here we'll use the optimized API, and we'll also see how TorchScript can be used to speed up execution for deployment. We'll start by importing all of the packages. You can just use Shift + Enter to run each cell if you don't want to go press Run each time.
We're just going to prepare some data for us to view performance gains on. Here it's just a randomized tensor of this shape, as you can see. There are some helper functions here that we're going to be using throughout this hands-on, which will just make things easier. I'd suggest just taking a few seconds to just look at the code and see what each function does. Firstly, we have the load model in eval mode. All we do here is just load the Faster R-CNN and return it in eval mode for us to perform inference with. This next function is get average inference.
We do a warm up so that the model reads an optimal state to begin timing the execution as to compare between stock PyTorch and the optimized warm version using IPEX. Finally, plot speed up. This just plots a bar chart comparing the performance differences. All right, here's a baseline PyTorch model. The baseline model is the simplest version of the model that can be loaded from the PyTorch hub. These are the weights for the Faster R-CNN network. These weights were obtained by training using the recipe that was mentioned in the paper. Right. This is the first important bit. Input image memory format.
There are different ways to represent image data that are inputs to a CNN model. In this case, we focus on channels first and channels last. Simply put, channels first, the channel dimension comes first when storing these tensors. As you can see NCHW. That is C, H, and W. That's channels, height, and width. Whereas channels last, you have HWC, which is height, width, and the channel. PyTorch uses channels first by default. There are some performance advantages by using channels last, and that's the first thing we'll look at. Here we're loading the model using the functions that I described before. We load the model. This is the stock PyTorch model, and we run an inference on it. Sorry. My bad. I forgot to run this cell.
Here, we're loading the model, running inference and then just getting a average inference time just to compare it as we move forward. As I was saying, PyTorch uses channels first by default. This has taken about 441 milliseconds. Now channels last is a different way of ordering, again, as I said. Channels last, we can use the vectorization part that I mentioned during the presentation, which is part of the operator optimization in IPEX. It lends itself very well to vectorization. There are certain layers in the network that benefit from this type of optimization. You have the conv layer, the Conv2D, and also the transpose Conv2D. Let's see how using channels last helps.
For channels last, all you have to do is just use the memory format as channels last on your model and also on your input data. You just have to change the memory format like so. Once you've done that, you're ready to again... Yeah. There, as you can see, there is definitely a difference in the amount of time that this forward pass took. That's 362 odd milliseconds versus 441 milliseconds. Now comes the IPEX part. What we'll do here is on top of channels last, we'll see how the optimized API provided by IPEX can further speed up the model execution. I've highlighted the code changes here, and it's just two lines of code. You import IPEX, and then you wrap your stock PyTorch model around the optimized API. Once I run that...
This warning just says that for this particular model, a certain folding, which is a type of optimization, is not possible. For other types of models, it will be possible. We can ignore that. Right. Now that we have run optimized, we'll see how long it takes for the model to run now. Right. I'm just going to plot a bar chart. I'm getting a speed up of 1.68x compared to stock PyTorch with channels last. I would say that is definitely a considerable improvement since this is real-time inference. That was the optimized API. Now moving forward. I'm just going to check if there's anything in the chat. Oh, perfect. Let's go back. All right. Now TorchScript.
As I said, PyTorch works in eager mode and in graph mode, and for both of these, there are different ways to optimize. TorchScript helps us to convert our model from eager mode into graph mode. Once it's converted to graph mode, there are many optimizations that you can do, like running it in a multi-threaded way and many other optimizations like that. The two APIs here that help us convert it to the script mode are the first one is the torch.jit.trace. This trace function takes in the model and an example input, and it records the trace through the model and uses this recording to convert it to a TorchScript module. Here after that, there is this freeze.
Under the hood freeze, IPEX uses constant folding, as I spoke about during the presentation. During the deployment scenario, it's very important that the model execution is as fast as possible. This provides a huge advantage, over eager mode. Let's convert it to TorchScript mode and again, compare performance to see how much speed up can be achieved. Right. With TorchScript, we are seeing over stock PyTorch a 2.46x improvement. Now, this can really come in handy, during inference or in deployment scenarios. That was about TorchScript. Next, we will move on to the NLP workload. This is a DistilBERT. It's a smaller version of the BERT model.
With the DistilBERT, we will focus on how to use the quantization features that IPEX provides. As I mentioned before, quantization, it helps to change the data type of a model from say FP32 to something of lower precision, like int8. Once we do that, the size of the model reduces considerably, and it is sort of a compression technique, so that way the inference can be much faster compared to the original FP32 model. For this, we'll make use of the transformers library and import the DistilBERT tokenizer and the DistilBERT model. Here also we have very similar helper functions to load the model in eval mode and return the model and get the average inference time. Just very similar to what we saw with the Faster R-CNN. Right.
This is just a sample here, example input for the DistilBERT. I'm just going to load this. This is of the question answering type of task. Right. We just tokenize this and then again, similarly... This cell that I'm running now is for stock PyTorch. We're going to be repeating very similar experiments as before. We have the stock PyTorch first. You can see it's taking about 4.83 milliseconds. Now, for quantization, we import the quantization modules that are provided by IPEX. You have the prepare and convert and IPEX itself. We're going to be looking at two types of quantization. Static quantization and the dynamic quantization. In static quantization, there are a few offline steps involved, such as calibrating the model.
Ideally, this is done with the full dataset. In dynamic quantization, there are no such offline steps, so the performance can be expected to be slightly lower than static quantization. There are a few steps here. The first is you set the configuration. That, that is the Q config. Then you prepare your model, then once your model is prepared, you convert it. After converting, we use the TorchScript module that we saw earlier. This whole setup, these four steps that I mentioned. Let me run this cell. These four steps that I mentioned can be perfect for a deployment scenario. Right. I've run these cells, and we've got a converted model, int8 model. These last two steps here are warm-up runs.
These are very similar warning messages I'd like to point out. Some of the optimizations are not possible because of the type of the model, so it's just letting us know which of the types are not possible. Once the model has been converted to int8 from FP32, let's try running inference and comparing the time. As you can see, We have a speed up of around 2x . That is with the static quantization. Now, similarly, let's also try the dynamic quantization. In dynamic quantization, again, you have prepare and convert. No calibration steps. Convert to TorchScript. Few warm-up runs. The model is converted from FP32 to int8. Let's print out here. As I said, slightly lower, but yeah, we're still seeing a 1.78X.
I'm getting a 1.78x gain over stock PyTorch. That was the quantization part of IPEX. Now, this is the final sort of feature that we're gonna be looking at. Before that, let me just check if there are any messages. Oh, yeah. Perfect. All right. The LaunchScript. Now, the LaunchScript. We spoke about all the optimizations. Now, you can do something on top of those, that is tune these models further at runtime to get a maximum throughput out of these models. For today, we're gonna be focusing on how to do this on a single instance for inference. For this particular exercise, I would like us to view the output from htop.
htop can give us a log of how many processors are being used when a certain process is running. I've mentioned the steps here. You just go to File, and then click new and go to a new terminal, like so. This terminal has to be in shell, so we'll just go into Bash. Once you're in Bash, you can just type htop. There we go. This is what the htop output looks like. We'll come back to htop to see how the cores are being utilized while using the launch script. As you can see here, the launch script can be used as a Python module, and you can pin the process to a certain number of cores in this particular example that I'm showing.
First, let's see how all physical cores can be used to run the same Faster R-CNN network that we saw earlier. It's just in a script for-format, you can see in the scripts folder. It's just the same model. I'm going to run this first. Right. It prints to log, and also you can see the cores 0 to 23, meaning 24 cores are being utilized. If I go to my htop output, I can see that 24 cores are being used for this certain execution. For launch script, it really depends on the topology that you're using. Different topologies can benefit from different types of tuning. I would definitely suggest taking a look at the documentation.
It is very detailed, and there are a lot of knobs you can turn in order to achieve the best performance using IPEX at runtime using the launch script. This is just an example, again, as I said, how this allows us to pin to 24 cores at runtime while running the same Faster R-CNN workload. Once this is done, we can look at how we can pin the process to all logical cores and not just the physical cores. That way, instead of using just 24, we'd be asking the model to use all of the available cores, including the logical ones. I'm just going to wait for this to finish. All right. I'm going to run this next to use all of the logical cores. This machine has 48 cores in total, including the logical cores.
We've now tried to pin it to all of them. If I go back to htop, we can see all of the cores are in action. It might not utilize it fully because the workload doesn't demand it, but in theory, we've pinned it to all 48 cores, as you can see. Similarly here, we can. What we're doing here is just mentioning how many cores we want to pin it to. We saw 24 and then 48, and now I'm just going to pin it to 10 cores and see how htop reacts to that just as soon as this execution finishes. Yep, there we go. Finally pinning it to 10 cores. You can see zero to nine, we have 10 cores, and if you look at htop, only the 10 cores are being used. Yeah.
As I said, there are a lot of knobs that you can use. These were just the pinning it, pinning to core, but I would definitely suggest you check the documentation and see how at runtime you can tune the model for the best performance. As an exercise, you could also probably just change the number of cores per instance here from 10 and try out different numbers to see for which particular setting you're seeing the best result. With that, we have ended the hands-on session. What I'm gonna do is firstly share... I saw the, a message about the codes. I'm gonna share that first. Let me stop sharing my screen for a second. Just give me a second. All right, here we go.
I'm posting in chat the link to this code, which is part of the oneAPI samples. It's open source. You can take a look at this code and try it again later once this session is complete. Yeah. I suppose that was the end of the hands-on session, and I'd like to hand it over to Ashok for the Q&A. If you have any questions.
I think this will be a good time. Thank you everyone. That's it from me.
Hi, everyone. Please post any questions in the chat window. We collected some questions that were asked previously. Just give me a minute. If you have any questions, please post in the chat window. Hi, everyone. Please post any questions in the chat window. We collected some questions, and we will address them one by one. I think one of the questions was: Is PyTorch beginners friendly, and what is the difference from other interfaces such as TensorFlow? As you know, both PyTorch and TensorFlow, you know, you can use them for deep learning model building, optimizations, deployment, and so on. They are, you know, complementary in a way, actually. PyTorch tends to be more, Python friendly.
There is also differences in the way you build the models. For example, I believe in TensorFlow you have the graph mode. They introduced the eager mode, which lets you execute the model step-by-step, like in PyTorch. As you saw in the hands-on portion, building and defining models in PyTorch, it's. You do it in Python, and it's easy to build and debug and experiment. It tends to find more users in the research community, where they have to build dynamic graphs or dynamic models, and so on. I hope that addressed your question. The next question is: What is the time taken for backward pass? I believe this was asked by Ashok Jayaram.
I think we haven't timed the backward pass, but usually backward pass tends to have more operations or at least heavy compute, now that you have to use the gradients, compute the gradients, so on and so on. You can easily time it while you. In the Python script, you can add the timing statements, just like you would time any Python script, and you should be able to get that. I think the notebook that Pramod shared, you can experiment with it. You can modify it. It's just a Python script, so please do experiment and add the timestamp measurements for backward and also the optimizer step if you're interested in that.
You know, the performance improvements, by the way, should apply to backward as well, not just the inference. I believe the next question... Let's see. Can we have a link to this notebook? Thanks, Pramod, you already shared that. The next question: Is this optimized time performance changes over heterogeneous hardwares? You know, today you executed this lab on an AWS instance and you saw the performance improvements. These performance improvement should carry on in general to any CPU. It depends on the kind of CPU you have. You know, some new hardware from Intel has better, you know, performance and also instruction set.
For example, the new Sapphire Rapids, which was announced recently, has the AMX, which is a newer instruction set for deep learning, and it can help you even better, right? Heterogeneous hardware, you know, if you mean, like, combining CPU and GPU, you could do that in PyTorch. You can, for example, execute some operations on the CPU and also dispatch some of the operators or parts of the model to GPU. For example, you can do data processing or handling on CPU while the GPU is computing the kernels, right? So there are different use cases, even distributed training and parallel model deployment. We also, as Pramod mentioned, work closely with community projects such as ecosystem projects such as TorchServe and Hugging Face.
TorchServe, for example, will let you run multiple models at the same time, you know, basically deploy your inference at the same time. There is also similar thing for TensorFlow called TensorFlow Serving, I believe. If you have heterogeneous hardware and diverse set of hardware, you know, these performance optimizations do carry over and you can use TorchServe out of box. We already optimized some of this in TorchServe, you can leverage that there. I think the next question is IPEX available via Conda? Yes, it is. You can do pip install IPEX. In the call to action slide, Pramod shared the link to the Intel Extension for PyTorch. If you go to that link, it will show you exactly how you can install using pip.
Usually the Intel Extension for PyTorch mirrors the PyTorch releases. For example, if the latest PyTorch release is 1.13. Intel Extension for PyTorch will also have a 1.13 at the same time. You can actually map the same versions and use the latest in both cases. Okay. Does this work with new Sapphire Rapids HBM? Yes, it does. As I mentioned earlier, Sapphire Rapids does bring quite a few features, HBM and AMX and, you know, larger core count and better memory bandwidth and so on. IPEX and even out-of-box PyTorch, you know, both of them use our oneDNN library, which is optimized for, you know, deep learning kernels. That library will automatically leverage the latest instruction set behind the scenes.
As a PyTorch user, you do not have to worry about, you know, which platform or hardware you're using. It is transparent and automatically done at runtime. It will show the better performance. Yes, we did optimize for Sapphire Rapids and HBM and so on, so you should definitely see those improvements. Okay. I'm going to browse through other questions here. When using this with Sapphire Rapids HBM, does it pre-fetch weights into HBM L3 cache? You know, I think HBM is a special SKU, and depending on the way the pre-fetches are configured, it does leverage HBM. As I mentioned, the oneDNN kernel library does cache pre-fetch weights and also leverage L2 cache and LLC and so on. Depending on the...
It really depends on the size of the weights and the model and the kernel you're executing. We heavily tune it depending on the hardware, right? You know. The short answer is yes, it does leverage, you know, HBM and L3 and L2, and even L1, you know, as appropriate. You know, it really depends on the tensor sizes and the kernel computation. Each kernel is different. Convolution and gems, they are heavily optimized and hand-tuned for different hardware configurations. They're automatic. As a user, you shouldn't need to worry about, you know, is this particular kernel being efficiently using HBM or Sapphire Rapids. As long as you use the latest IPEX and latest PyTorch version, you know, it should be...
have already been optimized. If you do notice any performance, depending on your model or a special kernel, for example, that we may not have optimized, please do file an issue on IPEX repository or even directly on PyTorch. We, you know, regularly monitor these issues and address them. You know, we also constantly look at benchmarking libraries such as TorchBench in upstream PyTorch. We also have our own benchmarking internally that we do, which we look at TorchVision and Hugging Face and, you know, basically the popular models, right? When we optimize for Sapphire Rapids HBM or just Sapphire Rapids, we basically do broad model coverage or so to speak, you know, sweep the model's different domains, make sure performance, you know, improvements are seen.
We have our own, you know, roofline or expected performance depending on hardware, right? Depending on that, we ensure the optimizations do show up in the latest PyTorch. You know, if you want the cutting edge, it's always recommended to use IPEX because IPEX meaning Intel Extension for PyTorch. It tends to have the latest optimizations. You know, you can think of Intel Extension for PyTorch as a staging ground, so to speak. You know, bulk of these optimizations will be upstreamed into the open source PyTorch. If you're using latest hardware, we strongly recommend using the Intel Extension for PyTorch.
I think the next question is: Are there any precautionary measures if we embed pandas, scikit, SymPy, et cetera? You know, PyTorch should interoperate with all these libraries, pandas, scikit, and so on. Because PyTorch has its own tensor library, and it doesn't... You can actually, you know, because NumPy, you know, which pandas is based on, the tensor memory and layout, these things are very critical for performance. Upstream PyTorch has utilities or APIs to interoperate with NumPy. For example, if you do any pandas data frame, you know, computations, and you want to...
You want to avoid the copy, you know, things like that, you can actually link that or actually import in, you know, use the PyTorch tensors. IPEX, you know, the Intel Extension for PyTorch, doesn't have its own tensor library or anything like that. It just builds upon PyTorch, right? It's all should be seamless with NumPy, you know, which I believe all these libraries interface with, right? Pandas, SciPy, and so on. There shouldn't be any. The only precautionary thing I would call out is, you know, with deep learning and tensor computations, it is critical that you avoid unnecessary copies, things like that, right? As much as possible, you should reuse the existing tensor at memory allocations.
You can explore the NumPy, PyTorch interoperation APIs to do that. And everything else, you know, interaction for PyTorch and all the Intel optimizations should just work because they're all based on the existing PyTorch tensor APIs. I think the next question is: It isn't available as a pre-compiled conda package. It must be installed with pip. I believe we uploaded this to PyPI repository, and you should be able to install as a conda package as well. I'm not sure you know, if the conda repository, you know, like conda-channel or something like that, we uploaded there. You know, if you have an Anaconda environment, you should be able to. You don't have to build IPEX, you can just install with pip.
I know my expectation is it should also work with Conda install as well, but, you know, we can double-check that. You know, pip is, you know, it should work with Conda as well. Yeah. I think the next question is: The link to configure your system on, the link takes you to a marketing page. Okay, I think. Yeah, we will need to double-check that link, I guess. Get started with AI. If it is taking you to a marketing page, obviously it's not referring to configure your system, so there might be a link broken. We can definitely follow up on that. Will the notebooks and scripts be available later or shared? Yeah, I believe we shared that, right. Does it run on Intel Flex DGPU ATS-M? That's a great question.
We do have. If you go to the Intel Extension for PyTorch, we have a ATS-M, basically DGPU or even IGPU. You should have Intel Extension for PyTorch for that. Right now, the upstream PyTorch does not work with DGPU. You have to use Intel Extension for PyTorch, and that should let you run your models on ATS-M. If you have an ATS-M, please do try it and let us know on the Intel Extension for PyTorch GitHub any issues. If you can look at the branches, click on the branch in the GitHub. There should be a GPU branch there. I'm browsing, you know, quickly browsing through other questions here. Looks like, thanks, Susan, for helping out, following up with some of the questions already.
Until when would the... Okay, it's... I think Pramod Pai already answered that. I think the last question was from Alan Ghaffari. I hope I'm saying the name right. He mentioned he's not seeing the config steps on Getting Started page. Looks like, yeah, we'll have to follow up. Thanks, Susan Kahler, for following up on that. Any other questions if you guys have during either hands-on lab or during the first 10 minutes of presentation, please post here and I can answer the questions. I think the next question is, what's the future plan and roadmap? You know, for roadmap, you know, depending on the hardware release schedule, right? PyTorch and Intel Extension for PyTorch will enable those features according to the release schedule of hardware.
As far as PyTorch itself is concerned, we do work closely with, you know, the PyTorch community and ecosystem to basically support all the latest PyTorch features. For example, recently there was a PyTorch developer conference, where we have been enabling new optimizations in PyTorch, such as PyTorch, you know, Dynamo or Inductor, which is a newer compilation technology in PyTorch. New data types is always something we are interested in enabling, like bfloat16 or FP16 and so on. Expression quantization. Anything related to data types, hardware features, instruction sets, these are all part of the roadmap plan. They tend to overlap or actually map with PyTorch own release plan.
I'm going to paste a link to the ATS-M link that somebody asked for it. Please go to this link for the ATS-M. With that, I can hand over to Susan, I guess, for conclusion.
All righty. Thank you both so much.
All right.
-for the hands-on. If you have any more questions for them, head over to Discord after the event to ask your questions and go ahead and continue on with this conversation. There were some really good questions, and I know people were really following along and trying to replicate what they were seeing. Go ahead and close the window. Go back to the agenda page. We are gonna be taking a 30-minute break for lunch, and then you can rejoin the next presentation, which is Hacking the Hackathon using Fastai and IPEX. See you again.
All right. Welcome back. I hope everyone had a good lunch hour or I guess 30 minutes. I'm here to introduce you to our next session. This session is called Hacking the Hackathon using fastai and IPEX. Our speakers is Sai Ramaraju Penmatsa and Ankur Singh. Sai Ramaraju Penmatsa, which I'll say Raju to shorten it, is currently pursuing his master's in software engineering with data science as a specialization at SJSU and working as a graduate research assistant in computer vision. Ankur, after his graduation from undergrad, started his own company in India, called AI Adventures, which provided AI/ML solutions to businesses. After three years, he joined Zootone as the ML team lead. Currently, he is pursuing his master's in software engineering at SJSU.
I'm going to hand it over to him. If you have any questions during this session, please put them in the chat box and we'll answer them when during the Q&A session. Over to you, Raju and Ankur.
Hello. Good afternoon, everybody. Hope you're all doing great, and thank you for joining us today, joining with me and Ankur. Today, we are gonna walk you through about how we hacked our way to winning the Intel Hackathon using the fastai and Intel Extension for PyTorch. Before that, I would like to introduce my teammate, Ankur. Want him to introduce himself.
Hi. Good afternoon, everyone. I'm Ankur, and currently I'm pursuing my software engineering from San José State University. As Sri already introduced us, so I'll keep it brief that before joining SJSU, I was leading the ML team at Zootone. Prior to that, I was the co-founder and CEO of my company called AI Adventures.
Hello. I am Raju, previously a software engineer at Accenture and currently doing my master's at San José State University. Next, I would like to share our experience of how the hackathon was. Essentially, it was the first-ever hackathon that either of us has attended. It was like, it was an eight-hour long session, probably one or two energy drinks and coffee has helped us in iterating a lot of models, a lot of experimentations and also it has been a great learning experience as we got to develop some models on real-world problems and also deploy them. During this process, we got to connect with a lot of great people at Intel.
People like Eduardo, Ben, Scott, Paula, to name some of them. That essentially led us to opportunities, like picking their brains and how it is at there. Next, it also is one of a special experience for us, and it will always be like that because it created a lot of ripple effects. For example, as we were the winners, we got a chance to interview with Greg, Intel CTO, Greg Lavender, and he presented with Intel NUC. Also this success got shared in our college newspaper. Then I wrote a medium article which got picked up by Intel and got published in the Intel blog, which got, luckily, retweeted by the Intel CEO, Pat Gelsinger.
Now we are here talking about our journey and all with you guys. Okay. What exactly did we do and how did we do it is something that my teammate would be sharing with you. Over to you, Ankur.
It was a great experience for both of us. It being the first hackathon, we ended up winning and had a lot of ripple effects, so it is a special one for us. Moving on, talking about the problem statement. First we'll see the problem statement from business perspective. The task was to do targeted pest attack on weeds. Instead of spraying pesticides on all the crops, we wanted to identify the weeds and then only, like, spray pesticides on them. Next was. Basically, the hackathon team wanted us to build a deep learning model which can separate weeds from plants. The model will be, or was supposed to be deployed in a drone. It has to be computationally cheap, and also it has to have fast inference time.
The drone would in one flight go scan the complete field and then identify the weeds in the land and will spray them with pesticides. The drone will have a small battery, and there will be some resource computation, resource limitation. All these were the constraints under which the model has to perform. Looking at the model from deep learning perspective, basically it is a image classification problem where the drone camera will capture quite a few images, and in each image, we want to identify whether it's weed or not. We can actually take it a step further and do object detection or segmentation to get a more fine grain details. Sorry, do it in a more fine grain or granular level.
With the dataset that we had, it was kind of limited to image classification. Next, since there are only two class, weeds and plants, it was a binary classification, so we can get away with binary cross-entropy loss. Like, our metric was accuracy score because the dataset was quite balanced. It was not, so it was not an exact 50-50 distribution, but it was quite balanced. We used accuracy score as metric. Now, like, we wanted to do this because oftentimes it's very difficult to take a business problem and turn it into a deep learning problem. Here it's pretty straightforward, but in most of the scenarios, the loss function is not clear, the metrics is not clear, and also there are multiple possible ways of framing the problem.
It's one of the most important step when you are trying to provide a deep learning solution, is to identify how to translate a business problem into deep learning problem, which is, which can be framed as a deep learning problem. Here it's pretty straightforward, but oftentimes in business and real-time settings it can be quite challenging. It's a good practice to spend more time on it and also think about the data that you have, because it will define most of the things that you can do with it. Moving on, after understanding all of these things, our first approach, since it was a hackathon, so hackathon is like building a car and using it to cross the finish line, and you have to do all of it in just 8 hours.
It's kind of try a lot of things, fail fast, experiment. Our initial approach was, we didn't do any exploratory data analysis. We straightaway jumped to building model. For building model, we used fastai. fastai is basically a Python library which is built on top of PyTorch, and it has a lot of really good helper functions, utilities, and it's very application-oriented. You get training loop, validation loop, logging, callbacks, mixed precision training, fine-tuning, and a lot of extra stuff for free. You can spend more time experimenting rather than fiddling around with the code. We went ahead with fastai because we had a lot of ideas that we wanted to try, and fastai would enable us to do that.
We started with very basic data augmentation. We didn't do anything complex or heavy. We started with the ResNet-18 model. The ResNet-18, 34 are kind of default models which generally tend to work really well. We started with that. Our first round was basically fine-tune the head. We took the ResNet, replaced the ImageNet head with the binary classification head for our problem statement, and then we just fine-tune the head for one epoch, and then we fine-tune the complete model for five epochs. This was our initial approach. After the first round of training, we found a quite few insights, that changed our approach. The first insight was the Intel Extension for PyTorch is amazingly fast.
Like, we were able to process or train on 1,300 images, each epoch was just taking 6 seconds. There was no GPU, only CPUs, it was like a jaw-dropping performance to see. We realized that we don't have to limit ourselves to one or two experiments. We can actually conduct a lot more experiments. That was the first insight. Second, that is how, like, we, just with the training speed, the inference speed was so good that we can actually list down all our ideas and try out a lot of different things. Next was. We were using FastAI from the very beginning, it was very easy for us to implement and test our ideas. Next we started by trying a lot of different CNN architectures.
There's this Kaggle notebook by Jeremy Howard. Jeremy Howard is a renowned figure in this deep learning domain, he's also the author of FastAI library that we were using. He has created this Kaggle notebook called Best Vision Model for Fine Tuning. Like, we used the, like, findings from that notebook to narrow it down to a few architectures that we'll be testing. We tested ConvNeXt, we tested, like vision transformer models, we tested quite a few other models. We also tested them with different weights.
Roughly, we experimented with around 12 different models, we were able to train all those models just because we had, like, you know, like Intel Extension for PyTorch, which allowed us to train models, like per epoch was just six seconds, so why not? This is the plot from the notebook that I referred in the previous slide. If you look at the plot, on x-axis you have fit time. Each dot here is basically a model, and the color represents which family of the model, which family does it belong to. On x-axis you have fit time, so how much time does it take to fine-tune that model. On y-axis we have error rate, so what was the error rate after fine-tuning.
Now, like, generally, we want models which are on the lower left side, basically low error rate and low fit time. That is what we are interested in. This is the region from where we are, we're trying different models. We ended up using ConvNeXt-Tiny, which was trained on ImageNet 2022 dataset. Basically, it's a different version of ImageNet, which instead of having 1,000 classes, have 22,000 classes. Generally, models trained on this dataset, this version of ImageNet, tend to perform much better. It's a really good notebook, and the link to the notebook is attached, so I'll highly recommend it, recommend you guys to go and check it out. Okay. There are many other things that we tried.
One thing was TTA, test-time augmentation. The idea is that during inference, you take an image and perform some augmentations on it. As you can see in the image, you have one original image and six augmented images. Now you pass, you do inference on these seven images, and you get the predictions for each of these seven images, and then aggregate them to come up with the final class. This is called test-time augmentation because we are performing augmentation during test time, during inference. We do data augmentation during training, but the idea with test-time augmentation is, say, at a certain angle, the model will end up misclassifying.
If you augment the image, model has many more different views of the same object, out of seven, if for one input image, the model says it's a dog, but for six other it says it's a cat, probably when you aggregate the effect of that misclassification will reduce. That is the idea with test-time augmentation. Generally, it's not a good practice to do it when you're deploying it on a edge device. This was just to like some idea that we wanted to try, since during the hackathon it won't be deployed on the on drone, we thought of giving it a try. Another idea that we tried was progressive resizing.
The idea with progressive resizing is, in the first round of training, you train your model on a very small size of the image. Say, for example, the first round of training can be done on 64 plus 64 dimension image. You train your model for a few epochs, you increase the image size to 128 plus 128. Again, after few epochs, you increase it to 256 by 256. This basically allows you to train your model for more epochs. The initial epochs will take very little time as the image dimensions are very small. You can like initial few, or like when you're training it on 64 plus 64 dimension images, the training time would be very less.
The idea is you don't have to train your model on the complete full-fledged image in the first go. You can actually save, like if you plan to train it for 20 epochs, first 5 epochs can be done at one-fourth the size, next 5 epochs can be done at half the size, and so on. Like, this can help you save a lot of compute and also can reduce the training time drastically. These ideas didn't worked out really well for us. Both of these ideas were giving very comparable results to what we were getting out of the default, like, training loop. We didn't end up using it, but we like, we were able to perform all these experiments in just 8 hours.
Like, we just wanted to make sure that we can do as many experiments as possible and give all those things a try, because you never know what idea or what technique will end up working for you, given the dataset and the model in hand. Like, that was our approach. There are many other things that we tried. Like we tried mixed precision training, we tried gradient accumulation for larger batch size and other things. Overall, most of them ended up giving us similar results. At the end, we basically trained the model with the least amount of tweaks and mostly default and least hyperparameter tuning. Just only the elements were included that gave us some boost, some significant boost in the performance.
That was our final submission. Now, coming to the deployment part, we actually used like Intel's Computer Vision reference kit. These reference kit are basically like a kind of sample code or some baseline code that you can start that you can use and build your model on top of. Like we... This reference kit has all the code for basically connecting to the MLflow server and then packaging your model as a MLflow model, then registering your model to the MLflow server and all those things. This actually helped us a lot. Like key insight from this was that deployment is actually hard.
The hackathon team has done a great job to make it easy for us and other participants as well to deploy our models. Generally, it takes more than eight hours to deploy the model, but since we are able to like, given a complete new dataset, we were able to do some analysis and then train models, experiment, and then deploy it. All of it was possible because the MLflow server was already up and running and the reference kit has had most of the code for like, interacting and registering the model to MLflow server. We also, during our training process, we also found that there were some wrong labels in the dataset. We were able to do it after first round of training.
After first round of training, we basically calculated our top losses. For each training image, we calculated the prediction as and the loss, and we then sorted them in descending order of loss. We identified a few top losses, and that allowed us to basically get a glimpse of some, like, we looked into some wrong mislabel classification in our dataset. Like, we planned to remove them, but we didn't do it because we didn't wanted to, like, disturb the distribution of the data. In most real-world scenario, if you have the liberty, you should go ahead and delete wrong labels because that can act as a noise and can hamper the model's performance. That was there. Now you can find all the code here.
Our solution notebook is posted on Kaggle. You can find all the code. This notebook won't work as it is because you don't have access to the data. You can still refer the code and look at the output images, output of each cell. That if anyone is interested, they can go out and explore. Yeah, that is all that we had from our side. I think we are good for Q&A.
I have not seen any Q&A as of now, but plenty of support. Mohammed Bhat said, "This is an amazing summarization. Helps identify the mindset in a hackathon." But Mohammed, I see no questions.
Okay.
Perhaps you can give some follow-up on some other observations you've done during your hackathon that may be outside of.
Yes.
the slides.
Sure. I can discuss more about this part. This was our initial approach. We started without doing any EDA. We just jumped into building the model. We used a very basic pipeline. We basically started with very basic data augmentation, used a ResNet and this is all very easy if you're using FastAI. It's just 5-6 lines of code. In my experience, it helped us a lot because it gave us the idea about the speed of training, so we can plan our next experiments. Also it gave us a glimpse of some misclassifications and wrong labels in our data.
Generally what people do is spend a lot of time cleaning the data, doing the right augmentation and doing a lot of things. They plan they can, if they do get all those things right, they can actually train the model well. In my experience so far, generally starting by building the model helps a lot because one is you get to just build a pipeline end-to-end and get it up and running. That is very important, even in businesses and even in experimenting or in scenarios like hackathons. When you try to build a complete end-to-end pipeline, you get the idea of the complete thing.
If you spend a lot of time preparing your data and then when you go about training, you realize that if the model or the data you have selected is big enough to fit in the GPU or CPU memory, you'll come across other problems. The general idea is, each module of the pipeline, try to keep it as basic as bare minimum as possible, and try to get, you know, complete the pipeline. That should be the first and most important step. Once you have the pipeline ready, you can update or fine-tune each of the parts in the pipeline. Say, for example, you can replace the model. Instead of training for 5 epochs, train it for 10 epochs. Instead of using 10% of the data, now you can use the complete data.
Instead of using basic augmentation, now you can use some complex augmentation. It's important that you get your end-to-end pipeline up and running as quickly as possible. Once we completed this initial approach, sorry, initial round of training, we were able to calculate top losses. That gave us insight that the model training is not only for making prediction, but also you can train a model and use it to analyze or understand your computer vision model much better. There are a lot of, like, in, what do you say? Like, there's this ELI5 and other things which are meant for interpreting the models.
You can use them to understand where your model is struggling and then either add more samples or try to, if it's a noise, try to get rid of it. You can do a lot more analysis if you see or if you focus your efforts on the parts or the samples or distributions where your model is struggling. That is true for not only CV, but NLP and your conventional machine learning as well. That is something that has helped us a lot and allowed us to identify some wrong labels and other things in the data. If you just take care of those things, you can easily get a significant bump in your performance. That was one finding that I that this hackathon was a good use case of.
So, uh-
Raju, do you want to add something?
Uh-huh. Oh, I was gonna say, there was a follow-up questions to, which we were just talking about.
Mm-hmm.
Again, Guffeth, and he says, "Why no EDA? Isn't that typically dangerous not to understand check data quality first?
Mm-hmm.
Did you know something about the data that let you skip this step?
Okay. No, like, we were looking at the data for the first time. As I already said, like, setting up the complete pipeline for me has been the first and foremost thing. The reason for doing this is even if you have some issues in the dataset, for example, there is some image which is corrupted, you'll get to know it as, like, as soon as you, like, set up the pipeline and try to train it. Oftentimes, most of the bugs will come, become apparent or will come to the surface when you are training or building your pipeline. You don't have to spend extra time figuring it out beforehand because no matter how good a EDA you do, you will miss a few things which will come up later when you're building the pipeline.
As you said that, I think building model is part of EDA. We use mean, standard deviation, other things. We can also use, say, some feature extraction from model for our EDA. Building model is again part of EDA. Your first training, round of training will never give you the expected output, so it's kind of part of EDA. You are just checking if everything is correct.
Yeah, like if you look at it that way, because the prediction from first round of training, we didn't use it or we didn't deploy that model. It was just like to make sure everything is working fine.
Okay. We have just only two more minutes. Do we have any more questions? Okay, it looks like we have no more questions from the audience. Thank you, Ankur and Raju, for your really interesting talk. For the rest of you, if you think about any new questions that you might have, please visit us on Discord after the event is over and ask your questions there. We can monitor and forward or if Ankur and Raju wants to show up, that'd be great as well. This concludes the presentation. Again, please close the window and then go back to the agenda page and you can join for our next presentation, Spatial Single-Cell Analysis using oneAPI AI Analytics Toolkit. We'll see you in the next one.
Thanks for attending this one.
Hi, everybody. Welcome to the talk, Spatial Single-Cell Analysis using the oneAPI AI Analytics Toolkit. Our speaker is Abhishek Nandy. He will talk to us about using oneAPI in medical imaging. Abhishek is co-founder of DinoPy. He has a BTech degree and a curious mind. He is also an Intel Black Belt developer. He has been an invited educator at several leading premier education institutes in India. Abhishek has also authored books on reinforcement learning, Unity machine learning, Leap Motion, and game engines. If you have any questions throughout the presentation, put them in the chat box and Abhishek will answer them at the end of his presentation. Over to you, Abhishek.
Hello all. In this session, we'll talk about spatial single-cell analysis using oneAPI AI Analytics Toolkit. My name is Abhishek Nandy. I am Intel Software Innovator as well as oneAPI Certified Instructor. Let's start. What's the agenda for today? We'll be applying the project to showcase that we are able to use medical imaging and that we are future ready. Using oneAPI AI Analytics Toolkit, we are able to write and develop performant code quickly and correctly. Our target would be that bringing in the use case of medical imaging into a particular sense that we can use oneAPI AI Analytics Toolkit using it. Now, before getting in, some of the important technologies or we can say important terminologies that we need to know before going into the topic itself. First of the thing is like transcriptional profiling.
It is a process to identify a rare disease condition, ways to diagnose, and also see the body respond to treatment. Transcriptional profiling would be a very important topic that we would be touching as we go through the entire use case for this and we study the subject in depth, such as the rare diseases might be we are dealing with cancer. There are different diseases that we can relate into. RNA sequencing, it is also a very important thing. It's a sequence technique which we use as next generation sequencing to reveal the quality of the RNA in biological sample at a given moment, analyzing the continuously changing cellular structure. In that case, what happens is that when we are dealing with the RNA sequencing for a particular person, we age on a daily basis.
There are rapid changes into the structure. These changes, if we study in time, we get more in-depth knowledge into how the person has nearing or what's the entire, how he's working around it. Then we have gene mutation. Gene mutation is very important because gene mutation leads to general changes in a structure of a RNA that can lead to different kind of diseases. If we study a particular gene structure, it's very important for us. These three important topics would be of very important points that we relate to when we are studying the use case. Before delving in deep into what exactly what we are doing, we have to know what is single-cell RNA. Single-cell RNA sequence provides transcriptional profiling.
We are touching on the points that we have already covered. Transcriptional profiling of thousands of individual cells. When we are bringing in different cells together and they are combined, we are doing a study for that. This study allows us to more relate to gene expression. Gene expression gives us varied results which we can get more information into. The single-cell RNA allows us to study gene mutation in rare diseases. That's a very important aspect of studying single-cell RNA structure in biological or medical imaging itself. Why use Squidpy? Squidpy is one of the important particular toolkit that is available that I've used in Intel AI Analytics Toolkit, and it's very useful as when you're studying those medical imaging. We have to bring in lot of data that might be high dimensional.
To get much more ordering into that, we have to study Squidpy. We'll be porting Squidpy to Intel AI Analytics Toolkit. We'll be porting Squidpy within Intel Dev Cloud. That would be a very important thing that we'll be covering again. Porting Squidpy to Intel AI Toolkit. These are the important aspects like AI toolkit and also the DevCloud. Now, exploring Squidpy. Exploring Squidpy, as we can see from this particular image itself, it's being directly taken from the DevCloud itself. We have taken a particular image, and we have analyzed the image using some specific techniques that are available in Squidpy. Using those AnnData patterns, we are able to see the image structuring itself as a cluster. Here are the images cluster data we have used in spatial methods.
When we are studying, we can draw different patterns in these images. That's very useful for us. It's an image of very high dimensional data, and you see each and every color aspect has been given so that it is able to track a proper information in the image itself. Let's move on. Now exploring Squidpy. Squidpy, as you can see, like when we are dealing with deep learning itself, we have to study different aspects of a particular image. With Squidpy also, we can segregate the image into different ways. As you can see, this has been generated as image 001, 2, for a different kind of pattern that we were looking into for 10x Genomics data set that's the data set that's there in it.
The most important thing is the flow for it. Here you can see like, first of all, what we do is we bring in the spatial data set, that is the data set that consist of different images. After that is being contained in a place and we are directly bringing them into the DevCloud. After that, we use AI analytics toolkit such as Intel optimized TensorFlow, or we use Intel optimized Python and do a inference into it, so we can derive different patterns or say, we are looking into the cell structure. We have utilized these frameworks for our likings. Similarly, we are using analysis of medical image such as nuclear segmentation, ligand receptor method that we would be touching on later on. Let's take a look at Squidpy as a complete flow.
You can see, first of all, we are bringing in data. data can be brought in different format. As you can see that the medical imaging data is not commonly used patterns, but you can see the different aspects of it. Maybe it is a vision data, Cichlid fish data, murfish, IMC, four-eye, CI-CIF to be more precise with. These are known as AnnData, structure data set that we are dealing with. After that we are bringing it into a particular container where it is being bought and directly we are using the DevCloud to see these particular images. After that, using those toolkits that are available, like as Intel optimized TensorFlow and Intel optimized Python, we are doing analysis to find out spatial neighborhood.
We are deriving interactive visualization with the help of Intel AI Analytics Toolkit, as well as Squidpy. It gives a in-depth analysis on the patterns that we are working on. What we would be doing is, we would be doing a case study of vision data. Vision data is very high dimensional data that can be referred to as we will bring the data set into the DevCloud, and we'll try to bring in different patterns into it so that we get more results like this that you can see from the particular image itself. This is directly taken up from the DevCloud, and you see like varied patterns of a particular size of the cells that has been shown, and we can have a cluster analysis onto it. The steps.
First of all, as we are doing the experiment in Intel Developer Cloud, you have to be logged in into the Intel Developer Cloud itself. Now, we would be using a Jupyter notebook powered by Intel oneAPI AI Analytics Toolkit. The next step would be to install the Squidpy library using a pip command. This is the most important way of getting installed the Squidpy library. We use pip install Squidpy interactive so that it installs all the necessary requirements for Squidpy. We can install Squidpy both ways, as simple way as Squidpy, or if we use interactive, then it uses all the things that are available and install it for you. This is the first thing we need to do, and then we move along. The next step. Next step would be make sure all the supporting libraries are ported.
As we have ported all the libraries, we start loading a preprocessed data set. As we install SquidPy, there are 10 10x Genomics data set that are already given into the libraries. We can directly use it into our reckoning, and we can derive patterns from it. From this preprocessed data set, we derive certain ways to analyze the data. We then use spatial functions that are there in the SquidPy to see how the analysis of data works out. After that, we emphasize on the study linkages on the data set and find spatial patterns on it. It's very useful.
This shapes, as you can see from the previous slides itself, if we go along step by step, we would be very much in a way to do different kind of analysis in the field of medical imaging itself. Okay. After these steps has been followed, the next challenge would be to visualize the data. We are able to study different kind of datasets. As you can see, as I mentioned from the slide of the flow of a particular Squidpy library, that there are different kind of supportive patterns that are followed into the visualization. We are bringing in those data into reckoning, and we are making analysis on this. Very useful in studying medical imaging data. We are able to analyze curated dataset from 10x Genomics.
We are also doing spatial statistics, that is the different kind of graph analysis within the dataset, and also using Intel AI Analytics Toolkit as well as Squidpy into it. For our reckoning, what we are doing is like we are using Squidpy Intel AI Analytics Toolkit plus Intel DevCloud. What we are getting in terms of doing this particular experiment or this particular case study, it's been optimized for different devices, very useful in studying medical imaging data. We are able to analyze curated dataset from 10x Genomics. 10x Genomics being the best curated datasets for studying biology or say we are trying to get information into the single-cell RNA structure, much more into, say, cancer and all those datasets are easily available. We'll go through when we are dealing...
When we are showing the demo itself. We are able to do spatial statistics and graph analysis. We are working with huge medical imaging datasets. That's a very big plus point. When we are dealing with huge medical imaging dataset, that means, we are trying to gather a data that can be studied and can be implemented for a cause, so that the final inference that we are dealing it with gives us better results. Very important it is. In terms of results, we are able to analyze single-cell RNA data seamlessly. Okay. We are bringing the data into the pipeline. We are doing analysis on it. It's a very easy and a seamless process. We are able to import high-order gene data. We are able to express gene standards from 10x Genomics.
We are able to find clustering of genes. Clustering of genes, as you can see from the images that I've already shown, that, when we are dealing with it, we are drawing clusters that are pinpoint colors to associate with different kind of statistics in the images itself. Let's take a look at the demo. This particular demo that I would be showing would be in on Intel DevCloud itself. For the entire demo to work, you have to have a DevCloud account. Similarly, you can replicate the environment locally itself. Both ways, the most important part is that you need to have Intel AI Analytics Toolkit or the oneAPI-based toolkit at your system, or you have to use the DevCloud that's available. Let's take a look at the demo itself right now. We are in the demo right now.
Here you can see like, we have opened up the Intel DevCloud, and we are with the notebook server from JupyterLab that's initiated from DevCloud. Key things that are available. If you click on this plus button, you'll have this Intel oneAPI notebook, PyTorch notebook, and a TensorFlow notebook. Apart from that, if you want to clone a particular GitHub repo, you can do it from terminal itself. Let's take a look at the demo that I have created for this particular use case. First of all, we have brought in all the necessary files that are needed for us to get started. In the next option, we have used the pre-processed dataset that's there in the Squidpy library itself. After that, we have created cluster annotation using a function called spatial scatter.
As you can see from this, there are different images, color being given to the image that is there. We are using a mouse brain over here, and you can see there are different patterns being created and options available. Apart from that, if you want to see like the dataset that's we are looking in, if we check on a image, the image can be partially distributed into three options, zero, one, and two. These are three channels that are available. If I use the channel-wise as false, you'll see only one channel. That's the way you can segregate an image and also you can get started with it. After that, we have used image segmentation. There is a method called segment watershed that allows us to differentiate between the cell structure that's being done.
After that, we have segmentation features. From these segmentation features, you can see which particular pattern that is, if you can say that these options for the channel-wise, which is more predominant over here, you can see from the image itself. As you can see, the per channel intensity is plotted in the second row. Cortex one have a higher intensity of channel one. We can derive the features from this itself. After that, we are extracting the cluster features. That's the way you get started. If you want to install the Squidpy itself, you will need this particular command that is pip install Squidpy that I've already mentioned, and you can get started. I think you had a very good time going through the demo itself. Let's move on.
What we had as a key takeaways. Using the AI toolkit, we are able to work with terabytes of data because the image that has come in, is of a varied size of, say, a huge databases or say datasets that has come in, and we are curating those dataset to use for our purpose. That's a very important part. Like, we are able to use terabytes of data. We can study bulk genes in genome structure. As you can see, like, from the demo itself, like, we brought in different types of, analysis patterns or different function utility to study the genome structure, and it was very useful for us. Now, key takeaway in terms of Squidpy. We are putting a analysis on the Squidpy itself because it's a very useful library.
We are able to build and analyze the neighborhood graph that is the patterns for these analysis that we have derived on. Also we are able to have a form recognition on the spatial coordinate itself. We can compute spatial statistics for cell types and genes. That is, we can segregate the different kind of patterns or different kind of, say, anomalies in the cells using this spatial statistics itself. Efficiently store, analyze, and visualize large tissue images. When we are scaling those images of the tissues, we are able to move in deep into the image structure, and we can study those. For this we have used another library that is leveraging scikit-image itself.
It is a different kind of library that's, that's being used for this, the image analysis itself. Now we can interactively explore AnnData. AnnData is also a different kind of dataset, particularly meant for single cell RNA and also tissue getting into knowledge of tissue images itself. We are able to study large tissue images also. Usefulness. We are able to understand how ligand receptor works, very important in terms of single cell RNA structure and also the different kind of mutation that might form. That's a very important thing that we are able to study from it. We are able to find the structural anomalies in the single cell RNA.
We are effectively studying parts of the image datasets, very essential in studying rare diseases such as cancer or, say, any kind of disease that's very new to us, we can definitely analyze from this. These are the resources that I would definitely recommend. That's my. You can find me on LinkedIn. If you want to take advantage of Intel oneAPI AI Analytics Toolkit, here is a link for it. My project in DevMesh is available. I am also there in YouTube. You can connect accordingly. That's more about it. Thank you for being connected, and I hopefully this particular topic has enlightened your mind. Stay safe. Thank you.
Okay. Thank you, Abhishek, for that. Are there any questions? If you have any questions, please post them in the chat, please. Okay. I am not seeing any questions coming through, so we'll go ahead and wrap up the session for today. Thank you so much, Abhishek, for sharing your genomics. Oh, just as I start to close it down, we get a question. Abhishek, what are some challenges in porting that you wanna talk about, please?
Yeah. what's the question all about?
What are some challenges in porting?
As you can see, like, porting these datasets, like when you are into the 10x Genomics also, it's not free at all. We have to find out which are the libraries that are very essential for us to work upon. I chose Squidpy because within this, many important datasets from 10x Genomics were there. In terms of porting, when you start porting, you have to have the knowledge of AnnData because it's a different kind of a data structure you are dealing with. Might be you are trying to study hg19 data, that is human genome sequence. For that, you need to have different kind of utilities to get into that thing. Most importantly, when you extract the data, it's pandas that you use.
Otherwise, the supporting libraries from Squidpy is there.
Any more questions?
Okay. We'll go ahead and conclude this presentation. Thank you so much, Abhishek, for presenting today. Go ahead and close this window and go back to the agenda page to join the next presentation, Introducing the oneAPI Community Forum. Thank you, everyone. Hi, everyone, and welcome to the talk, Introducing the oneAPI Community Forum. Our speaker is Rod Burns of Codeplay, and he will share with us how oneAPI started and how it continues to evolve. Rod has been helping developers to build complex software for well over a decade. While working at Codeplay Software, Rod is involved in providing, supporting, and building educational materials for developers using our SYCL product. Most recently, Rod helped to create SYCL Academy, a set of materials for teaching SYCL that have already been adopted by some of the top universities in the world.
If you have any questions throughout the presentation, put them in the chat box, and Rod will answer them at the end of his presentation. Over to you, Rod.
Thank you, Susan. Thanks so much for the opportunity to come and talk a bit about the oneAPI Community Forum. I appreciate it a lot. It's great to be here. I'm based in Edinburgh in Scotland, so it's great to be able to talk to you virtually with people all across different time zones. I'm glad to be here. Right. My name is Rod Burns. I'm the VP Ecosystem at Codeplay Software. This year, Codeplay has announced that we'll be leading what's called the oneAPI Community Forum. I've accepted the nomination to become the chairperson for that forum.
As Susan talked a bit about, I've been working with the developer community for the past six years or so, helping them to learn how to use SYCL, supporting them, and trying to bring together the whole community together, to help each other. Increasing my focus on helping the community to make the most of oneAPI is a natural extension of that work that I've been doing with SYCL. My goal is to ensure that the whole developer community, that all of you can be involved in defining the direction of the new, the oneAPI specification and the implementations. Let's just set the scene a bit and try to understand what it is that we're trying to achieve by bringing together this oneAPI Community Forum.
Working from the bottom of these points here, it's clear that heterogeneous architectures are becoming increasingly multi-vendor, and I think this is a good thing. It brings competition, it brings a lot of choice for developers. There is, though, a historical challenge for anyone that wants to write software for heterogeneous architectures. Far, we've not seen a common language or set of APIs that can really be used with different parallel workloads across different architectures. What this means is that there's a large investment for developers to migrate their software to new hardware platforms from one vendor to another. What we need to do is we need to find a solution to this problem, and I think that the solution to that is by using a standard-based environment.
At Codeplay, we're motivated to take oneAPI to the next phase, and with that, an increasingly open governance model with broader participation from the overall community. This slide's really talking about what the mission of kind of oneAPI generally is trying to achieve. The mission for oneAPI is to deliver an open standard space set of APIs for all different types of accelerators. Whether that's CPU, GPU, FPGA, or maybe some of the new accelerators that are being designed for things like RISC-V even. With oneAPI, there's already a specification that exists. What that does is it defines the interfaces for a set of different libraries that provide common operations for various things.
Alongside that sits the SYCL programming model, which I like to think sits at the heart of oneAPI, and it brings that fully featured and production-ready SYCL compiler and runtime that you can use to deploy your software. What the oneAPI Community Forum exists for then is to set up the mechanisms to help us define this specification using the feedback that we get from the community, and alongside that, to bring that feedback to the implementations of that specification as well, to make sure that they work the best for developers that are using them.
What we're doing is we're looking for, to bring experts, and leaders from the industry to these different groups to help bring that crucial feedback into the specification, so that we can ensure that it's able to support a diverse set of processors, but also that it covers all of the crucial things that developers need when they're writing their software. Let me explain a bit about what the Community Forum is. The first thing that it is, it's a cross-industry group of hardware and software experts, so people like you, and we would like your input on the specification.
It defines a standard specification based on industry standards, for interfaces that can be used with a bunch of different accelerators. It consists of multiple groups that anyone can join for free, and in these groups, you can provide that feedback, bring discussions, and even present proposals for changes to the specification. Ultimately, what we're trying to do is we want to drive the future of open standard space accelerator computing, and we want you to be a part of that. Let me talk a bit about why you should join and why you should come and help us to shape the oneAPI specification and the implementations that we have. If you're a software developer, having this specification and implementations of oneAPI means that you can develop using open standards rather than some proprietary interfaces.
It means you'll be able to develop one code base for all of your targets, whether that's, as I talked about, a CPU, a GPU, an FPGA or some other accelerator type. The standard libraries mean that the common operations that you use for things like math, like BLAS or neural networks, where you're running linear algebra, for example, these are already built. The interfaces exist based on open industry standards, and they're optimized for the processors that you need to use. You'll also have access to an existing set of software, open source projects and learning materials, a lot of which has been built using the SYCL open standard over the past number of years. Ultimately, you'll future-proof your software for the next generation of processors. It doesn't matter what vendor that they might be made by.
Perhaps you're a hardware developer, and by that, what I mean is you design processors, and there are still a lot of reasons for you to join and to give us your feedback and to give us your input. If you're, for those of you who need to deliver a complete set of software with a tool chain for your processor, you can fast-track this by harnessing the existing specification and also the implementations that exist for your own processor, meaning your development efforts are shared with other organizations working on these open source projects as a collective community effort. The last thing that I would say about that as well is that developers that are using your processors will be using a language and a programming model that they already understand, i.e. C++.
Here's a QR code that you can follow, web URL, and here's how you can contribute. You can use that QR code to get to the website. You can get in touch with us to find out how to join the different groups. Once you join, you can submit, debate, and vote on the proposals and changes to the oneAPI specification. The organization exists of a few different groups. There's a steering committee, which is what I'm a chairperson for. There are special interest groups and working groups, and those groups discuss and vote on changes to the proposal, proposals and changes to the specification.
Here's a snapshot of the various groups that exist, and we're also hoping that new groups will also be formed on the back of discussions that are had for specialist areas that require that level of discussion. The language group covers discussions and early proposals around specifically ISO C++ and SYCL proposals at the moment. We hope to expand that to different languages, such as Julia and others beyond that. The AI group is covering the interfaces for the oneDNN library, which is interfaces for neural networks. The math group is covering interfaces for math operations, as you would imagine, but that's things like BLAS and LAPACK. The hardware abstraction group is crucial for enabling a wide set of processors. That group is helping us to understand how to make it easy to integrate new accelerator processors into oneAPI.
These are all groups that you can join and get involved with. This is the feedback loop that we are creating. The special interest groups, as we call them, for AI, math, language, and hardware abstraction, are providing input into multiple parts of the specification. The open source projects implement that specification, but we're also looking at how we can ensure that teams working on open source implementations can give and receive feedback that includes the community. I would ask you to join us and find out how you can influence the future of oneAPI with these groups. Here are some of the organizations that have been involved in oneAPI. There's lots that you will recognize there.
Really what I'm hoping to do is to be able to add your organization's logo to this page and get you to participate in the community forum and the groups that we have discussing this crucial future. I'll just spend a little bit of time talking about kind of case studies of ways that companies have been involved, organizations have been involved with the oneAPI specification and the project. Fujitsu Laboratories adapted the oneDNN library for their Arm CPU as part of the Fugaku supercomputer. Which help them to achieve some pretty significant performance improvements. They made changes to the oneDNN implementation and then contributed those changes back into the open source project. And some of that work has helped us to understand how to evolve the oneDNN specification too.
This is really about a sort of cross-national laboratory effort that's happening between Argonne National Laboratory, Lawrence Berkeley, and Oak Ridge, where they're trying to find common ways to program all the different supercomputers that they're building, whether they're pre-exascale or exascale machines. The partnerships that they're building with organizations, including Codeplay, where I work, is to enable SYCL, DPC++, but also some of the libraries such as oneDNN and oneMKL across all of these different hardware architectures. Some of these are using NVIDIA GPUs, some of them are using AMD GPUs, but also the Aurora exascale machine which is using Intel GPUs too.
They are part of the hardware extraction group and some of the other groups, including the oneDNN group, to help us to understand again, how to evolve the interfaces and how to adapt the implementations to fit the needs that they have. Please get involved. You can get in touch with me directly. I'm happy to receive emails at any time. My email is actually rod@codeplay.com. It's quite easy to remember. Equally, if you want to email oneagi@codeplay.com, then you can get in touch with our team, and you can find out how to join some of those, some of those groups. And yeah, please get in touch and give us your questions and help us understand how to make the future for heterogeneous computing. Thank you.
All righty. Thank you so much, Rod. A question I have for you is what's a good starting point? Go to oneapi.io, then maybe explore the spec?
I think so, yeah. I think a good place to start is to go to oneapi.io. There's quite a lot of articles that talk about some of the sort of case studies that I mentioned there, different people that are using, getting involved in the, in the specification. Yeah, I think going to the website, the website links to the specification, but it also links to the open source project, so you can get an idea of, I guess the scale of which the development which is going on, to enable some of these quite complex libraries, and the tool chain that's been brought to developers, yeah.
Oh, we have one question: Is SYCL similar to DPC++?
SYCL is similar to DPC++, in the way that DPC++ is an implementation of the SYCL standard. The SYCL standard is managed by The Khronos Group, which is an open standards body. They define a set of APIs, interfaces, for the standard which allows you to write, I guess, parallel programs. It also allows a processor company or someone to write a compiler which can turn those instructions into something that the processor can understand. DPC++ is an implementation of the SYCL standard. DPC++ is part of oneAPI in the sense that the SYCL standard is connected to the oneAPI specification in the way that it's required for the specification, other parts of the specification to work. Yeah.
I think hopefully that explains.
Great. Thank you. We are getting ready to start our next session. Thank you so much, Rod. Appreciate it.
Thank you.
I hope that you in the audience will consider getting involved in the oneAPI Community Forum. If you have any more questions for Rod, you can head over to Discord after the event to ask your questions and continue the conversation. This concludes this presentation. Close the window, go back to the agenda page to join the next presentation, RISC-V Vectors and oneAPI: Accelerating the Future of Heterogeneous Compute. Thank you again, Rod. Bye.
Hello, everyone. Thanks again for continuing to stay with us during the Dev Summit. We'd like to start with the next session. The topic here is RISC-V Vectors and oneAPI: Accelerating the Future of Heterogeneous Compute. Our speaker today is Stefano Suttora. A little bit about Stefano. Stefano Suttora is the director of the technical programs for RISC-V International. He has developed and managed numerous open source initiatives in software and hardware over the course of his 20-year career in technology. What I would like to add that if you have any questions throughout this presentation, feel free to put them in the chat box for Stefano to answer.
It's gives me a little extra pleasure because Stefano and I have worked quite some time together, it's great to see you here. I yield the floor to Stefano.
Great. Thank you very much, Sri. It's great to be here. Thank you for having me. Yeah, my name is Stefano. I'm the Director of Technical Programs for RISC-V International. This is a pretty brief presentation, so I'm gonna sort of speed through some of these slides, there's plenty of links in there for you to follow and you can learn more by following those links. First, I'm just gonna start with a quick overview of what RISC-V is, for those of you who haven't heard of us and who haven't heard of BISA. I'll go over some of the details of where we're at today, specifically with regards to vector and AI and ML. Then we'll talk a little bit about where we're hoping to be in 2023 and what the future looks like for RISC-V and for oneAPI.
First off, who is RISC-V? Who is RISC-V International? I like to think of us as the support team for the ISA. We not only support the existing base ISA and extensions, but we help to ratify specifications, extensions to that ISA. It's more than that, though. We actually work with communities, so with the software ecosystem, the firmware ecosystem, and with hardware implementers to try and make RISC-V a community effort rather than just one organization. How do we go ahead and do that? We do that by using the RISC-V ISA and extensions. Let me go over briefly what those look like. The base ISA for RISC-V is essentially the foundation. It tells you how many bits your core has, 32 or 64 bit. It'll describe the op codes and things like CSRs.
What then happens is you layer on top of those extensions, and those extensions do the things that compute needs to do for the specific workload. You might think of things like multiplication, vector processing. These are all extensions that you add to the base ISA. We realized that in order for folks to create SoCs that are standardized or across a group, we need a way to group those extensions together. We've done that with RISC-V profiles. We've frozen the first RISC-V profiles, and they'll be ratified next year. What they do is think about something like Linux. If you're gonna boot Linux, you wanna be sure you have extensions like integer math, multiplication and division, floating point, double precision, and compressed instructions.
Rather than checking to see if every SoC that you're using has those, you can simply ensure that they implement the RISC-V profile. By implementing that specific application level profile, you know that Linux will run and that your microarchitecture supports it. We realize it's more than just the microarchitecture when it comes to supporting something like Linux. In 2023, we're gonna be working on defining different RISC-V platforms. These are ways for the community of implementers of RISC-V to align around common platform types. Staying on our Linux example, a platform that would support Linux would need to talk about interrupts, how memory works, what the security model looks like around that memory, and most likely, some discussion around bootloaders and ABIs. All of that stuff will happen in the platform for RISC-V.
By creating a compatible platform, implementers know that they're handing their customers something that will work the same across different implementations. That's how profiles and platforms work together to provide implementations of RISC-V that as software engineers, you don't have to worry about which platform you're buying as long as it implements and is compatible with that platform. Now, we realize there are gonna be extensions that don't fall into those categories, and I just wanted to assure folks we do consider those. We actually. Part of the due diligence we do in ratification is ensuring that any standard RISC-V implementation has a proof of concept and has that software support that's needed. Whether it be some cryptography extension or any extension that's standardized to RISC-V will have software support.
What this does is it opens up the options for implementers to meld the best of open source with whatever commercial interest they have. While you can create a completely open source implementation of RISC-V from the base ISA all the way up through software, through Linux, the key to RISC-V success is that we allow proprietary implementations to live alongside open source implementations. That way, customers can choose which parts of the RISC-V ISA they wanna implement as a standard and open source, and which parts they wanna create custom extensions to, and create an implementation that speaks to their value add. One of the benefits to this is that we can learn from in-the-field testing and bring a lot of that work that gets done as a custom extension into the standard if it's applicable.
The way that these groups work together is around these compliant implementations, and I just wanted to touch briefly on what it means to be compliant. We have three open source software tools. The Spike Simulator and the Sail Golden Model work together to create a model that any implementer can use as the standard that they need to attain. They can run compatibility tests, which are written in Python, against that model and their implementation. By using these 3 open source tools, implementers are able to take whatever implementation they've done on a base ISA profile and platform, and even optional extensions that they've implemented and ensure that they're compatible across different RISC-V architectures. We've talked a little about how you can have a compatible implementation of RISC-V and what the benefits are of the combination of custom and standard extensions.
We'll talk a little bit about that later. First, let's go over who's using RISC-V today and sort of what the current state of RISC-V is. I joined RISC-V in 2019, and I can say from personal experience, it's been an amazing period of growth over these past few years. The projection for growth keeps getting steeper the longer I've been here, and it's exciting to see this kind of growth that crosses industry. We'll talk a little bit about these different industries and why RISC-V crosses them, but what's really important to me is the growth that I've seen in members. The membership growth since I've been here has been astonishing. The first technical meetings that I attended were leadership of maybe 12 people.
Today, when I go to our technical steering committee meeting, there are 30 member organizations represented and four different elected roles in that body. The leadership that we have in RISC-V today has grown immensely, and that's reflected in our membership base. The membership base is also completely global. I work every week with folks from Chinese Academy of Sciences, IIT Madras, over to the European Processor Initiative, Barcelona Supercomputing Center, and countless organizations here in the States. It's truly a global organization, and one of the benefits to that is that, as you notice, there are 2,000 plus individual members. That individual membership is completely free, and that opens up the door to modifying the RISC-V ISA and providing new extensions to more than just commercial entities, but to individuals as well.
That gives us a diversity of thought and the ability to group together around a wide and diverse community. Let's touch a little bit on those industries that I mentioned earlier. Obviously, I'll talk more about AI and ML, one of the things to note about RISC-V is that we've got a ton of traction already in IoT and edge compute. I think in 2023, you're gonna see a lot more hardware produced. We did as a smaller organization and a smaller ISA, we did suffer a lot from the chip shortage and the supply chain issues. As those clear up, I think you're gonna see more innovative use cases at the edge that take advantage of RISC-V. Let's talk about some of the stuff that I see coming that I think this group will find particularly interesting.
From the data center, specifically in HPC, there are several uses of RISC-V that have already been implemented. Let's talk about those first. No, sorry. Automotive first. In automotive, we have several different efforts that are currently underway. Mobileye has switched their computer vision efforts over from MIPS to RISC-V, and groups like Andes and Renesas and Imagination Technologies have come together to discuss functional safety, ISO 26262 and ASIL B and ASIL D certification. The way we combine this effort is through what we call a special interest group or a SIG. The Automotive SIG just got started this year. It's headed up by Imagination Technologies, and it's gonna look at the possibilities that can happen in automotive and what we need to do to further those efforts.
I'll also mention that Renesas has a development board that they'll be shipping in 2023 that actually uses RISC-V as the application-level processor, rather than as an application-level as a accelerator-type processor. Obviously, we'll talk a lot more about AI and ML in the coming slides, but I just wanted to briefly chat about some of the work that's being done, specifically by companies like Esperanto that are shipping a 1,000 core tensor processing unit, and StarFive that's doing things in AI and visual processing. Again, all of these companies are able to come together at RISC-V around a special interest group that just spun up around AI and ML. I'll provide links later in the deck, but the idea is that this group acts as a think tank to distribute the work among different areas that they see using gap analysis as priorities.
Those priorities then spin up different task groups that can go off and do the work that needs to be done. Lastly, certainly not least, is high performance computing. HPC, we work pretty regularly with groups in Europe like E4, the European Processor Initiative, and I mentioned Barcelona Supercomputing Center, Technical University of Munich. HPC is really a global effort, and we see folks like Tactical Computing Laboratories here in the States that are tackling different interesting problems with HPC. One of the benefits to RISC-V is that we're able to bring these folks together and achieve a common set of goals together. Tactical Computing Laboratories is heading up a software test infrastructure. As many of you will know, in high performance computing, you're dealing with software stacks that range from Fortran all the way up to modern-day stacks.
Those stacks are customized depending on the workload. That customization means that the build process can often be cumbersome and that the stack itself can be complex. Well, Tactical Computing Labs has created a continuous integration environment where we can test all these different stacks on RISC-V to ensure that your project can get up to speed as quick as possible. Those are just a few of the industries that are being taking advantage of the flexibility of RISC-V to further what we can do with compute. Let's talk a little bit about vector and a little bit about floating point, which I'm guessing is what most of the folks at this event are gonna be interested in. Vector on RISC-V is often referred to as RVV.
It was ratified last year at the end of the year at our RISC-V summit in San Francisco, it's a single RISC-V extension with variable vector lengths. Regardless of the processor, whether it's 32-bit, 64-bit, and regardless of the length of the vector registers, it's the same vector extension, so the code doesn't change. This allows RISC-V vectors to map onto existing neural network models and different types of models. Regardless of, again, 32-bit, 64-bit, the data types being used or the width of the registers, RISC-V's vector extension can map onto all of that. It also allows for smaller code side and lower power consumption, which we'll talk about in the next slide, based on implementation. A little bit later, we'll talk about how this enables open source extensions.
Obviously, in Vector, we're gonna talk a little bit about some proprietary extensions that already exist to take advantage of Vector. RISC-V is all about taking the community effort and rolling it back into the standard, and Vector is a great example of that. Let's talk a little bit about the difference between RISC-V Vectors and traditional SIMD. I mentioned that there are variable vector register lengths. What this means is that at runtime, the width of those vectors can be configured to be a multiple of the base size of whatever the register was, and we'll talk about that in detail in a couple of slides. This is all up to a maximum of whatever the hardware length is set, and the implementer gets to choose the hardware length.
VLEN, the length of these, the maximum length of these registers, is chosen by the implementer. The Vector ISA is agnostic to this Vlength. Regardless of how large these registers are, the ISA itself or the vector extension doesn't change. This allows us to use a unified code base. RISC-V often takes advantage of the second mover principle. We can look at historically what other architectures have done and try to do things with that in mind. There's also no requirement for dedicated vector memory as it uses system memory for its computation. I've put this slide together. I'm not gonna go into it in detail, but it gives you a picture of some of the things we were thinking about when we were making the RISC-V Vector extension.
We're not trying to reinvent the wheel, but we are trying to look historically at what's been done in Vector and where can we improve things. One of the keys is we actually work with compiler folks. We work with the LLVM community and the GCC community to ensure that as we build these standards, we're not making their lives harder. I think it was someone at Intel that once said that, "Hardware without software is just heat." Really what we'd like to do is take advantage of the fact that we can work with these open source communities to build a standard that enables software to get as much work done as possible without reinventing the wheel and making them recreate a bunch of work that they've already done. I mentioned the variable vector widths a couple of times.
You can see here that we've called that out, narrowing and widening, mixed data type, vectorization are all supported. How we support that is a little complex, but I'll try to go over it quickly. Part of what we use is something called as a CSR, called LMUL, which you can think of as grouping the vector registers in a specific way. Let's say you have a traditional 32 register setup, and let's say it's 32 bits in width. If LMUL is set to 1, then you're essentially gonna use those registers as you would any other. You have 32 of them, and they are each 32 bits wide. However, you can scale that either up or down.
You can set LMUL such that you're multiplying either to two, four, or eight, or down in the other direction if you're looking for smaller widths. Let's say you use LMUL equal to two. What you've essentially done is you cut the number of registers you have access to, the logical registers you have access to in half. Now you have half the number of registers, but those registers have now doubled from 32 to 64 bits. You can see by addressing V zero, the first of the vector registers, rather than a 32-bit register, you now have both of those registers, V zero and V one, creating 1 64-bit register. You can think of instances where this might be helpful.
If you're trying to multiply two 32-bit registers, you need a 64-bit register to store the result. Rather than having to change how the system works or use a different system, the system itself, RISC-V Vector system, is flexible enough to double the vector widths. That keeps going. If you set LMUL equal to four, you now have 4x the width and a quarter the number of registers to use. This keeps RISC-V's vector implementation flexible enough so that at runtime, this variability allows for a flexible implementation that can be mapped onto many different kinds of models. The idea here is that we want fewer instructions.
We want things to be simpler for use in the field, whether it be efficient parallel computation for either scalar or vector or regardless of what data type you're using, if you're using floats, doubles, or integers, and regardless of the size. These registers themselves can be broken down into different effective working sets. While your register might be 256 bits wide, you can split that up into different working elements of size 4 bits, 8 bits, or 16 bits, allowing you to address many different elements inside that one vector register. Companies are already starting to take advantage of this. SiFive is one example. You can see how by applying RISC-V Vector, they saw a 24x speed up, and then they took their value add, right?
They went out and wrote a custom extension to layer on top of this vector extension and find even more speedups that they could get. The idea that we've reserved opcode space for custom extensions allows companies like SiFive to go off and do innovative work and build their value-add into their implementations. It also allows folks who are interested in doing open source work at the edge to do the same thing. Folks can go out and create custom extensions to try out their ideas. As the RISC-V community sees those custom extensions being worked on, we can work with those groups to see which of those ideas are worth bringing into the standard so that everybody can benefit from that work. It really is the best of both worlds.
Folks can go off and implement custom extensions that add value to their commercial products, and open source groups can go off and create custom extensions that may someday work their way into the standard itself. I've just got two minutes left. I want to go over a little bit about what the future is for Vector and for RISC-V in AI, and about how you can contribute to this future. I mentioned before that we have special interest groups. Obviously, we have a special interest group for Vector that we're spinning up. We just finished the first Vector extension. We're going to spin up a Vector SIG, and that SIG will be able to create task groups to go tackle the next problems that we're going to solve in Vector. I've listed a few of them there.
We're looking at different, half precision extensions to add on to Vector. We're also looking at how Pac SIMD can work together with the current Vector extension. We're looking at matrix multiplication as a possible extension. I highly recommend folks get involved in this group if they're interested in participating in those discussions. We also have a floating point SIG that just got spun up. Kenneth from Imagination Technologies is gonna be running that. I'll mention briefly that you can see a lot of links here to mailing lists. Those mailing lists are publicly readable. To contribute on these mailing lists, you do need to become a RISC-V member. I would just note that it is free for individuals and very cost-effective for corporations.
All of our work is done on GitHub, so you can see here the floating point SIG is already gonna start publishing their strategy, gap analysis, and priorities on GitHub, and anyone can comment and make suggestions on GitHub. Likewise, we're spinning up task groups to actually go off and do the work to actually write the specifications. bfloat16, which is probably familiar to most of you, that is already being worked on. That's being worked on by Ken from a company called Rivos. You can see a link here to the mailing list and of course the charter and documentation, but the specification itself is also already being worked on and in GitHub. We actually encourage folks to comment early. We have a 45-day review period before we ratify a specification, but we encourage that folks get involved as early as possible.
Read through these specifications and give us your feedback. It's easier for open source communities to do their work together if we start off early on in the process, giving feedback and making suggestions. There's also a lot more instructions to come with bfloat16. As you look at the specification, keep in mind, we're trying not to boil the ocean on our first try. We're trying to take a part of bfloat16 and implement it, and then do more in the future. We also have active software committees that are currently working, and we've actually divided up the work in an effort to get more done in 2023. We've sort of separated our software world out into two parts. The first part, which we call the Privileged Software Committee, is concerning itself with the low-level operating system firmware software.
That committee is gonna focus on how we can form our ABIs and our firmware services to comply with stuff that's already out there so that folks don't have to reinvent the wheel. We also want to, along with that type of reducing risk and adopting RISC-V, we want to attract more applications. We have an Applications and Tools Software Committee that's gonna look at the higher level applications, so everything from compilers all the way up the stack. We're gonna try and prioritize the gaps that the community sees. If the community needs to see more work done in security libraries like OpenSSL or in specific compilers like LLVM, we're gonna try to put our efforts in there.
We highly recommend that folks get involved in this, and this is one of the places where oneAPI and RISC-V will be working together quite a bit. I just wanted to mention briefly some of the leading organizations that have joined RISC-V that are working on AI and ML. I'm sure many of these are obvious and familiar to many of you, but we're excited to work with these companies not only because they're RISC-V members, but because they're active in the oneAPI community. I think that this cooperation between these two communities amongst these organizations is where we'll find the most success. Just in summary, I wanted to talk about some of the things that are coming in 2023. We have RISC-V Vector C intrinsics that are gonna be worked on.
There's actually a task group that's currently spun up to do that. I mentioned AI and ML special interest group, and we also have the Vector SIG working on matrix multiply and most likely some second phase of the Vector specification. I mentioned LLVM improvements. Those are things that are going on both inside and external to the RISC-V international organization. Of course, interactions with oneAPI and that engagement, we're really hoping to see that pick up in 2023. I know specifically efforts around oneDNN and SYCL are places where the RISC-V members are contributing, and I'm looking to spread that out through more of the oneAPI community. How can you get involved? I've left some links up here.
These are the groups that you can join, or feel free to just participate on our wiki, or over on our mailing lists. Thank you very much, and I appreciate your attention. Please feel free to send me any questions. My email should be up on your screen.
Fantastic, Stefano, for your talk. Really amazing to see the progress that RISC-V is doing and the great growth that's happening. I don't have any questions other than what are CSRs? Although somebody did answer it, maybe it might be nice to know, maybe go a little bit into what is the... I assume it's control and status register?
Sure. Yeah, it's like a full status register. Part of what you can do is take a look at the RISC-V ISA and look at that base ISA, because that's where a lot of the registers are called out. Then as you look at extensions to that, to that base ISA, you'll find that those extensions may add CSRs or perhaps use the CSR somewhat differently. Really you wanna pair, as you look at the RISC-V ISA, pair what you're reading between the unprivileged and privileged documents and whatever the latest extension is that you're interested in. Between those two documents, you'll get a good idea for what's going on.
Wonderful. Is there any other questions for Stefano out there? I do have one. I think you have a PSA for the RISC-V Summit next week.
Oh, yeah, absolutely. Next week will be the RISC-V Summit. It's gonna be Tuesday through Thursday, and it's gonna be in the Bay Area, in San José. It's gonna be both virtual and in person, feel free to go over to riscv.org, and right up at the top you'll see the notification of the summit. We have a lot of great talks. We also will likely have many great announcements at that event. If nothing else, keep an eye on our X feed, as I'm sure that'll be quite busy next week.
I put the link for you on, on the chat. That's about all the time we have for this session. Thanks all of you for attending, and if you have any questions for Stefano, please mail him at stefano@riscv.org. Feel free to close the session, and we'll start our next session, the Leveraging Default Intel Optimizations for TensorFlow, and that's gonna be a hands-on session. Hope you all enjoy that. See you there.
Hi, everybody. Welcome to the hands-on training, Leveraging Default Intel Optimizations for TensorFlow. Our speaker is Sachin Muradi. He will show us how to fine-tune a pre-trained model for image classification using the Flowers dataset. Sachin has been at Intel for five years and is part of the team that works with the TensorFlow oneDNN direct optimizations, which helps integrate Intel oneDNN optimizations for CPU into Google's open source machine learning framework, TensorFlow. If you have any questions throughout the presentation, put them in the chat box and Sachin will answer them at the end of his presentation. Over to you, Sachin.
Thank you, Suzanne. Hello, everyone. Hope you guys are not too tired like today's session. Yeah. My name is Sachin Muradi. We're gonna do a hands-on session with TensorFlow, you know, and leveraging Intel optimizations in TensorFlow. Let's get started. One second. Hopefully you can see my slides on your screen. Yeah, I think. Mainly we're gonna talk about the TensorFlow. I think most of us already know what TensorFlow is, but if you don't know, like it's Google's machine learning framework, you know, designed to have like researchers or developers to push like machine learning boundaries.
Even if you're not using TensorFlow actively, I'm sure like if you're using a Google Search or Google Home Mini at your home, you are already, you know, using the application which are developed on top of TensorFlow. Since we are gonna talk about Intel optimizations in TensorFlow. Intel optimization in TensorFlow is introduced with oneDNN library, which is like Intel's deep neural network library. It's a cross-platform library, open source, supports like bunch of data types like float32, FP16, bfloat16, and, you know, quantize like int8 formats. It supports like all the compute intensive operations, those are used in deep learning, neural network, like, you know, your convolution, gemm operations and pooling, and also like memory band limit, bandwidth limited operations, like listed here below.
It does all the optimizations for you. Like when you, as a data scientist or as an application engineer, when you develop, when you run, the TensorFlow, you will have all the parallelization and optimizations taken care of for you, with like Intel oneDNN. What happens when you combine TensorFlow with Intel oneDNN? This is just a small graph that I wanna show you, like what kind of benefit you get, like when you use oneDNN with TensorFlow. The first graph here, like it shows the comparison between TensorFlow 2.8 and 2.9. 2.9 is a TensorFlow where like Intel optimizations are enabled by default. The speed up you get with 2.9 has already Intel oneDNN enabled in it.
You see on the side, like, you know, you can see the performance like shoots up to like 3x compared to 2.8 TensorFlow. The first slide, the first diagram it shows the throughput case, and the second diagram, it shows the latency case, which is like more real time. How does, like TensorFlow achieve it? Obviously with oneDNN, like aggressive operation fusions and using like advanced instruction sets in your hardware like, you know, AVX-512 and upcoming like the AMX, which is advanced matrix extension instructions. Yeah. With oneDNN, like, you know, you can get like a better TensorFlow performance when you run it.
I just talked briefly about the bfloat16 data type. Today I'm gonna show you, like, how you can use that, and it's how easy it is to actually use it, and what are the benefits with it. Just for the people who don't know what bfloat16 data type is, we already used a single precision float data type, which is like 8-bit dynamic range and has like 23-bit like mantissa, which is like precision. bfloat16, you know, lies somewhere between like FP16, which is like, you know, you might have heard of it, like a lot of GPUs use FP16 data type. bfloat16 uses the same dynamic range as Float 32, but it has like lesser precision.
Since you have like lesser bits, like you'll see like, you know, less memory footprint and like, you know, more compute operations per second. When you use bfloat16 with like Intel oneDNN, there is very minimal like accuracy loss and the performance is like much, much better than like, you know, FP32. Like for the Intel platforms, like it is supported in a third generation Xeon, which is like code name is Cooper Lake, and then upcoming fourth generation Xeon scalable, which is the Sapphire Rapid code name. How do you enable all these bfloat16 optimizations? You don't have to like train your model again in bfloat16. You don't really have to like go in each layer, like change the weights or like biases to bfloat16.
All you need is just like one line of code change, which I'll show you in our hands-on demo. Now that I talked about like oneDNN advantage in TensorFlow, how do we enable oneDNN? If you are using TensorFlow 2.5- 2.8, we have a runtime variable as mentioned on the slide. If you set it to one, like you will have oneDNN optimizations enabled in TensorFlow, and then you will see a better performance with your TensorFlow. If you're using TensorFlow 2.9, and... Yeah. If you're using like, you know, on the Intel platforms, above 2nd generation Intel Xeon, you will see the performance out of box. Like you don't have to do really anything, like it's included by default.
I would encourage like everyone to use the oneDNN in TensorFlow to see like better performance. With that introduction about like Intel optimization in TensorFlow, let's now move to the hands-on session. The goal of our hands-on session is really to see and, you know, ease of use of using like Intel optimization in TensorFlow, and then we'll see that with an a very popular example of like transfer learning. We'll train a model and then we'll deploy it. Yeah, let's do that. For the setup of this entire hands-on session, we already have a cloud instance set up for you, so that like, you know, you can just.
If you have the browser with you right now, we'll just like, you know, go on, and like, you know, do it ourself. We are using TensorFlow 2.10 for it, and that's it. Like, no more like library installment or like any driver dependencies. In our end-to-end example, what we'll see, we'll see a transfer learning use case, like where we'll train a pre-trained model, which is available in TensorFlow Hub repository, and then we'll like optimize the trained model for our inference, and then we will just deploy it using TensorFlow serving. With that, let's now move on to the hands-on session. Before that, so I have just one instruction for you. We have, you know, cloud instances set up for you.
There will be a bunch of IPs listed in a Google Sheet. I would ask every participant to write their names down in front of IP. Let me first share the link to the Google Sheet. Just give me one second here. I have the link now. I'm just gonna paste it in the chat box. I hope everyone has the link in the chat box. Why don't you open it? I'm gonna start sharing my screen now to just show you the Google Sheet and then how you can populate it. Just one second. Let's now start. I think you should see the Google Sheet in front of you.
Make sure like, you know, take any IP addresses on the Google Sheet and just write your name down. This is just to make sure like everyone has their unique IPs and no one is like using, you know, someone else's notebook to run their application. Let's just wait for like 30 seconds so that enough people get, you know, Google Sheets assigned. Okay, I'm seeing people writing their names down here. Okay. Just, yeah, just repeating it one more time. You have the Google Sheet link in the chat box. Once you open that link, you will see a bunch of IP addresses, and then you can write your name down in front of the IP address so that you'll have a unique cloud instance for you. Okay.
I think we have enough people signed up for, you know, the IP addresses. Yeah. Why don't we all like open the IP addresses that we have? You should see something like this. Once you have this is our notebook, oneAPI-dev-summit-tf_lab.ipynb. Let's click on that so that it will open our notebook. At the beginning of notebook, you should see it is written leveraging default Intel optimizations for TensorFlow. I hope everyone is with me till this point. Okay. Today, now that we are here on the notebook, the notebook is basically divided in, like, roughly four sections.
First we are going to do, like, a transfer learning, then we are going to export the learned model or the trained model in a SavedModel format. We'll optimize the SavedModel to run, like, faster inference, and then we'll deploy this model using TensorFlow Serving. People who are not familiar with, you know, Jupyter Notebook, it's really simple and interactive way to run your Python code. When you are on, like, a particular cell, all you have to do is, like, click on this Run button, and then it will execute those Python commands for you. We are gonna do that with our first cell. Where we are importing all the Python libraries.
Once you go click on this cell, just click on the Run, and then you should see something like, you know, we are using TensorFlow version 2.10. I hope everyone is able to do that. Now, let's get on to our application part. In our transfer learning use case, we are going to use a very popular TensorFlow's flowers dataset. The first step we are going to do is, like, download the dataset, and then we are going to prepare the data for our training. With that, let's execute the next cell here.
At the end of the cell, like you should see, you know, you have, like, these many files divided into five classes, you should see the classes of flowers on your screen, which is like daisy, dandelion, roses, sunflowers, and tulips. Now that we have downloaded the dataset. Our data can have like, you know, random randomness in it. In order to have like a unified data to, you know, train our model, we do the normalization. This is the preprocessing that you do on your data. In the next cell, like what we are going to do is actually normalize the data. There is also a prefetching we are going to do on the data.
Prefetching is nothing but like, when you're training the model, so you're gonna fetch like each images, before running, like each step of the training. When you do the prefetch, like you will have those data stored in a buffer so that you don't have to go back to memory each time you're fetching the images. TensorFlow provides this, you know, prefetching mechanism with TF data. Okay. With that, let's just execute like the couple of cells that we see on the screen now. I'm just gonna execute this one, then the normalizations and prefetching, and then I'm gonna execute the next one. Okay.
Once you are till this point, like you should see you have the data and then like the size of the data is like, you know, 512, 224 and 224 by 3. 512 is the batch size that we have set for our training. You can play with it like once after the session, like if you wanna change it to a different batch size. Now that we have the data prepared, let's now move on to the transfer learning part.
For the transfer learning, what we are going to use is, like we are going to use a ResNet-50 pre-trained model, which is available in TensorFlow's TensorFlow Hub, which is a place for like, you know, a bunch of pre-trained model that you can use. You can. There are models for like, you know, recommendation, natural language processing and computer vision. What we are going to do is, like we are going to use the ResNet-50 model, which is pre-trained on ImageNet dataset, which is a different dataset, consists of like thousands of images, trained on like 1,000 classes. We are going to load that model and then. Okay, let's first look at the.
Yeah, let's first execute this cell, so that like we'll have the link of a TensorFlow Hub model. Then what we are going to do, so the model that we downloaded from TensorFlow Hub, so it is already pre-trained. What we are going to do is, like We are going to take the model and we will extract the model. We'll wrap that model in TensorFlow, in the Keras layer so that we can use it for our like, own classification layer. One thing to mention, so the models, those are available in TensorFlow Hub, so those are headless model. What do you mean by headless model? Like it's, it won't have the classification layer. Sorry about that.
Yeah, with the headless model, it won't have its own classification layer. Like when you're using it for the transfer learning, you can have your, like your own set of data, and then you can have your own classification layer, and then you can train it for like your own data. That's what, yeah, that's what it means by headless model. We are going to wrap that headless model in a Keras layer. And then we are trying. We'll try to add our own classification data, which will only classify for like five classes of flowers. Yeah. The next cells are basically doing that. Let's now first wrap it up in Keras layer. It should take like couple of seconds.
Once you have it in the Keras layer, now we are going to, you know, write our own classification layer to classify it only in the five classes, so that, like, when we run the inference, like, we'll see one of those five flowers. Let's run the next cell. Okay. Now it should print out the model summary line. You can see like, you know, we have the feature extractor from ResNet-50 that we downloaded from TensorFlow Hub, and then we will have our own classification layer. Then for transfer learning, we are only going to train the last layer, because, like, we wanna classify it for only five classes.
Since we are only training the last layer, that's why, like we, when we, wrapped the feature extraction from ResNet-50, we specified the trainable parameter as false, so that, like, it won't train the entire ResNet-50 but only the last classification layer. Now that we have the Keras model, now we are going to move to the training part. Before the training part, like, TensorFlow provides basically the Keras APIs. They provide the callback mechanism where like, while training or while running inferences, like, you can have your own callback functions to provide like an additional data, like, if you need to. Here, like what we are going to do, like we are going to print out the throughput while we are running training.
We have like our own callbacks for that. For that, let's just execute this cell. The next one. Here, remember I talked about the bfloat16 data type. I wanna show you like one fun thing right here. What we are going to do, now on the next step, now we are going to compile the model and train the model for like 10 epochs. Now that we have this notebook on like everyone's browser, I have another notebook which is on top of a next-generation Intel Xeon processor, which has support for bfloat16. What I'm gonna do, I'm gonna enable the bfloat16 data type for it.
This notebook, only I have the access for it, and then the one that we all are using, is using the cloud instance, which is, yeah, the Ice Lake processor. To enable the bfloat16, remember I mentioned like there is only one code line change that you have to do. This is that code that is going to enable the bfloat16 support. What I'm gonna do, before I compile or train the model, which we are going to do in our notebook, I'm just gonna set this configuration for bfloat16. I'm just gonna run this cell. Now we both are at the compile and the train model step.
What I'm gonna do, I'm gonna start training on the bfloat16 side, and then at the same time, we all are going to start training on our cloud instance. Why don't like, yeah, you guys click Run on the next two cells, so it will start training. You should see something like, it will go through like each epochs to train. I hope everyone is with me till this point. Like, at this point, you all should be seeing training started on your cloud instance. Okay, this may take like few seconds. Yeah, while it's running, like you can take some time to like look at the notebook. I'll also show you one fun thing that is happening with bfloat16.
Okay, now on your screen, I'm assuming like everyone is around like 3 epochs or 4 epochs. What happened with the bfloat16? You see like bfloat16 is already on its ninth epoch. It's like it's training almost like twice as fast as FP32. With bfloat16, like, you can actually converge your model much faster than you would have done it with FP32. You see I'm still on my epoch 4 or like next I'm gonna jump on epoch 5. On the bfloat16 part, I am already on thirteenth epoch. All I did was just one config change, like just one line change in our in our script to enable the bfloat16 and the same script for training the model. Okay.
I hope everyone is with me till this point, no issues. Even if you face any issues, no problem, we'll have a question and answer sessions later. You will have access to this notebook even after the session so that you can explore and then experiment with your own data, your own batch size or your own use case. I'm just gonna look at how my bfloat16 training is doing. It's almost on 22nd epoch, we are on eighth epoch right here. You can also try these bfloat16 optimizations with the next-generation Intel Xeon processor, which I think right now is available as a preview on AWS or Google Cloud instances.
Yeah, you can do like amazing things and like faster with bfloat16 on the next-generation Xeon processor. Now we are on almost like the last epoch. I think this 10 epochs... I mean, it's not good enough to train, since we are like we have limited time today, we are just going to train it till 10 epochs, and then we see like the validation accuracy is around 82. You can add more epochs for like, you know, converge it to like 98% or 99%. For the time constraint, like we are just going to have it till 10 epochs. Now that we are...
We have done the training part. What we are going to do, like we are going to export this train model into a saved model format. For that, like let's just run the next cell. Now it should export the train model into the saved model format. Okay, now that it has exported the model, what we are going to do, we are going to run the inference. We are going to run inference on this saved model, which is already trained, and yeah, we just exported it to run inference. For that, like we have a benchmark script which basically runs on a dummy data and it will run for like multiple iterations. Right now we are doing like five warm-up runs and then 50 iterations.
With that, let's just run this first benchmark script on the saved model. It should start running. What we are doing is like we just train our model, and then we are just running inference on a dummy data. It takes few seconds to run the inference. You should see something like this on your screen for every iteration and time it took for every iteration. At the end, like you should see throughput something like this. Now that what we did is we trained the model, and then we exported it for inference. What we are going to do, like we are going to optimize the saved model even for the faster inference.
For that, like we are providing you a script called freeze optimize script. What it's going to do, like it's gonna, you know, convert variables into constants so that like it can be cached, and then remove like training nodes or remove any dead graph like where like, you know, the control is not actually reaching. Yeah, bunch of those optimizations with this script. Intel also has like a tool called Intel Neural Compressor, which can even reduce the precision from like float to like int8, and it will be even more optimized. It will be even much faster. Yeah, since like today we don't have that much time to demonstrate the INC tool, so we are just gonna freeze...
We are just gonna use the freeze optimize tool on the saved model, and then we are going to run the inference on the saved model in the new optimized model to see like how fast it gets. For that, like let's just, let's first optimize our saved model. I'm gonna provide input. I'm running this tool and just providing the input where I saved my saved model, and then I'm just gonna dump it to like a new folder called my optimized model. Let's run this cell. Now it's going to, you know, take few seconds. It's, it will go through the saved model and it will freeze, it will optimize the saved model.
We'll see like, you know, what are the actually benefit of using this tool while running inference. It's gonna take a few seconds to run this model. I hope everyone is following through with this notebook. Now it has done running the tool. We have an optimized saved model in this folder. By the way, like, even if you see the folder list, like you should see in the models like we have the saved model, we have the optimized model. Now that we have the optimized model, what we are going to do, like we are going to run the same benchmark script, which you're going to run.
Which ran on the dummy datasets, on the dummy input, and then we'll see what's the performance difference you get. For that, let's run this cell. You can see it's already running faster than before. Now that it has finished running, let's just plot the data and see what was the gain that we saw. For that, let's run the next cell, which is just plotting the data. You see, when we exported the train model, and compared to the. Yeah, we exported the saved model, and then we optimized it using the freeze optimize tool. You can see there is 40%, 46% jump in the performance. Yeah, that's the benefit you get with using the freeze optimize script.
Yeah, after the session, like feel free to explore that script as well. One more thing. We have the train model, and then we optimized it using the freeze optimize tool. We have seen the bfloat16 case for training. Now we are going to see the same bfloat16 case for inference as well. I'm gonna switch to my, you know, other notebook where I'm running the bfloat16 part. I'm gonna do the same thing. For saving time, like I already have the models, you know, placed in my directory, the saved model and the optimized model.
I already like used the freeze optimize script and saved it. What I'm gonna do, I'm gonna first run the float32 model, which is like the saved model, which is not optimized on and it's in the float32 format. Let me first run that. And just to make sure everyone is still at this point on their notebook, you know, hope there are, like, no issues following this. But yeah, you'll still have the notebook access, like, after this. I think I should be able to. It's running the saved model. Again, it's just running the saved model, which is exported from the training, exported from the trained, you know, model.
Then after saved model, we will run the optimized model, which is optimized using the freeze optimize tool. Then we'll run the bfloat16 model. Then I'll also show you, like, the changes that we did for bfloat16, which is just a single line change. Okay, now that I've run the saved model, float32 format, I'm gonna run the optimized model, which is optimized using the freeze saved model script. That runs a bit faster than the float model. Okay, running it for the 50 iterations, just like what we did for our cloud instances. Okay, it's slightly better than float32. For bfloat16, like you can see, I have specified the precision as bfloat16 for our benchmark script.
What I'm doing with this particular argument, like, I'm just gonna open the TF benchmark script here. In the benchmark script, if my argument is bfloat16, all I'm doing is just changing the config with this one line change, which is exactly the same as what we did for training. Even for training or inference, like, all you need is just the one line change to, you know, optimize it in the bfloat16. Now that I have added this one line change, let's see how fast it does with inference. Running the same script as before, just with bfloat16. Now you should see it's already running much faster. Okay. Now before, like, we plotted, you know, SavedModel with optimized model.
Now we are going to plot, you know, saved model, optimized, saved model, and then the bfloat16 model. Okay, let's see how that graph looks. You see, like, it's much, much higher than your optimized float32 model. All that with just, like, one line of code change, nothing else. That's the benefit of bfloat16. Okay, now I'm gonna jump back to our original notebook. Yeah, I hope everyone is now clear, like, you know, with bfloat16 optimizations, like, you can see much, much better performance with your training and as well as your inference. Yeah, it will... With the next generation Xeon, you should be able to use the bfloat16, you know, just with one line change. Okay, back to our notebook.
Now that we have, you know, we have trained our model, we have saved it for inference, we also optimized the inferred model. What we are going to do, we are going to start a TensorFlow serving locally on our cloud instances so that we can deploy our trained model. Let's just run, like, next couple of cells for that. Okay. We have this cell, just in case, like, if you don't have the TensorFlow server, which you should already have on your cloud instances, you know, you can, yeah, skip this cell for now. Yeah, when you go try it on your own, like, you can have this command to install TensorFlow serving. Yeah.
Now we are going to start the TensorFlow Serving on our, like, local port 8501. Let's execute the next couple of cells to make sure we have a server started on our local port. Okay. Yeah. For the inference, we are just gonna prepare our data to run the inference. Next couple of cells is just going to do that. Why don't we go ahead and run it? And then we are just going to plot a random image from our inference, just to make sure, like, we have the images of our flowers. You can see we already have a dandelion image here. Okay.
Now we are going to request the server that we started with three of the three images, and we'll see, like, what our serving will classify them as. First we'll prepare a JSON object to make, like, three inference requests. Let's just run that cell quickly. Now with the REST APIs, we are going to request our server to run inference. Next cell is basically doing that. For that, let's just run it. You should see something like this. You see like there is one wrong classification for this one. Like, you see the model thought it was tulip, but it was actually daisy. There are like two right classification.
Why is that? Because, like, we did not run even enough epochs. Like, you can run, like, more epochs for your training. I'm just quickly gonna show, go to the training part. Sorry. Sorry about that. Just going to the training part. Just gonna show the accuracy again. Right here. We trained it only till like 82% accuracy, which is fine due to time constraint. When you try it on your own, you can train it for like more epochs like maybe 30 or 40, and then you'll definitely have much, much better accuracy. You will see all of the images classified accurately. I think if everyone is still this part, that's great.
I think we are done with our hands-on session, so I'm just quickly gonna switch to the presentation part. Gonna stop sharing my screen. Okay. We just saw, you know, with our, you know, bfloat16 optimizations, we see like, you know, much, much better performance. We have like, you know, more, more optimizations with it. With the Intel, like, you know, you can use Intel Neural Compressor to convert it into int8 format. With the next generation Xeon, like, you will see even better performance. Just to give a glimpse of like, you know, one of the models that we ran on Intel fourth generation Xeon. Like, you can see up to like 30X gain.
When you go from FP32 to oneDNN enabled, then int8, to using like the next generation AMX instructions. Not just Linux. With oneDNN, you also get benefit on the Windows system as well. This is just one data we wanna show you. Compared to the, you know, TensorFlow with versus the TensorFlow using oneDNN. You know, you can see here, like, you know, you get like almost like 3.2x with FP32 data type, and then you get like 4.7x, you know, with int8 performance. Yeah. That was it for the hands-on session. Key takeaway here is like, you know, you can see like, you know, with with oneDNN enabled, like you have the faster inferences, a faster training and inference.
For bfloat16, all it took just one line of code change, and then you saw like better performance. With upcoming fourth generation Intel Xeon processor, like you should see even like, you know, much better performance. You'll have access to this notebook after the session, so you can experiment on your own and explore, like, you know, more things with it. There are like more repositories for education. We have like, you know, TensorFlow Hub. There are like Intel Model Zoo, like we already have an optimized model and the script for it, so you can go explore that. We also have support for Intel GPUs. That is supported in Intel Extension for TensorFlow.
Yeah, when you... Yeah, if you, if you get time, like, you know, feel free to explore it and, you know, try things out with it. With that, I think that sums up our entire session. I hope everyone was able to follow up. Yeah, I guess we can take questions from here. Okay. I'm just gonna scroll to the first question, the one that is not answered. Okay. Okay. Okay. Is it possible to have TF build that runs on many ISAs? Yeah, I think TensorFlow... One, one good thing about oneDNN is since it's a cross-platform deep learning library, so you can use it on like, you know, any Intel CPUs.
If you're using it for like, you know, different CPUs like Arm, like or AMD, I think they will have a different TensorFlow packages. You can build it from source on your systems. I hope that answers your question. The next question I'm gonna take. Okay. For your script that you were running on Sapphire Rapids instance, was it the HBM version or just a simple non-HBM version? Yeah. The one I was using is a non-HBM version. Okay. I don't see any question. Just making sure I have answered the ones which are not addressed. Okay. Okay, I see one more question. Do you need different code for HBM versus non-HBM? Yeah, we don't need.
You don't need any change in script to run it on non-HBM Sapphire Rapids. There's one more question. Can one TF build work on AVX2 and AVX Five Twelve and AMX? Yes. If you're using the Intel platform with like AVX2, AVX Five Twelve, AMX, one TensorFlow, you know, supports all. But you have to make sure, I think the Linux kernel versions are compatible. There was a question. I would think that HBM would be abstracted as any memory would, roughly. There is no change like at the framework level or library level. I think the TensorFlow builds will be same for both of these SKUs. I think it's. Fourth generation is available with like AWS instance for preview, I believe.
Feel free to try it, and see some wonderful things with bfloat16. Okay. I think I don't see any more questions. Okay. Yeah. Thank you everyone for joining. Hope that was helpful. I think the presentation will be shared with you, and then you'll have link to the notebook. Feel free to try it and explore more. Yeah. Thank you. Susan, back to you.
All righty. Thank you so much, Sachin. I really enjoyed learning about fine-tuning computer vision models and especially seeing the performance improvements that you showed. If you have any more questions for Sachin, you can head over to Discord to ask your questions and continue the conversation. Now, I am going to go ahead and wrap us up for the day. Get out your phones because there are a lot of QR codes. There is a question about one question before I go into this, about how long the notebook will be available. I'm going to let Sachin, if you take a look at that, and Gauri, if you all answer that question, that would be super, and I'll continue on with this then. Go ahead and bring your phones out because you're going to have a chance to have some QR codes.
My favorite one is Notices and Disclaimers, and I'm sure that's your favorite one too. If you wanna learn more when we talk about performance, exactly what we're talking about, how it's measured, parameters, et cetera, please go ahead and get the QR code for this on Notices and Disclaimers. Where are my slides at? Hang on a second. Okay. Okay. I wanna talk to you about, we did have some folks that are in the oneAPI Innovator program that presented today, and we also have some that will be presenting tomorrow. You can go ahead and think about participating in the Intel Software Innovator program. With special access to content, hardware and, of course, our Intel engineering expertise. Now, why might you wanna join the Innovator program?
I think I gave a few highlights, but I think what's really important is this is the oneAPI event, and that you can have access to Intel expertise, the content and support that you need to be successful in the Innovator program, as well as early code testing research, spotlight speakerships, like this Dev Summit is one example, and other Dev Summits. You can write articles, promote your demos, projects, et cetera, and then you can connect with the oneAPI Community. Again, I think today is one example that we've had a lot of success with, is being able to connect researchers, developers, and students, and we have talked with all of these types of folks today during this session, and you'll hear more from researchers, developers, and students tomorrow as well.
To join this program, again, just go ahead, take your phone, scan the QR code, and you'll get more information that way. We also have a oneAPI Meetup community, and I know that a few of you have already joined today. If you're interested, go ahead, scan the code, and you can read more about it and consider whether you want to join or not. We also have the DevMesh project and we have the DevCloud. If you want to learn more about either DevMesh or DevCloud, feel free to scan those QR codes as well. All righty. What's next is go ahead and try things out on DevCloud. There's more information about how much space you get, how long you can have it, et cetera. Just realize there are no software downloads for that, no configuration steps, no installations.
You can get started very quickly on the Intel Developer Cloud. There's also the ability to take step-by-step trainings that are in the Developer Cloud and continue to enhance your understanding of some of the things that you saw earlier today, especially the DPC++, and I know we talked about that earlier in terms of the difference between SYCL and DPC++ as well. We really want to hear your feedback. This is the second time that we have done the oneAPI Dev Summit for AI, and we took your feedback from the last one that was held in July. We'd love your feedback today. This is day one. I want to remind you that this is day one of the summit. We had so many content submissions that we added a second date. Tomorrow is going to be focused on High-Performance Computing.
You will still hear some things about AI tomorrow as well. If you enjoy today, please come back and join us tomorrow for the focus on high performance computing or what we call HPC. All right. Well, you know, we like social earlier today we talked about posting on Twitter, there's the opportunity to win some cool prizes. There might be some country specific parameters or limitations on whether or not you can enter or not. Go ahead and post on Twitter using the hashtag #oneAPIDevSummit. You can continue to do that today as well as tomorrow as well. Happy hour will be starting soon. I think we will need to open up happy hour early, we can go ahead and do that. What we will need to do is click the link.
We'll need to close out of this and then click the link that's in the agenda to get to happy hour. After the event, either tonight or tomorrow, go ahead and join us on Discord. This QR code will take you right to the group that you need to go to on Discord. We look forward to your conversations there with us as well. I will see you tomorrow for day 2, which is focused on high performance computer. Again, close this window and head over to the agenda to join happy hour. Thanks a lot. I hope you have a wonderful rest of your day. Bye.
Yes. Yes. Welcome, folks. Everyone who's joining. We're just waiting a few minutes for everyone to join up, and we'll get started in just a little bit.
Yay. More people turning on their camera.
All right.
Hi, Abhishek. Hi, Andrew.
Hello.
How are you guys doing?
Good. How are you?
Good, good. Did anyone wanna talk about their thoughts on the event before we got started? Any opinions, questions, suggestions for next time? If it was anyone's first time attending the Dev Summit as well, we'd love to hear your thoughts.
Oh, it was my first time. It was very informative. I come from a very strictly high-performance computing background, where I spent a lot of time interacting with compilers. I didn't do a lot with Jupyter Notebook most of the time. Seeing coding done in that context was somewhat new to me. Interesting all the same. I'm just used to seeing screens full of C code and then you pump it into a compiler, and then a computer runs only that really fast. I'm really here for the statistics on HBM performance on the Sapphire Rapids types of things and the new fast stuff that's coming out, that I'm increasingly curious about. I guess tomorrow is really my day to shine.
The AI stuff is interesting as well.
Nice. Yeah, I'm glad to hear that you enjoyed our event. If you ever wanna look back at our presentations, we do record them and post them online, so it'll be in your email in about a week. They should all be posted by the 19th. If you ever need to look back, we also have our slides uploaded as well from all the presentations. Let me send you the link just in case. It'll be the same place where the agenda was.
Okay.
I'll send that into the chat, and you can just keep an eye out. In about a week, it'll be uploaded. For those of you who are just joining, go ahead and head to Discord. We'll be using that as our main chatting platform just so we can continue the conversation on there. In a moment, Russ will get started with some games for happy hour.
Oh, so there. All right. Yeah, before I jump into Jeopardy, let's see. Are we ready to go, Gabri? What do you think? Gabrielle?
Yeah, we can get started. Let's get started.
Okay, awesome. Yeah, let me just share. Let's see here. Yeah, appreciate everyone joining. Looks like we have a pretty good group of folks that have joined from the event. Really appreciate that. Let's see here. Just wanted to remind people that we do have a get social part of this event. We actually have some $20 gift cards that we're giving away. It's really a random giveaway, but anyone who tweets on X using this #oneAPIDevSummit hashtag. Just give your impressions, thoughts about the event.
We started the period from 9:00 A.M. Central Time this morning, and it goes through tomorrow at 3:30 P.M. Central Time or 1:30 P.M. Pacific Time, or I guess 4:30 P.M. Eastern Time. Then we'll be choosing winners randomly throughout the event, and within 2 weeks you'll be notified if you're a winner. There's a pretty high chance of you being a winner if you do tweet. I think we have 100 gift cards that are available. Here is just a sampling of some of the tweets we have so far. Really good to see some excitement and people talking about some of the events, some of the sessions, things like that.
All you have to do is just tweet with the hashtag #oneAPIDevSummit, and that will make you eligible to win. Now, if you tweet once or if you tweet 100 times, your chances of winning are the same. We ask that, you know, please, if you wanna tweet multiple times, that's great. Just know that, for the purposes of the event or the purposes of the contest, it'll count as one entry. Also there are some rules. You have to be 18 or over, and I think there's maybe 10 or 11 countries that are included as far as eligible countries, as far as you have to be a resident of one of these eligible countries to win.
The other thing I would encourage people to do is if you go to X. All these people that have, it's great. A lot of folks have posted and have used the oneAPI hashtag. We don't really know how to contact you, and the only way we'll be able to send you a direct message is if you are a follower of Intel software. That's another thing, is you click. This is the Intel software page here. Just click on Follow. I'm gonna unfollow, but I'm gonna follow again. Just click on Follow, that way we can send you a direct message.
If you follow the at Intel software site, we'll be able to send you a direct message, and that will only be used to fulfill the prize. Of course, we keep all of our privacy and other policies. We definitely have an opt-in policy, and we wanna protect your data and your privacy. If you follow that, we'll be able to contact you. Any questions? I'm really excited to see a lot of tweets and everything around the Dev Summit today, so. With that, what we'll do is we're gonna go into the Jeopardy game and have some fun. Let's me jump into that real quick.
I'm gonna go to Discord now, I'm going to put a URL into this event. Here. What you'll do is you'll just click on this URL here. I'm putting it into Discord right now. It's the Jeopardy, the Training Arcade. Go to Discord and click on that link, and that will actually get you into the game. Yeah, thank you, Gabrielle, for posting that on the Teams.
Yeah, no worries. Just for those who joined late, they might not have seen the link.
Yes. Perfect. I'm gonna share my screen, but actually what I'm gonna be doing is we're gonna be playing this game. The nice part about this is you just click on this, you enter your initials or the username, and then you can start playing the game. This is a fun game just to get to know each other, but also, this is around oneAPI, some tech terms, some different trivia. We think you'll have a good time playing this game. I'm gonna choose from the audience, people who wanna choose. I don't know if anyone's played Jeopardy in the past, but what you do is just click on the Discord link, and then you'll get to jump in and play this game.
What we, what we'll do is we'll choose some of the first categories here. I got, like, I think I have three players here. The Intel folks can play as well. We're not gonna be winning prizes necessarily with this game, but if you do end up with the high score, you can brag about it using the #oneAPIDevSummit hashtag and talk about it on X. You can brag about it, and you will likely be chosen for that $20 gift card. Let's jump into this. Let's see. We have quite a few people here. Maybe Gabrielle, I'm gonna start out with you. Which category would you like to start with and dollar amount? Are you in the game yet?
Jeopardy.
I'm sorry, Gabrielle Feldman, I think you're on mute.
There we go. I wasn't sure which one you were asking. We can do tech terms for 200.
Tech terms for 200. Awesome. The year of the earliest known use of the phrase API, so application programming interface, according to the dictionary. What year was API used? Just the nice part about this is you just select it on your screen. Just select the answer you think is right. We'll have, you know, we'll have some scoreboards here. All right. That year that this term was first coined was 1968. All right. Let's see. Sri, maybe you could pick a category and a dollar amount.
Oh, okay. Hold on. I've actually had to get... Let me put my initials in.
No worries.
I was reading Discord and, okay. Let's do tech terms for 400.
400. Okay. FPGA stands for what?
Oh, gosh. I should know this, but I don't.
All right. Well, see.
You have a chance all the time. Okay. It's floating. It's field programmable gate array.
You think? Okay. Well, push the button. Let's just see everyone. Yeah, three people that haven't responded quite yet. Is it flashpoint gas argon, elicitive paleonomic glyco accelerometer?
I thought it would maybe be the palenomic glyco accelerometer. My dad. Oh, God. Look at that.
It erased the right answer.
The funny part was I was thinking floating point at one time. Some time ago, it was floating point gate array.
Just sort of a tutorial on this. To answer the question, all you do, you're in that browser, just click on the button, it'll record your answer. I think it tells you if you're right or not, right or wrong.
No, it's correct.
Let's see. I might just pick on some folks here. Maybe Andrew Downs. Would that be okay if you picked one here?
Sure. I'm not on Discord at the moment due to a very convoluted scenario with my work laptop.
Oh, sorry. It's actually. Let's see. If you click on the here, maybe we can give you a link directly to the game you can play.
Yeah.
Here, I'm gonna link directly to the game, and then you don't have to get on Discord to get to the game.
Okay. Okay.
Let me I'm gonna put this in the chat right here, just in the Teams chat. If you click there.
Let's take a gander. Let's see.
You'll be able to get in.
Oh, look at that.
What I'll do is I'll pick on someone else for the moment, okay? Let you get in.
Okay.
You wanna pick a dollar amount and pick a category.
Yes. Let's do oneAPI 600.
600. Okay. oneAPI simplifies development and deployment of what? Pick out your best guess and just click it. It looks like we have nine players total now. Let's see, four have yet to respond, but we're going down. We're down to the last 15 seconds, so respond if you're playing. Click on one of these buttons. All right.
I really wanna say 6502 and A085.
Yes, exactly.
I was always a Motorola fan until I joined.
Yes. All right, Andrew, can I pick on you now?
Yes.
Which one would you like?
I just got in. I will do. Oh, boy. Let's see. I'll start easy. I'll go oneAPI for 200, if that's still available.
Perfect. Yep. If it's there, it's available. oneAPI addresses this with ease of use while eliminating the need to maintain what?
Ooh. Ooh. I'm gonna go with this.
Mm.
Although I think some of these other options are technically also accurate.
I do-
Probably extensive reports. I don't know. We don't do much of those anymore.
But-
I think, legacy code written in JavaScript and or Perl is very compelling.
Exactly.
Very compelling reason.
All right. The best answer, I think we got that. Okay, I'm gonna show the leaderboard here. Look at your screen. We have GNF is at 1,400.
Ooh.
Why am I... What? Why am I... Oh, okay.
All right. Let's keep going. I'm trying to dismiss the leaderboard. Okay. We have others who want to choose a, others who wanna choose the next one we do, either put it in the chat or just yell it out.
Well, apparently, nobody is. Are people scared of compute architectures? I'm kinda am, so I. I'd still do. I'll do compute architectures for $200. Kinda ease my way into it. Don't ask me-
That's a good strategy.
Don't ask me about x86. Oh, I'm supposed to click it, right?
Yep, click that.
Okay.
Okay. Let's see. Whoa.
What?
Oh.
Are you strong with scalar?
Oh, okay. All right.
Oh, I forgot. I forgot.
Okay. CPU.
I was wrong anyway.
All right. Okay, who wants to choose next? I might pick on Emma Mi. Emma, what would you like to choose?
Looks like we need some more points, so I'll go with 600 for compute architectures.
Okay, great.
Let's go big, everyone.
Yes. This one is vector.
Good.
So we know the ten.
If you're strong with scalar, what is strong with vector?
That's another question for scalar, vector, matrix.
Oh, here's where I lose it.
There might be some.
You changed my mind. You changed my mind.
Some varsity in this one. I don't know. Let's see.
I might have negative points after this question.
I'm headed-
That's okay.
I wanted to change my answer.
Me too.
Sorry.
No, no tasties vaxies.
Yes. All right. Okay. Well, any others wanna wanna choose one? I might go on the list. Maybe Vlad. I don't know if you're on. If you want to either tell us which one you'd like to choose.
Yeah, sure. The computer architecture one.
Awesome. Okay.
I'm close right now because of those questions. Now there's another one. Good luck, everyone.
Yes.
Peace.
The good news is Everyone's done with this one, with this category, so.
Oh, no, I got it wrong again. I should have...
All right.
I'd like to say I was right, but I...
All right, thank you, Vlad.
I'm last again.
See you, bud.
Oh, I didn't click it fast enough.
Yeah.
I had it right. Aw.
I'm like, I really should be clicking.
You got to click fast.
Oh, no.
I gotta... Yeah. Gotta click fast.
I should have purchased more Intel Xe processors to click faster with.
All right, let's see. Anyone else wanna choose? I'm looking at Ricky Rodriguez. Let's see.
Tech terms, 600. $600 question, tech terms.
Tech terms or oneAPI? Which one?
Tech terms. Tech terms.
He said tech terms.
Awesome. Okay, let's do tech terms. Okay, this is double jeopardy.
Oh, great.
This is where you catch up if you're behind.
Okay.
Oh, what happened?
This is what you do.
Oh, I wrapped up.
You wager everything you've won, if you wager it, and you win, you get it, you double your score.
I'm going for negative.
You either get rich or more negative.
Well, if you're negative-
I'm gonna lose money.
I don't think it'll penalize you if you're negative.
Oh, okay, good then.
Has everyone made their wager?
Oh, that's proper math. Zero times zero.
Yeah, I think everyone's made their wager, so let's do it.
Wager made.
Um.
Wager made.
The type of machine learning where structured data sets with input labels are used to train and develop an algorithm. Let's jump into that. Structured data sets with input labels. Is it unsupervised learning, hypervised learning, transfer learning, or supervised learning?
What does hypervised mean? I don't know.
Yeah, Intel's had the hypervisor, right?
Hypervisor, but not even hypervisor, I don't understand that.
Does Intel have a hypervisor?
Well-
Oh, my God.
I think they used to.
I didn't even have enough time to select.
Sorry. Sorry. Okay.
More -1.
Yes. I won it all back.
Look at that. Okay.
I went from 0.
Yes, indeed.
All right.
Yes, indeed. That's good times.
Okay.
I have redeemed myself or sorta, kinda, sorta.
Yeah. We got one more with this level of jeopardy, let's jump into that, and then I'm gonna show the leaderboard again. There's one of these libraries that is not associated with oneAPI, and you have to guess which one of those is not a oneAPI related library.
I think I'm getting some score back for this one.
Good.
I think I have, like, negative 1,000 or something.
A lot of people are getting this one right. There's only two more people that need to respond, and it is time to show the correct answer.
Yay.
Good job, everyone. Okay, I'm gonna show the leaderboard, then we'll go into double jeopardy.
Whoo.
Double jeopardy, everything's worth twice. Okay. All right. Vlad is in the lead. Everyone else is doing pretty respectable. Emma started late, we're gonna be okay with her scores. That's great.
I'm a bad decision. I love knowledge.
All right. let's see. Let's go to Gabrielle Feldman. if you wanna choose one.
Let's do words 400.
Okay. The year that CPU was first used to refer to a central processing unit in a computer. This is according to this Merriam-Webster's dictionary. No googling. No cheating and googling this.
Can I use ChatGPT? I can't use AI?
Yeah, exactly. Can you use AI? All right. That was 1962.
All the good stuff happened in 1962.
Yes.
Pretty much all of computer science came about.
Yeah.
Well, it was done in the 1960s.
Awesome. let's see. I'm gonna choose, let's see, Fabien Aichour. maybe you could choose the next one if you'd like.
Yeah. Intel Trivia.
What dollar amount?
For $800.
$800.
Oh.
Awesome.
Ah.
Okay. This is again, everyone gets to bet.
Oh, no.
Everyone bet. This is where you can catch up, guys.
Oof.
I think...
All right. Okay, looks like everyone has wagered. Let's go to this question. In 1969, Intel opened its first non-U.S. sales office in this city.
Oh, good golly. I am no...
Oops. Yeah, some of these might be hard, so you know.
Could be, could be.
I wagered all of my money, so it's been a pleasure playing with all of you. This isn't real money. We won't send anyone a bill, so. Okay, the right-
Argh.
was Geneva, Switzerland.
I was sure it was-
Look at that. Andrew Downs got it. Awesome.
Yes. My vague historical knowledge finally paid off.
I had to put that one in there. I actually lived in Geneva, working for Intel a while ago, so that was a trivia question that interested me for some reason, so. All right. Well, let's see, Andrew, I think, as getting that one right, let's have you choose the next one.
Okay, let's take a look here. I will do Cycle for $400.
Right. Numbering system for Cycle specifications.
Cycle.
No worries, I pronounce it wrong all the time myself. What? Yeah, let's see. Choose the right numbering specification or the numbering system for Cycle.
Oh, Lord, I have no idea. Oh, everyone's doing ears now. Let's see.
Hmm.
I like in reverse order of issue. That's a fun one.
I went with the top right. The last time a SYCL specification was released was in 2020.
Dang.
Oh, I should have gone with the top right. That makes more sense. Well, shoot.
Could be wrong.
Oh.
Oh, what?
It was top left.
If only there was.
All right.
Wah, wah.
Argh.
Hey, I think my...
We lose your video? Uh-oh.
I think maybe we got some...
Uh-oh. Russ has been bit by the same Teams bug. He gets... There's something about his... We have a daily thing where his Teams blows up.
Time for a break.
Something out Nima. Oh.
Oh, hey, Russ.
There we go. He's returned.
Object.
Looks like Andrew has a pet.
Here we go. We have a special guest until Russell gets back. My camera decides to focus.
I did not let my cats in. Next time-
Oh, that's a cute cat. Here he is. Oh, my camera doesn't like this.
Mine likes to sit in front of the monitor and attack the mice, like the mouse pointer. I say, "No." Oh, hello kitty.
All right. Does everyone see my... That's awesome. Look at that cat. Does everyone can hear me again now?
Yes. Yes.
Everyone see the screen?
I don't want to alarm you, but there's a volcano behind you that seems to be-
Oh, okay. Yeah. All right. Everyone see the Jeopardy game now? Is it still up?
Oh, shoot. Yeah.
Okay.
Great. Oh.
Oh, wow. That's just amazing. I cannot believe-
Okay, I think for the next question.
Cloud of a negative zone.
Oh, no. Well, the next question I might ask Massimiliano Pozzetto. I don't know if Massimiliano is on, and he, if she wants to choose the next one. If not, shoot, I'll have Gabrielle choose if she would like.
Sure. Which Gabrielle? Me Gabrielle?
Either one.
Okay. Let's see. Words, 1,200.
All right, 1,200.
1,200. Psychonic.
Okay, this one you have to know. This is sort of, this one you have to know James Reinders for. Psychonic Owls is what?
Huh.
Famously friendly to dolphins and horses. Let's see. Mascot of the University of Bristol, an anagram of IWOCL or SYCLcon, or a twisted banded.
Yeah.
I don't know if anyone's been to IWOCL or SYCLcon, but that's an event really geared for people who like to do parallel programming and have been working with SYCL for a while. That happens in May of every year, I think. April or May. All right, what's next? Sri, what should we do next, buddy?
Let's do SYCL for 800.
Awesome. Okay. We only have four more questions left, and then we'll be done with the Jeopardy, and then we can see who won. The definition of SYCL queues, which are the mechanism through which. What is a SYCL queue? Does a SYCL queue maintain order in the universe? Does it host code for a device for future execution? Synchronous requests are made of FPGAs or device code that can be sent to a host or device, but it will not execute.
That's a really useful thing to have.
Yes. Very useful. All right, everyone responded.
Makes debugging so much easier.
Awesome. Okay, we just have one more person who needed to respond, and I'm gonna show the correct answer. This is host codes to work-
I wanted to pick that one.
On the device future execution.
Yeah.
I didn't.
Look at that. All right. Let's see. Fabien, would you like to choose the next one?
Intel Trivia.
Intel Trivia?
400 or 1,200?
$400.
Yeah.
$400. Awesome.
I think-
Okay, what was the name of Intel's first product?
Oh, no.
I apologize. This particular category is all about Intel, so sorry about that. Was it the 4004? Was it the Schottky Bipolar Random Access Memory or the metal oxide static RAM 1101 or the 11 PD RAM? This one might be hard.
It is hard.
All right.
Ah.
This was one year after its founding.
I knew this was
This was the Schottky Bipolar Random Access Memory. Pretty interesting. Pretty interesting product.
What is this Schottky Bipolar Random Access Memory? I mean, it sounds like the most, like, you know, unstable, well not unstable, but bipolar? Don't get... Like, negative, positive. I mean-
'Cause it has two poles. This is 1969. You know, I think 1971, Intel actually made a watch at one point, like a digital watch that was $400, which I think was like $4,000 at the time.
You're not still going, are you?
Crazy stuff. All right. I'm gonna, Let's see. I'll pick on the Gabrielle who did not choose last time.
Let's do words for 800.
All right. Just a moment.
Nobody wants to tell Intel trivia now.
Yeah, exactly. All right, here's the daily double, guys.
Oh, sure. Why not?
Oh, yeah.
Spend it all.
It's all or nothing.
I'm good. That way I end with a negative when I hit the Intel trivia item.
You're here just a moment. Yeah, take a look at that. Hello? Sorry, my daughter was coming home from school and just got locked out. Let's see.
I, my cats are outside crying.
Well, yeah, we got this fancy digital lock that opens it digitally, now the key doesn't work. The only way you can open it is digitally.
Oh, man.
So...
Don't use NFC.
SYCLcon is?
Brian about Prozuse game.
This is my favorite one so far. A roundabout Ponzi scheme. Let me tell you, we've got an investment for you. Okay. That was a gathering of great people to discuss heterogeneous programming.
Last but not least. Gosh, what should we pick? In maybe Intel trivia?
Yeah. Decisions, decisions.
Decisions.
Okay.
Decisions, 1 million revisions.
Does everyone see my screen now? This leader received the National Medal of Science from the U.S. President, Jimmy Carter. This was a long time ago. Which one of these
Oh, my God!
received that?
This is hard. Between two of them, but...
I don't know if anyone got it right, but it might be because I'm not connected very well. Let's see.
Really? It was Noyce?
Oh, no.
I was sure it was Gordon Moore. I thought it was either Moore or Grove.
Yeah, it was Noyce. Okay.
Who is Noyce?
All right. Well, Vlad is in the lead.
Well done.
There's still a chance. We have one more question. We have Final Jeopardy.
I don't know.
I think not many of us can make the gap between Vlad here.
We have GNF and SR. Emma's actually up there now. Look at that. Makko. I'm gonna dismiss the leaderboard, and we're gonna play the Final Jeopardy.
Yikes.
You get to wager with this final question.
Why not? Do it all. Go for broke. I don't know anybody nothing. I'm at the stage of what would Vlad do. Now I'm at 2,000.
All right, are we all done wagering?
Yes.
Okay.
Indeed.
The definition of the oneAPI acronym. This might be pretty easy or it might be hard. Is this one ring to rule them all? That would be sort of cool. One application programming interface, one academic performance index, or one application platform interface. Let's see. All right. Okay.
Oh, yeah.
I'm down.
Good job.
It was. It was a great job.
Let's... It looks like Vlad-
Wait, my score didn't update. Oh, there it goes.
Oh, it's up. Oh, GNF.
GNF wins.
Look at that. Is that Gabrielle Feldman who won? Awesome.
Yep.
God.
I got from 3-8.
Nice.
Great.
Holy moly. Good job. That's a great score.
Yeah, that's great. Great job pulling it out. Is it okay if I post these scores on Discord? We won't share any personal information or anything. We'll just sort of post this on the leaderboard. If that's okay with everyone, Andrew and Vlad and everyone, we'll just post this. Just have your names on the leaderboards. I will post... What I will do is I'll post the a link to this game that you can play on your own if you want. Just a reminder here. Just a reminder to you can post as well on social media. Here's some of these sparkle posts here.
You can post just using the oneAPI Dev Summit hashtag, and that's where you'll be entered to win the $20 gift card. Thoroughly encourage anyone, especially some of the winners here, you can brag about your jeopardy. I really appreciate everyone.
Aw.
Yeah, great score, scores for everyone. Yeah, please do this, and then just keep in touch with us using this Discord here. We'll be posting new news as time goes by. Let's see, really just appreciate everyone's time today. Hope you had some fun. We're gonna have some new games. The next, the next Dev Summit will have a similar game like this tomorrow. Hope all, everyone, has a great afternoon. Thanks for joining us and let's see, Shri or Gabrielle Amaranto, anything else we wanna say?
I have nothing at this time. I'm looking forward to seeing all of you tomorrow for day two on the HPC portion. Also, there is a great hands-on session, the last one, tomorrow that I think all of you would really appreciate. Looking forward to seeing all of you there for that.
Awesome. All right. Well, hey, we appreciate it, everyone. Have a good one.
Take care.
Thanks, everyone.
Thanks, everyone.
Bye.
Bye.
Bye. Take care.
Bye.
Bye.
Yeah, take our survey. That's the other thing.
Oh, yeah. Yes, please take our survey.
We love to hear your feedback.
Yes. Yeah, please take our survey, guys.
Also, if you liked me more than you like Susan, that's an important survey question. No, I'm kidding, guys.