To really deliver on the promise of AI. We have a very exciting program today. So we will start with a few introductory remarks and really, like, discuss the work we're doing on scaling AI capabilities, but also provide you with concrete examples of the work that is being done between InstaDeep and BioNTech. Some of the work is actually presented for the very first time, so we're quite excited about that. But before anything, I'm very excited to welcome Ugur Sahin, the CEO of BioNTech, to present. Ugur?
Okay. Can you hear me? Yes. Karim, thanks for the introduction and welcome, everyone. Yeah, so let's get started today. What we want to do today is not to provide you an advertisement about AI, but we really want to accomplish two things. At the beginning, we want to make clear why we are doing what we are doing, yeah? And the second is why we really need this AI capabilities to be able to do what we want to do. So let's start with the goal that we want to accomplish. BioNTech was founded in two thousand and eight, with the goal to change the way how cancer patients are treated. And the motivation for that is a scientific biological motivation.
So you all know that the whole industry spends billions and billions of dollars every year into cancer treatment. And the problem is, of course, we are making progress, but actually, real cure is still the exception for cancer patients. And the reason for that is the core reason for that, and the root cause for that, is that every patient has a different cancer. And this is depicted in a simple slide here, showing how cancer establishes. It is based on gaining mutations in normal cells, and these are random mutations. And so healthy cells acquire mutations, and there are three billion locations in the genome where these mutations can happen, and then sequentially, the cancer cells acquire more and more mutations. And we have two fundamental challenges in cancer.
One is that every patient has a different type of cancer, and this is even more complicated because we have all our transplantation antigens, and recognition of cancer by T cells means that every patient has a different type of immune system, so that is one variation, but the second variation is that every tumor cell within a patient is different, so we call this in the scientific community, we call this inter-individual variability, and the second is intra-tumor heterogeneity. This is known for twenty years, but it is not really addressed by today's treatments, and if you want to address that, it becomes clear that is an extremely complicated situation, so that means every cancer treatment for every patient is a battle, is a new battle, driven by understanding the complexity of the disease.
That means one question will be, or is: How can even every cancer cell is different, how can we develop treatments that address as many as possible of these tumor cells? And the second question is: Cancer is evolving, so cancer is adaptable. Can we somehow predict how the evolution will continue so that we get an understanding whether a treatment not only works, but how the tumor is going to react to that? And given that cancer affects every year more than twenty million people worldwide, yeah, this becomes now a high-level computational question, yeah. That is something that we want to accomplish to create solutions, yeah, to address that. Our pharmaceutical strategy to address that is combination therapies, yeah. Combination therapies, very simple. One is immune modulators.
We know that the immune system is able to recognize cancer, and we are developing next-generation immunomodulators, and we have powerful molecules that can activate and modulate the immune system. We are not going to talk about this. The second are targeted treatments, molecules like antibody-drug conjugates, where the drugs, the chemotherapy, is delivered to the tumor, and not only the target positive tumor cells are dying, but also the target negative tumor cells are dying. But there is one additional element, yeah, and this is our mRNA vaccines. Our mRNA vaccines provide us the real opportunity to customize, to tailor the treatment according to the genetic profile of the patient. And this provides us the opportunity to really ask the question, we have 20 different clones. Can we develop a vaccine that address these 20 different clones?
So this is fast forward, yeah, the way how we see cancer treatment in the future. The cancer treatment in the future will be starting at the top, getting clinical samples from the patient, doing the clinical omics. That means analyzing the genetic changes in tumor cells. So the data generated here are about four terabytes of data for each patient, yeah, and that requires really AI machine learning algorithms to come to the right conclusions. Of course, if we understand the situation, we need to make decisions, and we need to have treatments. These are our drug toolbox. We have our mRNA therapeutics, including the mRNA vaccines, but also mRNA-encoded antibodies, cytokines, and so on. We have engineered cell therapies, so that means engineering the patient's cells to attack cancer. We have antibodies and antibody conjugates against new targets.
We have T cell receptors. These are the receptors which T cells in the body use to recognize mutations or tumor antigens. And we have small molecules, immune modulators. Many of these molecules are invariant, so that means they can be applied to many, many different patients. Yeah. And this is the way how cancer treatment works today. So that means a treatment, a certain antibody is applied to 20% of patients who have this target. But some of them are absolutely personalized. These are our personalized vaccines. So that means we are combining off-the-shelf drugs with personalized treatments, and we treat our patients. And in the future, we will not only treat the patients, but we will monitor them and see how they react. So that means we have to combine a few skill sets.
Deep genomics and immunology expertise to analyze the patient data. That's what we are already doing today, but AI gives us the opportunity to do that in a much deeper and faster scale. Individualized treatment platforms to address the inter-individual variability. We spent 30 years in developing our mRNA pharmaceuticals. We are not going to talk about this, yeah, but this really means that we need to have this cutting-edge technology to ensure that we can address different type of targets in patients. Then AI and digitally integrated drug discovery and development. When we started BioNTech, we had background in machine learning. We did computational medicine, but we did that like biologists would do that, and we wanted to do it really in the way how AI researcher would do that.
That's the reason why we partnered with InstaDeep, to have really not only some AI capabilities but the most, the cutting-edge AI capabilities that have been developed for the specific purpose that we need to address. And of course, in-house manufacturing is another skill set, so this was my intro, and I would like now to call Ryan Richardson.
Thank you, everybody. Thank you, Ugur, and welcome, everybody. Thank you for coming. I'm gonna be very brief, but I just want to give a little bit of context to how these two companies came together, BioNTech and InstaDeep. Some of you may know BioNTech, some of you may know InstaDeep, but how did our paths cross? So just a little bit of a historical perspective on that. So, as Ugur mentioned, BioNTech was founded in 2008, and just a couple of years later, in 2011, we introduced our first computationally designed mRNA cancer vaccine. And just a couple of years after that, we took our first mRNA personalized cancer vaccine into the clinic in 2014. Incidentally, the same year that InstaDeep was founded.
In 2017, we transitioned our personalized cancer vaccine platform to a fully in silico process, meaning that we use algorithms to select the neoantigens, removing human intervention on a patient-by-patient basis. And so far, since that introduction, we have used AI to select thousands of neoantigens across hundreds of patients that have been treated with our vaccine. Our paths crossed in 2019, when we started project work with InstaDeep, and at that point it really was project by project, but it quickly escalated from there. And in 2020, we formed a joint AI lab, where we underwrote a sort of long, multi-year commitment in the bio AI field to establish infrastructure, dedicated personnel, and a joint vision.
That then quickly escalated, and already in 2022, we, alongside Google and other technology investors, invested in InstaDeep Series B, and that was soon followed by a broadening of the AI work across BioNTech platforms, and that culminated in the 2023 acquisition of InstaDeep by BioNTech. And today, we operate InstaDeep as a wholly owned subsidiary based here in London. So two companies, one mission. BioNTech, over 6,000 employees, headquartered in Mainz, Germany, with a mission, as Ugur mentioned, to harness the immune system to fight cancer and other serious diseases. InstaDeep, over 370 employees based here in London, with a mission to productize disruptive AI innovation.
I think what's important here is these two companies, these two forces, have really joined together now under the rubric of one common mission, which is to build a leading AI-first, personalized immunotherapy platform and to leverage the breakthroughs that we obtain in the process across the full value chain. For that value chain deep dive, I'm gonna turn it over to Karim.
Thanks, Ryan.
So, thank you so much for the introduction. And, so today we're gonna be, as I mentioned earlier, showcasing the capabilities we have, but also showing you concrete examples of how we are applying those to BioNTech's immunotherapy pipeline, going from like labeling of, like medical samples, RNA, DNA sequencing, proteomics, identifying targets, protein design, and so on and so forth, and also like lab operations. So we're gonna deep dive straight into it, and the first part of the presentation is gonna be about AI capabilities. As you know, to deliver world-class AI, you need lots of capabilities. You need compute, world-class innovation. You also need a sort of like platforms that can allow people to use these tools, these powerful tools, easily.
And so we've done a lot to develop those capabilities at InstaDeep and BioNTech, and this starts with, compute. So today, I'm very happy actually to introduce you to our new supercomputing cluster, which is coming online in Paris, France, and that we call Kyber.
At InstaDeep and BioNTech, we're heavy users of computational power to accelerate innovation, from creating algorithms to launching products in the real world. However, compute resources are becoming increasingly limited.
Our mission is to drive innovation at the highest level and to have real impact. If you want to really drive sustainable progress, you need to also have a hardware expertise. The compute resources available to engineer was critical to our productivity.
InstaDeep made the decision to invest in their own infrastructure.
Today, we're bringing this new step with our new supercomputing cluster.
We've been able to design the cluster that we need to continue to drive innovation forward without being worried about the availability of cloud compute, for example.
Each Dell NVIDIA HGX has eight H100 graphics processing units, and each one of our specifically designed racks is composed of two HGX H100s. These are connected to 24 CPU nodes with 256 CPU cores each. We have one storage node that brings 122 terabytes fast NVMe storage. This new infrastructure brings 10 times more compute power to our engineers, creating an exascale supercomputer. BioNTech and InstaDeep are working together to develop next-generation innovations in life sciences. This new cluster will power future models and expand our capabilities in artificial intelligence as we aim to become a leader in digital biology.
We're now able to take all the work that we have built upon over the last several years and scale it up. InstaDeep is in this for the long run, and over the course of the next five, six, seven, 10 years, we are going to have benefits that really allows us to further invest into the research.
But it's not just about customers and products for commercial use. Our mission is to accelerate the transition to an AI-first world that benefits everyone, that can address large-scale challenges, and there is much more to come.
Cool. And, really, congrats to the team who worked super hard to bring Kyber online, which is the case today. And so for this supercomputing cluster, we spent tremendous amount of time designing and optimizing the cluster. Most companies don't do that, but we have expertise in this. And so to tell you more about this work and the engineering and software aspects as well, I'm happy to invite Nasef and Alex to come here.
Thank you, Karim. Hi, I'm Nasef Labidi. I'm Head of Infrastructure at InstaDeep. Let's dive into the specs of this new computing infrastructure. Kyber is composed of 14 identical racks, bringing all together more than 200 NVIDIA H100 GPUs, more than 86,000 CPUs, and 1.7 petabytes of fast NVMe storage. This design that we created is repeatable and expandable. And this new computing infrastructure is near exascale, so it provides half an exaflop of computing power, which brings us to the top 100 worldwide of computing infrastructure, and specifically in the top 20 of H100 GPU clusters, specifically.
So we put our expertise, and it's actually our third iteration, creating in-house infrastructure. We designed this cluster in a way that it's an in-house design, it's repeatable, it's predictable. So all the racks that you saw on the video are identical, which provides us with lots of advantages in terms of maintenance, in terms of predictability, of the cost, of the power usage, the cooling needs in the data centers. So this paves the way for a huge cluster and a huge ecosystem and possibilities of expansion in the future. So it's a consistent design that we created internally, validated by NVIDIA also. It's optimized for large AI workloads, and as I said, it simplifies maintenance and expansion in the future.
We built on our expertise in hardware, but also in software, and especially we built internally our own platform for orchestrating AI workloads, large, massively distributed AI workloads, and it is called AI Core. It allows us to manage our day-to-day business of training machine learning models from the compute infrastructure to the projects, the user, the security, and the workloads themselves. It's fully tailored for our own usage and for our hardware infrastructure, and it's built on open standards, so bringing hardware expertise and software development expertise brings us lots of benefits, and this supercomputer infrastructure is. We built it for our engineers so that they find it available whenever is needed.
It's, you know, that it's very hard nowadays to get hold of GPUs even in the cloud with all the demand happening with LLMs. So yeah, we have this huge power in our premises, and we'll be exploiting it. So also, this design bringing hardware and software gives us lots of flexibility. As I said, it's fully tailored for our needs. Also, there is no vendor locking, in-house design, and we built it in a way that it's we can repeat it, we can expand it anytime, and we can also provide it to other. So yeah, it gives us predictable costs, predictable energy consumption.
Whenever we want to expand in the data center, we know exactly where we are going, and yeah, it solves lots of capacity management issues, and also, it's cost-efficient, as we calculated that it makes us save 50% on equivalent cloud spends, so thanks, everyone, and happy to pass it to Alex.
Thank you, Nasef. So I'm Alexandre Leterre. I'm the Head of AI Research at InstaDeep. And now that we've seen how we're growing our computing power at InstaDeep, I want to dive a bit more on how we intend to use that cluster, for what purpose, but also, more importantly, why is it such an exciting moment for InstaDeep and BioNTech? So you might have noticed that the recent advancement in artificial intelligence has mostly been driven by what we call the scaling laws. So the scaling laws, it's an empirical law that's been observed, stating that the performance of modern AI systems, such as LLM, large language models, grows as we increase the amount of data it's being trained on, the amount of compute, and the size of the model, so the number of parameters.
In practice, what does it mean is that if we scale an existing system, here, OpenAI GPT-3, we scale it at an unprecedented level with more compute data, and number of parameters. Well, we get more or less more intelligence for free. So we don't need to necessarily innovate from an algorithmic point of view, but simply by scaling, you get better performances out of an existing system. However, scaling laws do not come for free, in a way. It's no small feat to try to leverage the scaling laws because the scale at which you operate is really tremendous. So you will scale very large neural networks that are terabytes of data. So to train them efficiently, you're gonna have to split this model in small pieces and shard that, spread that across a pool of hardware accelerators, so a very large cluster.
Because this is split across different hardware accelerators, there are networking issues that arise. So terabyte of information will be shared across this hardware accelerator, so you have to make sure you're not bottlenecked by this communication, and you actually spend your time not waiting for information but can actually perform useful operation. Also, this neural network, they are so large, they perform billions and billions of operations per second. So you have to make sure you squeeze the most performance out of each piece of hardware. So you have to code very low level and be close to the compiler, so that you can squeeze the most performance. So this call for very advanced engineering solution, and at InstaDeep, we've been using this and applied that to two very concrete application, which I want to dive into now. The first one concerns reinforcement learning.
So reinforcement learning is a part of artificial intelligence that focus on learning from trial and error to solve an optimization problem. So as opposed to conventional machine learning that use a predefined dataset, here, reinforcement learning will leverage a simulation engine to turn compute into data. And here, scaling laws applies as well, meaning that a key ingredient to the success of reinforcement learning is how quickly you can simulate your system of interest and generate data to learn from. So at InstaDeep, we developed sophisticated solution for that. Here, you can see an example of diagram showing how we split it, the part that simulate the system of interest, and learn from it in a very efficient way across hardware accelerators that also allow us to scale horizontally. So when applied to a concrete product we developed at InstaDeep, the results are better, cheaper, and faster.
Better because as you scale the compute, you can see that the performance of the system goes up, up to 50%. Cheaper because out of each piece of hardware, we really squeeze the most performance, which translate into a linear scaling of the system. So as we scale the amount of hardware, the amount of data we can generate grows linearly without being bottlenecked by the communication. And then finally, it's much faster. So here is an example of compared to a baseline legacy system. We can see that as we scale the hardware, we can increase the speed or cut the time it takes by more than 200 times, which means that an experiment that we're taking maybe 16 hours before now takes 5 minutes. So you can imagine how empowering it is for a scientist and engineer to have that in-house.
It really accelerate the scientific discovery process. So I hope this gives. Oh, sorry, I have another use case, actually. This concerns generative AI. So this is the latest innovation on protein language models. It's actually gonna be the topic of the next session, so I'm not gonna dive into it. But I want to show you that scaling laws are in action again. Using our in-house software stack, we can see that by scaling a model from 150 million parameters to 15 billion, we see that the training loss decrease as we scale the size of network. And that's gonna translate in better downstream performance on some downstream task of interest. That we've been developing in-house and our software stack, which on par utilization of the hardware as the latest Meta Llama 3.1 model.
I hope this gives you a sense of what can be achieved if you combine, let's say, an advanced software stack with a growing computing capacity at InstaDeep. We really think that's gonna accelerate how quickly we can do scientific discovery at BioNTech and InstaDeep. Without further ado, Karim, if you want to join.
Thanks, Alex. And really, I think I want to really emphasize this. Everybody's talking about large language models, and the difficulty in those is actually being able to deploy those kind of like computational workflows at very large scale. And as you have seen, we have the hardware capabilities, but we also have the software integration and the wealth of expertise, having trained those models for many years to be able to do that. But this doesn't stop here. And actually, at InstaDeep and BioNTech, we're doing also lots of fundamental AI research, and this is an exciting area. And so today, we're very happy actually to introduce you to our latest model, GenAI model, which is a genuine innovation. This is not a GPT-style autoregressive model. This is not a diffusion style model.
This is an entirely new concept, Bayesian Flow Networks, developed at InstaDeep, and I'm very happy to introduce Alex and Bora to give us, sort of a description. Thank you, guys.
Thank you very much, Karim. So, my name is Alex Graves. I'm a research scientist at InstaDeep, and I'm gonna be talking about Bayesian Flow Networks here. Okay, so, a lot of you have probably seen this video before. This was generated by the Sora generative AI model. It's a caption-to-video model, so it takes a text caption, like the one at the bottom of the screen here, and uses that to generate a high-resolution video. And it's really quite remarkable just how much this model has learned about the real world, which you can see from this video. You can see that it's learned more or less about the way people walk, about, you know, the way, light reflects off wet pavements, cities look, and so forth. And it's done all of this purely by crunching data.
So it's looked at millions and millions of images with associated text descriptions and somehow learned to build a bridge between them, but, and so the, you know, it's important to remember that this kind of thing really was, you know, unthinkable just a few years ago. Like, this is really, these are really sort of recent breakthroughs. We've all got quite used to seeing them in the news, but it's important to keep, you know, in mind just how sort of game-changing these kinds of technologies are, and really, the thrust of this talk is to say, well, how can we take this technology and apply it to scientific data? Like, the kind of data that, you know, we care about here.
And I think in terms of thinking about scientific data, it's important to demystify these models a little bit. They can sort of look like magic. You type in a prompt, you get back an image, you get back a poem. It just seems like this machine is doing something, you know, kind of, you know, really incredible. But under the hood, it's basically, they're basically very large, complicated parametric probability distributions. So in principle, they're not actually that different from the kinds of statistical models scientists have been using to analyze data for many decades now. But so, you know, the question you might ask then is, well, you know, if these are just statistical models, then, you know, why is it only recently that we've had these big breakthroughs?
The key point here is they are models of not just, you know, one or two variables, but millions of variables simultaneously. Basically, they're modeling a joint distribution over very many variables at once. For example, a generative model of faces, like the one that generated this face here, that's basically trained on lots and lots of images, and it's essentially learning the joint distribution of all of the, you know, million-plus pixels in each of those images. When you run this as a generative model, you're simply just sampling in statistical terms. You're picking a sample from the joint distribution. The reason that's difficult is that all of those variables, all of the pixels in this case, they're all interrelated. You know, a face is roughly symmetrical.
The color of one eye generally matches the color and you know the shape of the other eye, and so forth. And so if we imagine kind of translating this to the scientific realm, there as well, of course, we have extremely rich, complicated datasets where there's an underlying, you know, there's a very complex system, and all of these variables influence one another in a complex way. And that is the power that we want to kind of bring from the world of generative AI into the world of scientific data. Okay, and so we don't just... In general, we don't just want to pick samples from a joint distribution, we want to control what the model does.
So in the same way that, you know, you control ChatGPT by typing in a prompt. Joint model of images and text. You can get this kind of control quite in a straightforward way. It basically comes down to, you know, what a statistician would call conditional sampling. Essentially, you fix one of the modalities, and you generate the other. So if you have this joint model of images and text, if you fix the images and generate the text, you have an image caption model, captioning model that will tell you, you know, in text what's in the image. If you do it the other way around, fix the text and generate the image, then you have a generative image model, a text, you know, a caption-to-image model.
But underlying both of those things, those seem like two different tasks, but underlying them is the same shared joint distribution. And that's really, like, the key, let's say, paradigm that we're aiming for here, which is whatever data we've got, whatever complicated scientific system we're attempting to model, we will just try to think of it as, you know, a huge collection of interdependent variables and learn a joint distribution over all of them. And then once we've got that joint distribution, pretty much anything we want to do with it later will boil down to conditional sampling.
We'll say, "Let's take some part of the data and fix it, because we know what that is, and generate another part." That might sound a little bit abstract right now, but we'll see later in this talk how that can be applied concretely to protein modeling. Okay, so then the question becomes: which model should we choose? And, you know, there are several contenders out there. We've already heard maybe diffusion models mentioned. So diffusion was what was used to generate the video at the start there. There's autoregressive models, so this is basically large language models, the GPT family. They all follow this autoregressive principle, which is quite simple.
It really just says, "Predict the next token, given the previous ones." And then there's more of like masked prediction-type approaches, which are often known as BERT-style models, where you sort of hide part of the data and attempt to predict the rest. Now, I won't go into the details of these models, but maybe the key point for our purposes is they all have pros and cons. They're all good for some things and not so good for others. And in particular, there isn't really a model out there that is good for discrete, continuous, and discretized data. And we feel that this is an important, this is really a problem in scientific fields, because typically, what you have in science, so unlike when you're just generating images and you have quite, you know, homogeneous datasets, in science, everything is very heterogeneous.
You have text, you have labels, you have charts, you have time series, you have all sorts of measurements, and if you want to follow this joint modeling paradigm, you need something that can handle all of those at once. So enter Bayesian Flow Networks. This is, you know, a new sort of type of generative model. The original paper was published last year by myself and three colleagues, one of whom, Tim Atkinson, is also a research scientist at InstaDeep. So, you know, we've got this basically the, you know. Over the past few months, we've been really pushing the development of this model for biological data, specifically for protein data.
You know, it's maybe somewhat of a technical point, but one of the key advantages of Bayesian Flow Networks and sort of differentiates them from diffusion-type models is that when they generate discrete data, they do it in a continuous way. Conceptually, this is because they operate not on the data itself, but rather on a set of beliefs about the data. Even for discrete data, the beliefs that the model has about the data can be kind of encoded in a continuous way. But basically, in practical terms, what that means is we can use advanced gradient-based sampling techniques to do conditional sampling across all of these modalities, across discrete, continuous, discretized.
Taken together, we think that this is really an exciting opportunity to, you know, it's really a good model for this task of jointly modeling a wide variety of of heterogeneous scientific data. So that is a sort of very high-level overview of what Bayesian Flow Networks are. I'm now gonna hand back to Alex, who's gonna talk you through a little bit, you know, how we can actually specifically apply these to protein data. Yeah. Thanks, Alex.
So indeed, I want to share with you how we see and how we started applying Bayesian Flow Network to scientific data. And here, our vision is very clear. We want to propose a unifying framework across modalities and data type, meaning that at training time, we would collect all the data we can, we can put our hands on, and train a joint distribution across this data type and modalities. And at inference time, we would kind of prompt the model. We would query it, we would condition a sample from it to solve a specific task, right? So perhaps here you have, as a way of example, I'm showing proteomics on the slide and some modalities that are relevant in that context.
Obviously, the sequence of a protein, so a protein is well-defined by the sequence of amino acids, which is simply a sequence of discrete variables, right? You might also care about the structure, right? Because that's highly correlated with the function of the protein. But you might also have other metadata you care about predicting or conditioning your model on. For example, the GO terms annotation, if you care about the function of proteins, the species, or the organism the protein comes from. Perhaps one step further, you have access to experimental data from the lab, and you know, for example, on which antigen an antibody binds to. So you have all this information, and we want to unify that and train a single model that can then be conditionally sampled at test time to solve specific task.
Talking about this task, if we take, for instance, a sequence, and you want to predict the structure, that's actually the protein folding problem, which actually motivated AlphaFold, which completely revolutionized the field of artificial intelligence for biology. Perhaps you care, like I mentioned, about the function, and you want to predict the GO terms based on the structure and the sequence. Or one step further, you know an antigen, and you want to generate the structure and sequence of an antibody that tightly binds to it. So that's our proposal. Each of the tasks could be solved with a task-specific models, and that's generally the approach that people take nowadays.
But we want to propose this unifying framework, that's gonna train a single model across all those modalities, and then you can prompt it differently at test time, based on your task. But if we just take one task in particular, just sequence generation, it's not even clear, as Alex mentioned, which type of approaches are the most suitable. You might want to use autoregressive models if you want to do de novo generation, so generating one amino acid at a time in a sequence. However, they are of limited usability for conditional generation. So perhaps you want to generate the CDR loops based on the framework region of an antibody. There, autoregressive models are not amazing. You might want to use a masked prediction model instead, but that's not good at de novo generation.
Finally, you have what we call the discrete diffusion, which are great in the continuous variable case, but for discrete variable, as Alex mentioned, they require much more engineering, and it's very difficult to apply, so as a first proof of principle, let's say, we want to apply Bayesian Flow Network to this problem of sequence generation for proteins, and cutting straight to the result, here I'm gonna present the result we get for our ProtBFN model, which is simply been trained on a very wide variety and large database of proteins. We compare here to alternative approaches, so ProtGPT2 is an autoregressive model. EvoDiff is a discrete diffusion model, and we see that the performance are much improved, like in terms of naturalness, so the proteins that are generated by ProtBFN are much more likely to have been found in nature.
It's also much more diverse, so the protein generated cover much nicely the, let's say, the realm of known proteins. And finally, they are highly novel. So we don't memorize the dataset, but we can generate novel sequences. And that's tremendously important because as a scientist, when you do antibody design or protein design, you want to make sure you focus your attention on the region of interest. So these AI tools should help you focus your attention on the key regions that might be relevant for you. And what relevant means is it should be novel, diverse, yet viable and natural. This is the goal here for this ProtBFN algorithm. Now, we went one step further. We started folding the sequences and look at the 3D structure of the generated sequences because it's highly correlated, again, with the function of that protein.
What we observed there is really interesting as well. We can see that some of the structure generated have one domain, but the vast majority has two, three, or even four domains, and that tells us that ProtBFN generate proteins that have, let's say, coherent interaction between regions that are far apart in the sequence space, but that are close in the 3D structure space, so ProtBFN can generate proteins as a whole and not just local structure that makes sense. We can also look at plenty of different annotation, where it comes from the tree of life, and there again, we see that we nicely cover the species, the type of structural motif, and so on. We summarize all this finding in a paper we just actually released for this event, so please have a look.
We also look at how to do zero-shot conditioning of the CDR region based on the framework region for antibodies. So please have a look. It's online out there. But we actually haven't stopped there. We carry on. Like I mentioned, our goal is not just to model the sequence of amino acid, but we want to model all these modalities and data type, right? We want to move away from the concept of you have one task, for example, protein folding, you would create a dataset to train a model, and then have scientists, let's say, access the model and prompt it. We want to put the model first, gather all the data you can find and put your hands on, train a single model, and then based on the task of interest, you would prompt the model differently, right?
So that's really the objective. That's what we then started doing, and we started from the ProtBFN model, and we trained an antibody-specific models that is highly multimodal. Meaning that we also model a lot of the biophysical properties and so on. And that's bring us really at a next steps. It's really in our next generation model. So as opposed to showing you more baseline and metric, we thought it would be useful and more interesting for you to see how we can use this model in practice, and that's why I will be calling Bora on the stage to show you how to use our next generation of protein language model.
Fantastic. Thank you, Alex. So yeah, I'm Bora, I'm a research scientist here at InstaDeep, and I will be introducing AbBFNX. AbBFNX is our first multimodal model for antibodies, and what we've done here with AbBFNX is really model all of the attributes that we think are important about an antibody. So we're actually modeling 36 different data modalities or data modes, covering sequence, genetic, and biophysical properties of an antibody, and our aim really is to kind of hark back to what Alex was just saying, to create a model that is flexible in the tasks that it can achieve, and this really is so that we can empower scientists with tunable, flexible generation, depending on the scenario at hand, and the way we do this is essentially, we don't think about the antibody as just kind of one big entity.
We actually look under the hood, so while we normally have the kind of FV region composed of the VH and the VL with the CDR loops, there is actually a lot more kind of that's going on here, and we essentially unpick this. We undo the stitching and model all of this in one big joint distribution, so we break the sequence down into 14 different attributes. We cover genetic lineages that contribute to the actual sequence of the antibody, including species, but also specific genes. We cover biophysical attributes that we know correlate with developability of an antibody as a therapeutic, and we also are interested in the lengths of the different regions, so we cover this as well, and what this then allows us to do is really get this flexible approach to modeling the antibody sequence space.
So what we can do with this, for example, is, say we have this antibody here. We're happy with it, but we think that the H3 loop there, the green one, is too long, and we want to shorten this and redesign it while keeping the rest of the antibody... we can do that. We can condition the model, so fix all the data modes that correspond to the regions that we want to keep. We can also request a specific CDRH3 length, and then ask the model to generate us samples, and we get exactly what we hoped for. So everything stays the same, but we have a redesigned, shorter CDRH3 loop. That's obviously a very simple toy example.
If we actually look at a scenario that is more akin to what we might see in the lab, we have here the generation of a library of anti-HIV antibodies. So these are antibodies that recapitulate the properties that we find in a common class of anti-HIV broadly neutralizing antibodies. If we were to design this library in the lab, we would have to go through a multi-step kind of design process. But with AbBFNX, we can essentially compress this into one. What we're interested in specifically is the HV genetic lineage, the light chain locus, the species, the L3 length, and also developability. So we condition the model on all of this, and we just search through the space within the confines of this. And what we find is that AbBFNX is really good at effectively searching through this space.
So when we compare the rate at which we find full hits, we are actually more than five thousand times more effective than just looking through natural repertoires. But we also know that these antibodies are still diverse. So when we're searching through this space, we've essentially effectively set the H1 and H2 loop sequences and structures and the L3 loop length, so we can see that all looks very similar. But we've not conditioned the model at all on the H3 loop. So here we see remarkable diversity in sequence, length, and also structure, and that's precisely what we want when we design a library. You might have a completely unrelated task: heavy and light chain pairing.
In this case, we might be starting with a heavy chain that we are really happy with, but we want to find other light chains that are able to pair with it. Essentially, this time the task is different, but we can use the same model just by changing how we query it and how we interact with it. This time, we would condition on the heavy chain sequence. Maybe again, we're interested in the light chain locus and the species, and also the biophysical attributes, so we can condition on this as well. In effect, what we're doing is we're taking the heavy chain, and we're looking for solutions. We're looking for light chains that are able to pair with this heavy chain. What we find again is this diversity.
We find light chains that have different lengths and sequences and shapes, but we know that the model still respects the conditioning information that we provided it, so the heavy chain sequence always looks the same, and the light chain actually shows sequence biases and length biases that are consistent with that heavy chain, and that is exactly the same model. We've not changed anything about the model. It's just the way we interact with the model, and because the model has learned this rich distribution of the underlying data, and with that, I will ask Alex just back to summarize what we've talked about. Thank you.
Thank you. Thank you, Bora. So yes, just to very briefly summarize, you know, I've introduced this, this sort of new class of models, Bayesian Flow Networks, and hopefully motivated why it is that we think they're, you know, a really good choice for, for the kind of data that, that we're looking at here. We've talked about some of the early results we've already got with, with protein modeling and, you know, how promising they are. And what you've just heard from Bora is, you know, really gives you a flavor of just how kind of tunable and steerable these types of models are. There's so many things that we can do with them. And of course, we're really excited about, you know, where we go from here with this. There's lots of data we still haven't tapped.
And so with that, I will hand back to Karim.
Thanks so much, Alex, and really congrats on the amazing results. This is fantastic. And yes, Ugur? Yes.
What we are doing here is amazing also because we are with our models, we are more closer to how nature works in the way of the independence of optimizing domains, of keeping some domains constant and changing some other domains. So I believe with this type of optimization of our models, we are more likely mimicking what is happening in nature.
Absolutely, Ugur, and what's exciting is indeed with this model, we can do conditionality of any variable against any other variable. Obviously, here we've shown it in a few examples with sequence and structure, but this is really like we believe a profound sort of like breakthrough, and we look forward to working more and importantly deploying it also in the lab with our biotech colleagues. But I think for me also like what is refreshing is this is not all about scale and LLMs. There is still room for fundamental research, and at InstaDeep and BioNTech, we have 45 AI researchers inventing new algorithms. Now algorithms compute, that's what we've seen, but importantly, we need to deploy and make those breakthroughs, those powerful models, accessible to all our colleagues internally but also externally.
For that, I'm very happy to invite Arnaud and Julia, who are gonna introduce us to the DeepChain platform.
Thank you so much, Karim, for the great introduction. So hello, everyone. I am Julia, and I am a product manager at InstaDeep, and I've been leading the product development for the DeepChain team. So you've just heard from super advanced innovation. These, these BFN models are incredibly exciting, like Ugur just mentioned. And what I think is super unique about InstaDeep, actually, is that we're able to combine state-of-the-art research with the most advanced engineering to enable the delivery of AI tools that can directly integrate into the R&D pipeline that you see here. And so I'm gonna run you through a couple of examples of these tools, in particular, BFN. So you've just heard about BFN. These, these kind of models can be used for de novo antibody design, but also can integrate into the optimization part of the pipeline.
Beyond BFN, we also have tools in the world of genomics. In particular, we have created models called the nucleotide transformers, which can be helpful for prediction tasks such as splicing or gene expression, and today are supporting BioNTech in the identification of new targets. Finally, one super exciting application has been the development of assistants. We go into the next step. I don't know if this is working, so the development of assistants, which can be used as standalone AI tools that can support scientists with natural language, so you can directly interact with them, and they can process tools in the background. But they can also be integrated directly into the labs, like Karim was mentioning, and we'll see later today a demo about this, so today, we're super excited to be upgrading DeepChain and to be launching the next generation of models in the platform.
DeepChain, in particular, is a single platform that combines our AI expertise with our life sciences in order to empower scientists to deliver the next generation of novel therapeutics and other biotechnologies. As a first instance, we are releasing our flagship model, so the ones you've just heard about, the state-of-the-art generative models are going to be on DeepChain. These are super exciting models because they enable the generation of sequences that are very natural-like, that are very structurally coherent, and as you saw, they can generate sequences upon certain conditions and certain parameters that the user can decide. So that opens up a lot of opportunities. On the other hand, we are also releasing our nucleotide transformers and SegmentNT.
These are our DNA foundation models, and specifically, SegmentNT can be used for predictions at the single nucleotide resolution, and that can be extremely valuable, especially for applications such as splicing, whereby, for example, if you take a DNA sequence, a single-point mutation on the sequence can actually have a lot of impact in a downstream process, such as transcription, and so it is vital to get that kind of resolution in our predictions. Beyond that, we've also been training our models in with context lengths that go up to 50 kbp in length and also without performance drop. And finally, even though some of these models have been trained with human genomic sequences only, they can also generalize across different species in a zero-shot manner.
Diving a little bit deeper into these NT models, we've consistently shown, both through our studies that have been peer-reviewed and published in major journals, but also through independent studies that have taken our models and benchmarked them, we have seen how we consistently outperform other models in the space, but we're not only state-of-the-art. We also have an enormous amount of traction in Hugging Face. In particular, we're one of the most downloaded genomics AI models in the world, and as of this morning, there's been more than seven hundred thousand downloads across model sizes, and so that opens up a wide range of opportunities, and specifically the opportunity to go out there and speak to researchers and ask them about which applications they're using our models on, but also understand more about the challenges and the pain points that they're facing.
This is why today, this has inspired us to not only release our new AI models, but also release capabilities. Capabilities for our users to take these models and build on top of them and then scale them for their own applications. In particular, our first capability is around an optimized setup. We've heard from Alex earlier today, that our models are becoming, and models in the LLM space are becoming bigger in number of parameters, but also in the amount of compute that you not only need to access, but also orchestrate. And so on DeepChain, we're doing that for you. In particular, you can now access our very optimized workflows with a few simple lines of code.
We have already shown how this kind of setup is already delivering value in a specific application for in silico design of regulatory sequences, whereby we have been able to increase the inference speed up to seven times and reduce the cost by half, and this has been super valuable for the team we've been working with because previously they had been trying to integrate these kind of models in production and had been struggling due to these kind of running time requirements, and today they'll be able to do that with DeepChain. A second capability we're releasing is the ability for our users to customize our models, and so now they'll be able to take one of our models and customize them with their own data for their own application.
We also have an example of this that we've been testing around the fine-tuning of a model in splicing, in a splicing use case. And so in this case, we have seen how, compared to an external implementation, we're able to improve performance by at least one point five times. Finally, one last capability that actually Arnaud will be introducing in a second through a demo has been the release and development of assistants. And so these assistants are tools that you can interact with through natural language, and they can support you in connecting multiple tools that we're developing here at InstaDeep. So now I think we'll be jumping onto the demo in a second. I don't know if you're able to switch the screen. Amazing. Okay. So allow me to introduce you to the upgraded DeepChain platform. We have here the models page.
We can get access to different models. For example, we've got the DNA models here, the SegMentNT that we've been discussing. Here we have the specific information about its datasets and different training parameters. We've also got our protein models that have been introduced, like ProtBFN and AtBFN, and finally, we can jump into these fine-tuning capabilities we've been discussing. In particular, you can start a new run by clicking on this button. You can now select which model you'd like to use for fine-tuning. We're gonna be adding a name here, AI Day, for this run. We'll be selecting the downstream task for gene expression, and now we'll be uploading our fine-tuning data. So in particular, for fine-tuning, you need a file that's your file that contains your sequences, and then you have your training labels.
In addition, you can also upload your validation data, so both your sequences and your labels, and then down here, you've got some parameters that you can fine-tune, and this is actually the part where scientists can experiment and test different parameters, different type of data. So biologists are really the experts in the data, while we can handle the compute and the orchestration on our side, and so here is where they can actually start experimenting. So now we're gonna start the fine-tuning run. The data has been uploaded, and now we can go on to the runs and actually keep track of this processing run here that we just started on AI Day.
So what we see here is the different parameters, and the plot is being updated live with the results that are being processed. But because usually fine-tuning takes quite a few hours to be able to achieve good results, we're gonna use a run that I started actually earlier this morning that succeeded, and also a model for AI Day. And so here, we see how the plots have been completed, and we're able to use this model now for inference, to be able to show how it works in practice. So now we just added the model to our list, so it's the one here. We're gonna be jumping onto the CLI for you to see how we can run these models. So here I'm gonna start by typing DeepChain models to see the list.
We just got the list of DeepChain models. In particular, this is the one here we'll be using today. Then to run our models, we're going to be typing DeepChain run, then select the model that we just moved to our list. Referencing our sequences that we want to use and our output file that we will be sending the predictions to, and we're gonna ask it to wait so we can see what's happening. Now what we did is just send some sequences to our systems, and what's happening in the background is, because we're using a fine-tuned model for gene expression, we're gonna be seeing, for each sequence that we've sent through, we're gonna be seeing a few values predicted for gene expression in the different tissues. We just got back the results.
Here's a snippet of those results, where we can see the gene ID associated with these different values of gene expression, and now we will be using this, so this is where we can see the full results, so we'll be using this file to be able to evaluate the results. We have a function here on DeepChain, where you can evaluate, you can type DeepChain evaluate. Yes, and then we can reference our results that were just computed and a list of labels. Sequences, sequences_labels is where our labels are, and so now our results pop up here. We see the performance improvement of our fine-tune model that we've just showcased compared to a baseline.
You can see, as you can see, if you really play around with those parameters and your datasets, you're able to achieve improved results with our models and customize results to your application. Now I'm gonna be passing it on to Arnaud, on dark mode. Go ahead.
Thanks so much. Thanks. Thanks, Julia. So, we're gonna move on to our AI assistant, and my name is Arnaud Pretorius. I'm a research scientist at InstaDeep, and I'm delighted to introduce to you today our AI agent called Layla. Layla is deeply integrated into all aspects of the DeepChain platform, including running models, performing analyses, calling internal as well as external tools, and much, much more. So to showcase some of Layla's capabilities, we're gonna run through a hypothetical scenario of a user who's new to the DeepChain platform and interested in analyzing DNA. So I'm just gonna clear here at the side. Let's clear this, and we can go to Layla.
Being new to the platform, we might begin by simply asking Layla, "Which models can I use for DNA?" Layla then knows to call the internal API associated with the DeepChain platform and give us a list of available models that a user could be using to analyze DNA. We see SegMentNT, a multi-species version of SegMentNT, and a whole host of other fine-tuned models based on the nucleotide transformer series built by InstaDeep. Now, being new, this user might not be familiar with these models and might be interested in knowing what they can actually do. We can ask Layla, "What is SegMentNT used for?" Again, Layla called the internal API, but now fetching information specific to SegMentNT, giving us the reply saying: SegMentNT is used for detailed genomic analysis, offering single nucleotide resolution predictions for various genomic elements.
So this might sound interesting to us, so we can go one step further and try to do some analyses. So if I just pull up a DNA sequence file here, I can upload this to the platform, and you'll see this under the attached files being shown over here. Now, I can tell Layla that I have uploaded a sequence and tag it using the @ symbol. Once this is done, I can then ask Layla, "Can you please segment it for me?" Layla now knows that it can call the custom SegmentNT model and give as input the DNA sequence that we've just uploaded. It provides us a segmentation text output, saying that at specific indices within the sequence, we can find exons, introns, splice donors, splice acceptors, as well as UTR elements.
But not only this, this gives a scientist a quick sort of overview and a feel for what's going on, but they might want to actually build a much more custom pipeline, in Python, some sort of larger scale batched workflow. And as Julia showed, we have the CLI for this that can really be useful. And to get started, Layla provides us with a command we can actually directly copy and paste to run this analysis, through the CLI. So I'm just gonna pull this across, and, here, if we just quickly look, we have this DNA file here, that we uploaded, and we can simply paste this, knowing that this is exactly the DNA file we want to analyze with the output being specified.
When we run this, the CLI goes and fetches the job, does the analysis by calling the model, and provides us with the output, so this takes a few, just a few seconds, and here we can actually take a quick look at this output file, and what we can see is it has a whole host of probabilities of certain regulatory elements being in certain places, and even though this is not nice to look at through the terminal, but in a Python workflow, you can very much use this easier for further analysis, but we can now go back to Layla and simply ask if we just want to, again, have a sort of overview of what's going on to visualize the results for us. Can you please plot it for me?
Layla knows it has an associated plotting script with SegmentNT and can run this to give us an interactive plot of what's going on. On the y-axis here, we see certain regulatory elements, and on the x-axis is the nucleotide positions in the sequence. What we're seeing here is probabilities of certain elements being present at certain positions. We can actually zoom into specific regions and see that here, for example, the nucleotide A in position one four nine oh has a probability of 92% of being associated with a tissue-invariant promoter. This can be a nice tool to really improve the productivity of biologists to quickly get going with the DeepChain platform. I hope that shows you some of the capabilities that Layla provides, but we'll see much, much more later on.
So just to conclude, we have a whole series of Layla models built on top of Meta's Llama 3.1, and they come in different sizes, including the 70 billion and the 405 billion models, all internally fine-tuned by InstaDeep. And we want to stress that Layla is more than a chat bot. It has expert knowledge of biology, integrated with powerful tools and the capability to really reason and make decisions, as well as learning through constant feedback. Thank you very much, and I hand back over to you, Karim.
Thank you, guys. Thank you so much for the live demo. As you can see, those tools are super powerful because if you're gonna waste time as a scientist or biologist to code those models, this is maybe not the most optimal use of your time. Everything is available on the DeepChain platform, and so today we are actually releasing the upgraded version of DeepChain with our most powerful models, including AbFNX, including the functionalities that you've seen with Layla. These are all available on the DeepChain platform, which is made accessible both internally within BioNTech Group but also externally, so we are really happy to partner with you. This concludes the first part of the presentation.
We've seen what we've done into building our supercomputing capabilities, the AI innovation that is coming from our research teams, and also DeepChain as a platform to make all these innovations available and sort of quickly accessible. We're gonna jump into the second part of the presentation, which is really about looking at concrete use cases, showing you how we work within BioNTech Group, InstaDeep and BioNTech colleagues working together to make this progress, which is actually important because this is really where rubber meets the ground. This is where we drive innovation that ultimately can translate into saving lives. This is the spirit. With that, we're gonna start with histology, and I'm very happy to introduce Youssef to tell us about the work he's doing.
Thanks, Karim. Hi, everyone. I'm Youssef Bendib, and I work as a senior machine learning engineer at InstaDeep. And today, I'm going to walk you through some of the cool stuff we are developing with the Histology department at BioNTech. So, the Histology is a core component of the immunotherapy pipeline, and one of the critical tasks that pathologists need to do at this stage is labeling digital slides of tissues. And as we scale up, pathologists are facing a heavy and growing, actually workload. And to understand more this challenge, let's take a look at a typical Histology image. So notice how we can zoom in from the broad overview, actually, to the cellular details, where we can see the details of each individual cell.
These images are actually very big and have a lot of details, but you need like a full tennis court filled with 4K screens to be able to visualize a single image with all its details. Labeling these images manually is very time-consuming and requires a lot of attention at every like magnification scale to label the image. How to solve this challenge? The idea here is to harness the power of AI and develop AI tools that will allow pathologists to become more efficient and very fast in labeling these. Allow me to introduce the first tool that we developed, which is the AI-assisted tissue annotation tool.
This tool actually is a collaboration between the AI and the pathologist, where the precision and the speed actually of the pathologist is enhanced. We can take a look at how this process actually work or how this tool work. First, this is an annotation when a pathologist does it for just simple two red cells. When you do it manually, it's quite precise, but it's a bit slow. Now, when you use AI for that and use our tool, you just draw a box around these cells, and it will automatically segment it in no time. We can take, for example, another region, like the white region here, the background, and with just drawing a box, it will segment it.
Now, if you take a more complex region or area, and the model doesn't recognize it, you can quickly, like, click on the area you want to exclude or include, and the AI will understand that, and it will correct itself automatically, so by deploying this tool to our pathologists, actually, we were able to achieve a five-fold increase in the speed of the pathologist, and at the same time, we didn't lose, like, the quality, but it's quite the opposite. We actually got a better quality because pathologists were able to even refine the annotations at the different levels of magnification. Thank you, so based on the success of this tool, we actually developed a second tool which does the segmentation of the whole slide image, and actually, it does it in just one click. It segments the whole slide with all its details.
And how we achieved that, we actually used a state-of-the-art vision foundation model, and then we decomposed the whole slide image into small patches, and we transformed the problem from segmentation into classification of patches. So, like, each image is decomposed into, like, millions of patches, I think like five million patches per slide. And we are processing them in parallel, like hundreds of thousands in parallel. And with that way, we can very quickly, like, annotate the full slide. And then when you group them back, you will see it as a segmentation instead of classification. So you can see here when we zoom on a region, and we try to classify, like, the patches, patch by patch. Then when you zoom out, you actually see how the tool is progressing and segmenting all of the slide in no time. And actually...
Thank you. Actually, with this tool, we were able to achieve 100x speed up compared to the first 5x compared to the manual annotation, actually, of the annotation of the full, like, whole slide by a pathologist. Yeah, here, my colleagues will show you more cool stuff on the other stages of the pipeline.
Thanks, Youssef.
Thank you.
Thank you so much, and I hope that this shows you, like, the power of AI vision tools to help. This is increasing the speed of data accumulation, which is also useful in many ways within BioNTech and beyond. So after actually handling, like, medical tissues, classifying the different areas, you do have naturally a DNA and RNA sequencing. As hinted before, we've developed, like, very advanced models in genomics, and I'm very happy to have Thomas, Maren, and Marie present the work we're doing in common on that.
Great. Thank you so much, Karim. My name is Thomas Pierrot. I'm a researcher and scientist at InstaDeep. Very excited to be here, and I'm leading the team behind this SegmentNT and Nucleotide Transformer model that have been already presented. And what we'd like to do right now is to kind of deep dive into what's happening under the hood and also explain how we actually leverage these models in the immunotherapy pipeline. And as we mentioned, we are pretty excited as this model get a lot of traction and more, now pretty popular in that, in that sphere. And probably the best way of thinking about this model is to think about ChatGPT, Llama, Gemini, any of these big language models, but instead of training them on English or on any other language, we actually train them on DNA.
Here we leverage the exact same technique called self-supervised learning, which is an amazing technique because it allows you to learn from any type of data without needing any label. In this case, what we do is that we actually collect lots of genomes from lots of different individuals, genomes from lots of different species, and train these models almost in a never-ending fashion out of these genomes. As Alex mentioned, scale is key. We first scale the data by going through lots of individuals, lots of species, as many genomes as we can, but we also scale the models to a few billion parameters.
And probably the most impressive results is that even though these models have been trained with, like, without seeing any knowledge, without seeing any labels about DNA, they actually acquired some of this genomics knowledge during the training. And what I'm showing here on this side is we're looking, we're deep diving into the activation to the layers of one of our biggest model, the 2.5 billion-parameter, trained on lots of species. And we see that actually the model acquired some basic genomics knowledge during training. In the first layer, we can see it can already make the difference between coding and non-coding regions.
But what's even more impressive is that as you go through the layers, this kind of representation becomes more granular, and you end up in the final layer with a very granular presentation with lots of different elements, where you can capture, for instance, UTRs. And that also explain why then this model can be fine-tuned, as you explain, to solve lots of different tasks in genomics with a very high precision. But we didn't stop here with the team and say, "Okay, so far we took inspiration from NLP to build this first generation of model, and so why not to build the second generation, take inspiration into computer vision?" If you look into computer vision, people now are training very impressive foundation models with segmentation. I'm thinking about the Segment Anything Models from Facebook.
And the way they work is they simply take all the images they can find, they actually train the model to like segment on the image everything they can find, like to understand the different elements. So in that case, can be the cars, the people, the road, and so forth. We do the same segment on DNA, but instead of working in 2D, we work in 1D, and we train our model to segment a DNA sequence and to find in the sequence all the genomic segments it can do. To do that, we worked hard with the team to gather a very high-quality data set of millions of annotations over the human genomes, but also lots of other species. We looked at many elements. I have a few example on this slide.
We would, for instance, have regulatory elements, such as promoter and enhancer, that are going to regulate the expression of genes, but we also look at lots of genetic elements. And we fine-tune this Nucleotide Transformer model that we presented on this annotation to bring them to a new level, and Maren Lang is going to give you more information about the performance of these models.
Thank you, Thomas. So I'm Marie Lopez. I'm the geneticist in charge of the AI applied team here at InstaDeep. And actually, I think that Layla has been doing a great job explaining the performances of the Segment NT model. However, I'm going to give a fuller picture here. And what we see is that for each nucleotide, the model is able to predict the probability of belonging to each of these genomic element classes here. And of course, we have some annotation that are related to critical functional elements, such as protein-encoded genes, but the model is also able to actually predict elements that are usually more difficult to map, such as regulatory elements like enhancers and promoters.
So to give you kind of a visual explanation about what this this model is doing, you can see here on the top a DNA sequence of 50,000 base pairs, and you would see like three different genes that are encoded here. And what the model is doing, as you can see in the line, in the first line of the track, is that it's accurately predicting the protein-coding genes under those. And if you look in the other tracks, you can see that the model is able to differentiate introns from exons, so getting a better accuracy at detecting gene architecture, for example, detecting splice sites in white as well, and also detecting more pervasive enhancer, as you would expect in this DNA sequence.
And what is even more impressive is that all of those together prediction represents 700,000 different probabilities, and the model is able to output them in actually less than a second, which is an incredible speed and precision delivery that is given by this model. And this is actually an incredibly powerful tool for genomic annotation and research. And now Maren is going to explain a bit more how this is actually used inside of the pipeline.
Thank you.
Yeah. Hi, I'm Maren Lang, Senior Director of Bioinformatics Research and Development at BioNTech Mainz, and I would like to show you one example application of the SegmentNT, and this is about alternative splicing. Splicing is the event where the parts that are coding called exons are spliced out of the pre-mRNA, and the introns, no, sorry, the introns which are the non-coding parts are spliced out, and the exons are joined together. Exons are the part of the coding parts, and this is splicing. And but not all exons always are part of the final mRNA, and therefore the protein. So that is the process that is called alternative splicing. And this is a normal process taking part in each healthy cell.
Here we showed on healthy data whether we can detect those events with the SegmentNT. Here we see the splice donor. Except it's simply ignore that. It's simply the task from exon to intron or intron to exon, which parts of the splicing events are addressed here. We show here that the SegmentNT performs much better than the SpliceAI, which is state-of-the-art tool. A related event that is simply the exon and intron detection is also addressed by the SegmentNT. As you just learned, it can predict many different tasks. The SpliceAI is performing worse here because it's an indirect task, and but it can be addressed there as well.
We are much better for splicing event detection, but also for exon and intron detections. This has been done on healthy cells, but we know that alternative splicing is an event that is very complex and can be easily disrupted. This is what therefore it is associated with cancer and many other diseases. We wanted to see also on cancer data whether we can detect these events. We checked. We fine-tuned the SegmentNT to detect tumor antigen candidates, which represent possible targets for immunotherapy. Yeah, this is what we did here.
And we can see here, I have no pointer, sorry, the SegmentNT also for this task performed much better than any of the other tools in any event here. And this is great news because we can use the SegmentNT also for this task. And this shows that we can use it for the alternative splicing prediction, and we can also combine it with other methods that will be shown now by Nicolas, Daniel, and Mike, who will show us AI-enhanced proteomics.
Thanks, Maren. So really exciting application of SegmentNT. And like Maren said, really like we are deploying AI end-to-end on the pipeline. So we've seen the visual AI modality, learning from pixels. Here, we are learning from nucleotide sequences, but we're also working on protein and proteomics, and bringing in multiple modalities together here. So we're gonna see an example of mass spectrometry and how we can use AI to identify potential targets with Daniel, Mike, and Nicolas.
Wonderful. Thank you, Karim. Hi, my name is Daniel Rothenberg, and I have the pleasure of representing the proteomics team from BioNTech, and today, myself, along with my colleagues, Mike and Nico, will be talking to you about how we can use AI to supercharge our target discovery efforts, so we'll start with some basic immunology first. Intracellular proteins are processed and presented onto MHC complexes, and at BioNTech, this is important for a couple of different applications. First are the T-cell targeting RNA vaccines, where the RNA enters the cell, is translated into a protein, and then that protein can be processed and presented into epitopes onto the MHC complex. Second, and in the context, importantly, of oncology, is looking for tumor-associated antigens or TAAs, and these are proteins that are expressed specifically in cancer cells, but not in healthy cells.
Just like all proteins, these proteins are also processed and presented onto epitopes such as onto MHCs. Now, the field has kind of coalesced around the same set of TAAs, and you can think about PRAME, your MAGEs, KRAS mutants, and HPV-derived proteins. These are all important, but the problem is that these really, if you focus on just these targets, it limits the breadth of the therapeutic population in terms of the population as well as the number of disease indications. So discovering new targets is of the utmost importance to broaden the range of the population that we can treat. Epitope presentation is so important because MHC-presented epitopes are the immune system's window into the intracellular proteome.
In order to get a therapeutic response, these presented epitopes must be recognized by T cells, and these T cells are then activated, and that gives you your therapeutic benefit. How can we validate what epitopes are presented at the cell surface? Being from the proteomics team, I love mass spectrometry. I'm biased, and so I think it is mass spec. Indeed, mass spec is the current state-of-the-art for detecting, identifying, and quantifying MHC predicted presented epitopes. However, the challenge is that mass specs don't just spit out peptide sequences. Rather, they give you a mass spectrum, which represents a biophysical fingerprint associated with that peptide, but it's not the peptide sequence.
At BioNTech, we have a massive database of mass spec-validated, MHC-bound epitope peptides for studies that we've performed internally, as well as publicly available datasets that are external. This database has over 200 million spectra in there, but these spectra need to be decoded into peptides. Using commonly available heuristics, we're able to run these through search algorithms, and that leads us to about 1.8 billion peptides in our database. These peptides can further be mapped onto specific genes, and that gives us whether they, they're useful or not. A lot of these hits are not particularly interesting because they're not tumor specific. We also have tumor-specific TAAs in our database, such as PRAME, such as MAGE, some of those other ones I talked about before.
However, you can see that we still have a lot of spectra that have gone unmatched using these basic heuristics. And so where we can turn to AI here is to supercharge our search and find new peptides, using novel search algorithms, and that would find novel targets that can expand the population that we could possibly treat. And so for more details on that, I'm going to turn it over to my colleague, Mike.
Hi, I'm Mike Rooney. I lead the computational biology team at the Cambridge, Massachusetts site of BioNTech, and I've been working closely with Daniel for the past few years to, you know, get the most out of this dataset and use it as a tool for target discovery. And one thing we realized early on is that we need to bring in AI-based methods. And the key application here is to use the AI to help us validate whether our peptide identifications are correct or not. So there are two key examples of how we do this. The first is in the upper left, where we're looking at the retention time of our peptides. So we can get very, high accuracy predictions of these retention times.
And so if we see any peptide that deviates from our expectation, we know pretty likely that's a false positive identification. In a similar spirit, we can look at the way the peptides fragment in the mass spectrometer. You know, each peptide, as it goes in, is hit with high energy, it breaks into pieces. We can predict the intensities of these fragments, and if we can match that predicted-
... if the observed fingerprint matches the predicted fingerprint strongly, we know we likely have a good ID, otherwise we know it's probably incorrect ID. So we can do this one by one, but really practically what we do is we run this across the entire data set, and we use it as a way of getting deeper and getting more confident in identifications. So running this on our data set, we can see up to a 200% increase in number of peptides that we can recover per sample. So how do we use this data? Well, what we're really interested in is what genes are producing more peptides in tumor samples than in normal samples.
So we've evaluated this systematically across all 20,000 genes in the genome, and on the plot on the right, you can see the count of how many times each gene is seen in normal samples versus in tumor samples. And we have this really interesting population of genes in the upper left, which are seen many, many times in tumor samples and never seen in normal samples. So we of course wanna know what those are. We do see genes like MAGE in that circle, but there's others that are not widely known as tumor-associated antigens. So we have done that. We are doing follow-up experiments currently to try to zero in on these.
We use a different workflow that's lower throughput, but much higher sensitivity, so we can be sure about. You know, we don't want there to be any low-level expression in normal tissues, and so we can test that directly, and we're getting, you know, some hits here that are looking really promising. And so these can go into vaccines, or they can go into TCR-based therapies. And there we also have some really interesting computational developments, ways that we can discover TCRs de novo, using computational approaches, as well as use, you know, rational AI-guided optimizations of TCRs to make them more sensitive to antigen. So that's ongoing. We'll hopefully present on that soon.
But today I'm gonna come back to this target question, and one thing I glossed over is that even with these great AI-based methods, there's still a huge fraction of the spectra that cannot be confidently identified. So there's various, you know, numbers floating in the field, but, you know, kind of best case, we're identifying, we still have 55% that we cannot identify. You know, what could these be? Are they, like, you know, non-coding RNAs? Are they circular RNAs, endogenous retroviruses, post-translational modifications, you know, unusual splice junctions, like Maren was talking about? There's no consensus in the field. But what there is, is a need for tools to identify these, 'cause these could be good targets. These could be cancer specific.
Nicolas is gonna take us to the next section, where InstaDeep has created a new tool, InstaNovo, that really zeros in on these spectra in an effort to figure out what they are.
Thanks, Mike. My name is Nicolas Lopez-Carranza, and I lead the BioAI team at InstaDeep. It's a pleasure to be here. As Mike says, between 55%-75% of the peptides available in a mass spectrometry database cannot be identified. The main issue here is how traditional mass spectrometry target decoy search works. It relies on a target database, here in green, as well as a decoy database, which is derived from the target database by scrambling the target peptides or reversing them. Then, the algorithm scores all of the peptides and keeps the best matches and calls for those peptides, using the decoy database as a way to control for false positives. But what happen if we develop an algorithm that does not rely on a database to call for those peptides?
Here, we are talking about de novo peptide sequencing, and you see how a sequence-to-sequence AI model is translating between an MS2 spectra into a peptide, as you see here in the picture. The great advantage here is that we do not need to rely on a database. It's a simply translation model, as we would be translating from English to German. So that's why we partnered with DTU, Technical University of Denmark, to develop InstaNovo, de novo peptide sequencing with deep learning. We trained this model on 33 million peptides from the ProteomeTools database, and we actually developed two models. The one on the top right is the autoregressive decoding of the peptide, where we, given the input spectra, are decoding one token at a time, as ChatGPT works.
The issue with an autoregressive model here is that once we made an error at the beginning of the sequence, we cannot recover from it. That's why we develop also InstaNovo+, where we use a diffusion decoder to avoid this issue, and we improve the performance. Regarding the results, we did manage to increase an immunopeptidomics data set by 40%, as you see here in the picture. We also made this model available for the community to build on top of it, and you can find the publication and the code available on GitHub. Without further ado, I hand back to Karim to continue with the innovations.
Thank you, Nico. And, congratulations on the great work on InstaNovo that we have open sourced in partnership with DTU. So, this is an example, obviously, again, of enhancing the quality of the data we have by AI-assisted labeling, which I think is very exciting, but things don't stop here. Obviously, we also work a lot on protein design, and we're gonna speak a bit about that. And I want actually to share the first joint, like, press release we did with BioNTech back in November 2020, when we announced our strategic collaboration, like Ryan Richardson said. And importantly here, we mentioned that InstaDeep's DeepChain platform would be deployed on multiple tasks, including RiboMab, like working on BioNTech's RiboMab platform.
I'm delighted that after a productive collaboration between BioNTech and InstaDeep, we have results to share, and I'm pleased to welcome Hippolyte, who's gonna tell us more about this.
... Hi, everyone. I am Hippolyte Jacomet. I'm a research engineer and team leader at InstaDeep, and it's certainly satisfying to see that what we announced four years ago actually has become a reality since, and maybe let me remind that RiboMab is BioNTech's platform for mRNA-encoded therapeutic antibodies for cancer and infectious diseases, but what problem do we try to solve here, actually? Among therapeutic antibodies, co-expressed and bispecific antibodies hold special interest. However, these require a precise pairing of their constituting heavy and light chains. Let me illustrate that with the example of co-expressed antibodies. Let's say we would like to provide a patient with antibody A and B here simultaneously. Each of their constituting heavy and light chains would get translated from their respective mRNA into proteins separately, and only later would assemble to form the antibodies.
Now, if this assembly process goes with antibodies that only differ by their variable domains, in most cases, this will result in mispaired constructs, and in the end, only 12.5% of the correct antibodies will be obtained. This is why being able to control the pairing process of heavy to heavy and heavy to light chain is critical, and this is what we set out to address, focusing first on heavy to light chain pairing. Our approach to this problem has been to engineer the interface between the constant heavy one, CH1, and constant light, CL, domains of antibodies. We set out to introduce mutations that would yield so-called neo-CH1 and neo-CL domains that one of the antibodies could be equipped with.
Such mutations are called orthogonal mutations because they both seek to enhance the affinity between the neo-CH1 and neo-CL domains, all while abrogating the binding between neo-CH1 to Y-type CL and Y-type CH1 to neo-CL. Now, this protein engineering problem is actually a multi-objective combinatorial optimization problem, where we search the gigantic solution space to find the optimal set of orthogonal mutations. This is the type of problem that InstaDeep has lots of experience with, yet here it came with its own set of challenges. We had to properly estimate the binding energies of the correctly paired and mispaired domain assemblies. We had to estimate the impact of each mutation on the stability of the heavy and light chains. We had to structurally model the mutations and to gain a deep understanding of the key interface interaction to help steer our models through the gigantic solution space.
All of which we could achieve thanks to our DeepChain platform and to an efficient in silico, in vitro collaboration, working hand-in-hand with our biotech colleague, who did an amazing job on the in vitro part here. And now for the results. Well, we were able to achieve more than 90% correct pairing, matching the best patented designs in the market. We did validate, we did validate that antibodies equipped with our domains remained their full functional activity, and this is how InstaDeep helped BioNTech acquire the technology required to develop next-generation bispecific and co-expressed antibodies. Thank you very much.
Thank you so much, Hippolyte. And congratulations again on this amazing results obtained with the DeepChain platform and in collaboration between BioNTech and InstaDeep. So we are now at the last presentation of this AI day, but last but not least, so I'm pleased to have Sven come also and present with me.
Yeah. Thank you, Karim. First to myself, I'm Sven Coutandin, Director for Global R&D Automation at BioNTech, located in Mainz, and I would like to give a very short introduction about the connection between AI and automation, so with automation and AI, so we have the great potential really to revolutionize the way we do or we work in R&D. So we have in-lab runs that, together with in silico runs, can really, in a closed optimization circle, accelerate the scientific discovery, but on this way, we also have different challenges, so we have R&D that is constantly changing, so we have to react also with automation on these changes, so we have to solve the contradiction between automation and flexibility. We have a high complexity in the combination of automation and science. This needs to be solved.
And we also need to keep the transparency for our scientists, who need to have control about the experiment and also on their results, to give the transparent to them back. And so, with the assistance of artificial intelligence, we see opportunities to overcome these challenges, to create transparency, and also to be fast changer, also with the automation system. And this then really helps us to unlock the full potential of laboratory automation. And with AI, we see capabilities. We can have information discovery from different resources, not only from the machine. Fast protocol development and change of the automation and the protocols needs to be set up on the machine. Machine error diagnosis, supporting also the troubleshooting while implementing all the cross-team interaction between engineer, scientists, and all the AI experts.
In total, this also supports the whole change management to create transparency and full control of what we are doing in our labs. Handing back to Karim.
Thank you, Sven. So obviously, like, AI being highly capable, offers an opportunity to improve automation in the lab. To give you a flavor of that, actually, we're gonna go live into Mainz and the tech lab of BioNTech in Mainz. We have David with us here, and we are very excited to show you a first, which is the DeepChain platform with the Layla agent, but in the lab. The idea here is we want to inject all the intelligence that AI is capable of into practically useful workflows like Sven described, and this is a challenge, and we are actually tackling this challenge. Over to you, David. Please introduce yourself and just go ahead for a demo. You have five minutes.
Hi, everybody. Can you hear me okay?
Very good.
Hi, everybody. As Karim said, I'm David, and I lead the RNA Optimization Group here in InstaDeep. Right now, I'm in the Tech Lab, which is BioNTech's center for laboratory research and innovation. I'm very excited to show you how we've fully integrated our Layla AI agent with BioNTech's laboratory machines to provide any help that our lab scientists need, which we call Layla in the Lab. This page shows you an overview of all the machines in the tech lab. Layla in the Lab allows scientists to see exactly what's going on around the organization and find out what they need to know immediately without interrupting their colleagues. Here you can see our AI telling us a live, real-time update of what's happening on all the machines in an easy to understand, and relevant way. Let me start by saying, how that works.
Our AI system is connected directly to live feeds from each of the lab machines. Let me show you a couple of these. Here is the live feed for one of our machines, the Opentrons machine. And if we look here, we have the live feed for another one of our machines, the Tecan. As you can see, the data is extremely technical. It's also generated constantly, building up to thousands of lines. It's impossible for a non-expert to quickly find the information they're looking for in these kind of logs. To understand everything that's going on, you would need to look up and understand technical manuals like these and many, many more.
Layla is able to use contextual information such as these manuals, information from within the company, and also data from information laboratory management systems to convert the raw log streams that you've just seen into concise, relevant, understandable information that you can see on screen now. But equally importantly, it can do this at high throughput and in real time. Let's have a look now at how this can help lab staff during their day-to-day work. What we're seeing on the screen at the moment is a summary page for the Tecan machine. At the bottom is a summary of the job as a whole. It presents the most relevant information. For example, we can see which protocol is currently running, who requires this job, and other information about the job.
In the middle, we've got the status summary, which is a real-time feed of information in an easy-to-understand format about the job as it's actually executing, and then at the top, we have the next action information. This shows the next manual action the scientist will need to perform on the machine, allowing scientists to quickly and easily plan their day and be prepared to carry out each action efficiently. This summary page presents what we think is the most commonly needed information, but the power of AI is that it can answer any question a scientist has. Layla has a chat interface designed for lab use with speech recognition. Let's have a look at that right now. Layla, what is the current machine, and what's it doing?
The current machine is the lab's Tecan 1080. It is currently experiencing an error while running the protocol ID AL073. The error occurred during the step, "Pick up plate with stained cells from incubation." The machine is unable to grip the Labware 96-well round Costar one at the specified source location, suggesting a mechanical failure or misalignment of the gripper.
With AI, you don't always know exactly what you're going to get back, and in this case, I think Layla gave us slightly more information than we needed. I'm going to jump on and look at how we can actually solve this error that Layla's been telling us about. To help us, you can see that the summary screen has now changed to a view that's specialized for the error analysis. At the top of the page, we can see a concise summary of what the error actually is. The key piece of information is that the gripper didn't find what it was looking for in the right place. Underneath this, we've got the impact assessment. In this case, it's telling us who needs the job and what the potential issues are if it's not solved quickly.
David, can you actually hear me? Yeah. David?
Yes, Karim. Yes.
Absolutely. And here you're showcasing, like, Layla as an agent that can read, but in the sake of time, perhaps you can show us the writing capabilities of Layla, basically the agent acting in the lab. If you can demonstrate that, like, and you should be thankful to Layla. Yeah.
Okay, absolutely, Karim. So if I go back to the homepage, I can show you a view of all of the different machines that are currently operating in the lab. I'll just zoom out so we have a little bit of a better view of them. And I'd just like to say, Layla, congratulations on an excellent job done today. I think it's time to disco.
Well done. As you can see, actually, Layla, our AI agent in the lab, directly from DeepChain into the lab, actually is controlling the machine. We've been working very, very hard to integrate those capabilities, not only to read from the machines, but to write to the machines. You can see we had the robotic arm waving. We had the lights flashing. This is showing you, like, the potential that intelligent AI agents can bring into the lab. This is a very exciting time, and importantly, this is leading to actually time savings and efficiencies operationally, and we're gonna have Michael tell us more about this.
Hi, all. My name is Michael Dahms. I'm Director for Digitalization of Scientific Labs at BioNTech, and since we progressed a lot in terms of time, I keep it short and sweet. So just to give you one metric on where it can save actually time in the lab, if you have common errors where the lab technicians know what they're doing because they happen each and every day, we cannot talk too much about efficiency gain. But if you have uncommon errors where you need to consult the technical manuals, like David showed them in his presentation, if you can easily use up an hour to analyze the error, and this technology can definitely speed up this error analysis a lot.
Talking about next steps and where we should go to with this application, what makes this technology demo unique is the level of abstraction you have above your normal lab automation procedure in terms of semantic information. So what is the system actually doing? What is it for? Who is it doing it for? And it contains a lot of interesting use cases for different kind of users, like the lab operator itself. There's part of it for the scientist, for the lab manager, maybe, and we should carve out the different technologies we have here to have the best fit use case for the user where it needs actually to. And next steps are, as I said, carve out the different user group requirements.
Secondly, our digitalization backbone we have already running at BioNTech, which is the laboratory information system, electronic lab notebook, and applications like this. We should connect this application to it to enhance the level of details it can provide for the actual samples, for example, it's processing. And last but not least, we should scale up to different lab devices. Currently, it's pretty much focused on liquid handling devices, but there are also mass spectrometry devices. There's a whole bunch of analytical devices and other devices which make total sense to be included here. Thank you. Back to Karim.
Thank you. Thank you, Michael. And, as you've seen, we're gonna be working very hard with our BioNTech colleagues in the lab, and hopefully, in subsequent editions of AI Day, we'll be able to show you on our progress. But this is a very exciting time, and as you can see, Layla is already in the lab, which is super exciting. And so this takes us to the end of the presentation. We're very happy to have with us Ryan Richardson, and we're gonna move into a Q&A session, fireside chat. So, we're happy to take your questions. Maybe, Ryan, you wanna come on board? Yeah. Thanks. Awesome. So hope you enjoyed this session.
We've covered a lot, as you can see, but this is actually a small snapshot of all the work that's going on between InstaDeep and BioNTech. Productizing AI, not only innovating, but deploying with the colleagues in the lab, in the different R&D projects, in the future also on the sort of industrialization and production. So we're very happy to take question, and thank you, Ryan, for taking the time.
Absolutely. Thank you. So questions?
Thank you for sharing. Really interesting, actually. So, my question is gonna be around RiboMab, and I just wonder, you know, how close are we now to kind of getting therapeutic levels of antibodies delivered as an mRNA? So is the AI actually or machine learning providing, you know, features of an antibody or imparting greater stability that would actually make it a much more realistic possibility? Because... and also from the perspective of cost of goods, because obviously, that's also one of the other drivers about bringing, not RNA, but mAbs to the marketplace for therapeutics.
Thank you for the question. Perhaps I can say a few words on AI, and you can
-discuss the different programs. So if you look at where we are from an AI standpoint, I think the best comparison is to look at where we were in natural language processing, basically AI understanding language, in roughly 2020, when you had GPT-3 coming. So those models are starting to become seriously powerful. They are not perfect yet, but that gap is being bridged. So I believe we're gonna see tremendous progress in the coming years, and this is a very exciting time to be working at the intersection of AI and biology. Now, on top of that, we actually shown. Your, your question is on antibodies. We've actually shown very exciting results for the first time today.
First, on our AFNX, antibody generation system, which is currently state-of-the-art, and in particular, with an ability to condition to different chemical, biochemical properties. That's exciting. So that's happening on sort of like the variable regions. But on the structure of the antibody itself and have it mRNA-encoded, this is exactly what we've shared with our results on the RiboMab platform using DeepChain. So I would say the progress is happening now, and the slope at which things are accelerating is definitely gonna increase in coming years. So yeah, it's like GPT-3 in 2020 and hopefully many exciting things to share in the future. So that's in general AI, and perhaps, Ryan, you wanna share on the specific programs?
Yeah, yeah. It's exciting that we already can see these kind of results with the application of AI. But in terms of where the therapeutic platforms are, we actually are already in human testing in the clinic with both our RNA-encoded antibodies and also RNA-encoded cytokines. So to your question about, do we get therapeutic level of translation? The answer is yes. And what we're doing is we're using a liver-targeting LNP to deliver the RNA and effectively turn the liver into the manufacturing engine of the therapeutic. So yes, I think we're seeing very encouraging results, and I think more work to do, but this could open up a whole new space of therapies across a wide range of potential targets.
Again, RNAs, encoded cytokines, and multi-specific antibodies and T-cell engagers we've already taken into human testing.
We have a question here. Yep.
Thank you. It's Sam Fazeli from Bloomberg Intelligence. I'm not quite sure where to begin, but I'll limit to two questions. Half an exaflop is a pretty but significant product, right? In terms of computing power. You clearly have very big ambitions for this. This can't be just for BioNTech to use, in terms of its scale. So what is the ambition? What is the business strategy here going forward? Should analysts be thinking about modeling InstaDeep as a revenue line, significantly for BioNTech? And then just to that last point, with regards to your pipeline, clearly, the pipeline currently is populated with a lot of products that have that you some of which you've in-licensed, some of which are internal.
Clearly, the personalized cancer vaccine, or whichever way you wanna call it, individualized neoantigen therapies, they benefit from AI today, I'm pretty sure. When would we see data to start getting excited by from work that's emanated from this collaboration?
Yeah, let me start, and Karim, you can add. So, you know, when we did the InstaDeep acquisition, we identified a couple of key value drivers, and one of them was, of course, cost efficiencies associated with effectively internalizing what was, at the time even, our largest AI technology and service provider. More than a service, really, technology, solution provider. We also saw advantages to, or synergies in terms of integration, and capability building. But I think the fundamental value driver was really in our core business of developing novel vaccines and therapeutics, right?
That's where, that's our business model at BioNTech, and we saw the potential application of InstaDeep's technology solutions across the whole, you know, across our platforms, across the whole, let's say, value chain, especially in drug discovery, as the primary reason for this acquisition and what we think is so powerful about the combination, and also unique to the industry, right? Because fundamentally, BioNTech is about discovering and developing new therapeutics and vaccines. We do think, as you've seen, maybe you got a glimpse today, that we think that there's multiple applications in that discovery arena, across platforms of InstaDeep's technologies and capability. You know, I think obviously that's a long-term value creation engine.
We don't currently split out InstaDeep in terms of financial reporting, but I think it is worth noting that while we haven't really talked about it today much, InstaDeep also has a third-party business outside of the BioNTech relationship, and maybe, Karim, you wanna say a few words about that?
Absolutely, Ryan. And the idea is that we want InstaDeep to be a leader in AI, and the way to do that is actually to develop core innovation in AI. That obviously applies to the biological pipeline and strategic objectives we have at BioNTech, but also that can have applications outside biology. So we do both, and we found that this is actually a viable way to operate because the same technology can apply to multiple use cases, and we've seen multiple proofs of that. For example, if you look at, like, designing a new protein, this is a combinatorially explosive kind of problem. We are leaders in industrial optimization within biology and outside biology, and these add up together.
So really, the objective is to continue to be a leading power in the world of AI, continue to invest, continue to derive new innovation, and from that point of view, our new supercomputing cluster is a must, and it allows us for more flexibility for the type of workflows. So we will use it for biologically, like, compute-intensive applications, of which there are actually many, but also in other types of AI research work we do, which ultimately benefits the progress we do as BioNTech Group company.
So that's a bit like the spirit of what we do, and I think also, like, in the idea of, like, sustaining a dream team of AI talent, we found that this AI identity of InstaDeep is the right approach, and as you can see, we're constantly pushing the frontiers of innovation. We publish many papers as well. Last year, we had more than 25-plus research papers in AI published at major conferences. Nature journals also, something where we publish. So it's an exciting time, and we are continuing to invest in those capabilities, but importantly, making sure all the work we do does actually benefit our BioNTech colleagues in the lab, in the industrial processes.
And in a sense, this is kind of like having the full vertical from pure AI, compute, algorithmic innovation, to in-the-lab testing, in vivo, clinical trials, and the others. That's the goal.
Hey, Karim and Ryan. A question from the webcast here from Yaron Werber at Cowen. So two-part question: How does DeepChain take external input from academics and other groups to fine-tune the model? And then secondly, you know, do you foresee other biotech companies using this open source model, and how can we at BioNTech keep some things proprietary to maintain a competitive edge? Thank you.
... Sure, absolutely. So, so we're very excited to have DeepChain available both for internal stakeholders within BioNTech Group, but also outside BioNTech for external parties. So the culture of collaboration and scientific, sort of like work together with universities is very strongly established at InstaDeep and BioNTech. So as you've seen, for example, our InstaNovo protocol that was presented today was developed with DTU, Technical University of Denmark. So this is the kind of spirit, and there is room for joint innovation there using public datasets. And things become proprietary when you're using specific proprietary data, for example, within BioNTech. And so that's the spirit with which we operate. We develop core technologies on open source, potentially in partnerships.
If you wanna go the level beyond, which is use your specific data, make sure with all the sort of like, privacy guarantees that come with that, this is something that we offer, and we have extensive experience doing that. So that's a little bit the spirit behind the DeepChain platform.
Yeah, and to address the second part of the question in terms of other biotech companies and how we balance that. You know, it's interesting, when we started working with InstaDeep in 2019, we were really the one of the first major biotech clients, right? Karim and team had built a-
Actually.
An extensive client list in other domains, in the tech industry and industrial sectors. But we were the first biotech or one of the first, and I think it's interesting. Initially, we were the main biotech customer, and I think that, you know, the question there hits on actually a very interesting point. But it's obvious you can see that InstaDeep has amassed a very abundant wealth of expertise out very quickly, and we've seen with our own eyes at BioNTech, the pace of learning of the models over the last couple of years. It's just been extraordinary.
In fact, it's one of the reasons that we decided to make the acquisition because we reasoned that as these models, effectively the brain of new drug discovery and design, as they progressed and got smarter and smarter, that needed to be a core competency inside BioNTech, right? Because we're not, unlike, you know, many big pharma companies, we're not just acquisition entities and commercialization engines alone. We're, you know, our fundamental business is in drug discovery and in innovation, and so we felt that needed to be in-house. So I think that leaves the door open to also expanding an ex-BioNTech biotech business in the future for InstaDeep, and I think that's something that we certainly could do.
And I think, you know, if we do decide to pursue that, I think it won't be a problem to balance our internal domains with external because of the breadth of application that some of the technologies you've seen on display here today can-
Yeah, and from a technological standpoint, this is absolutely feasible. Think about it as like you have large-scale language models or innovations like BFN. Those are trained on public data at scale, but then you can fine-tune them, and Julia earlier showed an example of fine-tuning, with the specific data of a particular company. And so this creates a model that is more advanced for a very specific use case. So as a platform, we can definitely engage with multiple stakeholders while making sure, everybody's data stays completely protected and useful only, for the person who brought in the data. And so this is a model that works very well.
It's exactly the same as, like, if you go to one of the large LLM providers as an enterprise customer, and you're like: "Hey, can I fine-tune or refine your model on my private data? But I do not want you to learn from my private data and give it to somebody else." And the large, like providers, whether it's Google with Gemini or Microsoft OpenAI with GPT-4, four and a half, offer this service. So it's exactly the same but applied to biology, where we are a leading force and working together with BioNTech, we're actually capable of providing experience on use cases that can be used for others in different, sort of like, tasks in biology. And biology is quite vast, so it's totally not a problem. I have a question here? Yeah.
My name is Harry, reporting for Time Magazine. You've described Layla as an agentic system, with, you know, expert-level biological knowledge and tool use ability. My question is: What measures do you have in place right now to ensure that this tool isn't used by non-expert actors for malicious use cases?
So, it's a very good question. So, until now, Layla is actually available internally and in testing. But importantly, like, the main use case for Layla is improved interaction with the system and improved tool use. So, for example, like you can call Layla to, like, do a routine on SegmentNT like has been shared. Those models are open source, and we are restricting users of Layla to tool use and sort of like database collection and others. So we are having, like, a very stringent process, making sure that the system cannot be used for other things before we release that. But as a sort of like conversational interface to very powerful tools such as BFN, SegmentNT, like we've seen, there is a great sort of use case for Layla.
Even in the lab, where we've seen even like with having like a text-to-speech capability for people working in the lab, which we believe democratizes AI. Because if you look at where we are in terms of like, biologists on one side and AI machine learning experts, those rarely intersect. And so Layla is sort of like democratizing access to powerful tools, but which are tested and open source, and we are constantly exchanging with the online community to make them better. But so we're very careful on that. But to answer your question, specific tool uses and increased sort of conversational capabilities to lower the bar of like, the entry bar for using these systems in a safe environment.
Just to clarify, the tool uses, are they only to improve?
Yes. We the tool use here is not like we do not allow Layla to go, for example, on the internet and use multiple things or multiple models. It's only to the models actually, in this case, developed by InstaDeep, or that have been sort of like open source for a while and validated by multiple parties. But we do indeed restrict the list of tools to only validated ones.
Hey, another question from the webcast here, this time from Elliot Bosco of UBS. This is about prioritization. So the question relates to how BioNTech prioritizes the incorporation of InstaDeep's technologies throughout our development pipeline.
Yeah. So I'll try to take that. So, you know, it's we don't actually differentiate directly between InstaDeep-derived molecular structures and non-InstaDeep derived or AI-derived structures. I think the goal that we're striving for is to embed AI where it makes sense to do so. And as you've seen today, at least you've got glimpses, we actually think that the applications across our platforms are quite broad. So we've talked about personalized cancer vaccines being an obvious use case, where we're using AI to design each and every vaccine in terms of the neoantigens or mutations that we target per patient. But you've also seen use cases for off-the-shelf drugs. You know, Ugur talked about our model, both focused on both off-the-shelf drugs, more traditional in that sense, still using novel technology and individualized therapies. We see applications in both.
So it really comes down to performance, right? And we oftentimes will take AI-derived molecules into the wet lab, into lab testing versus non-AI-derived molecules, and it's a battle of performance for performance.
Yeah. Actually, yeah, there, there's been many cases where we have AI-generated, say, constructs coming from InstaDeep and expert-designed constructs from the BioNTech colleagues. These are anonymized, and then we take them to the lab and we see what works. The feedback from that is that the best is actually mixing the domain expertise of the biologists at BioNTech with the AI expert. If those collaborate, that's when you get the best results. But obviously, AI capabilities are increasing constantly, and this is something to keep in mind. I think we had a follow-on question from a gentleman there. Yeah, before. Yes.
Where am I going? Oh, there you are.
Yes, it was me, Ian Johnston, Financial Times. Just a question on Layla. How do you tackle potential hallucinations within the model, and what impact that could have on its use in the lab? And how broadly within the lab is it likely to be used? Is it across all-
Yeah, it's-
- applications?
It's a very good question. I would say, you know, hallucination is a sort of like a well-known problem. But if you look at the evolution of the latest models, this is increasingly less of a problem. And how we tackle it is by having, like, very solid guardrails in terms of, like, prompt engineering and the other. To give you an idea, Layla can accommodate a context window of one hundred and twenty-eight thousand tokens, like roughly a hundred thousand words. So you can use these to give a very powerful context. We use it, for example, to give the context of every machine. If you saw in the live demo from the tech lab in Mainz, we showed you actually, like, an entire manual.
This has been entirely uploaded into Layla, which then can provide very accurate sort of context, and so where we are today is really like we see that this technology is very productive, and as we've seen with Sven and Michael, who are experts in lab automation at BioNTech, this is already useful today as a transparency into what the systems can do, so the question about, like, what's coming next is really about, we believe, continuing to integrate the technology into the lab, but really having that loop with the lab experts to see where this is most useful. The first feedback we got is that actually this is useful to create radical transparency of the state of every machine and sort of like troubleshooting, but obviously, as time passes, we're gonna see more and more use cases.
But I think what's exciting is to show that this is actually possible today. Very often we think about large language models are only like sort of like smart Q&A partners. But as you've seen today, including like the action capabilities of Layla, there is a lot more that is gonna come soon, and we believe that the people who are gonna be the best at this are the people who are iterating with real lab technicians, experts, to make sure this is deployed the right way.
One more question.
Hi, Manos Mastorakis from Deutsche Bank. So we heard a lot of fantastic stuff, most of which I believe is in preclinical, in terms of the preclinical kind of processes and how you discover drugs. Some of the things that we didn't hear is how BioNTech is using AI across the, you know, operations of the company. So when it comes to manufacturing or how the company is run in general, or how clinical trial data is analyzed. Could you give a bit of color on how you use AI with or without InstaDeep's help in those domains? It would be helpful.
... Yeah, it's a great question, Manos. I think, so, you know, the primary use case, what is and has been to embed AI in drug discovery, right? That's where we see, you know, again, an ability to combine our therapeutic platforms on one hand, which are largely very novel, with the AI capabilities that InstaDeep brings to bear. And that's where we see, you know, truly profound disruptive potential in terms of developing new drugs or discovering new drugs. But you're absolutely right that we see broader potential across different domains of the business. And there are examples, there are projects that we didn't highlight today that are happening in other areas. For example, you know, we're looking very closely at areas of how we can make clinical development more efficient, right?
How we can select patients more efficiently, how we can write protocols more efficiently, just to name a few. Of course, manufacture, which is also a core competency for us, both personalized RNA, but also bulk RNA and even cell therapy we do in-house. We think there's multiple applications there, along with supply chain. I think in those cases, though, you know, we at BioNTech, we also are aware of the fact that there's other external, you know, service providers or technology providers that might have certain domain expertise or a certain focus. And so, you know, I think where we think we can have, we can competitively differentiate through an internal solution, that's something where I think InstaDeep is very well placed, or where there's this a particular problem where InstaDeep has deep expertise.
I mean, there's actually quite a few of those areas, especially across the industrial automation arena. But we're also, you know, gonna use external providers, too. So I think we're in the fortunate position to be able to kind of choose what we, where we, sort of invest in InstaDeep to build capability, where we might rely on a third party that has an existing business, and it's a mix across the company.
Exactly. Like, the goal is not to have InstaDeep deployed on every potential AI use case within BioNTech. To give a concrete example, if it is a system to better manage, like HR, for example, it's probably better to have an off-the-shelf solution rather than have the sort of limited capabilities that InstaDeep has in terms of personnel, number of projects deployed on that. So we will aim to really move the needle, look at what is strategic for BioNTech, where we can bring that extra edge in terms of innovation, compute at scale, deployment that makes a difference. And so we constantly look at sort of like the value of having the InstaDeepers on a specific projects versus taking off-the-shelf solution, and we're very pragmatic about that.
Maybe, we'll take one final question from the webcast. This is from Daina Graybosch of Leerink. "On cancer vaccines, can you help us understand how the InstaDeep platform could assist in not just identifying neoantigen mRNA sequences, but also to predict the neoepitope immunogenicity?
Take that. Well, so how we can use it to predict neoepitope immunogenicity? I mean, that's a fundamental application of AI that we've had ever since we went to an in silico process. It's trying to predict MHC Class I and II binding affinities in immunogenicity for a variety of antigens and to be able to apply that on a per-patient basis, right? So that's a very fundamental use case that we're already using AI for. I mean, in terms of how InstaDeep can help us do that better, you know, I think you know, maybe, Karim, you can talk about some of the latest advances in models that might be applied.
Yeah, absolutely. I think this is a very rich topic, and there are multiple ways where you can have sort of like AI-led improvements into the current pipeline. So while I can't disclose, you know, specifics here, definitely, like, you have to look at AI as a capability that you can apply across the pipeline. Today, we've shown you several examples. There are other also use cases we're working on, and this is... If you look at the capabilities that AI offers, they're increasing extraordinarily fast. For example, like the lab demo that we've seen is something that would have been strictly impossible a couple of years ago. Now it's possible. So we constantly reevaluate, but yes, could we do something about personalized cancer vaccine? Absolutely, indeed from a technical standpoint.
Yeah, and maybe just want to add one point to that. So, you know, I think if we look at sort of what's the holy grail in the personalized cancer vaccine space in terms as it relates to AI, you know, again, one of the unique aspects of a personalized cancer vaccine that's powered by AI is, again, the ability to harness and create data assets, right? So, so far, the industry has largely built these models, trained these models on sort of hundreds, maybe a couple thousand patients of data. There's a lot of overlap in the datasets that different companies have used. A lot of the datasets are actually shared across university, you know, public-private consortiums.
I think if we fast-forward, I think if we, as an industry, are able to harness the power of data as we treat more patients, I think that could truly unlock a, a pace of learning that could be exponential to an extent. Again, we've never had in the industry a model where for a given drug modality, where the more patients you treat, the better the modality gets, right? That's not the, that's not been the paradigm in the past, and I think with, with personalized cancer vaccines, that could be the model in the future, and of course, AI is gonna be an important ingredient to help us get there.
Absolutely, like Ryan said, the limiting factor in biology today, and this includes personalized cancer vaccines, is really data. So the more data you have, the more you deploy AI to extract more data or make a better use of the data you have, with the few examples we've seen, the better things are. So it's all about, like, going through this virtuous loop faster and faster, which is what we are doing between InstaDeep and BioNTech. Awesome. So-
Thank you.
I think we covered everything. Thank you, guys!