Good morning, everyone, and welcome to DASH. It's amazing to be here again today. We have a lot to show you, and I will tell you that it feels amazing to meet with so many of you here in New York this week. First things first, I'd like to thank our sponsors and partners. They do so much for us, for DASH, and for Datadog, and you can meet them all in the expo hall today. Now, I won't be very long. As you know, at Datadog, we prefer to let the product do most of the talking. So this morning, members of our product and engineering teams are going to come on stage, and they will show you some of the things we've been working on. But before we do that, I'd like to take a minute to thank all of you.
I want to thank you for your trust, for your business, but also for all of the feedback you give us every single day. In fact, since last year, we have been in 187,000 customer meetings, and these meetings ultimately resulted in about 500,000 releases to production, covering more than 400 new products and new features. We build our products for you and with you, and together, we work really hard to allow you to fully observe your applications, to operate securely in the cloud, and most importantly, to take action, so you can run leaner, faster, and better businesses. To get us started on that road, I'd like to invite on stage my co-founder, Alexis.
Thanks, Olivier. It's really great to be here, and we have a lot of things to show you today. They illustrate the kind of work we've been doing across the entire platform over the past year. That work reflects what you've been telling us: new use cases, new software stacks, but also larger scale and faster pace on existing ones. Let's get started with AI. More specifically, we hear from a number of you that your LLM-powered applications are moving to production. Once in production, it is crucial that they're monitored like any other load-bearing application. But what's different in their case is the kind of data that's essential to understand health, performance, and safety. To tell you more, please welcome Mohamed to the stage.
Thanks, Alexis. Hi, everyone. My name is Mohamed Alimi, and I'm an engineering lead at Datadog. Over the past year, we've seen the impressive potential of large language models, as many of you experimented with them, and this led to an incredible innovation across many industries, where many of these experiments have evolved from simple application into much more sophisticated systems running in production using multiple LLMs, orchestration framework, retrieval system, and knowledge graphs. But this also led to new challenges. First, as these applications are involved on LLMs and more complex patterns, they become much harder to troubleshoot. Second, due to the inherent unpredictability of LLMs and AI components in general, these applications need continuous monitoring for hallucination. And finally, these applications can face significant risk from prompt hacking and data sharing.
To help you address all these challenges, I am happy to announce that Datadog LLM Observability is now available. Thank you. Let's see how it works. What you see in front of us is a live stream of interaction from an LLM-powered e-commerce chatbot. Datadog LLM Observability highlights issue that requires immediate attention at the top. For instance, it has flagged errors, potential hallucinations, potential hallucinations, slow responses, token count, and security threats. It also highlights faithfulness, which is a measure of correctness and accuracy relative to a given context, and we use faithfulness here as a proxy for hallucinations. I'm interested in the reported hallucination, so I select the first item, and I land on a comprehensive page with valuable information about the interaction.
So here you can see the duration of the interaction, the token count it consumes, the number of LLM calls it made, and also the models it invoked. Right below, you see the input and the output. In this case, a chatbot user has asked a compound question about a recent order. At first glance, the response seems fine. But when I check the trace view below, I can see an issue with one of the spans... So I click on it to dive deeper, and I see it's a retrieval span that is flagged for hallucination. Datadog LLM Observability reports all the context chunks that were used to compile this response. I also noticed that the chunk that received the highest relevancy score contains outdated information. However, the correct chunk has received a lower score.
The response that we saw earlier, which felt fine, is actually incorrect, and this is great. With few clicks, I managed to find out this issue. I don't wanna stop here, I wanted to know more, and in particular, I want to know the reaction or the behavior of a chatbot with respect to similar questions. With a single click, I discover that this question belongs to a wider cluster about return policy. Datadog LLM Observability groups semantically similar prompts and responses into clusters, and also auto labels them for easy analysis. Here in this cluster map, you can see that the return policy clusters is impacted by a high rate of hallucination, which is not the case of other clusters. It is very likely that the category of hallucination we've seen earlier is very specific to this cluster. This is great.
I have all the information that I need to report this issue to the relevant teams. Datadog LLM Observability allowed me to isolate the issue, understand the root cause, and also assess how widespread it is. Now I go back to where I started to address other issues, and this time we'll focus on security threats. Datadog LLM Observability allow us to monitor application inputs and outputs for malicious input and sensitive data leaks. I select the first reported item, and here, as you can see, we have a user who asked a question about return policy that contains a malicious segment that forced the chatbot to generate a legally binding million-dollar offer. Datadog LLM Observability has correctly identified this issue as prompt injection. Here again, I have all the information that I need to report this issue.
As we saw, Datadog LLM Observability helped me quickly troubleshoot my application. It also reported a quality issue, in this case, a hallucination, that could cause customer dissatisfaction and disengagement. It also reported a security threat, in this case, a prompt injection that can cause financial damage and erosion of customer trust. We're so excited to put this product into your hand. So to start, you can use the link above and be sure to visit our booth. It's so great to see so many of our customer innovating with LLMs and also using Datadog LLM Observability. Another great examples, here Jaime from WHOOP to tell you more. Thank you.
My name is Jaime Waydo, and I'm the Chief Technology Officer at WHOOP. WHOOP is a wearable health and fitness coach that helps our members achieve their goals. Unlike other wearables, WHOOP is worn 24/7 and delivers personalized data metrics across our key pillars of strain, sleep, recovery, and stress. My engineering organization is responsible for ensuring this always-on experience and for bringing AI into the technology in a meaningful and accessible way. WHOOP Coach is our AI-powered, bespoke human performance coach. We are teaching Coach to get smarter every day, to bring more bespoke summaries of our members' data to them, to give them recommendations that are actionable and realistic, to help our members truly make the most of their day, regardless of their goals.
For this, WHOOP leverages Datadog as the observability platform to monitor the health of all of our systems, from our infrastructure health and hardware performance to our UI performance and member experience. No matter the platform, Datadog powers our teams to set goals through SLOs and respond to monitors to ensure an always-on experience. Now, as you all know, AI expanded the complexity of the software world, and in particular, how to measure and monitor AI performance across a myriad of generalized interactions. Additionally, the landscape is ever-evolving. It will never be a product that is totally finished being built. This is one of the reasons partnering with Datadog is so important, because we need a lot of eyes to monitor the health of our systems so that we can ensure the greatest possible product is in our members' hands.
WHOOP has collaborated with Datadog to integrate these new critical metrics for our teams to monitor. As we expand our AI usage across the product experience, Datadog LLM Observability enables us to get visibility and quickly resolve issues to model performance, time to first token, data extraction, and even A/B testing across our prompts. Together, WHOOP and Datadog are diving deep into what it means to build a large-scale product built on AI for our members.
Thank you, Jaime. You know, I, too, wear a WHOOP, and this morning, I've been watching my stress level before the keynote. I asked our AI WHOOP Coach to help me relax before getting on stage. It guided me to do some helpful breathing exercises. Now I feel much better. So you've just heard from one of our customers about how we support the safe use of LLMs. We also work with a number of AI partners. They're the folks who provide the foundational models that power all these new applications. There, our work is about getting tightly integrated with their stacks so that getting deep observability on LLMs is just a one-click operation for you. To tell you more, I'd like to hand it over to Daniela Amodei, President at Anthropic, one of our AI partners.
Thank you, Alexis. Hi, everyone. I'm Daniela Amodei, President and Co-founder of Anthropic. Anthropic was founded three years ago with the goal of building trustworthy and reliable AI with safety at its center. Today, our mission is to ensure the world safely makes the transition through transformative AI. Earlier this year, we launched our Claude 3 model family, which was designed for many of the use cases that businesses like yours care most about. Claude 3 Haiku enables many AI use cases like customer support, chat, and sales. Claude 3 Sonnet enables the use of AI to perform tasks like search and retrieval over your own data, or to generate content and code to save your employees time. And Claude 3 Opus, for use cases that require state-of-the-art intelligence for complex domains like R&D, drug discovery, market forecasting, and more.
Our models are designed to meet your diverse needs and use cases, but they all share a common goal: to help people feel more at ease with AI and the trajectory of this evolving technology. At Anthropic, we recognize there's still a long journey ahead for the AI industry. We are here as a trusted partner in every step of your generative AI journey. We aim to provide the deep technical expertise and reliability that come with models like Claude, as well as the observability and security features that our partners, like Datadog, provide. And this is why today I'm excited to announce our integration with Datadog LLM Observability. This new native integration offers you, our joint customers, robust monitoring capabilities, a suite of evaluations that assess the quality and safety of your LLM applications.
This provides you with real-time insight into performance and usage, with full visibility into the end-to-end LLM trace, enabling you to troubleshoot any issues, reduce downtime, and get your Claude-powered applications to market faster. At Anthropic, we believe that what we do now will set the trajectory for how AI unfolds. Looking ahead, as our models keep getting more powerful, we are committed to ensuring that they are best-in-class on the factors that you care about: intelligence, speed, price, and reliability. And last but not least, we believe that understanding these models fully is critical, which is why we've invested so heavily in interpretability research that helps us see inside the black box of the model and understand how it works. We're excited to partner with Datadog and bring trusted, powerful AI to all of you. Thank you.
Thank you, Daniela. We, too, are really excited to get AI to be smarter, faster, cheaper, and better. So to recap, if you're building an LLM-based application, please give our LLM Observability a try. We'd love to hear what you think. Now, the meteoric rise of GenAI over the past couple of years doesn't mean that we get to ignore the rest of the stack. Containers, for instance, continue to be adopted and deployed at a rapid pace, so we continue to invest a lot of time and effort in making them easier to manage. To hear about the latest and greatest in container monitoring, I'd like to welcome Danny to the stage.
Thank you, Alexis. Hi, I'm Danny Driscoll, Product Manager for Container and Kubernetes Monitoring here at Datadog. Now, most of you here today are already partnering with us to monitor the health and performance of your Kubernetes environments with Datadog Container Monitoring. Over the years, many have told us that you chose to build your platforms on Kubernetes to deliver more efficient resource use, which can lead to lower infrastructure costs and lower energy consumption impact for your businesses. However, in our latest research, we observed that more than 65% of Datadog-monitored Kubernetes containers are still using less than half of their requested memory and CPU resources. There's still more that we can be doing here. That's why today we're very excited to announce Datadog Kubernetes Autoscaling.
With Datadog Kubernetes Autoscaling, we'll have a new solution that will allow you to prioritize the workloads and clusters with the most savings potential, to take direct action from the Datadog platform to apply and then automate right-sizing recommendations, and to observe and measure the impact of your complete autoscaling program on your key cost and efficiency metrics. Let's take a look at a demo. So starting from my Kubernetes overview, I can now immediately see the total idle cost for my entire Kubernetes footprint across clouds. In this case, I see that I have over $85,000 in idle spend last month, and I'm motivated to start optimizing. On the cluster list view, I have a prioritized list of all of my Kubernetes clusters across clouds.
These are sorted based on their idle CPU and memory use, with visibility into idle cost and a day-by-day breakdown of their total cost over the trailing 30 days. All of these signals are available with my existing Datadog Agent instrumentation without any need to deploy any new tools. Now that I've identified this top dev EKS Shoppers cluster as my most overprovisioned, I can continue on to optimize it. On the cluster detail view, I again have a prioritized list, this time of the workloads within my cluster, sorted by idle CPU and memory with visibility into idle cost. In this case, I can clearly see that Ad Auction 2 is my most expensive and overprovisioned workload, and I can take direct action from here to optimize it. When I open up Ad Auction 2, I get a complete multidimensional scaling recommendation for this workload.
That combines vertical, how to right-size the pod by setting proper CPU and memory limits and requests, and horizontal, how to select the proper amount of replicas to meet demand. For each component of this recommendation, I can drill in and inspect the Datadog metrics backing it at the individual container level to build understanding and trust before I proceed with any changes in my environment. Once I do have that trust and I'm ready to proceed with this change, I have multiple options for how to do so. I could simply reference these values with my existing configuration tool or GitOps workflows, but in this case, I'm more excited to take advantage of the Datadog platform to apply it directly. Now, I can do this as a one-time change and then adjust the workload based on its recent traffic up till now.
But in this case, I'm even more excited to enable autoscaling, which will ensure that Datadog can continuously monitor and tune this workload so that its usage closely tracks with its re-requested level of resources moving forward. Now, let's jump ahead to after I've started autoscaling this workload. Here, we're able to observe two key signals from the workload that show how the autoscaling is taking impact. On the right, we have an event stream of the Kubernetes events emitted by our new Datadog Pod Autoscaler custom resource, which is responsible for that continuous reevaluation and application of changes. Those events are overlaid with our CPU and memory metrics for the workload, which we can now see have a tight alignment between usage and requested resources, reflecting a more efficient arrangement for the workload.
Back out at the cluster level, we can start seeing how this has an impact in aggregate. As we see, the total CPU and memory allocatable for the cluster trend down as our autoscaler is able to kick in and start downsizing the nodes for this cluster, leading to that direct cost savings. So to recap, with Datadog Kubernetes Autoscaling, we have a solution that will allow you to easily and quickly identify and prioritize the clusters and workloads with the most opportunity for savings, to take direct action to apply and then automate right-sizing recommendations from the Datadog platform, and to observe and measure the impact of your entire autoscaling program on your key cost and efficiency metrics. We're very excited to start our private beta of Datadog Kubernetes Autoscaling with you all.
Feel free to sign up at the link here and swing by our booth for more information. Now I'm very excited to introduce Jason to talk about logs.
Thanks, Danny. Hi, I'm Jason, Product Manager here at Datadog. As you all know, logs are a rich data source containing a wealth of information that can be used in incident response, security investigations, and even business reporting. In many ways, logs are the foundation of O bservability. But logs come with trade-offs. There's no set schema or standard for logging, which makes performing this rich investigation difficult. With logs coming from more sources than ever, it's hard to know what attributes and content need to be extracted ahead of time. And let's not forget, I'm often forced to lean on a small group of expert users writing long queries in a proprietary syntax just to derive insights from my logs. But what if I could craft these powerful analytical queries myself just by chaining together a series of simple operations? Well, now I can. Introducing Log Workspaces.
Log Workspaces allow me to freely join and transform data across multiple sources, and then chain together simple queries to perform complex analysis in a single collaborative space. Let me illustrate the power of Workspaces with an example. Say I'm an engineer at a trading company, and my team notices an increase in the number of failed transactions between two of our services that receive and execute trades. While the rest of my team works on root cause, I've been tasked with understanding the potential revenue impact on our business, and more importantly, who our affected customers are. Given that my logs are coming from two separate services with different attributes and content, and that my customer data lives within Salesforce, I realize there's no single source for me to query.
I'm going to have to join fields from across these three sources, calculate new values, and even reference data that lives outside of Datadog. With Log Workspaces, I can do just that. Let me show you how. I'll start with a simple search in the Log Explorer, as it's the fastest way for me to quickly filter across all of my queries and logs. I want the logs from my trading platform environment, specifically that service that receives these trades. I'll take this search and open it as my first data source in the Workspace. Data sources help me transform my logs into structured tables that I can query with SQL, so they're very important. The definition of these tables comes from the columns that I've selected in my search, and I'm going to name this data set My Trade Received Logs.
Okay, now I have a record of all of these trades starting. I want to figure out which of them didn't complete successfully. These logs come from another service that finalized these trades, and I'm thinking that I can join them using a transaction ID. So let me import this data source as a second source, the Trade Execution Logs. Now, looking at those logs from that trade finalizing service, the attributes are quite different on these logs. I definitely need that customer ID and status, but it looks like the transaction ID that I was hoping to use to join my logs together isn't available as an attribute. But with Log Workspaces, I can extract fields directly at query time. Let me show you how. I'm going to use a transformation cell, which I'll name Parsed Execution Logs, to help me extract that transaction ID from my log message.
I'll use the word transaction as an anchor and then capture the next word, which is my transaction ID. And just like that, transaction ID is now a column that I can query. Okay, we have all the pieces here. Let's stitch it together. I'll use an analysis cell to generate this failed transaction record, and I'll ask Bits AI to query it for me. I know that I want my timestamp, my customer ID, my transaction ID, and the dollar value for my received logs, as well as the transaction status for my execution logs. I can use the transaction ID that we just extracted to make this join, and while I'm at it, I'll tier the trades so the high-value transactions are marked as being over $250.
Of course, I'm looking for the transactions that failed to complete, so I'll be sure to filter for just the records with errors. By describing my goal, Bits AI is able to write and execute this query for me. We're almost there, but remember, I actually wanted to know who these customers are. For that, I get a monthly customer report from Salesforce, so let me import this data source into the Workspace to finish my investigation. These are my Salesforce users. This time, I'll join the customer data in place using the customer ID for my Salesforce users and that failed transaction record to produce a final shareable report, the transaction record with names. There we have it. I have my customer name and country data from Salesforce alongside the transaction value, the transaction tier that we derived, and the transaction status from my execution logs.
Now, I know I was only tasked with figuring out who these customers are, but let me take this a step further and see if they have anything in common. This might be of help to my team working on root causing the issue. For that, I'll visualize this data and see if anything stands out. I'll start with an easy one, filtering for that high-value tier that we just derived, and then grouping by the country data that we imported from Salesforce. Immediately, I'm noticing that a large number of these high-value trades that are failing are coming from Italy, so I'll be sure to let the support team for that region know. I'll also share my full report with the rest of my engineering team, as this might be of help for doing root cause.
Maybe we need to be taking a closer look at the EU data center. They'll be able to see exactly how I came to this conclusion, and we can continue the investigation together in my Workspace. This is just one example of how the Workspace was able to help me identify the impact of failing transactions on my business and identify our impacted users. But let's imagine together the possibility of this Workspace for security investigations, creating evidence timelines, doing postmortem analysis, or even as part of an audit and a compliance report. Using Log Workspaces, I'm able to get the most out of my logs by joining, transforming, and chaining together data from logs and other sources expressively within Datadog to construct nuanced data sets without having to learn a complex syntax. ... I'm empowered to perform my own analysis, being limited only by my own imagination.
To get started with Log Workspaces, and to learn more about how to go further with your logs, please visit the link behind me, and be sure to visit our booth on the expo floor. And now, to tell you more about how to get the most out of the rest of your Observability stack, I'll pass it off to Gordon.
Hi, everyone. I'm Gordon. I'm a director of engineering, working on APM and OpenTelemetry. Now, as many of you may know, OpenTelemetry or OTel, as a standard for instrumentation and telemetry, offers a ton of great benefits like portability, interoperability, and vendor neutrality. At Datadog, we believe OpenTelemetry is revolutionizing observability by providing a standards-based foundation for us to build on, unlocking innovation across the industry. It's a tide that lifts all boats, and that's why I'm thrilled to invite Michelle Titolo from GitHub to the stage to tell you a bit more about how they navigated their tide on their way to OpenTelemetry and Datadog.
Thank you, Gordon. My name is Michelle Titolo. I'm a principal software engineer at GitHub, working on our platform engineering organization, and today I'm here to share with you our OpenTelemetry journey. GitHub is the home of open source software, and wherever we can, we try to use open source. We're also running at a really massive scale, so any piece of software we choose requires a lot of time and planning to roll out. Before I get into our journey, I just wanted to share some facts. GitHub serves 5 billion API requests per day. We have over 100 million people collaborating across 420 million repositories. We have hundreds of services, including our very large Ruby on Rails monolith. We run everything from containers, to bare metal, to VMs, and they all send traces.
Our thousands of machines are what make GitHub possible, and they're all hooked up with OpenTelemetry. But first, let's start with the beginning. Way back in 2016, we added tracing to GitHub.com. At the time, there was no OpenTelemetry, no open standards, so we used something from our vendor. When OpenTracing came out, we migrated to it, and we stayed there until OpenTelemetry came out. And then there were a lot of services at that point that had tracing, so it took us a bit longer to actually roll out. So a few months later, we GA'd our internal OpenTelemetry, which included the monolith. And then we had the long tail. Remember, hundreds of services. It takes time for all of those different teams to do the updates to get them onto OpenTelemetry.
But then we were starting to see that long tail really shrink and get to that point where there were only a handful of services left, and we thought, "We've been on the same vendor for seven years. Maybe it's time to think about something else." And so we did. Last summer, we began our evaluation into our alternative tracing vendors because we were on OTel. That same month, we began a proof of concept with Datadog APM. We were already a metrics customer, so being able to consolidate was a huge win for us. In September, we onboarded to Datadog and turned on the Datadog APM integration. And then in October, we turned off our old tracing vendor. If you're doing math in your head, that's four months from our initial, "Let's just think about migrating," to actually performing the migration and getting onto the new platform.
That's the power of OpenTelemetry and using vendor-agnostic tooling. You hear a lot of people say it's possible, and I'm here telling you that we did it. We also had a really easy time getting set up with APM. We had 38 lines of YAML, which is not a programming language, to initially set up APM. We also run what's called a gateway model, so we don't have sidecars, which most people assume. Again, bare metal, VMs, sidecars are hard. So instead, we run a fleet of collectors that all of our applications connect to, and that collector, that one central place, is where we configure where we send our traces. But of course, nothing ever goes quite as planned in software. So we did run into a couple of hiccups, which I'm just gonna briefly share with you.
Firstly, we had a month of time between when we added APM and got rid of our old vendor. So for a month, we were sending twice the amount of trace data. Any software engineer will be able to tell you that you cannot do twice the amount of work with the same amount of capacity. So our OpenTelemetry collector fleet got a little unhappy. Thankfully, because it was a single fleet, we made a pull request to add more capacity. We increased the size of that fleet, and we were able to run for that month with our increased capacity. And then once we turned off the old vendor, we were able to scale back down. We also have this really big Ruby on Rails monolith that I recently learned is 16 years old. So adding libraries there or making some foundational changes are challenging.
We were running into some issues with the upstream Ruby OpenTelemetry libraries. But again, OpenTelemetry is open source, so we were able to make pull requests upstream, get those merged, get those gems released, and then pull them into our application in order to see the results, and the rest of the community gets to benefit. I'm gonna wrap up with just a few success stories on how tracing and OpenTelemetry have really helped us improve our engineering visibility. The first is with performance savings. One team was investigating why updates to pull requests were taking longer than any other call for pull requests. They looked at their traces, their flame graphs, and saw we were updating the model twice. Easy fix. They batched that together and were able to significantly reduce the latency for that one API call.
We also love traces for end-to-end build visibility, especially when it comes to our GraphQL services. The GitHub graph is huge, which means engineers can query tens to hundreds to potentially thousands of objects. Our authorization service is responsible for making sure those results are returned to someone who has access to them. And at the beginning of this, every single object was being called individually, so that's tens to hundreds to potentially thousands of queries to authorization in one GraphQL call. That wasn't great. So we built a new way of doing batching authorization calls from the authorization service, so that every GraphQL call results in one call to authorization, and not this huge N+1 thing that we were seeing. And the authorization service was really able to benefit from this and just be more performant in general.
Lastly, every company has those bugs that you're like: "What is going on here?" And what's been plaguing some of our engineers for years has been timeouts. Until very recently, we hadn't been able to see what was going on, what was still pending when things were timing out. So we had two different areas we investigated. First, we enabled pending spans in our OpenTelemetry collectors, so that if a span hadn't finished but the overall request timed out, we were able to see what was happening. But then, things still weren't working. So we looked at the upstream OpenTelemetry SDKs and realized we needed to change how we instrumented Rack, which is Ruby on Rails web server. We did that, we rolled that out, and for the first time, engineers were actually able to see what was happening when a request timed out.
We were also able to see those timeouts show up in our RED metrics because we were tagging the spans with errors appropriately. This has been a huge win for engineers at GitHub, and I'm really excited to share with you today. That's it from me with a really quick overview of GitHub's OpenTelemetry journey. Back to you, Gordon.
Wow! It's amazing to see the progress that OpenTelemetry has made over the past couple of years. The growth of the community and adoption from customers, such as GitHub, is a real testament to the need for this project. And that's why we're enthusiastic supporters of OpenTelemetry here at Datadog. We're a top 10 contributor, and over the past year, our engineers have worked to make profiling an industry standard through the new profiling signal and have helped the collector along his journey toward a stable 1.0. We're maintainers across multiple repos, and we expect to continue expanding our support. Stable standard is good for all of us, and we're happy to do our part. In fact, the benefits of that standard are why we've been working hard to make ourselves more compatible with OpenTelemetry.
Because we've been building in this space for so long, OTel doesn't yet support all the products that we do. With our pace of innovation, I expect that to continue to be the case even as we close the gap. That leaves you with a dilemma: go all in on Datadog and forgo some of the great benefits that OpenTelemetry brings to the table, or be limited to the products that OTel supports. Naturally, you're probably wondering, "Why can't I just have both?" Well, we've been working hard on that problem because we believe that Datadog is better with OpenTelemetry, and OpenTelemetry is better with Datadog. Last fall, we announced support for W3C Trace Context in the OTel API and our APM libraries, bringing vendor-neutral instrumentation and interoperability into our native ecosystem.
This was a big step, but we know that instrumentation is only one side of modern observability, and many of you also want the flexibility and control offered by the collector. Well, today, I'm happy to announce we're taking our next big step by unifying the Datadog Agent and the OpenTelemetry Collector. Now, you can benefit from the agent and the collector working together to form a whole greater than the sum of its parts, enriching your OTLP data and enabling our product suite. Yep, now you can just have both. With our new agent, collector users will immediately get access to our full product suite and platform. You'll enjoy app-based management of your collector fleet, and you'll get the peace of mind that comes with being backed by our dedicated product support. New agent users aren't being left out of the fun either.
You'll get access to the large and growing number of community-contributed integrations, including out-of-the-box support for the growing number of commercial and open source tools being instrumented natively with OpenTelemetry. You'll get better interoperability across the tools in your observability fleet, whether vendor-based or open source. Of course, you'll get control over your OTLP data with full access to the collector's powerful routing and processing capabilities. Let me show you how it works. To get started, simply install the new agent or update to the new agent. If you're wondering about your current collector config, don't, you can just keep it. All you need to do is point our new agent at your existing configuration. That's right. Your existing OpenTelemetry collector configuration and the pipelines defined by it will continue to just work with our new agent by leveraging the integrated collector.
Once deployed, collector users will immediately feel a difference in the depth of the product experience now that you have access to the full Datadog platform. For example, we know managing large fleets of collectors can be tricky. With access to our fleet automation tools, you'll be able to view and manage your collectors from within the app, getting visibility into configuration and dependencies. You'll get access to our full container observability suite, including autoscaling, as Danny talked about earlier, and live containers, giving you real-time insights into your containers and the processes running on them. You'll get access to unique features like Single Step APM, providing zero-touch, automatic instrumentation of all your running services, getting you app-level insight in minutes with minimal effort, all with the click of a single button or the setting of a single flag.
Oh, did I mention that Single Step works out of the box with your OTel API instrumented code? Because it does. And of course, when things go wrong, you'll be able to fall back on our full product support experience. And finally, you'll get access to the more than 750 integrations that come standard with the Datadog Agent and platform. So as you can see, whether your observability strategy is Datadog first or OpenTelemetry first, your experience only gets better with our new agent. Get the best of Datadog while benefiting from standards-based instrumentation. There's no more dilemma. Datadog and OpenTelemetry work better together, and the best is yet to come. If you're as excited about this as I am, head on over to dashcon.io/together to sign up for the private beta. And don't forget to check out the demo booth to see it in action. Thanks.
Well, thank you, Gordon, and thank you, Michelle, for sharing your story on OpenTelemetry at GitHub. Let us do a quick recap of what we just covered. With Kubernetes Autoscaling, you can optimize your spend by right-sizing pods to workloads and save real money. With Log Workspaces, you can slice and dice all your log data directly within Datadog. Last but not least, our investment in OpenTelemetry. We put our money where our mouth is, and we give you the best of both worlds with a newly integrated OpenTelemetry Collector within the Datadog Agent. Now let's switch gears. Let's talk about security. There is not a single day that goes by without hearing about some kind of cybersecurity incident, and when that happens, we know it's all hands on deck. What we're seeing is that even in day-to-day security, it's never just a security team that gets involved.
Everybody has to chip in. Code needs to be fixed and reviewed, new versions need to get deployed, cloud configurations need to get patched and hardened, and so on. In other words, whether we as dev and SREs like it or not, we are an integral part of the effort to secure our infrastructure and our application. And that is precisely why we've been building security products for the past few years in a way that takes advantage of the rest of the platform and in a way that we hope speaks to you. To hear about the latest in security, please welcome Daniela.
Good morning, everyone. My name is Daniela. I'm an engineering director here at Datadog, and I'm excited to share what we have been up to with our security products. Today, more than 6,000 of your companies and organizations use Datadog Security every day to detect vulnerabilities and protect your cloud environments from attacks. As all of you know, when under a security attack, every second counts. Getting the right context at the right time is crucial. Last year, we announced the Security Inbox, which allows you to sift through the noise and zero in on the most critical issues in your environment. Today, I'm excited to share that we have made it easier than ever to get started with Datadog Security and take immediate action right there from your Security Inbox with all the context you need to determine the next steps.
In fact, it can take as little as a few minutes, thanks to our newly launched Agentless Scanning, now generally available. Let me show you how. All I need is to go to the Cloud Security Management setup page and configure an integration with my cloud provider under Cloud Accounts section and activate Agentless. Here, I'm gonna activate Agentless for hosts, containers, Lambda functions, and Data Security as well. When I'm done with the activation, the Agentless scanner will immediately start analyzing whatever is deployed on my cloud accounts for the resources I enable it for. Let's go take a look at what it found. For that, I will navigate to my Security Inbox, which lists out all of my security blind spots. Here, I see immediately two different critical issues: one attack path and one application vulnerability.
My Security Inbox automatically correlated and prioritized issues across my cloud misconfigurations, identity risks, infrastructure, and application vulnerabilities. I didn't have to do anything besides enabling Agentless Scanning. Not to mention, that I didn't have to add any tasks to my team's backlog to get this visibility. All right. Prioritized inboxes are great, but do you know what's even better? An empty one. Let's see how we can make it happen. I will start with the first one in my list. It's a public EC2 instance, potentially exposing sensitive data. Let me investigate further. Here, on the side panel, I am able to get additional details, such as resource name, tags, and account ID. Looking to the new Security Context Map, I'm able to see my vulnerable EC2 instance in full context.
In the left side, it shows me how this instance is exposed to the internet, and the right side, the potential blast radius, which resources and services were impacted. In fact, if I can go one step further and view these in Datadog Security. With Datadog Security, launched today in beta, I see exactly what type of sensitive data is being exposed. Now that I have all the information, let's go back and fix this. Going back to the side panel with all the details, I see that I'm able to start fixing this vulnerability directly from the context map. By clicking on Remediate, I have several options to fix. This time, I'm going to the first one, Open a pull request. Datadog will automatically generate a pull request for me with all the details that my team needs to merge it.
It includes a brief description of the vulnerability, a link to the original finding, and a pointer to the vulnerable resource. Of course, I can check the proposed changes to the relevant files. In a few clicks, I am able to fix a potential data breach that could be a real headache for my organization, and I wrapped up the whole thing in just a few minutes. Amazing, right? But wait, I'm not yet done. I still have that second critical issue in my inbox. This time, it's a remote code execution. It's a vulnerability in production, it's under attack, and there is an exploit available. Let's investigate it further. Here, I see it's a vulnerability in the third-party library, Spring Framework. It's used by the product recommendation service. It's affecting my production environment, and there is a high probability of malicious exploitation.
Datadog gave this vulnerability a solid score of 9.8 out of 10. Looking to the severity breakdown, I understand exactly how each risk factor impacted the score. Datadog severity score gives me the full context: if it's sensitive or internet-exposed environments, the evidence of attacks, and exploitability risk. Okay, I think it's pretty clear that I should fix this. For that, I will navigate to the Remediation tab. Here, I find comprehensive step-by-step instructions from, "How do I find the vulnerable library in my dependencies?" All the way on how to upgrade it to a new version. But you know what? I already took care of that first issue in my inbox. This time, I want my development team to handle it.
Since my team is on Datadog, I can send them a Slack message directly from the side panel and create a Jira ticket with all the details that they need to start fixing this vulnerability. All right, let's take a look at the fruits of my labor... Amazing! My Security Inbox is at zero. I can finally take a break. We just covered a lot of ground. Let's recap. With our agentless scanner, you now can get started with Datadog Security in just a few minutes without deploying any additional agents or software. With our actionable Security Inbox, you can get immediate context and spring into action using the new Security Context Map and Data Security as well. Last but not least, with our Infrastructure as Code auto-remediation, you can now automatically generate pull requests for your Infrastructure as Code.
To learn more about everything I just covered, please use the link on the screen. Now, to tell you how Datadog can help you to catch security issues even before they reach production, I will hand it over to Julien.
Hi, everyone. My name is Julien, and I'm a software engineer here at Datadog. Daniela showed us how quickly we can get started with Datadog Security. Datadog Security shows the vulnerability in your production context. But over the past few months, we learn from the conversation with more than 6,000 organizations that use Datadog Security, that you want to find vulnerabilities earlier in the software development lifecycle. That's why today, I'm pleased to announce that Datadog now secure the entire software development lifecycle. From the first line of code you write, all the way to deploying and monitoring your application in production. Let me show you how. I will start my, my production environment. I want to make sure that I remediate the most important critical vulnerabilities. Application, Datadog Application Security Management show me the security posture of my application.
The prioritization funnel helps me to cut through the noise. Here, I go from 158 vulnerabilities in production to the sixth one that causes an immediate threat, and exactly what I want to focus on. On the right side, I can see the breakdown by team, service, or library. It helps me to prioritize my remediation efforts. Datadog also shows me how quickly my team remediates vulnerabilities. I want to make sure that my team remediates existing vulnerabilities faster than new ones are being discovered. I see also the breakdown of all my vulnerabilities according to the OWASP Top Ten framework. Now, my production environment is safe and secure, but I want, what I want to do is to continue to detect vulnerabilities, remediate existing vulnerabilities, and don't introduce new vulnerabilities, and for this, I use Datadog Code Analysis.
In just a few minutes, I connect my code repositories and get started, and Datadog analyze my code in my IDE or my code repository. I write code in my IDE, Datadog will analyze my code as I'm writing it. Here, I add a value in a database, and if I have a vulnerability, such as a SQL injection, Datadog detects it, but also suggests a fix. In a matter of seconds, I find a vulnerability, and I fix it. And if I don't use Datadog IDE integration, Datadog also analyze my code on GitHub and annotate my pull requests. Now, my code is safe and secure, but what I want is to make sure that the code of all my services and the code of all my teams and application is safe, and for this, I will use Datadog Quality Gates.
With Datadog Quality Gates, I define rules that are checked at every commit. With Datadog Quality Gate, I can define a rule that will block any commits that introduce a vulnerability, either in my code or in third-party library. With Datadog Quality Gates and Datadog Code Analysis, I can focus on what matters: writing code for my product, writing new features for my customers, and not having to worry about adding a new vulnerability. Let's recap. Datadog now secures the entire software development lifecycle, and with Application Security Management, I discover vulnerabilities in production. I can cut through the noise and work on what really matters. With Datadog Code Analysis, I detect issues in my IDE or in my GitHub pull requests. I prevent vulnerabilities from reaching production. And finally, with the Datadog Quality Gates, I check that no vulnerability is introduced at each commit.
Any commit that may introduce a vulnerability will be blocked. To learn more, please visit the link behind me, or please visit the booth. Thank you. Now, let me welcome Kassen Qian on the stage.
Debugging! We all have to do it. Sometimes it feels awesome. After diving in and having that aha moment, I feel like I'm on top of the world. I may not know what day it is, but at least I figured it out. Other times, I'm not so lucky. I have to comb through code I've never seen before, dig through documentation someone wrote years ago, think of ways to reproduce this thing, and ask for help, hoping I don't look stupid. I'm on a wild goose chase, and I haven't even touched my code yet. All I want is to fix the bug, and that's why today I'm excited to introduce Datadog's Live Debugger. For the first time ever, I can debug my application with live production data at every step of the process. Let's see how it works.
Let's say I'm a backend developer working on features for an e-commerce website, and I own a service that's responsible for handling the checkout process. In my editor, I've installed Datadog's IDE extension for detailed code insights as I work. I notice that Datadog Error Tracking is telling me that there's an error in my file. It's on the method that checks for valid items in the shopper's cart, so I should probably take a look. Above the stack trace for this error, I can now click on Exception Replay to step through the execution flow of my code, as well as see the local variable values that were captured live when the exception was thrown. No need to run my code.
Exception Replay captured the runtime variables for me, so I can follow the execution path and realize immediately that one of the items passed to my service had a negative price value. My code properly checks that this isn't a valid price, so it was right to throw this error. Now, I can rule out that the problem isn't in my .NET service. It came from wherever my service is getting the price data from. Okay, so what are the services that my service is talking to? Looking at that code is probably a good idea, but where do I start? Do I check APM? Is there an architecture diagram somewhere that tells me how everything is connected? How am I going to find the pieces of code related to this issue? Actually, I can just click a button, and Live Debugger tells me everything I need to know.
Live Debugger preserves the troubleshooting context from my IDE, so I can seamlessly continue my investigation. With the help of tracing, it helps me visually understand the state of the application at the time of the error, as well as contextualize its location. It also provides me with an AI-powered summary of the executional context for each span, a description of the error itself, as well as a suggested code fix. Most importantly, I can now see the flow of production data between services and exactly where this interaction occurred in the code. My .NET service made a call to a downstream Python service to apply a coupon to the user's cart.
The request went through okay, but when I hover over the variables relevant to this request, I can examine their values to see that the final item price the Python service returned was negative, which is why the checkout failed. With Live Debugger, I know exactly which service talked to my service, the exact values that were passed between them, and what code was executed when that happened. Amazing, right? But I'm not done yet. In order to fix this bug, I have to reproduce it locally. But what cart item should I mock for this? What attributes do I need to have for each item? What was our discount value again? Luckily, I can fast-forward through all of that with Live Debugger's integration test generation.
It uses production context collected at each service entry and service exit to generate a test for me that mocks all of the relevant values for calls made between upstream and downstream dependencies for my service. I don't have to worry about setting up my environment or proper database. I get a working reproduction of this bug with one click. Now, I can run this test directly in my IDE and focus on what really matters: the actual debugging of the code, trying a fix, and checking if it works. I'll add my test, and then I'll set a breakpoint at the end to see if I can actually reproduce the fact that the Python service returned a negative price value. Running this locally with everything reproduced for me, I'm now off to the races.
Before Live Debugger, I would have had to inspect my application performance, logs, errors, documentation, and lines of code across many different files just to understand the nature of the bug. With Live Debugger, Datadog gathered the real variable values and application context all in one place, guided me through the execution paths in code, and saved me hours by reproducing the issue for me. Now, I can fix the bug and move on with my day, with minimal interruptions and fast time to resolution when interruptions do occur. To learn more about Live Debugger, our IDE integrations, and Datadog for software delivery, please visit dashcon.io/debug, and come say hi at the APM and software delivery booths. Next, I'll pass it to Sara.
Thank you, Kassen. Hi, everyone. My name is Sara Varni. I'm the CMO here at Datadog. I'm super excited to be here at my first DASH here in Javits, and hello to everyone on the live stream. Let's recap what we just saw with Datadog Security. Now you can go from setup to fixing issues in a matter of minutes with our agentless scanner and our Infrastructure as Code auto-remediation. We also now allow you to secure your entire software development lifecycle, helping you build in security from that first line of code and put time on the clock back for your developers. And also, when it comes to developer productivity, we're super excited today to announce Live Debugger, which allows you to fix code with production-level context and to fix bugs faster than ever.
We're super excited about all of these new features that point towards developer productivity, but we don't want you to just hear about it from us. We'd love for you to hear about it from a customer. And now I'd like you to hear from someone who's been revolutionizing the global payment space for over 25 years, and now they're partnering with us, with Datadog, to take it one step further. So let's hear from PayPal.
Hi, I'm Ryan Pritchard, VP of Consumer Engineering at PayPal. At PayPal, we're currently on a mission to revolutionize commerce globally. We've been innovating for the past 25 years, building a two-sided network of over 400 million consumers and 35 million merchants in 200 markets, accounting for 25 billion transactions per year or a quarter of the $6 trillion e-commerce market. Late last year, we launched Quantum Leap, an internal initiative at PayPal, to accelerate the pace of delivering innovation to our customers. We did it through three programs. First, we redesigned our checkout experience, streamlining our flows and reducing latency. Second, we doubled down on our mobile app as a primary engagement channel with our customers by leveraging our AI rewards capabilities on our journey to making PayPal the most rewarding way to pay.
Last, we introduced Fastl ane, a streamlined guest checkout experience for consumers, delivering a safer way to enter and store payment information and improving checkout conversion for merchants. At our scale, this meant thousands of engineers across the globe working on all areas of our technical stack. We also introduced a number of new technologies to improve latency. Collaboration, alignment, and insights into the performance of our systems were crucial to deliver on time without jeopardizing quality. Datadog was instrumental in providing timely insights into our entire stack, from user sessions all the way down to the databases. We were able to trace performance bottlenecks, diagnose end-to-end failures, and accelerate our canary rollout of new features. At PayPal, we're excited to bring all of these innovations to our customers worldwide, and we're just getting started.
Thank you, Ryan. So you saw how with Project Quantum Leap, they were able to accelerate the pace of delivery and improve developer productivity across thousands of developers worldwide at scale. But you also heard from Ryan that it wasn't just about driving internal efficiency, it was also about creating an amazing end-user experience. And we're working with customers like PayPal every day to figure out all of the new ways that people can take data on the Datadog platform and put it into action. And now I'd like to welcome onto the stage one of our incredible product managers to talk about a brand-new product that'll help you go from not only managing the health of your systems, but the health of your business. So please, help me welcome to the stage, Jamie Milstein.
Thank you, Sara. It is so good to be back here and see so many familiar faces. I'm Jamie, and I'm a product manager on our Digital Experience team. Now, in 2020, we launched Real User Monitoring, and since then, it's become the critical product for you to understand the performance of your browser or mobile apps. Now, we've added some critical capabilities with features such as Error Tracking and Core Web Vitals. And over time, you've seen us add a few more things around the user behavior space, Session Replay, and Funnel Analysis. But you pushed us to go further. You wanted to understand how changes to your app ultimately impacts your end users, and how that all actually impacts the bottom line.
You wanted to understand questions like, "If I release a new feature, how does that actually affect my conversion rate? And how does this all again impact the bottom line?" So what do we do? We built you something specifically for this, to understand your user behavior data. And that's why today I'm excited to unveil it to you all. Introducing Product Analytics. All right, I'll take it. Now, you don't wanna hear me talk about it, you wanna see it. So let's take a look. You'll notice right away, this is a brand-new product. It brings your business teams and technical teams into one UI, leading to better collaboration. But this is actually connected to the rest of Datadog, so I can go back and forth between my business data, observability data, all in this interface. You'll notice right here what I'm looking at is my analytics summary.
It has all the KPIs that I, as a product manager, would care about. I can see things like who are my top users, and I can even see demographics data, so I know exactly where everyone's coming from. Now, if I wanna actually understand the flow of traffic, let's take a look at one of our new User Journey Diagrams. In this diagram, all I need to do is put in a beginning point or an endpoint, and the diagram will do the rest of the work. In this case, I'm saying, for all users who went to my homepage, what do they do next? And here I can actually analyze the critical flow, so I can figure out where is the drop-off happening. Now, what I can do next is actually convert it to a funnel, where I can see the drop-off.
As a product manager, product manager, I don't just have to see the drop-off, I know why there's drop-off. I can see it could have been due to performance data, user behavior data, all in this single view. Now, this is really the first time that I don't necessarily have to query to understand what went wrong. I can just look here, and I can see that it's due to high error rate on the Add to Cart button. Simple as that. Now, don't go thinking that user journey diagrams are the only thing we have. Product Analytics truly has it all. With Session Replay, I could watch what one user did, watch all their cursor movements, see where they hover, and see what they might have missed.
When combined with Heatm aps, I can actually extrapolate out what I saw in one single Session Replay and really understand the macro trends. So for example, what you can see here is a heat map where I can see what are the hot spots on a page. I can see where are people focusing their attention. I can see what are their top actions. And we can actually just go back one. In the heat map again, we can see what are the top clicks. And with User Retention Analysis, I can actually measure user stickiness. I can see where people drop off. So for example, if I notice a drop-off is happening in a given week, in, say, week three, no one's coming back, I'm gonna wanna launch a marketing campaign to ensure that we can keep these users retained.
But lastly, with our analytics summary, I can query at very granular metrics for specific business KPIs. I can filter here to users in my loyalty segment, for example, and actually see how much they're spending in a given timeframe. I've added user data to enrich this, and here I can look at, say, users who spent more than $20, and just very quickly see who my top spenders are for my e-commerce app. Now, with this, keep in mind that Product Analytics data, it's actually retained for 15 months, so you can understand long-term trends. You can understand year-over-year, quarter-over-quarter. You can share this with your teammates, your executive stakeholders, your collaborators, and bring them into one interface. Now, to recap what we talked about, we just looked at Product Analytics with Datadog. We saw that it's actually very powerful.
It has extremely low overhead because you're already sending this performance data to us. You don't have to pay that data and performance tax twice with one single data source. I wanna stress that this is collaborative. It brings your business teams, UX teams, technical teams, all into this UI to ensure that there's no context lost. And have I said it enough? It's powerful. It truly has everything you need for a product analytics tool with no context lost. Come take a look. We're gonna be demoing this all day at the expo. Come find me, come join our theater session, and I'd love to tell you more about it. With that, I'd like to pass it back over to Sara.
Thank you, Jamie. All right, we just saw how you can use a product like Product Analytics to take the data that you're collecting on the Datadog platform and put it into action proactively. But let's be real, we're not always in that mode. Sometimes we're in reactive mode, and we need to keep on top of the issues and incidents that happen on our platforms. At Datadog, we are committed to building the most integrated and efficient products when it comes to incident response, and to talk about one of our newest features here, I'd like to welcome to the stage Galen.
Thanks, Sarah. Hey, folks. My name is Galen. I'm a staff engineer and one of Datadog's core incident commanders. I'm here today to talk about incident response, and I wanna start with a really simple idea. Most incidents are triggered by changes. Now, when I say changes, I'm of course talking about changes that you make to your own systems: deployment of new code, a feature flag or config change, a database schema update, or a manual operation like running k delete pods or k scale. But I'm also talking about changes outside your control... A spike in traffic from one of your customers, a downstream outage in a service you depend on, an infra issue or a network problem. And when I'm debugging an incident, one of the first questions I ask is usually: Has anything changed recently?
This was the inspiration behind Change Tracking, which I'm delighted to show you today. Let's take a look. Here we have the status page for a monitor that recently alerted. This is a monitor on the error rate of one of the endpoints of the checkout service. The evaluation graph tells me about when things started going wrong. Just above that graph, we have something new. This is a timeline of recent changes that might be related to the monitor alerts, and if one of these changes caused the issue I'm seeing, I can debug and remediate from right here on the monitor page. Here, I see three changes. First, we have a deployment of the checkout service about half an hour ago. A bit later, a feature flag flipped, and it changed the behavior of the Cart API service.
Then, not long after that, I'm seeing new error types occurring on checkout. Now, checkout is the service being monitored, so any change to that service will be included here. But sometimes a change on one service causes a problem somewhere else. In this case, Watchdog found another service with a highly correlated error rate. When errors started increasing on checkout, the same thing happened on Cart API. Because of this, the timeline shows all changes to both these services. You know, the timing of that feature flag change on Cart API looks very suspicious. Let's pull up more information. So here I have some basic information about the feature flag, the title and a description, and a history of who changed what and when. Now, this flag is managed through LaunchDarkly, so all this information is available out of the box.
But I know many of you out there use your own homegrown flag management tools. Don't worry, you can provide all of this via API as well. Now, the diff on this most recent change looks very strange. I think one of my colleagues might have made a mistake. Looking at the command they ran to make the change, I can see the problem. They applied the config for the new color scheme flag to the alt data source strategy flag. That'll certainly do it. Between this and the timing of the change, I've seen enough. Let's take action. I can use these buttons to view the flag config or to run a workflow, like this rollback workflow. Okay, great. The workflow is running in the background, but let's take a peek. This is a feature flag rollback workflow that someone in my company set up previously.
It's configured to check for permission from the flag owner in Slack before proceeding, so it's currently paused at that step. The owner can approve or reject, and once they approve, the workflow will push the change to LaunchDarkly. Now, I would hate to make you all sit and wait for that to happen, so let's jump ahead. Okay, it's 10 minutes later. The change was approved in Slack, and back on the monitor page, I see a new feature flag rollback. Sure enough, the new error type stopped occurring, and the overall error rate is back to normal. Change tracking made it easy to resolve this issue. I was able to see recent changes to my services, gain more context on the ones that looked suspicious, then remediate by rolling a change back. All this without ever leaving the monitor page.
Change tracking is a new platform feature available on monitors, dashboards, service pages, and, well, I won't spoil it, but you'll see in a minute. Visit dashcon.io/changes to learn more and sign up for a preview. Change tracking gave me everything I needed to resolve this issue, but when things are more complex, I might need a bit more help. For that, here's Sajid.
All right. Thank you, Galen, and hello, everyone. My name is Sajid, VP Engineering here at Datadog, working on many of our AI ML products, and I'm so excited to finally be able to share with you all what we've been working on with Bits AI, our DevOps copilot, built on generative AI. Now, we recently announced the general availability of Bits for Incident Management, which helps you stay on top of the most important issues in your infrastructure with summaries as soon as you join an incident, natural language commands to help you manage your incidents, and a straightforward way to find related issues in your infrastructure. Since launching Bits, we've heard from many of you that chatting directly with Bits is a great way to get information that you are looking for during an incident.
We also know that as incidents get more complex with many people, dozens of new services, and multiple teams, figuring out what question to ask is often the hardest part. Let's be honest, we've all spent a lot of time chatting with AI assistants all over the internet this past year. Sure, while sometimes they astound us with their capabilities, other times you end up asking a simple question and just facepalming at the response because it turns out you've asked something it just doesn't know how to handle yet. What you really need is an incident copilot that knows what it's good at and will just tell you, so you don't have to guess. Which is why I'm thrilled to unveil the latest evolution of Bits AI, a fully autonomous AI agent that does exactly this.
We've trained Bits to observe, plan, and act continuously so that it can help you run incident response end to end. Let me show you an example of Bits in action. I'm on call for a food delivery service, and I've just been paged for our most critical service, the Restaurants API, which is responsible for processing all of our user orders. By the time I scramble over to my laptop, Bits has already Slack'd me to let me know that it's begun its investigation. It's going to follow the instructions in the monitor, as well as follow links to runbooks and external tools like Confluence and begin planning out its investigation in a notebook. Let's check it out. Bits will actually design and execute its plan in real time, following multiple threads of investigation simultaneously.
Bits will also adapt this plan as new data arrives, choosing next steps based on its earlier observations. For example, here you can see that Bits found logs indicating that our restaurant service is actually timing out when connecting to an upstream Takeouts RPC service, and so decides to investigate that service, looking at error traces, latency metrics, and more for Takeouts RPC. Now, Bits will continue its investigation on its own, but we don't need to sit here and watch because it will use Datadog Case Management to keep us up to date on all of its key findings. Using integrations with tools like ServiceNow, Jira, and even Slack, Bits will keep all of my teammates in the loop. Once we're in Slack, Bits will begin to suggest next steps to me based on its investigation so far.
Here, Bits sees, using real user monitoring, that thousands of users are impacted, and so it suggests we declare an incident. Everything is pre-filled for me, so it's just one click to get that started. Once we're in the incident, Bits has a summary of the investigation so far, so it's really easy for my teammates to get up to speed, and it has two more suggested actions for me. Again, because of the thousands of users impacted, it's prepared a status page update for me, and because it knows that Takeouts RPC service that was erroring before is owned by a different team in the Service Catalog, Bits suggests we page them to get their help. All right.
So as we do that and my teammates join and get up to speed and start their own investigations, Bits is running in the background, following the conversation and looking for opportunities to surface relevant telemetry. Here, my colleague has identified that Takeouts DB is the problematic database that's slowing everything down, and so Bits decides to investigate that database in the background. If Bits doesn't find anything, it just won't say anything. But here, Bits has found the root cause: a database migration from two weeks ago that lines up exactly when the changes began. Using Change Tracking, we can see exactly what changed. We can see that we altered the data type of a key column we were querying, and this gives the team all the information they need to realize that this change likely broke our indexing.
And so we're gonna have to roll out a new index to fix the issue. As the team works through that, Bits has another thread running in the background, looking for other potentially related incidents in my infrastructure, and if it finds one, it will bring those threads of investigation together. For example, here, Bits sees that there's actually several other teams that are downstream of this problematic database, and it's let them know that the issues that they've been firefighting in isolation are likely caused by this one. So as they join and begin to ask about sort of an ETA for resolution, and we all hear that actually it's gonna take a while to fix the issue and the downstream services are still struggling under the load, Bits offers to help scale up these two services using Datadog Workflow Automation.
There's buttons right here for my teammates to trigger these workflows, and the rest of us can easily follow along right from within Slack. As the scale-up's complete, it's easy to just ask Bits a follow-up question to see how our services are doing now. Here we can see that the services have recovered, and things are looking much better. Finally, as we roll out the new index and our queries are looking better and we're able to fully resolve the incident, Bits has prepared a first draft of the postmortem for us to review. It's configured to follow my team's template, starting with a summary of what happened, an overview of the key systems, and a timeline of all the major events that happened in the incident response. So how is all of this possible? How did we build this?
To transform Bits into an independent investigator, we invested heavily in AI agent design and planning capabilities optimized specifically for the multi-user, multi-threaded environment of incident response. As with all AI research, data and rigorous evaluation have been critical to our early success. We actually built a dedicated simulation environment for Bits that allowed us to replay real incident scenarios continuously and benchmark how Bits does across a variety of dimensions. Over the last few months as we've been working on Bits, we've seen Bits improve enormously. For example, after we introduced a Change Tracking tool that Galen shared with you earlier, we saw Bits' data gathering benchmarks improve substantially. Unsurprisingly, making it easier for our users to find relevant changes in their infrastructure helps Bits do the same.
Now, all of this has allowed us to add truly autonomous investigation capabilities to Bits that run automatically as soon as your monitors trigger, that use existing runbooks, so you don't have to spend a lot of time feeding Bits detailed instructions, and will suggest next steps, so you don't have to guess at what Bits is capable of. If you'd like to try this out, you can sign up for the beta of our new autonomous investigation capabilities here at dashcon.io/autobits, or check us out in the expo hall below. And yes, since we started calling this release Autobits, we've been having a lot of fun with Transformers puns. With that, it's my turn to welcome Daljeet to the stage. Thanks, everyone.
Hey, everybody. My name is Daljeet, and I'm a Product Manager here at Datadog. Sajid just covered how the latest evolution of Bits can now work alongside incident responders to help resolve issues faster, and how Bits can now start investigations before you have reached for your laptop. But what if you don't even have to reach for your laptop anymore? That's right, folks. It is my great pleasure and honor to announce the latest addition to our platform, Datadog On-Call. Built by on-call engineers for on-call engineers, Datadog On-Call supports everything you need from a paging solution and combines it with everything you already love about the Datadog platform. Let me dive right into On-Call's core capabilities. Starting with scheduling. Whether it's business hours, follow the sun, or 24/7 rotations, On-Call covers all your scheduling needs.
Escalation policies ensure that all your alerts are routed to the right team members in your organization at the right time. But that's not all. Since Datadog already captures the state of your services and your teams, you no longer have to worry about about duplicating your service catalog into your paging solution. Viewing up- and downstream issues and paging the relevant teams for them is now possible in a single unified view. Now, let's talk about ways of getting paged. As a responder, you can set up notification preferences to specify exactly how you want to be paged: emails, push notifications, SMS, or phone calls. And yes, Datadog On-Call will circumvent your Do Not Disturb mode if you tell it to. No matter where you are, even if it's on stage at the Javits Center, Datadog On-Call will reach you.
Now, you might be wondering, "What can I be paged on using On-Call?" The short answer: everything from everywhere. Whether it's telemetry you already have in Datadog, all the way to third-party tools that you use to keep your critical systems up and running. Datadog On-Call will page you anytime, everywhere, about everything that you need to be paged on. While getting paged is a critical aspect of most engineering organizations, it's not exactly anyone's favorite part of the day, or let's face it, the night. This is why Datadog On-Call comes with out-of-the-box On-Call insights and overviews.
In a single view, you can see key top-level metrics such as MTTA and MTTR, as well as answer questions such as, "Which team members experienced the most interruptions last sprint?" and "Which services are experiencing the most issues and causing most of our operational load?" Equipped with these insights, you no longer have to worry about. Equipped with these insights, you can truly focus on getting your teams out of firefighting mode. All right, now that we covered On-Call's core capabilities, let's quickly see it in action and talk about why this is truly a game changer and why I'm personally so excited to get it into your hands. First, let's go back to the page I just got a minute ago along with the phone call. I will start by tapping the push notification and go directly into the Datadog mobile app.
Here, I can see that my checkout service is experiencing an elevated error rate, and before I go any deeper, I'm going to press Acknowledge to make sure that Datadog doesn't call me again while I'm on stage. Now, since Datadog already has all my observability data, I can see everything I need to determine the severity of the situation right here in the palm of my hands, without switching devices or losing any context. I can see where this page even came from, as well as automatically Datadog shows me my impacted SLOs. Speaking of, it seems we have already breached one. Now, if I want to start investigating, all I need to do is one tap, and it takes me to the triggering monitor. Here, I can see the evaluation history of my monitor and how my service has trended over time.
But what can, what I can also see here is the associated remediation playbook. Now, Datadog's mobile app allows me to pivot into related dashboards, logs, traces, and service, all from the palm of my hand. All of this without any urge to reach for my laptop. Pretty cool, right? Now, I've already seen that my SLO has been breached. I'm also seeing that my error rate is spiking rapidly, so something must have happened recently. I will go ahead within the Datadog mobile app and declare an incident right here in the same context. Now, once I do this, once I press Create, all of my organization's automations will kick in, meaning relevant people will be paged, communication channels will be opened, and as you have seen in Sajid's talk, Bits AI will be there to guide me through the entire incident. This is amazing.
I just went from getting paged, investigating the issue, and declaring an incident in record time, all while on stage in front of thousands of you and without a single laptop in sight. Let's recap. Datadog On-Call comes with all the scheduling and escalation capabilities you need to enable your teams to go on call. But what I just showed you isn't just a paging solution. What I showed you is a single platform for monitoring, securing, paging, and investigating issues on the fly. And as you've all just seen, thanks to mobile investigations, we have a platform that's fully connected end-to-end, that helps you observe, secure, and now more than ever, act. We can't wait to get it into your hands. Visit the link on the screen or stop by our On-Call booth at the Expo floor. Thank you very much.
Now, please welcome back Sara on stage. Thank you.
Thanks so much, Sajid. Now, I personally wanted to have Who Let the Dogs Out be the ringtone for Datadog On-Call, but unfortunately, we have no ties to the Baha Men. So if anyone has an in, please let me know. I'll be on the expo floor for the rest of the day. Let's recap what we just saw for incident response. First, with Change Tracking. Now, in real time, you can see changes to your environment. With Bits AI, today, we are excited to announce Autonomous Investigator, which helps you remediate issues even faster. And of course, now with On-Call, you can optimize your incident response from that very first page. That wraps up all the features and products that we wanted to talk to you about today in this keynote, but we are just scratching the surface.
There are many more features and products that we are announcing this week at DASH, and I highly encourage you to visit our Datadog Hub to talk with one of our resident product experts. Watch a session in either the Solutions Stage or the Observability Theater on the expo floor, and please attend breakouts, many of which are co-hosted by you, our customers, to hear about all of the great, new, exciting ways that customers are using the Datadog platform. And with that, now you've seen how all on one platform you can observe, secure, and act on your data with Datadog. Thank you so much for attending DASH. I hope you have a great rest of your day.