DASH Conference 2025

Jun 10, 2025

Olivier Pomel

Co-Founder and CEO, Datadog

Good morning. I'm Olivier Pomel, co-founder and CEO at Datadog, and I'm really excited to welcome all of you to Dash this morning. Now, I won't be very long. If you've been with us before, you know that we prefer to do more showing and less talking. First things first, I'd like to thank our sponsors and our partners, and you can meet them on the expo floor. I also want to tip my hat to our Datadog ambassadors for the great work they're doing with our community. Most importantly, I want to thank all of you, our users and our customers. I want to thank you for your trust and for building with us. Many of you are here today from some of the largest companies in the world, as well as the top teams that are building the future of AI.

It is a truly inspirational peer group and a great opportunity for all of us to exchange and to learn from each other. Many of these stories will be shared on stage today and tomorrow, by the way. Now, as the CEO of a publicly traded software company, job number one for me personally is to make sure that we keep investing enough in R&D. The world is being reinvented every single day, and I think we can all agree that change is happening much faster today with AI than ever before. Of course, this creates incredible opportunities for all of us, but these come hand in hand with an explosion of complexity and with a whole new category of risks.

Our job at Datadog is to make sure that you can tame that complexity, that you can get those risks out of the way so that you can happily and productively ride those technology waves all the way to success. That is why we are so focused on building with you. We have a lot to show you today to help you observe and understand your applications, to help you build and run them securely, and of course, to help you take action, or even better, to do it ourselves so you do not have to. To start us on that path, I would like to invite on stage my co-founder, Alexis.

Alexis Lê-Quôc

Co- Founder, Datadog

Thanks, Olivier. Thank you all for joining us today at Dash. I am really excited to show you what we have been working on. It has been almost three years since AI entered the world stage.

It may feel like an eternity to you. That's because we're all on the cutting edge of adoption, whether it's using coding agents, whether it's weaving inference into applications, or building infrastructure with lots of GPU. Another reason why it feels like we've been at this for a long time is that the state of the art is moving so very fast. Right now, there's a lot of focus on building better reasoning and general purpose intelligence. As good as the general purpose models get, I think there's still a lot of room for industry-specific, specialized ones. Coding models are a great example. They power the coding agents you probably use every day. Obsidian models are another great example, or even security models. We have been contributing to the field. Our AI lab recently published a state-of-the-art time series foundational model.

It's called Toto, and it comes with an associated benchmark called BOOM. What makes them special is that both are designed specifically for observability. In the spirit of open science, we're making all this work available for free, open weight on Hugging Face so that it can benefit you and others. I personally find a lot of promise in this work. I'm really excited. With any breakthrough in the field, I think the bar to clear to make all this AI truly useful keeps rising. That's at least how we think about it. We ask ourselves, how can we apply these new techniques to make a difference in your daily work? What does it mean for AI and agents to help you observe and understand, optimize and troubleshoot, secure and remediate? Not just in theory, but also in practice.

To find out, let me hand it over to Tristan.

Tristan Ratchford

Engineering Manager, Datadog

Thanks, Alexis. Hi there. My name is Tristan Ratchford, and I'm an engineering manager here at Datadog. Last year at Dash, we showed you that Bits is capable of operating like an SRE by helping you troubleshoot and resolve your production issues. When your monitor triggers, Bits will proactively launch an investigation, look across your entire Datadog environment for signal, and find the root cause in minutes. For the past year, we've been hard at work making Bits even better. Let's take a look at some of the big changes that we've introduced. Firstly, Bits is now looking at even more of your data, like things like dashboards and deployment changes, and it's able to correlate issues across various levels of your stack using our in-house data science models. Next, Bits is now able to perform deeper root cause analysis by continually refining its investigation.

Just like the Five Whys framework, Bits is continually able to ask why to reason about the root cause. As a result, Bits can now handle more complex tasks that span multiple services, tasks that could take several hours or several engineers to resolve. Finally, we've given Bits memory. You can now teach Bits to remember steps that were useful and correct ones that were not. We've also built a data set with a massive number of real-world production alerts that we've been using to evaluate Bits' performance against and to hill climb on accuracy. Today, I'm excited to announce that you can enable Bits AI SRE in your account. Using Bits is like instantly adding an engineer to your team who is already familiar with your system and is on call 24/7. Enough talk. Let's see Bits in action.

Let me show you how Bits resolves an issue from start to finish. The moment you're paged, Bits jumps right into action. In this case, we were paged because an endpoint on our Flight Query API is experiencing high latency. Bits will start its investigation by gathering context about the alert from your Datadog environment, your runbooks, and from lessons learned from previous investigations, all in under a minute. Like you or me, Bits is pulling related telemetry from your logs, metrics, traces, and more. All right, check this out. This is the really cool part. Now, based on its initial findings, Bits will then generate a variety of hypotheses to what it thinks the problem could be and then go verify each one of them concurrently.

With our latency issue, Bits is considering the problem is due to database query timeouts, a faulty deployment in the endpoint code, slowness in a downstream service, or a spike in query traffic. Bits will then go evaluate each hypothesis using your telemetry to determine if it found the root cause, if it needs to move on, or if it needs to dig deeper. For example, let's take a look at this branch. Here, Bits is hypothesizing that the latency is due to database query timeouts. Why? High DB load. Why? Increased API traffic. As you can see, unlike other agents, Bits is not a black box. You can follow its reasoning every step of the way. Bits will then continue to drill down until it finds the root cause.

With our latency issue, Bits has determined that the root cause was due to the database query timeouts from that branch we looked at earlier. Right there, you can see the power of the hypothesis tree. Bits is able to simultaneously investigate multiple chains of reasoning in minutes. You get a thorough investigation every time. Every step of the way, you can dive in and look at the evidence and reasoning that went behind it. You can also make Bits better by teaching it steps that were useful and correcting ones that were not. It is continuously learning. Just like a teammate, you can ask Bits questions about its findings or get help taking next steps. For example, here, I am asking, who owns the flight database? I can page that team and get help.

Finally, Bits will pipe all of its high confidence findings back to Slack or a ticketing tool of your choice. There you have it. Bits AI SRE, a fully autonomous AI agent that's able to help you troubleshoot and resolve your production issues 24/7 so your engineers can focus on what's important. Some of you have been using Bits in the past year and have seen great results. For example, Thomson Reuters is using Bits to accelerate how they triage issues across their global operations team. Fanatics is using Bits to stay on top of their alerts when it matters the most, like during the Super Bowl or tomorrow's NBA Final Game. Now you too can add Bits to your team. Check out this link to find out more.

I'm going to kick it over to Shah, who's going to tell us about incident response. Thank you.

Shah Ahmed

Product Manager, Datadog

Thanks, Tristan. Hey, everyone. I'm Shah, a product manager here at Datadog. Last year, we announced Datadog On- Call, and your response has been incredible as it quickly became our most in-demand preview to date. On- Call was pushed to general availability a few months ago, and there are over 1,000 companies using it to improve On-Call processes. We are not stopping there. We want to go beyond sending simple pages and making robotic one-way phone calls. Let us take a look at how On- Call and incident response is evolving.

Speaker 26

Hey, Shah. Just calling to inform you about a critical latency alert on checkouts. I can acknowledge it and share more details with you.

Shah Ahmed

Product Manager, Datadog

Yes.

Speaker 26

Want me to do that?

Shah Ahmed

Product Manager, Datadog

Let's go ahead and act that.

Speaker 26

Sure thing. The P99 latency on checkouts has increased from 900 milliseconds to 5.5 seconds, starting about five minutes ago. I've acknowledged the alert, so it won't escalate any further. I also have the linked runbook handy. Would you like me to share the suggested next step?

Shah Ahmed

Product Manager, Datadog

Yeah, let's get the first step. Could we also post to Slack in the checkout channel this is going on?

Speaker 26

Got it. I've posted to Slack in #checkout-ups. The suggested next step is to look at the upstream payments API for any latency or errors. Want me to pull that information for you?

Shah Ahmed

Product Manager, Datadog

Yeah, let's grab that.

Speaker 26

The error rate for the payments API has climbed to 8%. It seems there was a deploy to the payments API shortly before our monitor went off. I can pull the next few steps from the runbook.

Shah Ahmed

Product Manager, Datadog

Think I know what we have to move forward.

Speaker 26

No worries. Feel free to reach out if you need another pair of eyes. Let me know if there is anything else I can assist with.

Shah Ahmed

Product Manager, Datadog

Cool. Thank you. Bye. That's a preview of our voice interface for incident response. In real time, you can get details of the underlying monitor, get next steps, and take action even before jumping onto your computer. OK, so now that I've gotten all of that context from my voice AI, I'm ready to jump onto my computer and take action. I already have a tab open to Datadog, and I see this handoff notification on the bottom left. This is for the page the voice AI just told me about. This is new for On-Call and incidents. This handoff notification lets me jump in right where I left off on the call. No more digging around for the page, the alerting monitor. It's right there when I need it. Let's fast forward a little bit.

I've gone ahead and declared a SEV2 incident and kicked off a coordinated response with my team members. I've docked my incident, and I can see all the messages and graphs my teammates are posting. What you're looking at here is not a Datadog chat feature. This is a real-time sync with Slack and soon Microsoft Teams and Google Chat of what my teammates are already posting. Shared links and screenshots of graphs are rendered as live graphs that can be compared with anything else in Datadog. While I'm doing this, the doc sticks with me no matter what page I'm on. It's like turning Datadog into incident mode. With handoff notifications and our doc experience, you can collaborate in the same space that you investigate incidents. In the chat, a teammate highlighted that there is customer impact.

I'm going to go ahead and update my company's status page. To help do that, I'm happy to announce today we're launching Datadog Status Pages. I don't have to sign into another tool, and we already have a lot of the context to pull from your incident response. We basically can prefill almost everything for you. Basically, you will never forget to update your company's status page. We support templates, custom domains, and have several customization options to help keep all of your customers in the loop. With Datadog incident response, you can now co-locate everything you need to dive into the issue, work through it with your teammates, and update your customers. You can run your end-to-end process in Datadog. The voice interface is in preview, and you can try it out today on the expo floor after the keynote.

Handoff notifications and the doc experience are available now. To sign up for the status page preview, you can do that today. To learn more and sign up for previews, check out the link here, and I'll hand it back to Alexis. Thanks.

Alexis Lê-Quôc

Co- Founder, Datadog

Thanks, Shah. Wow, next time I get a page with that much energy at 2:00 A.M., I'm going to wake up really fast. You have just seen our new On-Call, and it's a real step up from the old static messages that we've all been receiving for the past 15 years. You know what? What else can we improve with the judicious application of AI to help you cut the daily toil? Security. To talk about making life easier if you're involved in security, here's Ron.

Ron Feldman

Product Manager, Datadog

Thanks, Alexis. Hi, I'm Ron, a product manager here at Datadog. Today, I'm excited to tell you how we're going to bring AI to Datadog's Cloud SIEM. Datadog Cloud SIEM helps you triage all of your security threat indicators. It's unique because it brings together security and observability, allowing for more thorough threat investigations. Now, Cloud SIEM is growing rapidly. This past year alone, Cloud SIEM has processed more than 230 trillion of your log events. That's more than 2x the year before. Now, as these event volumes continue to grow, how do we help overburdened SOC teams manage alert fatigue and high false positive rates? Our newest feature is Bits AI Security Analyst, launching today in preview. Bits AI Security Analyst vastly reduces the time that SOC teams need to spend triaging SIEM signals.

Bits autonomously investigates SIEM signals, recommends a triage resolution, showing its investigative steps with accompanying data queries, and allows for immediate remediation right in Datadog. Now, let's take a look at the workflow of a security engineer. I start my day, and I open Slack. I see dozens of new SIEM signal notifications. Today, I notice that some have threaded comments. Let's look at one. I see that Bits has investigated for me overnight. While I see a conclusion, let's click through to see the full investigation. This is the Bits Security Analyst investigation for an AWS CloudTrail signal. Bits presents a clear and reasoned conclusion. The signal is benign because it's legitimate administrative activity by a verified employee in a sandbox environment. The insights derived from the detailed investigative steps are summarized clearly and succinctly. Bits investigated all the key IOCs and analyzed all of the log results.

I can scroll down and expand each step to see Bits' agentic reasoning. This specific step shows that while the suspicious activity was irregular and low frequency, it's suggesting administrative tasks, Bits suggests further investigation. Bits then proceeds to investigate historical signals, IP addresses, user agents, and user behavior. Using the MITRE ATT&CK framework, Bits decides which steps to include and which entities to investigate, pivoting intentionally along the way, just like an expert security analyst would. Now, reviewing that just took me a few seconds, way shorter than the 30 minutes it would take me to do that investigation manually. Let's take a look at a suspicious signal investigation. This is another CloudTrail signal, but it could indicate enumeration of AWS services. We'll definitely want to investigate further, but speed matters. This could be an attacker probing our system. Let's look at how Bits uses actions.

I click on Take Action. I could use a pre-configured SOAR workflow, but I'm going to use Bits AI. Now, Bits AI Action Interface allows me to type in any prompt, but it uses the context of the current investigation to recommend three different prompts: quarantining the user, completely disabling the user, or creating a case. I choose the quarantine prompt and press Enter. Now, Bits is searching for the right action to take and suggests that I use the attach user policy. I click in. It prefills all the fields it can. I simply select the right connection, and I hit Run. Bits has now confirmed that the user has been quarantined and also tells me that it automatically created a case in Datadog's case management system.

I click into the case, and I see that Bits has prefilled all the relevant information, including the security agent conclusion and the quarantine action that Bits and I took together, along with the original SIEM signal. Now, taking easy action with Bits was not just fast and easy. It was safe because it used only my team's integrations. In short, I had the right permissions, and it even asked for manual approval, given the sensitive nature of the action itself. Next, let's navigate back to my SIEM signal list. Once I trust Bits AI's investigative capabilities, I can simply filter to the benign signals, click, and archive them in bulk. Now, I can get through to the rest of the items on my giant to-do list, like writing SOC reports and threat hunting.

Bits AI Security Analyst truly augments your SOC team, automating SIEM signal investigations and conclusions, reducing triage time from 30 minutes to 30 seconds, and accelerating remediation right in Datadog. You can try Bits AI Security Analyst today in preview by going to this link. Now, I'm going to hand it over to Mike, who's going to tell you even more about Bits AI.

Mike Leach

Product Manager, Datadog

Thanks, Ron. Hey, everyone. My name is Mike Leach. I'm a Product Manager here at Datadog. Let's continue this thread around autonomous agents that can proactively address problems within your applications. You just saw how Bits can help you triage SIEM signals and automate on-call alert investigation. Now, to extend that idea into your daily development workflow, I'm excited to announce the Bits AI Dev Agent. Much like many of you here, we've been trying out all the coding agents on the market, and we saw a huge opportunity to create a unique AI agent. Our new Dev Agent is deeply integrated within the Datadog platform. So it has complete knowledge of your observability data and uses live production context to autonomously detect high-impact issues, diagnose their root cause, and create context-aware pull requests. No other agent combines full-stack observability insights with true end-to-end remediation.

The Dev Agent can deliver faster, more reliable fixes, dramatically accelerating your dev process and issue resolution time. Actually, it looks like I'm getting a ping from the Dev Agent now. Let's see what's going on. It looks like the agent found a high-impact error. It's a slice-bound out-of-range panic in my CodeGen API service. It's been causing crashes for the last 10 minutes. The Dev Agent has already generated a fix and linked its PR here. It's even CC'd me since I'm on call. Let's take a closer look at the fix. Here on GitHub, the Dev Agent has automatically written a PR description summarizing what went wrong and the fix that it's proposing. It's clear, it's concise, and it follows my team's PR template. It even links to the error that triggered the agent. Now, let's take a quick look at the code changes.

In this bug fix, we see a common Go problem of accessing out-of-bounds slice indexes. The Dev Agent proposes a fix that sanitizes these inputs. Additionally, it adds some tests to validate the correctness of this logic. While this is a valid approach that will definitely prevent crashes, I'd also like the UI to reflect when it's sending invalid parameters. Let's ask the agent to update the commit. I'll just add a comment here asking for the change. Look at that. The Dev Agent has already responded and updated the PR. This is great. I'm going to go ahead and merge this PR. In just a few clicks, I've accepted a fix that's been proposed, tested, and documented by the Bits AI Dev Agent. Remember, I didn't even have to go looking for this error.

The Dev Agent proactively found it, fixed it, and sent me a Slack message completely autonomously. That's the unique power of this agent, and it honestly feels like having another full-fledged developer here on my team. Now, you might be wondering, how do I keep track of everything the Dev Agent is working on? We have built a dedicated page for that. Here, I have complete visibility into every PR generated by my AI-powered teammate. Whether it's tackling runtime errors, fixing security vulnerabilities in your code, or resolving issues serviced by the Bits SRE Agent, I can easily track the status of each PR, knowing whether it's been merged, is awaiting human review, or is in the process of iterating based on feedback or failed tests, which helps keep my team informed and in control.

Today, the Dev Agent is autonomously sending over 1,000 PRs per month across many teams at Datadog, even more if you count the PRs that are manually created from agent-generated code. We calculated that the Dev Agent is saving us thousands of engineering hours per week, and that's time that we can reinvest in shipping features and not sifting through noise. We're embedding the Dev Agent everywhere: error tracking, traces, profiling, code security, real user monitoring, database monitoring, test optimization, and more. You can diagnose and fix problems across all of Datadog. We're excited for you to try out the Dev Agent for yourself. Go to this link to sign up and learn more. Now, I'd like to pass it over to George so he can show you how the Dev Agent is helping our users in APM. Thank you.

George Sequeira

Staff Engineer, Datadog

Hi, everyone. I'm George, Staff Engineer in APM. I'm excited to share with you how we're embedding the Bits Dev Agent to help you solve some of your toughest problems, starting with latency. As an engineer debugging latency degradation, I'm looking at tens of services, hundreds of dependencies, all while coordinating with many teams. On a good day, this can take me an hour. Debugging latency is hard, and we've heard this from you too. That's why I'm excited to announce APM Investigator. Now, in preview. Let's take a look. I'm debugging a latency issue on my checkout endpoint. I see the p90 latency is elevated, but the p75 and p50 seem normal. Just above the graph, I see something new. Let's investigate. This is a latency investigation. Usually, this would have been a headache. I'd start searching traces, metrics, logs, and pulling in different folks to help.

Here, I have all the details of what happened and what I can do to resolve it. Up top, I see the slowdown is limited to a subset of requests. I can see the method causing the slowdown and a PR for the fix by the Dev Agent. To give me confidence in the findings, I look at the supporting section. Here, I see a comparison between a normal and a slow trace, showing me that this process premium users method is the problem. Below that, I see a correlation between the abnormal behavior and request attributes. Requests tagged with premium appear more often in highly in C cases in comparison to those tagged with basic or standard. All right, it's clear which requests are affected and where in the code I should look. Let's solve this issue.

Scrolling up, I can go to the PR the Dev Agent generated for me. Here in GitHub, the agent tells me the cause of the latency issue is an inefficient method. It shows me the proposed fix along with a test case validating the new behavior. In minutes, I'm able to root cause and fix a latency degradation, which could have taken me hours. That's not all. The investigator can help you root cause many other issues, like app inefficiencies, faulty deployments, traffic changes, and more. Now, let's take this one step further. What if I could fix issues before it alerted me? I'm stoked to announce proactive app recommendations. Now, in preview. Let's take a look. This is the recommendations page, where Datadog gives me performance and reliability improvements for the services, applications, and databases my team owns and operates. Each recommendation is prioritized by impact.

Sticking with the latency theme, let's look at this opportunity to reduce the latency on a service I own. This side panel replaces hours of investigation that I would have done. I get a clear explanation of the problem, a suggested change, and the impact. In this case, the get card items method is calling a downstream API sequentially. If I can parallelize or batch these calls, I can cut down my execution time over 75%. Wow. I see what the flow of execution will look like if I make that change. I can see the current latency right below that is around 6 seconds. To help implement the fix, I just scroll down, and the Bits AI Dev Agent gives me a suggested change. I can work with it to refine and apply these changes right here. Datadog does not stop at the service layer.

I get recommendations across my stack. For example, here's an opportunity to improve my product page experience. Users are having trouble adding a cart. I can see when the issue started and the page that it's happening on. To get a better sense of what's going on, I dive into the example session replay. Here, a user is repeatedly trying to add their cart, and nothing is happening. Scrolling down, I can see the impact. Over 45% of views on the page and 400 of our users are affected. I can see the source of the issue by scrolling down, where the Dev Agent tells me that there's a component on the page trying to use internal state that hasn't been properly exposed. I get a suggested fix ready for me to merge in.

Just like that, I've addressed two issues that could have paged my teams in the future. By analyzing the data that you're already sending through APM, DBM, RUM, and profiling, Datadog delivers recommendations to improve your application and services. Ones you've told us matter, like resolving N+1 queries and excessive lock contention. Let's recap. You've just seen APM Investigator and proactive recommendations. They represent a shift in how you operate through observability. With the Investigator, you resolve your issues in record time. With recommendations, you can address issues before they impact your business. Join us in the Expo Hall and get access to these features by signing up on the link behind me. Now, I'll hand it back over to Alexis. Thank you.

Alexis Lê-Quôc

Co- Founder, Datadog

Thank you, Ron, Mike, and George. You just saw how the Bits AI Security Analyst cuts the toll for security teams. You also saw how the Bits AI Dev Agent is starting to pop up wherever it can use observability data from production, like actual errors, to save you time and write pull requests for your review. Last, in APM, you saw how an agent can help you troubleshoot and optimize application performance with a lot less effort. This is great help with running software. How about helping you build better software? We have something new here as well. I'd like to hand it over to Muhan to share more.

Muhan Guclu

Engineering Manager, Datadog

Thanks, Alexis. Hi, I'm Muhan, an Engineering Manager at Datadog. As software engineers, it feels like we're slowed down constantly. Whether I'm responding to incidents, creating new infrastructure, or even just deploying, I hit friction every step of the way. That's why today, I'm thrilled to introduce a fully managed internal developer portal. The IDP to help engineers ship quickly and confidently using what you already have in Datadog. Working with unfamiliar services is a part of daily life. I'll never forget when I first responded to an incident caused by a dependency going down, how hard it was to fill in the blanks in the middle of the night. Let's see how Datadog helps with this. This is IDP's software catalog. Here, I can see my services as part of a greater whole. To set this up, I can start fresh or import my existing topology from Backstage.

A clean start includes all of the individual pieces of my system architecture using what's already in Datadog. Using AI, these pieces are composed into context-rich systems with titles and descriptions telling me how they relate. Right at the top, I quickly find out where code lives and what documentation I can read. I see detailed information about the services my team is already monitoring with Datadog. When coming from Backstage, Datadog completes the picture, filling in the gaps and overlaying real-time telemetry onto each component. Understanding my system in relation to best practices also slows me down. As an engineer, I really only hear about this from the occasional migration email. I really only care when my own builds start failing. The feedback loop is slow, and information is scattered across tons of spreadsheets. With Scorecards, I can see a list of best practices measured against my services.

Scorecards keep me in the loop about any ongoing platform work. I can quickly see where we're at with deployment, security, and alerting best practices, and know that any required checks are passing before I start a build. Speaking of migrations, they honestly tend to usually be pretty simple. Even if it's as easy as changing a couple of lines of YAML, I end up getting slowed down by all the back and forth between my infra and platform teams. Using self-service actions, I can find templates that let me manage infrastructure quickly and safely. I can perform actions on components like data stores and queues or spin up new ones. This create S3 bucket with Terraform action was made by my infra team for me. I can fill in any required information like the bucket, the region, and a justification, and then just hit Create PR.

A new pull request is automatically created and assigned to the infrastructure team for approval. I can view it right in GitHub. Now that the bucket's been created, I see it automatically reflected as a dependency in the system overview page. It complies with everything we need it to, covering regulation, internal processes, best practices, and any permissions. As a platform engineer, I love the idea of building templates like this. I don't want to have to learn yet another product-specific system. The S3 creation flow we just looked at was powered by App Builder, the way to build low-code apps within Datadog. With AI, I can make these templates for developers fast. Because App Builder runs in a low-code controlled environment, I can get the joy of vibe coding with the safety of predefined components.

Let's say I want to make a template for creating new RDS instances. I can start a new app from scratch and then start from AI. I'll tell Bits what I need, and it generates the template for me. Bits explains what it did and confirms any sensitive details around environment or policies. I'll just follow up with any tweaks. I'll make sure it looks good. When I'm satisfied, I'll publish this app for others to use. Because Bits uses policies that I've already configured and vetted elsewhere, I can share this template confidently and know that it'll run safely every time. All right, we just saw a lot of stuff. Let's recap. Datadog IDP is the only developer portal that knows your system and stays up to date automatically. You can understand your services without overhead, track best practices with scorecards, and manage your infrastructure with AI.

Alexis Lê-Quôc

Co- Founder, Datadog

Thank you, Muhan. What's great about the IDP is that it can work directly alongside Backstage. And it's always getting live data from the rest of the Datadog platform. Now, on the same theme of building software, let's hear from one of our customers who is also working on helping you build software faster. Here's Cursor.

Sualeh Asif

Co-Founder, Anysphere

Hi, I'm Sualeh. I'm one of the Co-Founders of the company that makes Cursor. Cursor is probably most people's favorite way to code with AI. We really, really want to help automate some of the most tedious parts of programming so that you can just go from your idea to code on the page as fast as possible. We want it to be fast and fun and also extremely powerful. Cursor, at least the company, very quickly in the last six months, has scaled by over a factor of 100 in terms of infrastructure. Our tab model inference does almost a billion calls per day. Over the lifetime of it, we have handled many hundreds of billions of files. Datadog has really helped us scale observability in a way that we never have to worry about it going down.

Probably Cursor would have grown much slower and would have had many more crashes if Datadog wasn't as good as it is today. I think probably the most exciting part is that the models right now can't see all the real errors that are happening, all the weird edge cases the code might be hitting. There's no better place to see that than Datadog right now. I think it would be really helpful to be able to aggregate this all in another tool like Datadog and then pass it to Cursor and ask it to fix the things using the ability to both understand the code base and to be able to understand what you have been working on. I think that will be a big boost to productivity for many, many people. I have a lot of respect for Datadog.

It's probably one of the most important tools to help us scale in the last six to twelve months.

Ala Shiban

Product Manager, Datadog

Hi, everyone. Hello, hello. My name is Ala Shiban. I'm a product manager here at Datadog. I'm also a Cursor user. Like many of us at Datadog, we both use and love Cursor. Like Sualeh mentioned, we're really excited about the possibilities of having agents access both Datadog tools, capabilities, and data. That's why we're introducing the Datadog MCP Server. The Datadog MCP Server allows agents to both access Datadog data. It allows us to add live instrumentation and use the breadth of Datadog capabilities to both find and fix issues for you. Let me show you what that looks like in an example. I'm a developer, and my users are complaining that the checkout flow is broken. They add items to the cart, they click checkout, and nothing happens. Let's try and ask Cursor for help.

In Cursor, I open a new chat, and I type in, "I'm seeing an issue on Coupon Django where clicking the checkout button doesn't do anything. Can you help me debug this?" Now, the agent tries to figure out what the problem is, looks at the code, but it needs more context. Because of the MCP integration, it can now choose which Datadog capability to use to help debug this issue. In this case, it chooses to use Datadog's live Lock Points. Now, let me explain what that is. Lock Points are like break points, only they don't actually break the execution or pause the execution, and they work on live services. Once you add the Lock Points to your code, they start streaming back debug data from those live services. You can see things like variable values and method arguments and execution paths without redeploying the app.

They're pretty cool. The agent is asking us to now reproduce the issue because it's using the lock points. We go back to the website. We click on that checkout button, and the agent starts collecting that data back from the Lock Points. It notices something interesting. The accents on the names of the cities are being stripped out. They're being removed. In this example, it's São Paulo. They're being expected in the code, and that's a good lead for us. We click through. We can see the code, the Lock Points, and the live data coming from those live services. I can see the same information on the Datadog portal in the UI. Both share this information with my team. I can make updates and changes to the Lock Points. I can also now generate unit tests.

Only this time, they're grounded in the production data coming from those Lock Points, so they're more accurate. The agent now writes the test. We run it. It fails. It's supposed to because we haven't fixed the issue yet. Now the agent proposes a fix. We rerun it, and the test passes. We fixed the bug. Let me kind of recap quickly. With the MCP Server, you can first use Datadog in any AI agent that supports the MCP standard. Second, you can now use the kind of reproduce the issues in production in the local environment using the breadth of Datadog capabilities, even ones that you might not be as familiar with. Third, you can generate fixes and tests that are grounded in real production data, which makes them more accurate. The MCP Server for IDEs is now in preview.

It's really important for us that you can use Datadog in any AI agent. We're happy to share that we've partnered with OpenAI to bring that operational context to their new Codex CLI. Let's take a look.

Michael Bolin

Lead Engineer for Codex CLI, OpenAI

Hi, I'm Michael Bolin, Lead Engineer for Codex CLI at OpenAI. Together with Datadog, we're imagining a future where on-call engineers work hand in hand with AI agents right in the terminal. Here's a sneak preview of some of the work and ideas we've been exploring together. Codex is a lightweight agent that can run directly from your command line, the primary environment, especially for SREs. It can follow your instructions while troubleshooting issues, read and edit files, generate code, and run commands securely, all supported through multimodal reasoning. Now let's see it in action. As an SRE, I might be wondering if something is wrong with my Redis service. I ask, are there any Redis errors? The Codex CLI interacts with the Datadog MCP Server to select the right tool, execute it, and provide concise findings.

Next, to check if someone else is already looking at the problem, I can ask, has anyone declared an incident yet? Again, the relevant information is retrieved on the fly from Datadog into Codex and shown in my terminal without having to navigate between apps. Because it retrieved all relevant incident details, I know immediately who is in charge and who I can follow up with. Next, I want to know if the issue is still happening and want to confirm it using real-time metrics. I ask, are the latency spikes still happening? Because the MCP Server gives the Codex CLI real-time access to Datadog tools and context, it can retrieve the relevant latency metrics on the fly and generate an interactive graph in my terminal. It also keeps all the results and context I've had so far, so it builds up more and more knowledge throughout the conversation.

Finally, I say, update the Redis latency monitor so we can catch this sooner next time. Codex edits the Terraform for me. This is how we imagine on-call engineers will work in the future. You'll no longer need to switch between apps when troubleshooting issues manually using different sources. You can collaborate with agents using natural language in your preferred work environment and use powerful tools through seamless MCP integrations. We can't wait to see Codex help you ship software faster and resolve issues even quicker.

Ala Shiban

Product Manager, Datadog

Thanks, Michael and the team, for putting that together. Both the standalone MCP Server and the MCP Server for IDEs are now in preview. You can learn more on our website, sign up. I will hand things over to our CMO, Sarah Varni. Thank you.

Sarah Varni

CMO, Datadog

Thanks, Ala. I'm Sarah Varni, Datadog's Chief Marketing Officer. It's been so exciting to partner with leaders like OpenAI and Cursor to reimagine what we're doing with SREs and developers and to meet our customers where they are. Honestly, that's one of the best parts of my job, hearing how all of you are using the Datadog platform in entirely new ways to power new experiences. We're super fortunate today to have one of those customers here with us. I'd like to welcome Dave Tsai, the CTO of Toyota Connected, to the dash stage. Please help me in welcoming Dave.

Dave Tsai

CTO, Toyota Connected

Thanks, Sarah. It's great to be here. I'm Dave Tsai, CTO at Toyota Connected, and we're building the future of connected mobility. Akio Toyoda started Toyota Connected to pursue the ultimate customer satisfaction. In 2018, he announced that Toyota would transform into a mobility company. Since then, the possibilities have been endless. Toyota Connected is delivering a key part of that mobility mission. To support this vision, Toyota Connected North America was established in 2016. Our goal was clear: bring the connected vehicle foundation in-house and drive the innovation from within. Now, let me talk a little bit about our company strategy. At Toyota Connected, we focus on delivering connected vehicle services, both in-vehicle experiences and out-of-vehicle services. We have built foundational products to power this vision. Let me walk you through them.

Our core products include DriveLink, delivering safety and convenience to our customers, Mobility, providing connected data services, the virtual agent, Hey Toyota, our in-vehicle AI virtual assistant, and finally, Multimedia, our in-vehicle infotainment system that powers our cockpit experience. These products operate at real scale. So far, we have over 12.5 million vehicles connected through these systems. Let's take a closer look at DriveLink. DriveLink provides a human-assisted service through an SOS button built into the vehicle. For example, if you're in a collision, pressing the SOS button connects you to immediate human support. We also offer enhanced roadside, automatic collision notification, and stolen vehicle locator, all designed to keep our drivers safe and supported. To show the real impact that we're making with 12.5 million vehicles on the platform and over 5.5 million calls handled, of these, 600,000 were critical safety calls.

We have helped track over 35,000 stolen vehicles. When we talk about vehicle tracking, it is not just about recovery. It supports civil service responses, too. Beyond individual vehicles, we operate a fully connected fleet. Our systems run at four nines uptime, and that reliability is possible because of observability and tooling provided by Datadog. We achieve our four nines uptime by driving our mean time to identification from minutes to seconds. We built workflows and a software catalog that quickly connect the right people to the right incidents when they happen. Please come meet our amazing DriveLink team at the expo hall to learn more about how we achieve our operational excellence. Our partnership with Datadog goes beyond DriveLink. Mobility and the virtual agent also rely on Datadog's full observability suite to help us build better and more reliable vehicles.

Here's a glimpse of logs and stats we monitor through Datadog. We currently oversee roughly 1,000 hosts, tracking about 8 million container hours. We're excited to continue to grow our partnership with Datadog as we scale even further. With a suite of tools Datadog provides, we have the opportunity to build even better cars, in the words of Akio Toyoda. Thank you. Now back to you, Sarah.

Sarah Varni

CMO, Datadog

Thank you, Dave. We're so excited to see how Toyota is using Infrastructure Monitoring, APM, and our synthetics products across the entire Datadog platform to power this new connected driver experience for over 12.5 million vehicles worldwide. As Dave mentioned, they're also going to be on the expo floor demoing their connected car experience live on the expo floor. I got to get a sneak peek of this this morning. It's incredible. I highly encourage you to check it out. As you heard from Dave, observability has been key to helping Toyota build this new connected car experience. Now we want to go deeper on one of the core pillars of observability, and that's logs. Last year, we launched Flex Logs with the idea to help you manage your storage costs more effectively. Today, we want to build on that vision.

To share what's new with logs, I want to welcome Kelly Kong to the Dash stage.

Kelly Kong

Product Manager, Datadog

Hi. I'm Kelly, product manager here at Datadog. Last year, we launched Flex Logs, decoupling storage from compute so that you could bring in more logs to solve new use cases, all while staying within budget. Just the takeaway, an online food ordering company uses Flex Logs to achieve full visibility across their stack, cutting MTTR and reducing revenue loss on missed orders. They're just one great example among many. In less than a year since launch, teams are now storing over 100 petabytes of data per month, making Flex Logs Datadog's fastest growing product in history. We're just getting started. You told us you need logs for years to comply with audits, investigate zero-day security breaches, and perform compliance reviews. When you're being pinged by three different teams for hourly updates, efficiency matters, and context switching only slows you down.

That's why I'm thrilled to introduce Flex Frozen, a new long-term storage tier designed for historical reporting and regulatory requests. Keep your logs fully managed in Datadog for up to seven years, where you have one platform for DevOps, security, and compliance use cases. That's not all. We're also simplifying how you discover and analyze these logs. I'm excited to announce Datadog Archive Search, a powerful new way for you to find log insights regardless of where that data lives. Let's play it out. My compliance team just asked me to pull a user activity report spanning back three years. Whether I'm leveraging Datadog storage, such as our new Frozen tier, or my own S3 bucket, where I already have years of archived data, I now have the same consistent search experience, where I can easily find relevant logs over any historical time frame.

Within seconds, I'm getting data back from my external archives without having to write the perfect query upfront or wait for a lengthy rehydration job. Once I'm happy with the results, I can set up a full CSV report to land right in my auditor's inbox. Archive Search makes it easy to produce reliable reports when you're under time pressure or scrutiny. For those inevitable follow-up questions, you now have Datadog Sheets. Eliminate the endless emails and exports with a native spreadsheet solution built right inside Datadog. Opening my results in Sheets, I don't have to worry about syncing my data or managing multiple CSV files. Pivot tables allow analysts and auditors to quickly summarize or drill down into data. For example, I can break down my earlier audit logs by different dimensions, such as team, user, or country.

Sheets is great for this kind of slicing and dicing or building real-time reports. For deeper analysis and multi-step investigations, I need a different kind of tool, one that supports storytelling. Last year, we introduced Log Workspaces to transform log data and build multi-step analyses on the fly. We are extending these same capabilities to Notebooks, your home for interactive graphing and collaborative analysis. The key is that I can now bring together all my different telemetry and context: logs, APM spans, metrics, and more into one unified canvas. Transforming these different data sets is easy, with intuitive one-click operations that allow me to parse, aggregate, or filter. Tasks that used to mean exporting data outside of Datadog or reinstrumenting upstream apps are now as simple as applying a formula. Best of all, Notebooks help you collaborate better with your team.

Whether you're reviewing, leaving a comment, or starting a discussion, you can do it all just like you would with your favorite real-time editor. Working together, your team can get to insights faster. Actually, I have one more teammate in here, one who knows this data inside and out. Bits AI is now integrated right into Notebooks. When I ask for help with reviewing user access patterns, watch as Bits jumps into data analyst mode, adding in relevant metadata, writing SQL queries, and visualizing the final result in a logical, easy-to-follow chain. I can take this final output and save it to my favorite dashboard or continue working hand in hand with Bits and my team. Notebooks offers a new paradigm for advanced analytics with full context, powerful computational abilities, and real-time collaboration. One more thing.

For those of you with existing queries in tools like Splunk, where you rely on piped query syntax, check this out. If I copy and paste an SQL query into a notebook, Datadog automatically understands and translates it for me, recreating the same time series graph in seconds. Welcome to the future. Everything we've covered today stems from a simple belief that more data should never mean more complexity or work. We're reimagining the way you interact with logs, from retention all the way to resolution. Visit the link on screen to learn more or to sign up for early access. Thank you, and I'll pass it back to Sarah.

Sarah Varni

CMO, Datadog

Thank you, Kelly. You just saw a ton of new features for log management. Let's do a quick recap of what you saw. First, with Flex Frozen, we're delivering a new storage tier, extending your log retention to over seven years. With Archive Search, we allow you to query your logs from cold storage without requiring reindexing. With Sheets and Notebooks, now also together with BitsAI, we help you analyze your log data in entirely new ways. Of course, last but not least, bring your own query, which makes your migrations seamless. No matter what your Datadog log management use case is, we want to make sure we have you covered. We also do not want you to just hear about it from us. Now I'd like you to hear from one of our customers around how they're using Flex Logs to deliver superior uptime and performance, all at scale.

Let's hear the story of Okta.

Speaker 23

Okta is the leading independent neutral identity company. Auth0 is Okta's developer-friendly platform for customer identity.

Speaker 24

If Auth0 goes down, this could potentially disrupt businesses by having them lose revenue, having customers upset because they can't log in to see the information and data that they rely on on a daily basis.

Speaker 25

With such a complex tech stack and having an uptime SLA of 99.99%, you can imagine that every second counts.

Flex Logs has been a game changer for Auth0.

Speaker 24

Now that we have Flex Logs, we have our logs in one single view. This has enabled us to have faster root cause analysis and incident resolution.

Speaker 23

We have significant cost savings by consolidating metrics, logging, and tracing. We reduce median time to mitigation with a great observability tool. With GenAI technology, the security landscape will become even more sophisticated. Together with Datadog, we can address the evolving challenges and keep our customers safe.

Sarah Varni

CMO, Datadog

Auth0 and Okta together are a great example of a software company evolving in the age of AI. We are super lucky today to have the CTO of Okta, Bhawna Singh, here on stage to tell us more about how they are rethinking the identity landscape for GenAI applications. Please help me in welcoming Bhawna.

Bhawna Singh

CTO, Okta

Wow. It is exciting to be at this high-energy Dash conference, right? I'm Bhawna Singh, CTO at Okta, the leading identity company with a vision to free everyone to safely use any technology. As we see the tech industry evolving with AI, we are also working to make agent development and use of user identity by agents safe. The reason we need to talk about securing AI agents becomes more important as we look at these stats. 82% of organizations are experimenting to deploy these agents in production environments in the next one to three years. If you look at the stats on customer expectation, more than 60% of customers have stressed the importance of trust in AI agents. Personally, I believe that as more people understand the power of agentic technology, this number will only grow.

As these AI agents begin to act on behalf of users, answering questions, automating tasks, and making decisions on our behalf, establishing trust in these agents will be essential for their adoption and effectiveness. Who should build this trust? You. If you are building AI agent applications, you are accountable. AI security starts with identity. As developers are focused on getting the agents to work, connecting them to data sources, and integrating with APIs, a strong, secure identity platform can ensure that they are running in secure environments. Agents must be built securely right from the start and need to run securely from the first deployment. At Okta, we have identified four critical requirements where securing AI agent development is crucial to building GenAI applications. Number one, starting with authentication.

For AI agents to operate securely, they must be able to authenticate users just like any other application. It needs to confirm who the user is before providing access or making decisions, just as verifying a customer's identity before making a purchase or a patient's credentials before giving them access to medical records. Number two is API to API calls. AI agents will interact with different applications on behalf of users and will need API access to call these applications. Without strong identity controls, AI agents could access APIs they should not, or leak sensitive data to unauthorized agents, or be completely unable to perform tasks on behalf of users. This means access tokens should not be hard-coded. They need to be stored in Secure Vault. Number three, another common use case we see is asynchronous workflows. Many agent use cases need them to work asynchronously.

For example, actions such as data processing or transaction approvals can take minutes, hours, or sometimes even days. Security systems today are not built for long-running asynchronous workflows. An AI agent might need to perform a task long after a session has already ended. There is a need to authenticate just in time when agents have to act without leaving the door open for attackers. Lastly, authorization. The need to fine-tune data access is a more understood use case in AI agent development space today. AI agents should only get the permission that they need and nothing more.

We identified these requirements after partnering and speaking with companies of all sizes and growth levels and built these capabilities out of the box in our Auth for GenAI platform, which is Okta's platform that makes it easy for developers to solve these requirements with built-in identity security for AI agents. As these agents are running, how will we ensure the agents are doing what you built them to do? Monitoring and tracking their behavior is the full circle we need to build this trust. Because if they access the wrong data, take unauthorized actions, or if someone hijacks your agent and changes their behavior, the impact can be immediate and irreversible. Secure identity and observability have always been important in our software stack, but it's even more so in today's AI agent landscape.

That's why in the age of AI agents, we need to treat identity and observability not as optional layers, but as foundational technologies and practices. Datadog and Okta are well-positioned to enable customers to tackle these challenges that AI agents pose. To highlight the innovative work Datadog is doing in this space, I'm excited to invite my dear friend Yanbing, Chief Product Officer of Datadog, to the stage. Thank you all.

Yanbing Li

Chief Product Officer, Datadog

Thank you, Bhawna. It's been a real pleasure working closely with the Okta team as their observability partner. Hi, I'm Yanbing Li, Chief Product Officer at Datadog. To just share just how excited I am to be here, I actually accepted the offer to join Datadog after watching the dash keynote on video a year ago.

I can't think of a better way to attend my first Dash in person by showing you how Datadog is driving innovation in security and observability for your AI applications and agents. As Bhawna just said, security is even more critical in the age of AI agents with all these new attack surfaces that's possible. Our security team has been busy at work. Since last Dash, we launched more than 400 new features and detection. Today, 7,500 customers, including one in every five Fortune 500 companies, use Datadog security to protect their infrastructure and applications. Now, as you build and deploy your AI agents, we're evolving Datadog security to meet the unique challenges of AI at every single layer. At the data layer, where training begins. At the model layer, where reasoning happens. At the application layer, where you integrate AI into real-world application.

To dive deeper into how we're helping you secure every layer of your AI stack, let's welcome Vijay.

Vijay George

Product Manager, Datadog

Thanks, Yanbing. Hey, everyone. I'm Vijay George. I'm a Product Manager here on the Security Team at Datadog. Now, let's dive right in to see how we can secure our AI stack from these new attack vectors. We'll take a look at a few examples at each layer, starting with data. At the data layer, we need to prevent sensitive data leakage in training data sets and prompt response pairs. Let's take a look at how this works while I'm training my new AI app with sensitive data scanning enabled. In Datadog, I can see a 3D map of my entire cloud infrastructure, which gives me context into how everything's organized and connected within my cloud environment. Here, this S3 bucket has some training data to fine-tune my custom model.

With sensitive data scanning enabled, every file in this bucket is automatically scanned for sensitive PII, which I can now investigate further and jump straight into the AWS console to eliminate that PII. At runtime, I can also quickly switch to identifying PII data leaks in every LLM interaction. Here, I can see my attacker is trying to get a social security number from my model. Datadog automatically flags the input and sets alerts to catch sensitive data leaks when it has been detected. That is a quick look at how Datadog helps detect and prevent sensitive data leaks. To help you go further, we are expanding support to detect sensitive data in API response payloads and other data poisoning attacks coming later in 2025. Next up is the model layer, where we need to make sure our AI model is safe and is not being manipulated.

We'll first start by looking at a supply chain attack, where an attacker is targeting the supply chain of an open-source model. Now, I've been testing a lot of different models from Hugging Face, and I've accidentally downloaded a malicious DeepSeek model that can run code and give a threat actor remote access to my app. Luckily, with Datadog, I can see that my DeepSeek model was loaded with PyTorch and triggered an unknown process running shell commands. Datadog automatically detected the malicious model, killed the process, and stopped the supply chain attack directly at the source. Now, let's look at a second example of a model hijacking attempt. Here, I'm using a tool we've open-sourced called Stratus Red Team that's going to help me simulate a real-world attack in my own environment.

The attack you're seeing here is an LLM jacking attempt, where the attacker is using a stolen access key to hijack my model and use my LLM compute for themselves. This could mean I'm left with a huge bill costing me millions of dollars if I don't catch it quickly. Now, when I get to Datadog, I can quickly respond to this threat in real time. Here, I can see Lucia Silva is my attacker trying to access my custom model deployed on Bedrock. From here, I can jump straight into the related signal to triage and investigate more. We're continuing to add more support for attack vectors at the model layer, including model drift, model extraction, and jailbreaking coming soon in the near future. Finally, at the application layer, we need to protect our environment from code to cloud.

Let's first take a look at a prompt injection attack in my production app. Now, here, I've built my app and added some bad code. Now, when I open a PR in GitHub, I can see that Datadog prevented a prompt injection attack and blocked the merge automatically. Now, if I override the block and the code makes it into production anyway, Datadog can also detect when an attacker exploits that vulnerability. Here, I can see the line of code that an attacker could exploit to trick my LLM and run commands to gain access to my entire system, which I can remediate now directly in Datadog. Pivoting to my cloud environment, let's look at a data poisoning attack at runtime. Datadog shows my app is training from a public S3 bucket, meaning an attacker could poison the data and maliciously change the model's behavior.

I can now remediate the vulnerability directly in Datadog and meet AI security standards with our out-of-the-box AI compliance frameworks. We are continuing to build more detections, including agentic tool misuse, novel identity attacks, and denial of service coming later in 2025. These were just a few examples of how Datadog security can help you protect your AI stack from these new attack vectors. We have partnered directly with AWS to build out our Bedrock detection library, and we are continuing to invest heavily in novel security research, building a comprehensive set of AI detections across cloud providers to make Datadog security the product to secure your AI apps. AI is changing how software gets built today, and we are evolving Datadog security to help you build and ship these apps securely end to end. We are so excited to see what you build next.

If you want to learn more about securing your apps in the age of AI, come see us on the demo floor today. Now, I'll pass it back to Yanbing.

Yanbing Li

Chief Product Officer, Datadog

Great job. Thank you, Vijay. You just saw how Datadog security offers security for each layer of your AI, from data that powers training to the models that drive inference to the agents that learn real-world impact, all through an integrated security platform. Now that we have secured your AI stack, let's talk about observing it. As you integrate AI into your product and workflows, how would you know their behaviors and the interaction between them, and also whether they are delivering the user and business outcome that you've intended? To explore how we deliver end-to-end AI observability, please welcome Anjali.

Anjali Thatte

Product Manager, Datadog

Hey, everyone. My name is Anjali, and I'm a Product Manager here at Datadog. As AI workloads move from R&D to production, GPUs become more and more critical. We've heard from you. 30% of model training failures are because of GPUs, and these clusters are often running idle. Yet, even as GPU sales skyrocket, SREs and ML engineers are left without end-to-end visibility in how GPUs impact their AI workloads. That is why I'm excited to introduce Datadog's GPU Monitoring. Let's see it in action. GPU Monitoring provides full visibility into your GPU fleet across all major cloud providers, on-prem setups, and GPU-as-a-service platforms. You can view your fleet at the cluster level, then drill down to hosts, GPU devices, and even MIG slices. It does not stop there. Datadog GPU Monitoring solves for various issues. Let's start with contention.

Here, my ML team says that their race services are failing recently. In GPU monitoring within the resource contention section, I see this spike in unmet requests, specifically in my cluster named Yanmega. I filter down to this cluster. Immediately, I see that there are no A100 GPU devices available. Not only are we maxing out our current capacity, Datadog has forecasted that demand will continue to max out capacity in the next four hours. Datadog GPU monitoring just helped me identify the type and number of GPUs to solve this contention issue with confidence. GPU monitoring also helps me solve congestion between my GPU nodes. Let's say my ML team says that their training times are taking 12 hours longer than usual. With Datadog, I can inspect RDMA and EFA network congestion between GPU nodes and NVLink congestion between GPU devices. This issue sounds like a data starvation issue.

Let's investigate our first node. Clicking in, I see that switch one, port one, experienced a failure that caused a throughput drop in data transfer across my GPUs, impacting overall model training times. I can reroute RDMA traffic to a working port to improve my ML team's workload speed and resolve this congestion issue. Lastly, GPUs are a precious commodity, and idle capacity can be the biggest drain on our budget. GPU monitoring helps you stay on top of your total GPU spend. Let's see this in action. Here, I see that within GPU monitoring, we've highlighted and identified the key cost optimization opportunities. Looks like our cluster named Nidorino is our most expensive cluster with over $157,000 in total spent. Clicking into this cluster, Datadog GPU monitoring shows me my total devices allocated, active, and effectively used GPUs.

I see that only 40% of my GPU devices are using their cores effectively, leading to over $28,000 in inefficient spend. I can also see this cost in the context of my entire cluster within CCM. Now, GPU Monitoring breaks down GPU consumption by pods, processes, and jobs, so I can identify non-critical and inefficient workloads. I see here that there's a pod hogging eight GPUs with less than 50% core utilization. I'll ask my ML team to consolidate this pod onto a fewer number of GPUs so that we can reduce our total spend. With GPU Monitoring, I've connected wasted cost in my cluster to inefficient workloads, so I can optimize my cluster's GPU usage. To recap, Datadog's GPU Monitoring helps us solve for resource contention, data transfer congestion, and wasted cost across our GPU fleets. I'm so excited for you to try this new product.

You can sign up at the link for the preview. Now, I'll hand it over to Victor to talk about LLM observability.

Victor Vong

Engineering Manager, Datadog

Thanks, Angeli. Hey, everyone. My name is Victor Vong, and I'm an engineering manager here at Datadog. Today, I want to tell you about the latest innovations in LLM observability. For the past couple of years, we've seen our customers begin to explore using AI in their workloads. In 2023, we saw mostly experiments, and at that time, we launched LLM observability to help our customers better observe their AI workflows. As customers started building on top of these LLMs, they needed to go beyond simple monitoring to ensure the outputs from their AI applications were reliable. That is why last year, we added new capabilities like hallucination detection to help our customers trust their LLMs. Now, in 2025, as our customers have gone even deeper into using LLMs, we've seen them begin to deploy their own custom AI agents.

While these agents have been very powerful, they also present a new set of unique challenges. For example, agent-based applications are a lot more complex than regular workflows. It's hard to see how these agents make decisions or pick tools so they're not always reliable. Most tools out there aren't ready to handle these fast changes. To help you build better custom agents and observe their performance, we're excited to introduce AI agent monitoring. Let's see how it works. Let's say I'm building a personal finance app called Budget Guru. Budget Guru tracks my spending, manages my personal budget, and gives financial advice all using AI. Now, let's take a look at how I could observe the agents powering Budget Guru in LLM observability. Here, I can see the user's input and the LLM's response.

What my agent did here was it used multiple LLM calls and different tool integrations, which means to figure out the final answer, I normally have to scroll through a bunch of complex traces. Now, with the new agent execution flow graph, with one click, I see a clear view of how my agents work together to create the final response. There's a lot my agents are doing here. I can see the triage agent calling the investment and education agents, and the investment agent is calling the budget agent for more information. All of that is being summarized and sent back to the user. Thanks to the new agent execution flow graph, all that noise is being filtered out, and I can just focus on what matters most. I can also see how each agent was configured using the new agent manifest.

When I click on the triage agent, I can quickly see its instructions, tools, guardrails, agent framework, and model information, making it easy to understand the agent's behavior at a glance. When my agent breaks, I need to do more than just understand a high-level view. I need to drill down and see what's going on inside each agent. To help do this, we're excited to introduce AI agent troubleshooting and LLM observability. Let's see it in action. Here, the user asked about their latest dining spend but got a vague answer that missed all the important details. I'm going to open the agent execution flow graph to see what's going on. After taking a quick look, I notice an error flag on my triage agent. Clicking into it, I see there is a tool selection error being highlighted, and notice it's an irrelevant tool call.

It looks like the web search tool was prematurely picked because of a vague prompt. To fix this problem, I want to try testing with a few different prompts to see which one gives me the best results. I can do this using experiments. It's a new way to quickly test and validate changes you make to your LLM applications. Let's use the same example. I'm going to pick a data set and add this trace to it. Now that I have a data set with that problematic trace, I've decided to test out three different models with three different prompts. I would normally run these experiments, dump everything into a CSV, and analyze all that data by hand to figure out the best prompt and model setup. Doing this was always so messy and a lot of work.

Thanks to Datadog's new experiments SDK, I can run all these experiments in parallel and very easily analyze and pick the best prompt and model. Let's see this in action. Here, on the experiments page, each line is an experiment and its setup, and I can compare things like duration and tool selection accuracy. With one click, I can filter for the highest tool selection accuracy using the cards on the left, and I'll also filter for low duration using our brush filtering. In two clicks, I went from nine experiments to just two, and it looks like it's coming down to prompts V1 and V2 using two different models. Let's compare them using the new experiment comparison page. Here, I can easily compare the experiments, see all the details side by side, and a quick summary at the top.

After taking a quick look, I can see that the GPT-4.1 model with prompt V2 has the highest tool selection accuracy and roughly the same duration. I'll choose that combo to deploy into production. I've now gone all the way from troubleshooting my custom AI agent to improving it through experiments. To recap, we've just seen how Datadog's LLM Observability can help us monitor how our agents interact, run experiments to test our changes, and debug and troubleshoot errors all in one single platform. We support all popular agentic frameworks, such as OpenAI Agents, CrewAI, LangGraph, Pydantic AI, Mistral Agents, Google's ADK, Amazon Bedrock, and more. We're excited to get this into your hands. Sign up today at dashcon.io/agents. We look forward to working with you towards an agentic future. Thank you.

Now, I'd like to introduce Kathy, who will talk about how to monitor your evolving enterprise stack that will soon include agents built by others.

Kathy Lin

Senior Product Manager, Datadog

Hi, everyone. My name is Kathy Lin, and I'm a Senior Product Manager here at Datadog. Victor just walked us through how Datadog helps teams evaluate the performance of the custom AI agents those teams are building. There's now an ever-growing number of external agents integral to the business that these teams don't build in-house. Understanding these third-party agents' behavior is equally as important to achieve new efficiencies and accelerate innovation. We've heard from you that keeping track of what each agent is doing and how they're interacting with each other is extremely challenging, especially when you're worried about security breaches or wasted investments. The good thing is Datadog is all about providing visibility to help teams scale safely.

To help solve the challenges that come with integrating AI agents, I'm excited to introduce Datadog's AI Agents Console. With AI Agents Console, you can now monitor the behavior and interactions of any AI agent that's a part of your enterprise stack, whether that's a computer use agent like OpenAI's Operator, IDE agent like Cursor, DevOps agent like GitHub Copilot, or enterprise business agent like Agentforce, all in addition to your internally built agents. With this visibility into both custom and external agents, Datadog helps you understand which agents are supporting your business and what actions are they executing. Are they doing so securely and with the proper permissions? Do they deliver measurable business value? Lastly, how are your end users engaging with your agent-powered business? Let's jump into Datadog, observe these agents, and get some answers.

With a few simple setup clicks, I can instantly see a comprehensive summary of every agent that's powering my business. For each of these agents, I get key insights out of the box. For example, I can see the total monthly costs of using these agents, as well as the error rate across each of my agents to easily detect the most ineffective ones for further investigation. Now let's deep dive into one of these agents, Anthropic's computer use agent powered by Claude Sonnet 3.7. Here, Claude Sonnet powers my Slack-based AI agent, which creates personalized spreadsheets for each of my customer success managers of their churn risk customers and the respective product features that have blocked implementation, requiring Sonnet to access multiple systems like Salesforce, Jira, and Google Drive.

I can also see more granular insights about this agent, such as the task completion status, which actions Sonnet took and which ones failed. When I want to dive deeper into this agent's performance, security, or business value, I can do so by using the tabs on the left. If there's an increase in the number of task failures, I'm alerted instantly. Let me click into this tile to see why this spike is happening. This brings me to the Activity Insights tab, where I can see user engagement insights like daily active users, who my power users are, and quickly filter to those failed sessions without needing to write a single query. It looks like we have quite a few failed interactions here. I'm going to click into one to see what's going on.

By doing so, I get a replay of every action this agent has taken, which is amazing because I've just gone from not knowing what this agent is doing at all to seeing exactly where it's clicking and what it's entering into the browser, like signing into Salesforce or navigating into the Analytics tab to pull that list of churn risk customers. I can also click anywhere on the corresponding events timeline on the right to jump to that exact moment in the replay. Now let's see if we can figure out why this interaction failed with this error. I can click into this detailed side panel, which quickly reveals why the agent has failed to generate that requested spreadsheet. It looks like it lacked proper permissions in Salesforce to view customers' churn risk states. This led to the agent repeatedly trying to query on an unavailable field.

By simply granting Sonnet proper permissions, I can restore user engagement and boost the business value that Sonnet provides. To summarize, Datadog's AI Agents Console allows you to innovate safely and with confidence. You'll get full visibility into every agent's actions, insights into the security and performance of every agent, quantifiable business value for all of your agents, and ultimately proof that your agentic AI investments are paying off with your end users. We can't wait to get this to all of you. Sign up to become one of our design partners by following the link above. Thank you. Now back to you, Yanbing.

Yanbing Li

Chief Product Officer, Datadog

Thank you, Anjali, Victor, and Kathy. You just saw how Datadog provides end-to-end observability across your AI stack. GPU Monitoring monitors and troubleshoots your GPU's congestion and contention and cost so you get the best out of your GPU investment.

LLM observability helps you build and operate your LLM applications, including agents, and with AI agent monitoring, with agent troubleshooting, and experiments. Last but not least, AI Agents Console gives you full visibility and control across your entire sphere of agents running your business, whether they're developed in-house or by third party. That was AI observability end-to-end. Now, switching gear, we also know AI is only as good as the data powering it. How can you gain deep insight into the quality and lineage of the data that's powering your AI? I'd like to invite Kevin to tell you more.

Kevin Hu

Staff Product Manager, Datadog

Hi, everyone. I'm Kevin Hu, a staff PM at Datadog and formerly the CEO at Metaplane. Together with my friend Ian, who leads data at Ramp, I'll be talking to you about a new topic for Datadog, data observability.

As we just heard, companies are increasingly using AI to provide better experiences to their customers and build more efficient businesses. Underneath these AI systems is proprietary data, which is data only your company has and serves as your durable differentiated advantage. In other words, your AI is only as good as your data. Now I'll hand it over to Ian, who will talk about how Ramp uses data as a competitive advantage.

Ian Macomber

Head of Data, Ramp

Thanks, Kevin. Hey, everyone. I'm Ian, the head of data at Ramp. Ramp helps over 35,000 companies control spend, automate accounting, and manage vendors all in one place. We help the average customer save 5% per year on expenses, and we're headquartered right here in New York City. Across the company, we collect unstructured data like receipts, invoices, and bank statements, as well as structured data from systems you're familiar with.

Our data team transforms this data. Would you mind going back a slide? Thank you. As well as structured data from systems you're familiar with. Our data team transforms this data to power critical use cases across the company. I'll start with one example. Capital Markets Operations. Ramp's business is extremely cash intensive. We work closely with banking partners like Goldman Sachs, Citi, and Barclays to maintain big lines of credit that we borrow against our receivables. That means we need to know which businesses owe us money at any point in time, down to the second, down to the cent. That is hard. Credit card transactions can be reversed. Authorizations can be held and removed, and an Uber ride may be multiple transactions once you include a tip. We also depend on many third parties who may send us data with duplicate rows, missing entries, and incorrect numbers.

When that happens, it breaks the trust that we have with lenders. By flagging when data doesn't pass the smell test, data observability helps our capital markets team sleep better at night, and in turn, helps us extend customers the credit they need to run their businesses. Moving from operations to product, one of the most exciting products we've built combines data and AI. It's called Price Intelligence. Over time, we've collected millions of PDFs, receipts, and statements across customers and vendors. Traditional OCR and rules-based systems didn't scale. With large language models, we convert this massive and messy set of documents into structured data. Then we surface pricing trends, outliers, and benchmarks across billions of anonymized transactions. When you're looking at a contract, you can see what might be overpriced, how it compares to peers, and whether you can negotiate it down. Invoices change.

Pricing models shift. LLMs aren't perfect. By catching these issues, data observability helps customers trust what they're seeing. We know foundational models will keep improving, but we believe there are really only two moats: customer context and data. Thanks to Ramp's product engineering and design teams, we're in a position to be that system of record. Now it's the data team's job to capitalize on that opportunity, and we can't do it without trust. Data observability helps us get there. With that, I'll pass it back to Kevin.

Kevin Hu

Staff Product Manager, Datadog

Thanks, Ian. Ramp shows us what's possible when data goes right. What about when data goes wrong? This diagram probably looks familiar to you. Data flows from sources through a warehouse to downstream AI and BI tools. Everything looks fine until a customer flags a data issue.

You start troubleshooting, but the context is either fragmented, messy, or missing entirely. Meanwhile, the problem compounds, and you start to lose the things that are easy to lose but hard to regain: time and trust. We do not think working with data should be this way. To help you shift from reactive firefighting to proactive action, we are introducing Datadog Data Observability, now available in preview. Let's say I am a data engineer at a financial operations company like Ramp, call it Pully. There is an issue where the quoted prices are incorrect. Instead of the issue going unnoticed and then eventually impacting customers, I get a Slack alert saying that the quoted prices are lower than expected, based not on manual checks, but on machine learning models trained on historical data that takes trends and seasonality into account. To learn more, I enter Datadog.

Within Datadog, I ask myself three questions. Number one, is this real? Number two, does it matter? Number three, what can I do about it? To answer that first question of, is this real, I look at the most recent data points that failed. It looks like, yes, there are several occurrences of data below the expectation, so there is probably a real issue occurring. Number two, does it matter? Instead of trawling through query logs to try and find the downstream dependencies, Datadog automatically parses them for me. That is how I know an executive reporting BI dashboard and a table in a vector database storing embeddings are affected. Clearly, this issue does have a real impact. Finally, I ask myself the third question, what can I do about it? Usually, that is when I am out of luck.

I don't know where data comes from because my visibility is limited to the data warehouse. Datadog helps me map all the way upstream, integrating lineage and context across products. I see this Snowflake metric is materialized by a Spark transformation, which is erroring out. When I look into the details of the most recent job run, the code exception indicates to me that the job expects to see a file in S3 that's no longer present. To figure out what's wrong with the processes generating that file, I zoom out to the full end-to-end lineage view. It looks like the S3 bucket is the destination of a Kafka pipeline fed by a microservice. I inspect the microservice that's producing those Kafka messages and see that there is a recent schema change, which corresponds to a recent feature push.

To resolve this, I ping the on-call engineer to roll back the relevant PR. What you just saw is a combination of deep data quality checks and machine learning models that are tailored to the enterprise data quality domain, overlaid on end-to-end data lineage. What do I mean by end-to-end? Existing data observability products typically start from the warehouse, then shift one step to the left or one step to the right. By starting towards the end and with a limited view, the damage is often already done. Datadog Data Observability is the only product that spans the entire data lifecycle, starting all the way from the services and applications that produce data to the streams and ingestions that move data to the jobs that transform data through the warehouse to the BI and AI systems that consume data.

Now you finally have the visibility across the full data lifecycle to detect issues sooner, resolve them faster, and ideally prevent them from happening in the first place. Datadog Data Observability helps companies like Ramp, Justworks, and Glassdoor trust the data that powers their businesses. If you want that same level of confidence in your data, you can sign up for the preview today or visit us at our booth to learn more. Thank you. Now I'll hand it back to Yanbing.

Yanbing Li

Chief Product Officer, Datadog

Thank you, Kevin and Ian. It's really exciting to see how Datadog Data Observability can help you not only deeply understand your data sets, but also the entire lineage so that you can understand your data lifecycle. Today, we've covered a lot.

We started with introducing a fleet of fully autonomous Bits AI agents, including SRE, Security Analyst, and Dev Agent to help you boost your team's productivity and reduce time to resolution. We talk about the On Call real voice interface that lets you jump-start an incident response. We then introduced Datadog IDP that can help your development team build better software faster with more confidence. The Datadog MCP Server allows you to build your own agent with rich observability context from Datadog. We're also reimagining observability from APM to logs and more. Datadog Security helps you protect your DI stack at every layer. Last but not least, end-to-end AI and observability so that you have full visibility across your entire AI application and data. What do you think? Is it a lot? Wait, we actually launched so much more.

Yes, what you're looking at are all the features we're launching today at this Dash. This may be the most visually non-exciting slide you've ever seen. As a product person, it really makes my heart sing because this truly represents the hard work from thousands of engineers so that we can help you observe, secure, and act better on your data and applications. Now, I don't expect this Dash keynote to have the same effect on you as it did to me personally last year. Seriously, please visit Datadog hub so you get to see more live demos together with our product and engineering experts. Also attend the breakout session where we not only talk about product and technology, but most importantly, real customer stories from many of you in the audience. That's a wrap. Thank you, and have a fantastic Dash.