Hi, everyone. Thanks for joining us today. We're really excited to have you here as we're talking about the story of finance data at Snowflake. Just a quick introduction to get started. I'm Sandra, and I lead the analytics engineer team for finance data at Snowflake.
I'm Jack. I am a BI analyst supporting the product finance team at Snowflake. So today we're gonna talk about the evolution of dbt at Snowflake, specifically how it's helped drive quick and strategic decision-making within our finance team. We're gonna start with an overview of our finance support model, and then we're gonna talk about how ELT within our team has evolved over the last 10 years ago or so. We're gonna talk about how dbt really helps us develop quickly and cost effectively, how we hope to continue to use it into the future, and then at the end, we'll have some time for Q&A as well.
So to start off, we'll talk through how our BI team support model for finance as a whole, and then we'll dive specifically into cloud finance support. The finance data team is made up of three distinct sections: the analytics engineers, the BI analysts, and the data scientists. The analytics engineers are in charge of building and maintaining our core pipelines, and they're also our team's experts on data architecture and data model design. Each BI analyst supports a specific financial function, for example, sales, finance, investor relations, or FP&A, and then we also provide ad hoc support for any function not covered by an analyst. Finally, our data scientists are in charge of building and maintaining our financial forecasts, including revenue, bookings, cloud spend, and credits.
These help support financial decision-making with data science and machine learning solutions, with things like scenario analysis, impact analysis, anomaly detection, and much more. Today, we'll focus specifically on our support of cloud finance as they largely function as the cost-governing arm across our company and also maintain some of our largest data sets. This is also the area that Jack's aligned with as an analyst on our data team.
So outside of myself, the product finance team is composed of four financial analysts, and the main responsibility of this team is to manage and forecast all expenses that go into delivering Snowflake to customers. And we also work really closely with the engineering teams at Snowflake to optimize these costs. And, you know, at its core, Snowflake is a company that is rooted in data-driven decision making. So my role as the BI analyst in support of this team is to build and maintain all the data products that go into making the most informed decision possible. So, you know, whether that be insights or recommendations based on customer usage, pricing, new product rollouts, really anything under the umbrella of improving gross margin and operational efficiency, these actions derive in a deep understanding of the Snowflake product and are backed by timely, accurate, and reliable data models.
So now we're gonna go into a little bit about how our ELT has evolved over the last 10 years or so. So in the early days of Snowflake, we really tried to leverage the product internally as much as possible, but a lot of the business functions were pretty nascent. You know, the finance team didn't have any sort of data support, and so all of our finance data was housed in Excel, and this worked for a while, but as Snowflake continued to grow and scale, we received more and more cost data from AWS, Azure, GCP, that Excel really couldn't handle this volume. And so that's when the finance team started hiring data people, and we made the switch over to Looker.
And within Looker, we were able to build out basic revenue and cost pipelines, but we soon outgrew the capabilities it afforded us as a data team. You know, it wasn't meeting our needs in terms of volume, complexity, or code testing, and so that's when we started looking for the next best thing. That's how we ended up on dbt. And, you know, dbt is incredibly performant inside of Snowflake. We're able to use created tables in places other than in, just say, Looker dashboards, for example. And ever since we made the switch back in 2019, it's enables, enabled us to do so much more than we initially even expected.
So our technical architecture has landed here with Snowflake and dbt Core at the center of it all. What we do is we take our raw data from a bunch of host systems, everything from Adaptive to Workday, and we ingest that with a variety of methods. We do data sharing, we use Snowpipes and other types of connectors. And once that data lands in Snowhouse, which is our internal instance of Snowflake, a bulk of the transformation is done by Airflow and dbt Core. We also have a few machine learning models that run in Snowpark and some transformations not fit for SQL, running in custom environments. Once all of that's processed, we share data to downstream teams, we write back into host systems, and we do visualization through a variety of tools, including Streamlit and Snowsight.
Ever since we migrated to dbt, we've experienced development velocity like never before. So I know you guys have seen a lot of these at this point, but we had to throw our own lineage graph in here, too. For some perspective, this is just part of the lineage graph of one of our revenue pipelines. It was so big I couldn't even find a screenshot that would fit in here, so I picked part of it. But I think this really illustrates how we've developed out support for finance using dbt over the past five years or so. We now have over 20 pipelines across 14 dbt projects and thousands of tables, and the tables are used every single day by our finance stakeholders in reports, dashboards, and investigating any sort of trends they're interested in at the moment.
So to summarize, dbt has been incredibly impactful for our team in three main ways, and the first is development velocity. You know, by bringing software engineering best practices to data, we've gained incredible oversight over our BI development, and this really helps our team iterate quickly. And the second is that dbt enables real-time or data-driven decisions. And so we have real-time anomaly detection, cloud spend forecasts, insight and recommendations based on customer usage trends, and much more. And the last is cost governance, and we'll dive into more detail here later. But for us, and I'm sure for many of you, you know, 2023 is really the year of cost optimization. And so fortunately, dbt offers a number of tools that really help us be the most cost-efficient data team possible.
So there are endless dbt features that we love and have aided in our development velocity, but we wanted to highlight four of our favorites here. To start with, we use dbt Core in conjunction with GitHub to allow for version control. This allows our team of over 25 to work collaboratively on large pipelines together without stepping on each other's toes. We also have thousands of tests validating our assumptions and giving us immediate feedback on any data quality issues, so we can remedy them immediately. We also use dbt docs projects as implicit documentation of our platform. Each project has a schema.yml file, and it's filled with table and column descriptions that we programmatically read into our data cataloging platform, Alation. Then my personal favorite is dbt targets. This allows me to easily switch between dev and prod environments or different warehouse sizes.
One big thing is we're always trying to size our workloads appropriately in different warehouses, and this feature has made it easier than ever before for us to do so. One big example of cost governance and data-driven decision making is our COGS anomaly detection modeling. This project was done by our data science team last year, and they built it out to support cloud finance. What they did was they took our COGS data, and they identified outliers in our cloud spend. Then they used a macro in dbt to create Slack alerts for our finance and engineering teams. These run daily and alert on anomalous cloud spend and allows the engineering team to investigate and remedy any overrunning warehouses that we have. Over the past year, we've saved over $5 million just from these alerts.
So if all this wasn't enough, we're doing a lot more to keep our finance team happy. And Sandra and I, we find ourselves in an interesting position where we are, you know, the analytics engineers, the BI team, incurring costs on Snowflake, while at the same time supporting the product finance team that manages our internal cloud spend. And so the reason why we think this is a unique position is because it really makes us the forefront data stewards at Snowflake. You know, it's one thing to bring quality data and information to your stakeholders, but it's another to do that in the most cost-efficient way possible.
So in the last year, like many of you, probably, our team was tasked with finding ways to reduce our costs. Specifically, could we reduce our internal cloud spend? While evaluating our pipelines, we realized that our largest tables were actually the ones coming from our cloud service providers. Ironically, Cloud Finance, the team responsible for managing our internal cloud spend, actually maintained our largest pipelines. So for some context, our COGS pipelines look like this. First, we bring in our raw data from our three cloud service providers and stage that data. And these stage tables only have very minor transformations from the original cost and usage reports that we're getting. So what we do is we merge some payer accounts together, we rename some columns and add in some metadata, and after this data is staged, we run through two reporting pipelines.
One brings in the business logic and allows us to report on COGS, cost, and revenue, and another is an allocation of our total cloud costs to each customer to try and get at margin-level data. Back to data size here. Just that staging portion at the beginning that I was talking about already has about 445 TB of data, and that number continues to grow significantly every single day. So you can imagine the massive cost of the entire COGS pipeline as a whole when our starting data size is already this big.
So when we dove deeper into the cloud inventory tables that ingest this raw data, we realized that this pipeline was both far more expensive and far less performant than the rest of our finance pipelines. And so these models, to recap, ingest raw cost and usage data from AWS, Azure, and GCP, and, you know, some of them individually take about an hour to build each day, and collectively, this pipeline takes a little bit more than 6 hours to run each day. So if we take a standard Snowflake credit, which is priced at $2 per credit, the compute cost associated with building this pipeline each month is about $6,000. And so clearly, pipelines of this volume are both a cost, a cost issue and a performance bottleneck.
From a performance standpoint, you know, the slow query runtimes mean that our stakeholders can't access data quickly. It also means it's very difficult to develop on this pipeline because, you know, any changes or updates to these models take so long to productionalize. From a cost perspective, we really wanna highlight two main areas, and the first is Job Credits. Job Credits is a metric we use internally at Snowflake to assess workloads. We can think of a job as any query run inside of Snowflake. We take all the credits consumed by a warehouse, and we attribute them proportionately to jobs run on that warehouse each hour.
So, you know, the longer it takes a query to run, or in this case, you know, the longer it takes for our pipeline to build, the more Job Credits we're consuming and the more compute costs we're incurring. And so even though there's, you know, a really high Job Credits cost associated with building this pipeline, this cloud spend data is crucial to our product finance team, and so we need to rebuild and rerun this pipeline each day. The second area that we wanted to highlight is data transfer cost. And so in the original version of this pipeline, you know, all of our models were materialized as tables, which means they were rebuilt from scratch each day. Something that we do as a finance data team is we replicate a copy of all of our data models to a backup deployment in case of emergency.
Since our, you know, largest pipeline by volume was being rebuilt each day, we are also transferring our largest pipeline by volume each day and incurring, you know, a very large amount of data transfer costs. For these reasons, you know, our problem statement was pretty clear: How can we make this pipeline faster and less expensive at the same time?
Originally, we approached this problem through the lens of query optimization. Our first instinct was to consult the Snowsight query profile to try and identify any bad joins or slow-running CTEs, and we ended up not finding much here. Our joins were pretty optimized. Our CTEs actually ran pretty fast. We looked into replacing our computationally expensive macros with mapping tables. This helped a little bit, but also wasn't incredibly significant either. The last thing we tried was auditing columns and removing anything not being used downstream. This had also very little impact on speed, and so it was time to attack this pipeline from a different perspective. At its core, it was clear that this was largely a data volume problem, and so our conversation changed to how we could reduce that data volume.
We asked ourselves questions like: Can we limit the amount of data ingested? Maybe only keep three years of data instead of all time. Or can we reduce our granularity, maybe aggregate at a daily level instead of hourly? We consulted with our stakeholders and ultimately decided that we didn't want to lose any of that data because we wanted to track very granular changes and trends over time, and so we chose to convert our tables to the remaining solution, incremental models. So to start with, let's make sure we're all on the same page about what an incremental model is. In comparison to a regular table that's fully reprocessed every single day, an incremental model only transforms new data and updates the relevant existing data in the table already.
Once we've designated a model as incremental in dbt, the first run will still rebuild the entire table from scratch, but subsequent runs will be much faster. Based on the incremental filter that you add in the table, dbt only processes that new data that you've designated and integrates that back into the table that you've originally built. This is great with a high volume data pipeline because you're only updating a fraction of the entire data rather than the whole thing. The output table remains the same, and you don't lose any information, but you incur less cost and spend less time constantly rebuilding your models.
Yeah. So dbt incremental models led to significant improvements from both a cost and performance perspective. You know, by using an incremental filter and a unique key, we're able to limit the amount of data that is transformed on a daily basis. So, you know, rather than rebuild from scratch, we're updating the smallest window of data possible. You know, as we mentioned, you know, we replicated backup copies of all of our data models to a separate deployment. So now, rather than transferring the entire pipeline to a separate deployment, we're really only transferring the updated portion of this model. And so that saves us a lot of money from a data transfer perspective, and then we're also saving a lot of money from a job credits perspective because there's less compute required to, you know, run this pipeline each day.
And then, you know, this, this improved runtime also increased performance because we're able to get tables or these models to our stakeholders much more quickly, and we're able to develop on this pipeline much more quickly as well. And so another thing that we really like about this incremental configuration is the on_schema_change parameter. And so, you know, as new columns are added to your tables or the schema itself changes over time, this parameter ensures that our models continue to run successfully. And so in summary, our big data problem became much more manageable, but also very importantly, is the fact that the scope of our data remained the same this entire time. And so despite all these improvements, we do want to discuss some challenges we ran into along the way.
The first was choosing the best incremental filter, you know, one that optimizes performance and aligns with our business logic. So our goal is to update the smallest window of data possible while prioritizing data recency and data accuracy. So we'll dive into an example with our AWS data to provide a little more context. Each day, AWS updates their cost and usage report, and they update data for the entire billing month to date. So that means any data that already exists for that billing month is supplanted by the new data. Now, at the conclusion of every month, we expect a bill invoice from AWS with finalized data for the month that just completed. We typically receive this bill invoice in the first couple of days of the next month.
And so in these instances, we not only need to update the current month's worth of data, but the previous month's worth of data as well. So the problem became: How do we define an incremental filter that updates just the current month when applicable, and then the current month and previous month whenever we receive a bill invoice? And now the second challenge that we ran into was defining a unique key that aligns with this business logic. And so within our AWS pipeline, we have multiple independent payer accounts, and each of these payer accounts, consult or merge into one consolidated cost and usage report. And these payer accounts refresh at different times, and we receive bill invoices at different times for each of these payer accounts.
And so we needed to define some sort of unique key that would apply the incremental logic that we just discussed at the account level. And so we were able to achieve this by using a compound key of billing month and account ID, and this ensures that we have the most recent, the most accurate data for each account, for each month, all in the same model. And so, you know, this dbt migration or this incremental migration, you know, resulted in a lot of improvements in our cloud inventory staging pipeline. You know, the original daily runtime was a little bit more than six hours. That has now been reduced to just one hour.
You know, we've seen an 88% reduction in our monthly data transfer costs and a 67% reduction in the amount of Snowflake credits it takes to build this pipeline each month. And so perhaps less quantifiable also is the fact that our overall sentiment toward this pipeline is much improved as well... you know, not only are we saving a lot of money, but we're saving a lot of time and headache from a developer perspective, too.
Finally, we'll discuss where we intend to go next. We look forward to leveraging dbt features more across the finance pipeline to continue to reduce our cloud spend. Our next goal is to explore the relationship between cluster keys and incremental filters. Our hypothesis here is that an alignment of the two could lead to enhanced query performance for certain models. We'd also like to experiment with workload-specific warehouse sizing. For example, for those known periods of diminished performance in the cloud finance pipeline that Jack was speaking about earlier, where you have to update both the current month and the previous month, if we increase our computing power and reduce our overall runtime, could that actually reduce our spend? We also have a few pipelines that aren't easily handled by SQL that we originally wrote in Python.
And now that we have dbt Python models in partnership with Snowpark, we're hoping to migrate those models over. We're also excited to leverage Snowpark to facilitate new machine learning workloads and continue to champion data-driven decision making alongside our stakeholders. If you have any questions about anything we talked about today, feel free to reach out to us via email or connect with us on LinkedIn, and we're looking forward to continuing the conversation. Also, I think we have some time for Q&A if anyone has any questions. Yeah.
Yeah, do you have a strategy for implementing full refreshes on your incremental models?
Basically, only when we need to.
Yeah.
We have, like I said, thousands of tests. Every time we make something incremental, we make sure we're including a test that will alert us if the data is not fresh for some reason or like some historical thing has changed somewhere. And then that will tell us that we need to do a full refresh.
Mm.
Can you shed some light on your Slack alerts through macro? How do you actually do that?
Do you want, like, the whole-
Like, what's the, what's the API like? Do you have a macro? I see that you have a macro to send the Slack alerts for some data change, right?
Yeah. Yeah. So and you've done some Slack stuff. Would you like to?
Yeah, I mean, I didn't necessarily build the back end of it, but what we do is you basically have this query set up or, you know, this system set up where we've defined what we want to detect as an anomaly. So maybe, you know, we see, we take a trailing 28-day average, and we see something that exceeds that past a certain barrier. And then anytime that happens, we'll ping, you know, a Slack channel that we've defined as a team.
You know, maybe it's called COGS Anomaly Alerts, and that'll pretty much give us the date that we see it, you know, the reason why it's been detected as an anomaly, and give us a message so that we can then go into that model or wherever, you know, that data is housed and see, is it an issue with our data models, or is it perhaps an issue with something that we need to go address with a separate team?
I think just to you could probably combine that in a user-defined function. Just probably they're doing something model.
Right.
Yeah, exactly.
When do you use the incremental table and when you decided to just simple tables?
So we, we typically tend to use an incremental table when there's significant amount of volume. So on tables that have a little bit less volume and are pretty performant when they're materialized as tables, we don't see a need to materialize them incrementally. We actually have seen kind of a reduction in performance if we try to incrementalize small tables. And so if our pipeline is pretty performant, you know, it completes in minutes rather than hours, we don't really see a need to use incremental models. But otherwise, you know, in this case, where we have this really expensive pipeline, this really slow pipeline, we see as a really good candidate to try incremental models. And if we see a significant performance improvement, we'll stick with it.
You know, there are times where we've experimented with incremental models and not seen the kind of performance improvement or cost improvement that we wanted, and we've gone back to table materializations.
Do you have a kind of guideline or criteria?
We probably won't look at it if it's under, like, 1,000, 1,500 seconds for runtime for the table, just based on how long our pipelines usually take. But if it's pushing above that, we'll consider it.
Mm-hmm.
What are the core tasks that you guys do for incremental models? You guys mentioned that you guys do multiple tasks for incremental models. So just wanting to know if you guys can share what are, like, the main tasks that you guys focus on.
I think a lot of the tests that we do are ensuring... I mean, they're pretty uniform, I guess, across all types of our models, so, A, we don't want to have any sort of unnecessary type of duplicates. The second is we really want to confirm that our unique keys are working the way we want them to. Another thing that we're checking for, and, you know, one of the reasons why defining, you know, the best incremental filter was one of our challenges, is ensuring that if there is an update, say, the data a year ago, you know, outside the scope of our sort of incremental build, are we capturing that data?
We have tests that attempt to try to capture, you know, that sort of look back, make sure that we have an AWS data, say, the correct assembly in each of our billing months and stuff like that. Some of it is pretty unique to the business logic that we're using in that model, and then some of it is pretty uniform in terms of like, you know, duplicate observations, stuff like that.
How do you load the AWS off data?
So our team actually isn't in charge of that. Our IT team handles it, but I believe it's put into S3 or I guess it's in S3, and we bring that in. I would guess either Snowpipe, Fivetran, something like that. I don't know. I know that's incremental too, for sure. It's huge.
... Yeah, I guess that's kind of related to my question. So like, what is the organizational structure of the team? Again, so you have, like, analytic engineers embedded in each domain, then some of you have data engineers actually populating, you know, Snowhouse.
Yeah. So, Snowflake has data teams all across the company, so we're on finance data, there's IT, sales, marketing, et cetera. A lot of, like, the ingestion stuff or, like, financial systems tends to be owned by other teams. There's a FinSys team that handles anything from, like, Workday, stuff like that. We'll sometimes own the Snowpipes to bring that in, and sometimes they do it. It just depends. But anything that already is in a raw table, we'll handle everything after that. Anything that's a core pipeline, revenue, contract information, stuff like that is the analytic engineers, and then each analyst owns all of their models for their specific area. Yeah?
I saw that the models are implementing the functionality within the search. How is the team addressing the data integrity aspect, how you guys manage it?
Sorry, I don't know if I heard that completely.
I think the question was, you are implementing the merge plus, merge functionality, which is deleting the records, based on the record being inserted.
Mm-hmm.
How is the team addressing the data immutability aspect, like ensuring that the financial record doesn't change over time?
Yeah. I mean, there's definitely a difference between, like, reports internally versus what we report to shareholders, right? Like, our earnings report and stuff. So anything that is going to investors, like... or for us, for a release on our earnings call, we have a whole data model structured for that as well, where we snap out everything. So we make sure we're only reporting what we reported last quarter, stuff like that. But then in terms of internal reporting, oftentimes, like, you want the most up-to-date data, regardless of if it's GAAP revenue or not, right? Stuff like that. Yeah?
Follows up on the last question, are you using dbt tests and any additional packages for data quality, especially kind of finance and accounting data, and how are you balancing that with your optimization and performance and those things that sometimes comes with the trade-off and the process?
Yeah. So we just use dbt tests, but we write them all ourselves. I don't think we're using any external packages. But we also have built a lot of dashboards on top of tests and like, we are making our data quality observability dashboard right now, which is, like, a central spot to go and see, like, anything failing in our pipelines, any DAGs that haven't run. But we have hundreds of eyes on these numbers every single day, like-
And the tools you brought together, you have no-
Yeah, we do, we do all of that, and there's always the, like, human component, too. If that number looks wrong, we should go investigate it. Yeah. Yeah?
So, as, like, finance data, are you guys, as to what your final consultation, are you guys going to be the, like, the costs, like, like, incorporating that and, and from other data teams? Or are you, like, a very separate, like, buyer that also use cost modeling of tracking your costs, like, complications?
Yeah, that's a great question. Our team was definitely not the only one tasked with reducing our costs, right? Like, if you think about, like, a lot of the data science teams probably have even bigger data that they're running. And so this is, like, a company-wide thing. We definitely had a central team that was tracking all the different projects for this, and it was definitely, like, everyone was on board to try and work on it. This is just the biggest example from the finance side that we did.
Like, when you guys have, like, issues, or like, a lot, like, are you, like, standardized, like, hey, let's look at this person who shows, like, credit or five or whatever, like, are those kind of get more responsibility to make, like, decisions?
I think, you know, like, supporting the product finance team, it's more that we'll see, like, you know, large volume areas of data or large cost areas of data, like, what we're incurring to store, you know, data. And so we'll more attack those that kind of, like, show up on our radar and attack them specifically. Our scope, just, you know, as the finance data team, it's pretty obvious, like, which one was our big bottleneck, just 'cause this cost pipeline is so much, you know, slower and more expensive than the rest of them.
But in terms of, like, having, like, I guess, you know, a uniform metric for costing out things, we typically go by what we're, the cost we are incurred from, like, the cloud providers or where we store our data and things like that. So we have, like, you know, a cost number, but to standardize, we were using, you know, the Snowflake price from a compute perspective.
Yeah. Yep.
I wanted to ask, you guys experiment with your Snowflake streaming tables?
Dynamic Tables? Or-
Dynamic Tables, yeah.
That's great.
With a screen as the data comes into the next model. Experimentation with that, with your incremental, you know, maybe creating type.
Yeah, that's a great question. I get asked this one fairly frequently. So we looked into it a little bit, and we ran into some issues with having dbt refreshing our tables in full every day and not causing the stream to fail. So hopefully that we can figure out how to deal with that. I do think it would probably work better with something incremental since that's not being fully rebuilt, but we haven't done anything yet. Definitely something we're trying to investigate.
I know you're still like that type of-
To be honest, I don't think I've investigated enough to... I don't know. Yeah, let's chat about it later. Anyone else? Yeah.
You mentioned that you saved—you guys saved the financing alone, I think you said $5 million. Can you expand a little bit on that? Like, how, how do you measure that?
That was the anomaly detection that Dee built out. Yeah.
Yeah. That number didn't come from us.
Okay.
That's from the data science team. They work with us, but they're the ones that did the actual project. So, I think I would assume they looked into, like, all the overrunning warehouses and, like, the more, the longer term of those running and came up with something that way. Yeah.
Mm-hmm.
Yeah?
Sorry if I missed this, but what's the framework to planning to bring in? You mentioned at the end that you're still going to integrate that. Do you keep them to Snowflake, or will you integrate them to dbt as well?
I mean, I think the goal is definitely to integrate this into dbt. But I can't speak to that very well, again, on the end of analytics engineering team, that's more of, like, what our data science team is working on. I'm sorry. I wish I had a better answer for you. Definitely. If you follow up, I'm definitely going to ask them and share that answer with you, though. Yeah.
Cool.
Yeah.
What were your considerations in selecting dbt or versus Cloud? Do you have just a whole list, or are there no considerations other than you would cross it?
Yeah, my understanding is, like, because we were on it so early, like, 2019, I think that's when Cloud came out, is my understanding. That's about... Or I would assume that's why we picked it, and we kind of just stuck with it and ran. Everyone across Snowflake uses dbt Core, the ones on dbt Cloud, that I'm aware of. So yeah, but I wasn't here then. Also, again, happy to follow up if you want more explanation. Yeah. Anyone else? I think we're way over time, so thank you, guys.