Status Update

Jun 18, 2024

Kevin Sturgeon

Director of Solution Engineering, Teradata

My name is Kevin Sturgeon, and I will be your presenter today. Before we begin, I want to give a little bit of an overview of today's session. Before we jump into the actual demonstration, I want to provide a little bit of business and technical background around some of the challenges that were solved by the technology that we're talking about today. Of course, we're going to spend the bulk of the time on the demonstration, and we're also going to take some live questions and answers at the end. Throughout today's session, whether it's during the presentation or the live demonstration, please feel free to put all the questions in the chat widget in your interface. With that, we'll jump in and begin.

As a quick background here, really the setup is that organizations of all sizes and in every industry are rapidly leveraging new AI-driven tools, techniques, and technologies to drive unprecedented business value. What sets the most successful companies apart is in implementing a robust analytic and experimental foundation as a key enabler to the AI-powered enterprise. Organizations must support several pillars: flexibility to adopt and experiment with new, rapidly evolving tools and techniques. Productivity in keeping analytic workers focused on innovation and not infrastructure. Cost-effectiveness. It's critical to unleash this sort of innovation on demand without sacrificing economics or business value. Scale. AI tools and techniques are powerful because of their scale. Organizations must support experimentation and operationalization at scale and speed for the largest data sets and consumer types. Unfortunately, many organizations struggle in building this foundation due to several key challenges. IT legacy.

Legacy of IT controls and SLAs limit access to analytic resources required for this type of experimentation. Budget constraints limit the acquisition of powerful analytic services and platforms required to innovate and experiment. Data access. Organizations are rightly concerned about exposing PII and other sensitive production data to new experimental AI or, even worse, external services. A huge issue that's coming up now as these techniques mature: replication and operationalization of these experimental models and techniques into a pattern that supports broad production scale adoption and supports the business. In addition, organizations must also support the evolving needs of different analytic and data worker personas required for these innovations. Data engineers must now adapt to drastically larger and more complex data sets transformed using new tools and techniques while still serving the needs of traditional analytics and operational processing to run the business.

Data scientists must now perform their experiments at even greater scale, speed, and with vastly greater requirements for processing and analytic resources. And finally, developers need to adopt and embed the products of AI and machine learning models, inferences, and services into the business and domain-specific applications. So to conclude the kind of background on this, experimentation at scale is critical. Unfortunately, it can be difficult to plan an all-in-one consumption approach that supports this level of experimentation and also supports operational requirements in production. Innovation is key and is critical, and Teradata AI Unlimited can help unlock that innovation. So for decades, the largest organizations in the world have leveraged the power of Teradata to provide them with the most advanced and scalable analytics, including the most innovative new AI-based solutions.

Now, with Teradata AI Unlimited, organizations of any size can enable users to leverage the power of Teradata in a true on-demand AI and machine learning engine in the cloud, eliminating the friction of managing complex environments or fear of cost overruns. So what we're going to see in today's demonstration are some key architectural enablers and how they apply to a typical experimental AI use case, specifically leveraging this industry-leading analytic engine Teradata is known for, including native capabilities to enable AI and ML outcomes. We'll see the system is completely transient for quick spin-up and spin-down with no residual or ongoing consumption costs. Of course, this is deployed natively in the cloud to any cloud service provider using industry-standard and open-source tools. This also runs on demand in your company's account and infrastructure, allowing organizations to both leverage combined cloud spend and enforce proper security models.

Finally, AI Unlimited will also enable the experiment to production pipeline. Since the workload engine is identical to the high-density enterprise-class Teradata Vantage platform, experiments can be fast-tracked immediately into production. And finally, with full support for reading and writing data to object stores and open table formats across clouds and catalog providers, users can access data in an unprecedented and frictionless way across vast disparate data sets across their enterprise. Now on to the demonstration. We'll see how all these concepts combine into allowing users to run rapid AI-based experiments on how to extract real business value from generative AI techniques. And again, we'll do questions and answers live at the end, so please put any questions you have in the Q&A widget, and we'll address at the end.

So the demonstration that I provided here today is going to be an experimental workflow that tries to answer the question of how do we combine generative AI and open-source tools and techniques with traditional machine learning into developing a potential real-world business value application. And I talk to users all over the world, and they're struggling right now with how to provide that, how to take all this new generative AI stuff and really get real business value out of it. Chatbots are wonderful, but how do I leverage this power at scale for something real in my company? And the experiment that I want to run here says, how do I take a representation?

How do I take, say, retail customer reviews, apparel reviews, use the words that people have written to use generative AI techniques to turn the semantic meaning, the representation of those words into vectors or into vector embeddings, and then use those semantic representations to feed into a customer segmentation model? I'm going to do that using the AI Unlimited engine. As we talked about in the overview, this is a highly experimental thing. I don't know what the business value is, and I don't know if this is even worthy of production. I'm using open-source tools. I'm using data that may be of dubious value. So I don't want to commit to large-scale infrastructure. So what I've done here is I've deployed this AI Unlimited engine on demand into my as a customer in my AWS environment. And we'll see this was deployed via Jupyter Notebook.

If we look at the service architecture for those who are interested in AI Unlimited from an infrastructure standpoint, it's using a set of lightweight services deployed either locally on a laptop or, in this case, in the cloud. All those services do is instantiate the analytic engine on demand in your AWS or in your Azure or GCP environment. All the work that I do is set up to be stored in GitHub or GitLab. I set up a GitHub or GitLab connection with my enterprise or me as an individual. When I run the command to start up an engine, either in a Jupyter Notebook or CLI or push a button, it will record all of my actions that I want to store in that Git project. When I shut everything down, all of my resources will terminate.

I'll have no persistent instances, and all the work that I've done will be stored in repo. So we can combine common development techniques, GitHub, GitLab, repository for state maintenance and information storage, have an ephemeral on-demand engine with no persistent resources or compute or storage costing me in the cloud. And we'll see how that works from a data scientist standpoint as an infrastructure person or whatever. I've created a demonstration here which talks about how to do all that administration, but really all we need to do is simply deploy this. And I've done that. I've deployed this engine into AWS. I can have any one or a number of instances to build a very large or very small cluster, and that tells me how to connect as a client.

So now I, as an analytic worker coming into this, will get access as if I am just a user of the system. This demonstration to do this AI use case is built in Python. Teradata provides a very, very powerful Python client connector which allows users to interact with all the very powerful MPP-powered, highly scalable functions inside of Teradata using common Python syntax and design patterns. So we wrap a common data management framework called pandas, which allows us to interact with the data as if it's local, but in the background is this extraordinarily powerful, massively parallel processing engine that I've deployed on demand, hundreds or thousands of units of parallelism available to me to run this data processing. So for this demonstration, I'm going to do a couple of different things.

I'm going to take a look at my raw data, my customer retail apparel comments, the reviews of their purchases, and I want to take that, push that through an open-source generative AI large language model to basically get the numeric semantic representation, a 50-dimensional representation of the way that the person writes the semantic meaning of their reviews. And I'm going to use that numeric representation in a native customer segmentation model that will further the exercise. So the first thing I'll do is I'll use very common Python syntax to look at that customer comment history. And this customer comment history is in an S3 bucket in a demo data set, but it could be millions or billions of records. We can work on this data in parallel. And we see that we've got a couple of records that come back here.

The next thing I've done is I've created a model table. This model is an open-source model called the GloVe model, which is a 50-dimensional tokenization model, again, that represents the semantic meaning of text. This model I've just ingested into the database. It lives in, again, an S3 bucket. It could live in a Glue database or an Iceberg table or whatever. But I've loaded this model, and this could be a larger dimensional model. This could be a Hugging Face model or anything else in the open-source space. This model really is a token, a word, and the vector representation of that word. Teradata provides a native, a built-in vector embedding function called Word Embeddings.

Word Embeddings is a function that runs natively in our analytic database and our MPP engine, so at scale can address billions, tens of billions, hundreds of billions of tokens of records of documents of corpora, whatever, at any scale running in this engine. What it does is it takes the model and applies the vectorization tokens against all the text in those customer reviews. We can do that based upon a token or a word. We can do that based upon the document or the entire review. We can also do similarity comparisons in line with this function. But here, all I want to do is I want to tokenize, or excuse me, I want to vectorize the reviews. I want to create a 50-dimensional numeric representation of the combined semantic meaning of the way that people write reviews. Sort of interesting thing. Okay.

So I've done a short little experiment combining a vector embedding, vector representation of the way that people write. Well, what can I do with that? And my thought experiment here is I can use that for customer segmentation. Is there value, hypothetically, is there value in the way that users write their reviews to segment them based upon the way that they write? Teenagers from adults to parents to single folks, maybe they all write differently. Maybe they've got a different way of approaching reviews. And that may be a very good predictive feature to segment users that way versus a traditional approach of, oh, lifetime value or zip code or income band or age band. We're going to use a generative AI technique to look at the way that people write. That could be very interesting. That's a great experiment. I don't know if it's got any worth.

That's why I'm using an on-demand compute engine that I can spin up for a very short period of time, spend very little money, do my experimentation, then shut it back down again. So I'm going to use another traditional machine learning function built into the Teradata analytic engine, a K-means clustering algorithm. And K-means basically will take a number of clusters or segments that I want to break the data up into, put those segments randomly in that n-dimensional space, and then measure, as they move those centers, measure the closeness of these individual data points to those different centers or those different centroids. And it will return to me some information about how many users or points are inside that cluster, the distance those points are from the center of that cluster, things like that.

One of the interesting things about a K-means algorithm is that I tell it how many segments. There are ways that we can use our human intuition or algorithmically. We can program this too very easily, but I do it because it's nice and graphical to understand the ideal number of customer segments that I want to feed this K-means algorithm. The way that we do that is we look to see where the inflection point changes as we reduce the distance between data points in that cluster to the center of that cluster. We take a value called the within- cluster sum of squares, which is basically the mean distance of all of the points in that cluster from the center of that cluster.

We see as we add the number of clusters, that error distance or that not error distance, excuse me, that distance decreases. Of course, if we have a number of segments or clusters, the same number as the number of users, that value will be zero. But that's an extraordinarily inefficient or unwieldy cluster. A very inefficient but unusable cluster is a cluster of one, but everything is a very, very far distance away. So what we can do is we can try different iterations of this K-means with different values, measure the within- cluster sum of squares distance, and where the point inflects is where I typically get my most efficient clustering, where it's representative of a decent population tucked closely into the center of that cluster without overusing the number of clusters, so where we have sort of a decreasing marginal value. This is an iterative process.

One of the great things about the scalability and the speed of the Teradata engine is I can run those experiments iteratively or in parallel to find out that value very, very quickly. Here, I do it iteratively for a demo. The Teradata engine, advanced workload management capabilities, allow me to do this in parallel as well. So I can run multiple experiments for multiple values of my clusters in parallel against the same engine, and it won't miss a beat. And we see here iteratively, even it's very fast. So it's going to take the numeric representation of customer comment history, the semantic meaning of their comments, run these experiments on how far away all the points are given the different clusters' counts I train this model on. And I'm going to do that from 2-8 clusters.

And then I can plot that data live to see where that inflection point is. And it could be four, six. Sometimes when I run this, because this is, it starts with random seeding, sometimes I get five, sometimes I get four, sometimes I get six. So we're kind of on the edge here of what the segment looks like four, five, and six. Again, another experimentation thing. So now what can I do with this? I've got something that looks like there may be a signal here or at least a trial of this. I can simply take the cluster IDs of all these user IDs, pull them out of here, and then take just that data and run that in my enterprise data warehouse, look at my BI tools, look back at total customer lifetime value. Is there any correlation there?

I can use this data in further experimentation for other types of analytics like churn prediction or next best action recommendations, all sorts of different use cases of identifying how to segment users based upon non-traditional means, the semantic meaning of how they write. So there are a few concepts in this demonstration I want to review in the conclusion here, is that not only were we able to get an experimental analytic outcome that we could use further downstream in operations, but I've done it against a platform that was ephemeral, deployable on demand, and uses no persistent resources. So an administrator or even myself using the CLI or a button or the Jupyter Notebook can spin up the engine on demand. I, as an analytic worker, get all my project artifacts hydrated from my GitHub or GitLab repo. I can begin work. Here's my model.

Here's my comment history. Vectorize that data on demand, run my segmentation experimentation, and then take the results of that data and then shut everything down. So that's the key to experimental innovation and agility with Teradata AI Unlimited. Spin up and spin it out on demand and then leverage extraordinarily powerful inbuilt and open-source AI and ML techniques to look for unique and powerful outcomes. So that concludes today's demonstration. I hope you all have enjoyed it. Please let me know any questions, and I'll look forward to seeing you the next time. Thank you very much. Take care. Okay. I get back to the screen. Perfect. So before we jump into Q&A, and there's a lot of really great questions in the chat, please keep them coming.

Wanted to let you know with the announcement here we saw on the splash screen as well as this screen up here, this offering is available now in public preview. So if you want to, you can sign up for it in these links here that are in the presentation in the resource center. And we had a question in the chat that asked about how to get access to this. And this is a complete marketplace offering. And one of the things that's very exciting for me here at Teradata is we do so many great things with our analytic platform and our scalability. And now that we're offering this essentially as a self-service on-demand engine in either Microsoft Fabric or purely from an AWS or Azure Marketplace, you can go get them.

I showed in the demonstration attaching it to an enterprise GitHub or GitLab so you can do sort of enterprise deployment. But you can also deploy this basically as a standalone, give me a 1- node, give me a 100- node cluster, deploy it, let it run, do my work, and then shut it back down. So if you're interested in this, please sign up at these links here. This is the public previews now available across all these different platform deployments. Also, want to include some additional resources. I touched a little bit on our ClearScape Analytics functions. These are the native and extended ML and AI capabilities in the analytic platform that can run, again, at scale both in AI Unlimited as well as our traditional Lakehouse and enterprise data warehouse offerings.

We've got a little bit of a business proof here on total economic impact and the value of using these two different capabilities. But wanted to add a couple of additional follow-ups before we got into the Q&A here. So we slipped those in. How to get access to it, the private previews available from a Fabric offering as well as a marketplace offering for both Azure and AWS. Go ahead and sign up if you're interested. So let's jump into the Q&A and again, keep the questions coming. One question I'll deal with it specifically here is, do I need a GitHub connection to run this? There are multiple different deployment types available for AI Unlimited, all in your tenant, in your account. So I can deploy this where it attaches to my corporate enterprise Git repo. And what that does is that maintains, again, schema persistence.

So if I create objects and create tables and things like that, it will load that in as a repo. But it also controls authentication and authorization. So if your company already has an enterprise Git repo, just attach to that and give people grants. And it's a nice way of keeping integrity across the different ecosystems, not having to contract with Teradata for whole authorization and integration architecture. We can also deploy this standalone. So I can deploy directly from AWS Marketplace, and it will generate a standalone cluster for as long as I need it with an onboard Jupyter Notebook. I can also deploy this via Microsoft Fabric in the Fabric port where it will deploy implicitly from Fabric. Question around how much does this cost? So this is based upon the marketplace.

So you can see when, depending on region, when you go to the marketplace link and accept the terms and conditions, I think the 1 U instance is like $1.90 an hour in US East 1 on AWS the last time I checked. So this experiment that I ran here, this taking customer comment history and vectorizing it and then doing a couple of experiments, really, if I hadn't been talking, I could have done that in a couple of minutes. That at $1.90 an hour was like $0.25, $0.10, $0.02, something like that. So it becomes an extraordinarily cost-effective way to get ephemeral, massively parallel computation capabilities and then shut it down when you're done. And I think that is a wonderful addition to the portfolio and the power that Teradata has. Let's see. Another question, can I use other tools to access this?

I showed Jupyter Notebook because it's a very common interface and allows me to build demos and interact with it and things like that. This is an analytic engine that has got a Python interface, has a SQL interface, has an R interface, also has all the connections to a set of ecosystem and partner tools. So I can connect any of my analytic tools, whether it be something like a SQL engine, like a desktop DBeaver, I can connect things like data partner tools like Dataiku and H2O.ai that can run their visual analytics in our engine. This becomes a very interconnectable and usable asset to your enterprise as you deploy this thing ephemerally. Also, question around what data formats do you support? I mentioned in the demonstration there was retail comment history that was sitting in S3.

One of the things I didn't show and we're going to do in a future session is our integration, our bidirectional integration with open table formats. I mentioned during the presentation. So one of the innovations that we've deployed in both this platform as well as all the other offerings is the ability to interact with open table formats across clouds, everything from Azure, OneLake, Lakehouse, and Fabric, AWS Glue, Iceberg, and Delta formats. So there's a huge push now to do data management and table management in these open table formats using Glue and OneLake and things like that to manage this Lakehouse storage architecture and do table management, schema management externally in an open table format. This engine, as well as our traditional VantageCloud Lake offering, implicitly supports open table formats.

So if I have an entire data catalog sitting in Glue that's got my entire enterprise from landing zone all the way to refined data products sitting in Glue, I can spin up this engine, say, a 100-node cluster for 5 minutes, attach to that catalog, do all my analytics, exfil the results of my analytics back into that catalog, shut that engine down, and then let other tools just use it automatically, use that data because the data is being managed in an external catalog provider. And I can do that across cloud. So it gives you not just analytic agility, but also significant flexibility in how and where you manage your data and your data assets. That also allows an implicit evolution from experimentation to production.

If I'm running an enterprise Lakehouse architecture in, say, our VantageCloud Lake environment, and I've done this experiment, I've vectorized my customer comment history, and I want to save the vector representation, I can save that to a Glue catalog, and I can pick it up in VantageCloud Lake. So now my operational people can take that data in, maybe pin it in high-speed storage or cache it or something like that. So it gives like that sub-millisecond response time, but I can work basically seamlessly between the different platforms from experimentation to operationalization very easily. So let's see. Okay, answered that one. Talked about ClearScape Analytics, the language support. So we use Python here today. If you've seen any of these prior sessions, I've talked quite a bit about being able to bring language of choice to the analytic engine.

The Teradata native capabilities, we have hundreds of inbuilt functions that allow you to do, like I showed here, vector embeddings, K-means, machine learning models, prediction models. We can access all those functions using traditional SQL, using our Python connector, which allows you to write in things like Python pandas and scikit-learn patterns, and it will be translated and run in the engine. We also write in R. One of the things that we didn't show here, but we've shown in prior sessions, is the ability to take open-source and third-party models and ingress them into the database. So if I have a model that I've built, whether it's a machine learning model, say, in an AWS SageMaker or an Azure ML, I can serialize that model and load that into our engine and run that and run that inference at scale.

So there is a whole open and connected story that we have around very broad and deep analytic capabilities that we typically present as part of our enterprise Lakehouse architecture of VantageCloud Lake. But all of those analytic capabilities apply in the AI Unlimited engine. So maybe we've got time for one more question here. Is, let's see. Can I access? Oh, yeah, yeah, yeah. There's a question around what LLM models do we support? And I mentioned things like Hugging Face, and we can load those models and then take that representation for that Word Embeddings. We also have the ability to serialize the model format running our Open Analytics Framework. So there's another whole session we can do around Open Analytics Framework where I can run in this parallel engine. I can run a runtime container that can run onboard Python.

In the next couple of iterations, we'll see this run on GPU, where I can run GPU inference using an open-source model in this engine in parallel at scale, which is a very exciting evolution, which will come in the next couple of sessions. There's a question here that just came in as well around efficient chunking size. Maybe if I were to infer what the question is around chunking, typically I think of chunking as the ability for a single processor to operate on a set of data. You'll see in like Python pandas, if I have a very large dataset sigle file, I can load in chunks and operate on that chunk, or I can broadcast that. The Teradata AI Unlimited engine is an MPP processing engine.

So when I spin up this instance, I can run one from one to many instances of this engine. And inside that instance, there are many different elements of parallelism depending on the number of processors that I have. We are going to automatically optimize those partition sizes that we can work on in parallel. It's a shared nothing architecture that will chunk up. So if I've got a billion records and I have a thousand units of parallelism, I will try to, as efficiently as possible, chunk up and do that. So we don't have to worry about tuning iterative chunk sizes or broadcasting row partitions into like a Spark RDD. You've got to do like row partition values, partition keys, and things like that. We'll do that all automatically.

That's part of the 40 years of legacy that we have of continuing to refine and optimize the way that we manage very, very large datasets. We will find that those optimum IO chunk sizes and operate sizes depending on the function, where the data lives, and things like that. We've got a whole set of mature and patented and continually evolving technologies to optimize the way that we handle and work with data in parallel. Another question around how do we take workloads developed on AI Unlimited and move them into VantageCloud Lake?

I think that's a great question, is that I mentioned a little bit about moving data and data products between AI Unlimited, and we can mediate that with OpenTable formats, the ability to save the results of all of this code into, say, a Glue catalog and then pull it back in. But the code that I write, so this experimental pipeline that I run, I wrote a pipeline in Python. Python is a procedural language. There's nothing that says I can't take that identical notebook or that Python file, change the host that I'm connecting to, and point to my enterprise platform, and now run with an enterprise density and enterprise scale. So it can be as easy as, implicitly, the data products live in a catalog and the code is completely transferable to the exact same analytic engine.

I could run that wholesale without modification in my operational environment, in my production environment. We can do other things to further optimize that where I can take all of that code that's procedural code, and I can express that Python procedural code in SQL and run as an embedded pipeline in my database. What's interesting about these pipelines is that we can then not just in a development cycle, but I can start to push a lot of that preparation back into data engineering and ETL workload. So a lot of the experimentation I did was the vectorization. So I've got customer comment history, and I match it to a model to create a vector embedding table of my comments. Why not push that into my ETL pipeline? A lot of ETL pipelines are still written in SQL.

In fact, things like DBT today, they're using you're writing SQL at the core of it. You've got a lot of great automation around it, but it's the core. If I want to do vector embedding, we have native functions that can do that vector embedding in about three lines of SQL code, put that ETL back earlier in the pipeline, and leverage that preparation step earlier on. So I can sort of implicitly operationalize these things. So I can take it different ways. I can just take that data and code wholesale, or I can take the SQL that's been run that we can translate and run that as an expression in my ETL pipeline.

So there are a bunch of different ways, but the key is the same code that I do in experimentation, I can run in an enterprise frame, and again, all the automatic workload management and optimization and concurrency controls and all of that is going to be applied. So I can run the same code, hundreds, thousands, tens of thousands of concurrent users, billions of queries a year without having to refactor this to run into an operational platform. So really, it eliminates that distraction of having to rewrite an experimental pipeline into something that's going to run operationally. I think it's a fantastic workflow for people to get very, very efficient with these things. So any more? I think we're over the scheduled time here. So with that, I'll close, leave you with the fact that this is available today.

Please see the links in the resource center for how to sign up for the public preview. We'll make some of these links in this recording certainly available, and I'll look forward to the next session where we dive into deeper topics. We'll conclude the session today. Please, any follow-up questions, let us know, and we'd be happy to support you any way we can. Thank you all very much again for the time and attention.