Appen Limited (ASX:APX)
Australia flag Australia · Delayed Price · Currency is AUD
1.605
+0.020 (1.26%)
Apr 28, 2026, 4:10 PM AEST
← View all transcripts

Investor Day 2021

May 20, 2021

Speaker 1

Hello, good morning, good afternoon, good evening, everybody, and welcome to Appen's Investor Technology Day. Thank you very much for joining us. Our agenda today covers a number of topics. I'll provide the briefest of introductions, and then I'll hand it over to my colleagues to give you, 1st of all, a Update on the AI market and secondly, an update on our technology, including some demonstrations. We'll have a period for Q and A at the end of it, and you'll be able to answer or sorry, enter your questions via the app.

And then we'll close and we'll be all done prior to 2 p. M. Sydney time today. I'm joined today by Wilson Peng, our CTO, who's coming live from California and Ryan Cullen, our Head of Corporate Development. Both Wilson and Ryan are far more interesting than me to talk to, and they'll be doing the bulk of the talk today.

Wilson is an engineer with an extensive Ground in Search and Artificial Intelligence. He worked for IBM and for eBay for many years. At eBay specifically, he worked on search and that gives him a lot of The tease in artificial intelligence. And then he was Chief Data Officer for Ctrip, a travel company, where he built Many, many models to help that business use data more effectively and grow and thrive. Ryan Collin, also an engineer, has worked for telcos here and or sorry, in Australia and in the U.

S. And has had a stint with the Boston Consulting Group, advising technology companies on growth strategy. Today, our theme is all about our transformation into being an AI powered provider of AI, data and solutions. Our talk today will tell you why this is important and First of all, though, just to recap on some of the things we covered yesterday, how we got to this point and the evolution of our business. When I joined Appen 6 years ago, we were a leading provider of language data.

We evolved over that time to be a provider not only of language Data but also of training data for all AI use cases, including all AI data types, so speech and Natural language, text, relevance, image, video, three-dimensional data, including LiDAR, for example. So we've moved quite a lot from our initial Position as a provider of language data. We're also evolving our service model sorry, our delivery model, essentially service led Now to being more product led and you'll see a lot about our products today. From a revenue perspective, we are moving over time to do more committed revenue rather than just all project revenue and that obviously goes to revenue visibility and earnings quality. From a customer perspective, we are still very concentrated to our largest customers, as many of you know, but we're working to win many new by our customer base over time.

Yesterday, we announced a change to our organizational structure from one that is functional to one that is more aligned to our customer cohorts. We now have 4 P and L customer facing business units: our global business unit, which serves our 5 largest Tech companies, the U. S. Technology giants, our enterprise business unit and then our business units in China and the government sector. And then finally, yesterday, we announced some changes to our reporting.

We were reporting by data modality, Relevance and speech and image. In Australian dollars now, we're reporting more by our customer segments and other strategic areas of interest such as our new markets, which includes the Enterprise China and Government segment, but also the revenue that flows through our products from our major customers as well. To look back at the passage of time, when I joined the business again, the majority of our revenue Came from the global customers and we provided services to them essentially on their platforms. We acquired the Butler Hill Group in 2011 and then Leapforce in 2017. And with Leapforce, we also gained Appen Connect, our crowd management program That helps us manage the crowd resources at scale for our customers.

In 2019, we made an important acquisition with Figure8, which gave us our own annotation platform. And this provides a number of opportunities for us. 1st, to sell to customers that don't have their own Annotation and Data Preparation Technologies. 2nd, it gives us the opportunity to do more types of work. So the platform covers all data modalities.

And finally, we use it ourselves to improve the efficiency of our own operations. Most recently, we invested in the expansion of our business beyond our global customers with the addition of the business units I just mentioned, enterprise, government and China, all of which is fueled by our platform and all of which requires the technology that we've invested that we've inquired and invested in over the last few years. So increasingly, we'll be a product led organization. Our products give us the opportunity for scale, for quality, for productivity and underpin the growth of the business going forward. We are also, as discussed yesterday, Increasingly customer centric with our 4 customer facing business units.

That's not the topic for today. It's more the topic around product. And on that, I'll hand it over to Ryan to take it from here and provide the AI market update. Thank you, Ryan.

Speaker 2

Thanks, Mike. I'm going to talk about the AI market today, but I'm going to start by talking about the AI application life cycle. So it is a useful grounding for us to think about, particularly the role that we play with our customers to help them build AI enabled applications.

Speaker 3

So on

Speaker 2

the left hand side, We see a very typical view of how a customer might build an AI application. It all starts with the business needs, so the hypothesis around What the application is going to deliver. Typically, the first step is to collect and bring together the available data to build the model based off. So that could be in house data or it could be data that is collected, bespoke for the application. The second part is the preparation of the data.

So it's one thing to have the data. It needs to be in the right format with the right labels that are going to be able to for the AI models to be trained on the data. So that for our side typically involves a lot of data labeling and It's a big role of what we play. So once the engineers have data that is ready for the model build, the next step is to build the model. So this step typically involves the selection of modeling techniques.

There is a wide variety of different approaches that can be used to train models. But once that's selected, they apply the training data and build the model. Next is testing. So once the model has been built, Does the outcome of the model meet the requirements and support the business need? And in the case that it does, then the model will be put through to deployment.

So that's actually when it's Put into the application, deployed in the real world. There's an interesting side loop here for some applications that Maybe high criticality and the confidence of the model is not where it needs to be. There will be a human in the loop. So that is effectively where low confidence Predictions are routed to humans who will make the decision and then that closes the loop. Monitoring is very important in AI models.

There's an adage that it's not a question of if a model will degrade over time, it's when. We'll talk about this a bit more later. And once models get to that point that it's degraded below acceptable performance, it goes back to acquiring more data. This is a bit stylistic. In reality, it's a lot messier, to be honest.

There are many iteration loops that can occur. The most common ones are around the testing phase. So an engineer will build a model, they'll test the application. It May or may not work. If it doesn't work or if it doesn't get the performance that they're looking for, they'll either acquire more data, prepare more and label more data And try different model building techniques.

And we'll talk through some of the differences across those approaches later today. So to simplify the AI lifecycle and the AI model development approach, an AI model consists of 2 main parts: The model instructions and training data, where the model instructions is the architecture for the model to learn. So it's not saying here's the output, Yes, thank you. Here is a guideline for how the model should learn once training data is applied. And this can be as little as 10 lines of code in some instances.

The next important part for training for model development is the training data. So these are the examples that the model learns from. And it's typically the more and the higher quality training data, the better. So I think it's helpful to contrast AI development to normal software or traditional software development. So in a traditional software development sense, you'll have an idea of the outcome that you're looking to get to.

And you write Code and it's deterministic. And by deterministic, it means every time that application is run with the same set of inputs, it's going to deliver the same outputs. Test the code, you deploy, monitor and if there's any changes, it's you rewrite or you edit the code. In AI model development, it's different. The labeling of the data is the really important part, so that training data composition.

The provision of the instructions, I. E. The architecture that we spoke about, is used to write the code and then the model is tested. So you can see the difference in traditional software development, writing the code is the most important step. In AI model development, it's the gathering and the labeling of high quality training data.

So it's interesting to think about what is an AI model. And I've put up an example here because AI models, they're really this Indecipherable set of nodes, weights and biases that When you look at it from an outside in standpoint, it makes absolutely no sense. So that's why when you hear about AI explainability and model debugging being really difficult is because the actual code that has been written as part of the modeling process is This highly complex system that's very difficult to debug. So what's the important part and the bit that is able to be debugged and improve quality is the training data. We've spoken about training data in past presentations and obviously it's core to our business.

But breaking it down to really simple components, Training data consists of 3 things. So firstly, it's the file. So you can think about that as the example. So it could be an image or a text file Or a snippet of audio. Then there are attributes to the file.

So it's really important as part of the training process to assign meaning to the file. So let's say this file was a use for autonomous vehicles, the box driven Surrounding a car and saying within these pixels there is a car that you can do web that is the attribute of the file. The next Is the attributes of the labels. So this is the metadata, what time it was labeled, who it was labeled by, under what conditions. So we'll go through and step through a few examples of what training data actually looks like.

So on the left hand side here, this is an example of our LiDAR annotation tool. So LiDAR is used in autonomous vehicles. And you can think about it similar to radar, where it's a pulse that's sent out and received. And what it does, it allows the sensor to measure distance and different rough shapes in a 3 d point cloud environment. You can see on the top left of that image what a standard camera sees and in the dark Blue points, that is the LiDAR frame.

So in this instance, the task for our annotator has been, can you draw a 3 d cuboid, so a Cube around this PIK car in the frame. On the right hand side is the label. And this is what's called a JSON file, which is the actual the meaning. And I've highlighted a few sections here. So that first section I highlight in light red Is the center of the cuboid.

So it's saying in this dimension in space, There's a cuboid. The next part is the height, width and depth of the cuboid in meters. So in this space, there's a cuboid and it's roughly 1.9 meters high by 1.9 meters wide and 4 and a bit meters long. So that's the cuboid. And then it's saying within that space, there's a car, right?

Super so when we think about how this Trains the system and the model when there's a representation of these types of cuboids in a real world environment because it's been trained To look for cuboids, it will be able to say, okay, I know that there's a car in this space. I know the dimensions of the car and I know how far it is away from me. Obviously, super important for autonomous driving. The 3rd highlighted section in this space is interesting. And this is saying that In a 2 d image, so in the image in the top left hand corner, there is also a car.

So it's actually in this LiDAR frame, it's blending together 2 different types of sensors into one set of training data. This is called sensor fusion. So a simple example, but you get the idea that this can be exceptionally complex, particularly when you're looking at Hundreds of different objects in a frame, it could be vehicles, pedestrians, bicycles, other stationary objects. So it gets quite difficult to label and the JSON file or the annotation is very difficult in itself. Another example of training data here is speech, so spoken audio.

On the left hand side It's an example of our speech annotation tool. And here you've got just 1 speaker, so 2 speakers, sorry, A quite simple example, speaker B saying a few things and then you've got speaker A. On the right hand side, you see the JSON file and A little bit simpler than the LiDAR frame. Effectively, at this start time in the file and this end time, here are the words which are being spoken, all right. What's a little bit complex here and really important for the training of voice recognition system are the noise associated with it.

So you'll see that there's an insertion of certain types of noise. Here, we've kept it quite simple just to say noise, But it can include quite specific things like a cough or sneeze or it's very important for So training Data quality is really important. And I think it's intuitive as we think through the development of an AI model. It's Providing a lot of examples, if those examples are wrong or not representative of the real world state, The model is not going to perform that well. So low quality training data leads to low performing models.

The thing is though that poor quality data is not always obvious and there's many different types of quality issues with training data. Let me step through these a little bit. So we see 3 main buckets of problems. The first is where there's been An error in the labeling process. So this is on a specific label, there's something wrong with it.

The next is more about the composition of the training data. And unbalanced training data is A really big issue. And that's where you may have overrepresented in some areas and underrepresented in others. And again, that leads to a suboptimal performance in the models. The third is bias in the labeling process.

So it may be that there's no errors in the labeling and it's balanced, but the individuals who have performed the labeling May have certain bias that leads again to suboptimal outcomes. And we'll go through some examples of all of these. So A really simple example here, let's say the task for the contributor is to draw a box around the cows And that was the instructions provided. What we might be looking for, or in this instance, is A bounding box, so a tight bounding box means that there's not a lot of space between the image of the cow and the box around these 3 cows. So the left is pretty clear.

The cow in the middle is pretty clear. The one on the right is a little bit trickier because it's occluded. You can't see all of the cow. But The intent is that we just see with the boxes drawn around just the part of the cow that we can see. First type of labeling error that may occur is just for whatever reason, the contributor missed the cow on the right.

And obviously, that's Great. And pretty clear that that's an error. The next could be the accuracy around the bounding box fit. So here, the contributor has been a little bit generous in the space that's provided outside around the camp. And while it seems Somewhat trivial, it's actually really important because the way that models are trained, it's on pixel by pixel.

So it needs to be As accurate as possible to get that best level of prediction. The next problem might be a misinterpretation of the instructions or bad instructions. So we said The task was to draw a box around the cows. This is not necessarily an incorrect interpretation by the crowd worker, But if you've got a few 1,000 labels where there's a box around each cow and then a few 1,000 where it's a box around all the cows, you can quickly see how this could lead to problems. We spoke about the cow on the right hand side being occluded.

Another potential error, it would be that the crowd worker could assume the length of the cow and draw the box around what it thinks The cow would look like and the actual size of the cow. But again, this isn't the desired outcome that we'll be looking for. So a bunch of errors in the labeling process, errors and misinterpretation. Another big issue in training data is it's what's called Class imbalance. So class imbalance, you can think of it as we don't have a representative set of examples in the training data.

And we'll just talk through this a little bit more. So on the top, let's assume that we were building a model that was going to recognize cows and it would come back with The breed of cow that you put in a photo and it returns the breed of the cow. If the training data was limited To the top row, it would be probably quite good at recognizing dairy cows on green grass with a blue background. As soon as you put in a different set of cows, so on the bottom left, you've got some white cows on Pretty brown grass, some dairy cows on snow. I think the third one is a yak and then the right is a Texas Longhorn, and I think that's a bull rather than a cow.

You quickly see that by Limiting the training data size, how that would have a significant impact on performance for this particular type of model. And class imbalance, this is a very simple and straightforward example, but this is a really big issue for the performance of high quality Models. Another type of class imbalance is around data recency. So we mentioned before that all models degrade over time, and that's because the real world environment continues to evolve. And training data, unless you refresh it, it's static and it represents a point in time.

I've Got an example here around a search result or the search result returns for corona. So obviously, in May 2021, corona, there's a lot of news articles and statistics around cases. If you did that same search result in April 2019, the top return is Lickaland, right? So you start to get an idea around how important recency is. I mean, this is an extreme version, but it is a problem for a lot of Training data, particularly where the real world environment is continuing to evolve and continuing to change.

The next example talks about bias. And another stylistic example here, let's assume that one was trying to build a model to identify breakfast foods. And you asked a set of workers who are based in the U. S, Can you look at each of these photos and tag which one is a breakfast food versus not? So on the left, you've got black pudding, which is, From what I hear quite acceptable in the UK for breakfast.

In the middle is Hagelslad, which is sprinkles on toast From the Netherlands, on the right hand side, you've got kind of our vegemite. But someone in the U. S, Probably unlikely that they would get these right. So it's a form of bias. And it's not intentional bias.

It's just bias in because the crowd worker is not representative of all of humanity to represent all of the different types of breakfast foods that we see. So a lot of data sets require specific knowledge and or context for accurate labeling. So we spoke about this equation earlier that an AI model is model instructions plus training data. What's really important is that a good AI model requires the model instructions plus high quality training data. And our role in this AI lifecycle is to deliver that high quality training data.

And we'll talk a lot more, particularly in Wilson's section around how we're leveraging The training data market is continuing to evolve Yes, very quickly in some circumstances. So what we want to do now is talk through some of the trends that we see more specific to training data, and then we'll move on to Some observations on the model development market overall. So there are 5 major things that we want to Talk through today. So the first is that high quality data remains a major roadblock for the development of AI. The second is that AI use case is becoming narrower.

And by narrower, we mean more specific. And this has implications on Training data and then how training data is being used. We'll talk about the shift from model centric to data centric AI, Which is a focus on more on how to improve the quality of data, less on different modeling techniques. 4th, As AI models become more mainstream and more in the production systems for a lot of companies, There's an emerging need for training data operations. And then finally, and this is something we spoke about before, Using AI in the data training data preparation space is increasing.

And we'll talk a bit about this now and also in Wilson's section. So in terms of this, the first trend, data remains a major obstacle for AI. So there was a survey completed recently by O'Reilly. And looking at talking to people who actually built AI models And AI is in the in production systems. So the first if you look at the first largest Bottom of the neck, it's skilled people and hiring the right people.

The second is the lack of data or data quality issues. And then 3rd, it's identifying the right use case. 4th is culture. So you can see about those Four major segments, which represent roughly about 60% of the total bottlenecks. The one that's Related to the development of the model is training data quality.

So that remains a huge issue and something that's This has been fairly consistent over the last few years. So it's a big issue. A lot of AI practitioners spend a lot of time preparing data. On the right hand side is a quote from Airbnb, who 1 of the, I would say, more advanced players in having AI production models at scale. So they did some research and discovered that nearly 70% of the time that a data scientist spends on developing the model is not The modeling piece, it's actually collecting data and feature engineering, so extracting the features, so you can be labeling the data.

So There's a huge amount of time being spent on data collection and data preparation for AI. The next trend we see is that AI is becoming narrower, right? We'll talk quickly through a few examples that we have supported at Appen as illustrative areas, but we see this across the board. So the first example, we've supported a biz speak model. So the Someone wanted to build a model to suggest improvements to common bizpeak.

So you can think about when you've Written something in an e mail. There's a suggestion, hey, this looks a bit Bizvik like, here's an alternative. You think about the challenge with this, Bizvik is Highly nuanced, there's regional differences, there's context. It's a very difficult linguistic task to solve. So our task was go out, collect a lot of bizpeak and understand the intent and provide suggested alternatives.

And having to do this is very large scale and with a lot of context involved in it. The next example Of narrow AI is related to personal training. So there is a big push now to use computer vision as a way to suggest Training regimes and to monitor the performance of the person that's doing the actual exercises. The challenge is that a person's movement changes with age, particularly as People get older, they might be limited in their movements, etcetera. So one of the tasks that we were asked to do is capture and annotate videos Of seniors doing somersaults.

This is an actual task that we supported. So you start to get the idea of how specific Some of the data collection work is that we do and it goes back to that class imbalance issue that we spoke about that needs to be representative Of even the extreme version of seniors doing somersaults. The last version is about long term languages. So COVID created a unique challenge where there was a lot of information that needed to be shared digitally in almost real time around the globe. And this included some specific languages where There may be only not a lot of people who natively speak that language.

So the translation text Didn't support all of the languages. So we worked in a consortium with a lot of our large tech players to go and collect and annotate Some very long tail languages to make sure that the information about COVID was being disseminated, not just for the common language, but across the board. So really important project for us, that last one. We spoke again about this good AI model equals model instructions and high quality training data. But there is this question around, okay, if I'm an engineer and I'm looking to improve performance in my models, should I focus on spending time Around the model instructions or training data.

And this is a bit of A long standing question in the AI community. There's a quite Respected AI practitioner, Andrew Ng, who has a company Landing AI. And he tried to answer this question. So He had built a model to detect defects in steel sheeting. So a computer vision model, it It takes photos of steel and automatically identify is that a defect or is that a piece of dirt, etcetera.

They built a model and they've got to a baseline performance of 76.2%. He then split his team into 2 tasks. So one was he got a group of people say, hey, go out and improve the code. Get the latest research possible from the largest tech players and do whatever you can to apply This new architecture and this new model code to the existing data and see how you can improve it. He got another set of his team to go and improve the data.

So let's not change the code. Let's just go and collect more data, improve the labels, improve the quality of the data. And you can see the difference here. And this is one example. Improving the code had almost well, no impact On the performance of the model, whereas improving the data had a really significant uplift and the average human performance For these types of tasks was 90%.

So they actually got it to above human performance in identifying steel defects. Again, one example, but more of an illustrative view of how there's this Performance improvement benefit from looking at the training data composition. Another example here is the performance of a competition called ImageNet. So ImageNet is a bit of the gold standard competition in computer vision, where there's you've got a few million Labeled examples, and the task is to create a fairly general computer vision model We loaded an image and it will tell you what's contained within the image. Over the past 7 or 8 years starting with that core data set.

The performance of the model is able to get to 86.5%. And these are serious heavy hitters who are investing time in this competition. By providing extra training data, you can see, particularly in the outer years, that there's been a significant uplift in the performance So another example of the benefit of more training data and how that yields To better accuracy and model performance. So this is really comes to this shift from Model centric to data centric AI. In a model centric world, AI engineers will use the available data And try to develop models that compensate for any noise or inaccuracies in the model.

So you can think about it as you hold the data fix and you try to improve the model. Data centric AI flips that on its head. So it's all about improving the volume and or the quality Of the data, the training data that's used to train the model. And then you try some different models, but the focus is on improving the data. So it's holding the model fixed and iteratively improving the data.

And what we see is a A shift that's occurring in the model development world is that the shift from model centric where The constraint has been here's the data I have to a data centric view where there's a lot more focus being Placed on how do I improve the data, how do I expand my data sets and enrich the data. This is the shift that we spoke about from model centric to data centric AI. The 4th trend we see is the need for training data management. So we've spoken about the AI lifecycle and there's a really important part here, particularly that we focus on, which is the data collection, labeling and preparation piece. And in the pink down the right hand side, the kind of the tasks that we see and doing these tasks are really important.

But there's an entire set of capabilities that are emerging around how to support the development and the management of training data. So things like version control get really important. If you build an AI model on one set of training data And the training data has changed. You'll never get that same performance again. So managing the version of the training data is really important for, One is experimentation, so you can figure out what composition of training data worked better than others, but also traceability.

So if there is issues in real world production, it can be linked back to and quickly identify what the core training data set is. Training data security is another important issue that's emerging. We spoke before about The difference in traditional software development being code and AI being data centric. If you are a hacker, you can go in and change the code in traditional software. In the new world of AI, data is In the new world of AI, data is the most important piece.

So placing security around the training data It becomes really super important, and there's going to be a lot of focus in that. There are a whole bunch of other issues, but Yes. They're just two examples around how there's an ecosystem being built around the management and the controls Finally, and one that's really important for us and where we're placing a lot of focus Is applying automation to the labeling process. So there's 3 main buckets of automation that makes sense for data labeling. So the first is pre labeling.

So this is where AI performs an initial pass on the annotation. And the work is, it's doing more of a check and a correction of the pre labeling, if correction So it's still human annotated data. Humans have done that validation and the correction, but we're using AI to speed up that process by having 1st pass. So it significantly reduces the annotation time. We also see a fairly positive quality uplift through pre labeling.

The next is what we call speed labeling. So this is where AI is used to assist The crowd worker in the labeling process. So you can think about this similar to an autocomplete function, where it's Humans plus AI working together to get to a fast outcome and a higher quality outcome. Finally, where we use automation and the labeling process is in what we call smart validators. So this is when the crowd work has completed the annotation, it will be a layer of checking that Completed work prior to sending the file back and moving on to the next stage of the annotation.

So the benefit of Validators is, of course, it improves the quality of the model, but it also It acts as a guide for the crowd workers around how they might want to do things differently in the future to get to a better performance. So we've spoken a fair bit about training data. We'll move on now to some observations around Training data sorry, some of the modeling techniques and how that is evolving. So One of the things which I think is really important to understand is that AI enabled applications typically involve A large number of models that rely on a large number of modeling techniques. So the examples here are for a voice interface system.

So think about your favorite at home Voice interactive product. There are 3 main blocks, technical blocks, and these are in the dark blue on the left hand side where you've got language processing. So this is the models that hear what you're saying while there's a wake word, then they listen to what you're saying, That's processed then to text, right? So this is the speech to text component. The next is intent handling.

So it's one thing to Transcribe audio to text. The other is the natural language understanding component, which is highly complex It requires a lot of different types of models. So that's one is understanding the intent, but it's also then matching to The knowledge of the system. And finally, it's the response generation. And the response typically involves Different types of responses.

So one is the spoken audio that is returned to you. So in the let's say, you wanted Start a timer on your phone. It would be the voice interface system would be, okay, I've started the timer. But it also needs to then go into The application and start the timer, it's the actual activity that's involved. On the right hand side, And I know this is very hard to see, but all of these smaller boxes are different types of techniques and algorithms that are used For each of those processes.

So you can see that there's a lot of different models that are required To be brought all together to deliver an AI enabled application. And in the real world, what we see is that there's not one modeling technique that's typically used end to end for a model. So across the top here are some techniques. This is not exhaustive. You've got transfer learning.

Transfer learning is when you take A piece of a model that's been trained on something else and kind of slotted in, which will get you some benefit this out of the model. Self supervised learning, it's a modeling technique where there's no training or there's no data annotation required for the training. And then the next three are examples of supervised learning where the first is, it might be what we call off the shelf data, which is data that It can be bought from a marketplace or authority, data that's applicable, but not specific not necessarily totally specific for the application. Supervised learning, that's using AI assisted human annotation. And then there may be a requirement for supervised learning that's where it's human annotated only.

The composition of how the different modeling techniques are brought together varies, as you could imagine. So in the example of a U. S. English chatbot for a bank, U. S.

English is very common language. Retail Banking is quite a common industry. So there might be a fair amount of models that can be used for transfer learning, Specific techniques for self supervised learning and even a set of off the shelf data that might be specific for U. S. English and Retail Banking.

And that will get a long way in the model development stage. Then you start to get into more specifics. So For a specific bank, it's going to have different product sets that it's called differently, terms and conditions, a whole set Of company specific taxonomies, and that's where data needs to be collected. In this case, supervised learning using AI assisted Annotations might get a long way and there might be some requirement at the end For human annotated data where AI assisted modeling hasn't been developed yet. As you work down the specificity, The next example is a French chatbot.

You start to become more reliant on custom data collection and custom data annotations. Just because the existing models and the existing off the shelf data don't exist. The third example Is a Qatari Arabic chatbot from Marine Insurance Company. So you start to get the idea that more specific AI models Don't have the luxury of a lot of pre existing work that's been used. So it starts to get very custom in the type of Data that's and the techniques that are being used.

In saying that, we spoke about a limited number of techniques. There's a huge amount of research being put into new AI approaches. So on the left hand side, these are the number of papers which are being posted to archive, which is a quite common Place for researchers and other academics to post their papers. And these numbers on the chart are in 1,000. So in 2019, there was almost 30,000 publications on AI posted to archive.

On the right hand side, There's a lot of research being done by new teams. And it's we're in a nascent industry, and it's emerging very quickly. So there's a lot of There's a lot of forward progression in the types of modeling techniques that are being used. What we see though is The popular AI techniques that are used actually in mature AI practices still rely on human involvement. So here on the left hand side, it's for a bunch of Companies that were surveyed well over 3,500.

What are the different modeling techniques that they're using? And you'll see the first is supervised learning. So that's where Examples are being provided and examples that have meaning assigned. Deep learning, a subset of supervised learning, but another way that where humans Are required for the preparation of the labeling process. Human in the loop and active learning and knowledge base and knowledge graphs.

So these are all different techniques where a level of human annotation is required. So while there's A lot of research being put in advancing how AI evolves. We see in the real world that Human still playing a big role in the creation of high quality data training training data. So Mark mentioned and we've been on a journey, right? We're moving from this transformation from Into an AI powered provider of AI data and solutions.

We've gone from data types being language focused to very AI focused and supporting a wide variety of use cases. Our service led delivery model to something that relies heavily on our products. And this comes with a shift from project based to more committed revenue. Our customers have been concentrated. Our products have allowed us To support a greater diversity of customers.

And then Mark spoken more yesterday around the org structure and reporting. This evolution has not occurred overnight. We've been on a journey. It's through the acquisition of Butler Hill and Leapforce, Appen Connect became a really important part of our tooling and our infrastructure. The Phase 2 was we acquired Figure8 and that gave us a very strong base set of capabilities, but we've Continue to invest and evolve our products.

And this has led us to being very focused on a product led expansion. Products are really important, but it is one part of the capabilities that we offer. And it's the combination of Our crowd of well over 1,000,000 strong, our deep internal expertise on how to deliver high quality training data and the product, That's the real differentiator. We're going to focus today a lot on the product. And where Wilson will talk more about this.

I'm going to give a quick intro, but there's 5 main components to our product suite. The first is Appen Connect. So Appen Connect is our product that manages our global crowd workforce and does a lot of the matching from the crowd to the So it's a really smart marketplace that matches the global crowd with projects. We're applying a lot of AI and building a lot of smarts To make that as seamlessly as possible. The next is the Appen Data Annotation Platform.

This is the real engine Over the company where the crowd workers complete their task and our customers can set up and monitor performance and create real bespoke Annotation tasks for our crowd workers. Then we've got a set of new products We're really excited about and going to make a real step change in the performance of the business. So the first is Appen Intelligence. Appen Intelligence is the set of models that we use to improve automation throughout the business. So this includes Like what we've spoken about in the annotation process, so how do we improve the productivity of that crowd workforce And deliver better quality for our customers, but it also includes processes to manage the crowd and Those workforce tools.

So it's a really big part of what we're focusing on. The next is in platform audit. Wilson will talk a lot more about this, but it's in platform order and now enables our customers to understand the composition of their training data better. We spoke a lot about class imbalance and quality errors. These can be very hard to diagnose and navigate when you've got data sets of Hundreds of thousands of images as an example.

So the in platform audit is a way for our customers to really easily Navigate and narrow in on areas that need to be addressed and where performance needs to be improved or more data might need to be collected and brought into Finally, and this is one which I think is super exciting is Appen Mobile. So a really Great mobile interface that serves a couple of purposes. So one, it's a way for customers also, crowd workers to Engage with us. I log into the system, identify what jobs are available to them. And secondly, it serves as a different form factor For data collection and annotation, there are a bunch of features in the mobile specific domain that aren't available for desktop.

So things like location specific And other sensors which are in arid in mobiles but not in other areas. So again, Wilson will talk through all of these, but I'll This is just a quick intro. What's really valuable is though that these products create a huge amount of value for our customers. So first, we've spoken a lot about AI augmented data labeling and collection. So that really improves the speed, quality, scale and unit economics of the work that we do.

AI enabled crowd management, So increases our internal productivity and the experience of our crowd. We've got a lot of expertise in the company. And we're trying as much as we can to productize that expertise and build that into our products. So it automates a lot of the high quality work We've got a lot of inbuilt crowd management features. And this reduces risk for customers, particularly Those that are looking at different crowd solutions and thinking about how they work with Very large crowd.

Then finally, we spoke about the combination of the crowd with technology. That's a real competitive differentiator for us And enables us to do a lot of the work to solve some of those problems that we spoke about early on around Data Quality, Diversity and Bias. So we'll have a break now. Wilson's got a really exciting presentation and set of demos. After that, we'll go into some Q and A.

But we'll We'll leave now and come back in around 25 minutes.

Speaker 1

Hello, and welcome back. We'll take you over to Wilson shortly, but before that, just a brief recap Some of the things that Ryan spoke about. He took us through the evolution of the company from language service provider to an AI data provider, from a services led company to a product led company. He also then took us through the importance of training data and the importance of quality in particular. He mentioned the number of different techniques that he used and the number of different training data types that are available.

Overall, we're in a complex and evolving space and that requires a rich set of technologies. And we'd like to take you through those now. So I'll hand you over to Wilson, who's in our Bay Area location, and he's pleased to join us via the technology. Take it away, Wilson.

Speaker 3

Thank you, Mark, and welcome back, everyone. Raj has shared the AI industry is moving from model centric To support data centric AI, we have evolved our product suite significantly. We upgraded existing products To give our customers and the crowd a better experience, we've built new products to support new use cases and brought in a lot of machine learning capabilities To drive efficiency and unit economics, we now have an intelligence platform with a lot of automation capabilities And a human only needs to be involved when necessary. Let's take a look. We have 2 existing products, ApenConnect and ApenTech annotation platform.

AutoConnect is a platform where we match our global cloud to annotation tasks. AutoConnect is a platform where the cloud can deliver tasks. They can collect data. They can annotate data. It's also the platform where our customers, they can manage their tasks Yes, self-service manner.

Both Appen Connect and Appen Data Innovation Platform has evolved a lot with a lot of new features, better experience And a lot of new a lot of AI capabilities. We also developed 3 new products. This is what excites me the most, Appen Intelligence, in platform audit and Appen Mobile, they make huge difference to our business, Our customers and our crowd of contributors already. Appen Intelligence, they include Proprietary machine learning models to empower other products. It has models to automate labeling tasks.

It also has models to automate product management tasks. To support data centric AI, just collect data and annotate that data is not enough. Our impactful audit helps the data scientists to really analyze the training data so that they can understand the quality, Distribution and potential buys from the data. It is essential you probably already heard from Ryan. It is really essential to get the data right So then they can have a better AR performance.

Last but not least, Appen Mobile is our new mobile app. It upgrades the crowd experience, Appen to do different types of data collection tasks. It also helps Appen to increase our reach to even broader crowd group. So the AI data industry values quality, speed, scalability, security and unit economics. Our prospects can support all of them and really keep our business ahead of the competition.

Now let's look at some details of those different products. Let's first look at Apen Connect. Apple Connect is used by over 1,000,000 cloud workers as well as the Apple internal teams. Product managers set up products and tasks, crowd workers find a product and deliver tasks. There are 2 major focus for Appen Connect.

Number 1 is efficiency and scalability. We optimize the user experience So that both the client workers and our internal team members, it can be very efficient and all necessary effort can be saved. Number 2 is automation. We want to automate the product management effort as much as possible I like the platform to manage the crowd is the level of human. Let's look at a very typical product lifecycle.

A product manager, they will set up a project and then sourcing the workers or candidates to work on their projects. If I have those workers to ramp up their skill and pass the qualification, then the worker can start to work on the product. I would say, worker, welcome to the project. The product manager need to really track the progress, track their productivity, their quality progress. Enable the worker bumping into any issue, the product manager needs to support them to fix those issues.

Meanwhile, the product manager also needs to Detect the fraudulent users constantly and take those fraudulent users out of the products. So it is pretty complicated It's low and lifecycle. And some of those tasks are very time consuming. Sourcing candidates, supporting workers when they bump into issues And also do the fraud detection. Those tasks can take a lot of human effort.

We are using Appen Intelligence To automate them and also make AppenConnect's intelligence marketplace. Let's look at the automation of the task to source crime workers. Within Appen Intelligence, we have built a crowded DAA, We just contains a lot of crime data, their behavior data, their product histories, their skills, their quality and productivity data. Based on those data, we build machine learning models to recommend workers of projects or recommend products for workers. We also build a machine learning model to detect fraudulent users.

With those AI capabilities, when a product manager, they finished set up the projects, Appen Intelligence can understand the sourcing requirements and then find those workers for the product. We will send out personalized notification to the worker. And if the worker is interested, we will apply for the projects. Once a worker applied, applied intelligence will further screen them to check if they are eligible All is under potential fraud. Based on those information, Appen Intelligence passed them or failed them.

If the worker Past the auto screening, then they will be activated to the product automatically. So if I see auto steps here in green color, This can be done by Appen Intelligence. As steps in the gray color, those are many steps from the contributor. So with Appen Intelligence, with this automation, this will not save huge amount of efforts for product vendors. You only need to get involved with Appen Intelligence, when our machine learning model is not sure about the decision.

But during those times, those tasks All got automated. Let's look at another example, fraud detection. Even Appen Connect is a marketplace that can be fraudulent users. If you look at the example at the left side, Those two accounts, they are from the same IP. And 1 user normally works from 3 am to 5 am.

It is very suspicious It can be fraudulent users. It is key to mobile users so that the product quality is not compromised And we don't really pay all necessary costs. However, you can also understand that analyzing the activity from over 1,000,000 workers It's not possible by human. Fraud detection models from Appen Intelligence and help us to do the job The process more than 1,000,000 users every day and handling 200 plus signals for every user. And the fraud detection model, the accuracy is pretty good.

It's about 95%. So those models are used in a lot of places. It checks user during the new user registration. It's used to screen the product application. It also runs in the back end all the time To detect any suspicious activity.

Fraud detection with machine learning not only automates the huge amounts Human effort, but also support the skill human cannot just cannot handle, right? Talking about like 1,000,000 workers, it's just so hard for Appen Connect, as you can see, it creates huge value for our customers and our cloud. It connects customers with our global cloud. It automates a lot of product management work and it reduces the overhead costs. It enables our business to scale and support future growth.

Future investments focus on 2 areas. 1st, we will continue to optimize and make a Very good experience for both the crowd and also the internal team members. Secondly, we will just continue to add more automation So that many, many efforts become less and less. Now let's take a look at Appen Data and Nutrition Platform. Majority of the data annotation platform in this AI data industry only focus on certain areas.

Some focus on computer vision, some focus on audio and language, while the Append Data Integration platform It has the breadth and depth to support all kinds of use cases. It has tools to support different type of debt collection, tools to do content relevance, Tools to annotate audio and text data. Tools to support image, video and a 3 d point cloud data processing. Meanwhile, no matter how many tools you have, there will always be some special customer need You haven't heard it before. So we also have a powerful tool called Job Designer and we will help the customers to design a new tool very easily.

The job designer also provides a programming language called CMAL, which is allowed by the developers. So those annotation tools, they are pretty powerful and work really well in a single task. While some AI data use case is very complicated, it needs multiple steps at different operations to get the data right. This operation can be a human labeling job or a machine learning model or a script to process the data. Appen workflow enables those use cases.

It states all those different operations into a flexible workflow. Let's see how it works. Please have to play the first demo video.

Speaker 4

AI application can be complicated And customers often need the same data to pass through multiple jobs to satisfy their project requirements. Before workflows, Linking multiple steps for complex annotations used to be a manual process, not anymore. Let's see a real world example. A restaurant review platform must first categorize a continuous stream of user generated content to identify key attributes. When a photo contains a menu, the customer would like the menu transcribed.

When a photo contains food, the customer would like to know if it's a main course. And when a photo is outdoors, the customer would like to know a bit more about the restaurant's outdoor amenities. Workflows make it simple for customers to break their complex Projects down into smaller basic steps that are connected by flexible routing rules. Routing rules can be set based on specific answers, Result confidence or to route random samples into QA jobs for further review. As always, settings such as targeting, Quality controls and pay can be customized on a per job basis.

Today, most steps and workflows are jobs, but customers can leverage growing catalog of machine learning models and scripts to automate a variety of simple tasks. With workflows,

Speaker 3

I hope that I give you a better understanding of How those AI data use cases, how complicated it can be and it's great to be able to support those complicated tasks. Meanwhile, it's also very important to guarantee the quality. Quality is always one of the most important factors for training data. Appen Data Annotation platform has a rich set of features to do quality control. The number one form of proactive quality control is test question, which customers can define ground truth data.

And those data can be used to qualify the worker before they start or monitor their quality performance during the job. QA workflow use a different methodology, where we ask high qualified workers to review and correct And attention from other workers. Dynamic judgments and client judgments from multiple workers and aggregates the results To get a high confident answer, machine learning validation, so this the Smart Validator Ryan mentioned earlier, To use machine learning predictive results to validate the annotation from the workers is very useful in certain use cases. Normally, when we handle projects, we use smartphone, those features to really to achieve high quality output. Security is another core consideration.

You probably know clearly understand how important security is for the AI data. Appen's data annotation platform provides very flexible deployment options. Customers can use the platform in our public cloud Or deploy the platform in their private cloud or in a completely air gapped environment. So for the customers, they use our public cloud. They can use a feature called security access.

So Dan, they don't need to move their data into our platform. We only access their data when the worker is labeling them. And those access will expire after the data is labeled. This has created another additional layer for data protection. Our platform also meets all those major security and privacy compliance standards, like SOC 2, GDPR And HIPAA, with all those security features, our customers' data are well protected.

Our data annotation platform, it creates huge value for our customers. The full suite of tools deploy different customer use cases. After workflow, it enables complex training data preparation. And those quality and security options help our customers to get a high quality training data and also help them to really protect their data. Future investment for Appen Data Annotation platform focus on several areas.

We will continue to evolve the tools to support the newly surfaced use cases. For example, now our team is working on building a new tool to And with these auto satellite imagery tasks, we're also working more to provide a better HII To have a tighter integration with our customer systems and quality and security are never ending efforts. We will just continue to invest more and more On those features and offerings to help our customers to get high quality training data and also protect their data. Now let's look at Appen Intelligence. We build a product to prepare training data for AR companies.

Meanwhile, Appen itself is also an AI company and machine learning is used in all our products. Appen Intelligence provides those machine learning capabilities. We have seen earlier how Appen Connect is using Appen Intelligence To automate sourcing workers to do fraud detection. Now let's see how it helps the automation of the annotation efforts. Appen Intelligence provides proprietary machine learning models across different data categories.

It has models to identify speakers, to detect languages, segment audio files, Convert audio to text to do voice recognition. It also has models to analyze text data, to detect the gibberish, To extract the entity and to do text classification, those are the models commonly used in the natural language processing field. It also has models to process image and video data, transcribe text from an image, detect objects And generates face landmark or blur faces to protect the privacy. It also has a lot of models to handle 3 d data to object detection and tracking. So those models are used to pre label data So then Hema only needs to review the prelabeled results instead of labeling those data from scratch.

The models are also used to check the annotation from human, to check their quality, to validate their results And helps data quality, labor speed and saves a lot of labor costs. To better understand how these models are used, Let's see a few examples. Understanding documents with machine learning becomes very popular now. Finance companies, They want to process a receipt. A law firm want to find some legal information from a document.

To train those machine learning models, We need to transcribe text from images, from PDFs or other files. Now let's see how those streaming data are labeled

Speaker 4

Now Customers can get OCR training data with machine learning assistance. But first, let's see how it works manually. As a customer, I upload images that I want transcribed, configure the tool 1 by 1 and the whole document will take a long time to finish. Now let's see how machine learning assisted pre labeling Can make this process more efficient. As a customer, I will use the workflows feature to set up the job.

The workflow contains 2 steps. First, we use a machine learning model to automatically draw bounding boxes and transcribe the text in each box. The 2nd job is to route annotated images to contributors to review and modify as needed. Now I'm ready to upload the data and launch the workflow. After the machine learning model pre labels the data, Contributors will see the image with bounding boxes and transcribed text already in place.

Their job is to review the results and make edits if needed. Thanks to machine learning assistance. OCR transcription can be 5 times faster when compared with the manual alternative.

Speaker 3

Clearly, machine learning assistance has helped the OCR data labeling. Let's see another example. Voice recognition is another widely used AI technique. Let's see how to prepare training data for voice recognition machine learning model.

Speaker 4

Now let's look at a voice recognition example. We provide a comprehensive suite of tools and models to enable the information extraction With manual labeling, there is a 2 step process, starting first with audio annotation So what we'd like to talk today?

Speaker 2

I think we're going to talk about I think we're going to talk about weddings.

Speaker 4

Okay. So what was it at today? As you can see, the end to end manual annotation process is labor intensive and time consuming. Now let's take a look at Appen's ML Powered Audio Annotation tool. This uses a diarization model to automate the audio segmentation step, then uses an automated speech recognition model to generate transcription hypothesis.

Finally, we add a human review job to correct any possible mistakes in the model hypothesis. By this final step, the transcriber will see all the Transcription hypothesis generated and all they need to do is to correct any occasional mistakes. We have estimated that this is 20

Speaker 3

Now let's switch gears to computer vision. At home is driving is probably the most exciting AI use case in computer vision field. To do autonomous driving, the car needs to understand the environment surrounding them. We use sensors like cameras and the Larda to collect data. Naga is a special type of sensor, which collects 3 d point data of objects in a slowing environment.

And the machine learning model need to understand those 3 d data, those 3 d points and classify them as cars, Pedestrians, bicycles or other objects types. To prepare those training data, product workers need to operate in a 3 d environment. And it requires some special skills. Normally, it takes a long time to label those 3 gs data. Now let's see how the 3 d point data get labeled.

Please have to play the 4th demo video.

Speaker 4

In this video, we will demonstrate 2 scenarios, labeling LiDAR data manually and labeling LiDAR data with machine learning assistance. This example has 26 frames and the contributor needs to label all the objects in all those frames. First, The contributor needs to familiarize themselves with the 3 d environment. They can navigate around and refer to the image data on the upper left. Then they need to find the car and draw a 3 d cuboid, specify some attributes of the car and then adjust the size to make it a tight fish.

Then they go on to the 2nd frame, add a cuboid and repeat the process again. They continue this process for all 26 frames. Once all the cars have cuboids, the contributor steps through each frame and makes adjustments frame by frame. After adjustments are made, they can play back the video one last time as a final review. Now let's look at how machine learning assistance can help.

The contributor still needs to do the same process for the first frame, find the car, add the cuboid, Specify some attributes and make some adjustments. But when they move to the next frame, they find the cuboids of that car are already there And they just need to make some small adjustments. They can also switch to grid view where they'll find that the cuboids have been generated automatically in all 26 frames. They can just navigate across multiple frames, make some adjustments and the tool will interpolate backward to add or modify the annotations

Speaker 3

Self driving is pretty challenging. I need to handle different situations. We are seeing how this keyboard, how this object is labeled. But besides detecting those objects surrounding the vehicle, The car also needs to detect the lane lines on the road. Now let's see how lane lines can be labeled.

Please have to play the first video.

Speaker 4

In this video, we will demonstrate Two approaches for LiDAR lane line segmentation, labeling lane lines manually and labeling lane lines with machine learning assistance. Let's start with manually annotating. First, the contributor needs to zoom in so they can see all the points of the lane line. They'll then draw a tight bounding box to include those points, find another lane line and repeat the process, It's not a very complicated task, but it is time consuming and our contributors have to label 1,000 or tens of 1,000 of these lane lines. Now let's look at the machine learning assisted approach.

Switching to intelligent mode, The contributor can draw 1 large polygon around the whole lane and the machine learning model will detect the lane line to create annotations for each individual line. Thanks to machine learning assistance, LiDAR lane segmentation can be 6 times faster

Speaker 3

As we have seen from those demos, machine learning assistance is very powerful.

Speaker 1

Here is

Speaker 3

a quick summary of all the productivity difference we have observed. Audio and speech, Machine Learning Assistant annotation can be up to 1.6 times faster even in OCR. The results are even better. The 2 d image bounding box labeling with machine learning assistance can be 30% faster, while OCR This machine learning assistance can be 6 times faster. It also works really well with 3 gs data.

It can be 4 to 6 times faster, Labeling in the 3 d environment is really complicated for humans. So machine learning assistance is very powerful. Meanwhile, it doesn't really work well for content relevance tasks. Content relevance tasks are very often subjective. It always require people with certain culture background and super hard to automate.

So overall, A lot of those data annotation efforts from the credit workers are now being automated by Appen Intelligence. Those automation improve the data quality, Clearly, Appen Intelligence creates huge value for our customers and also Adding all those machine learning capabilities to other Appen products, it automates the cloud efforts and lowers the unit costs. It also helps to improve the delivery speed as well as the data quality. It also automated product management efforts So that our business can easily scale. You may recall earlier how Appen Intelligence is using other connect side.

In the future, we will just continue to add more AR capabilities to automate more use cases for both the worker side Let's now move to Appen in platform audit, which is the new product we released last month. It's still in early stage, but I'm super excited about Appen's Informative Audit. It already brings a lot of value for our customers. As you have seen from the slides Ryan shared earlier, training a good AI model can be expensive. It needs a lot of training data.

It needs compute it also needs hardware. It needs computation power. It needs efforts from a data scientist team. If there is progress with the model, it's better to find those earlier instead of later, So that you don't need to redo all this work. And redoing all this work, that's really introduced a lot of costs.

And AI model performance is driven by the tuning data. So debug and detect problems from tuning data early on is Key to the model success. In platform audit is designed to help data scientists to analyze Between data, so data analyzed raw data before labeling or analyzed data after labeling. And with ground truth data, We can also use them to evaluate model performance. In platform audits will help the data scientists to detect all the scale problems Like cost imbalance, accuracy or quality or labor imbalance.

So Those data problems might not be that straightforward to understand. Let's use the example to explain. Let's say, I want to train a machine learning model to classify if a treat is a positive treat or negative treat. To train on the machine learning model, I first need to collect training data. I script 10,000,000 tweets from the Internet.

If 9,000,000 of them are from male and only 1,000,000 are from female, then the model might not work well for female traits using that data set. So this is a cost imbalance problem. I detected the cost imbalance problem and fixed it. Now I have 5,000,000 tweets from the male and 5,000,000 tweets from female. Then Getting people to help me to label those tweets.

And then when I review those label results, I found a lot of positive tweets got labeled as negative. I got a data quality problem. The accuracy is not high. Now I detected the accuracy problem and fixed it. However, for those 5,000,000 tweets from Mio, 4,000,000 of them are positive, while 1,000,000 are negative.

Although the data labels are accurate, I get a label imbalance problem, which will cause a lot of problem for my model later on. I also detected and labeled imbalance problem and fixed it. The data set is now well balanced and has high quality. The model you can imagine the model trained using this data set, we are likely to have a good performance. I hope this give you a good sense of how Trinity Data Insight helps to detect and fix data problems.

Appen in platform audit It's priced additional value to our customers. It essentially enables customers to understand their trading data, Find problems and fix them, which in turn will help them to improve their AR performance. We just released the implantable audits Last month, and there's a lot more to do. Currently, in platform audit focus on training data analytics, And we will expand it to support model performance evaluation in the future. Ryan also mentioned that there is a trend Where people need a lot of tools to manage all those training data.

We are also adding more training data management features Into impact from audit. So this is a super exciting product and it can evolve to be a powerful Trinity Analytics tool Now Let's move to Appen Mobile. We released Appen Mobile early this year. This new mobile app The new mobile app It provides a very intuitive user experience, acquire a worker, they can register to become a user quickly, Find the products easily and work on all kinds of data collection tasks. The app has made data collection easier than ever.

Location based app also becomes very popular, especially during the pandemic. And those apps, they need location based data So the new mobile app is great. It provides better experience and also supports more data collection use cases. But that's not the only benefit it brings. The new app also increased our reach to the mobile only cloud workers.

The population actually is pretty big. You know that there's a lot of people, they're only using mobile. This is a pretty big population in Asia and also other developing countries. Enough talking about this app. Let's see how it works.

Please have to play the 6 video, 6 demo.

Speaker 4

Appen Mobile provides a simple and Straightforward registration flow for contributors. New contributors just need to input their e mail, name, country and state or province as appropriate And then submit the registration request. They receive code to their e mail and after verifying that code, the registration is complete. A contributor can easily log into the app and see projects recommended for them. They can simply click the project to learn more details and they can apply for those projects if interested.

Some projects may ask After the contributor qualifies for the project, they can start to work on project tasks. This particular project will ask the contributor to complete multiple steps, Including collecting audio data, transcribing that audio data, tracking their eye movements and then collecting handwriting samples. The first step is for the contributor to collect audio data. They will need to speak as directed by the prompt, record the audio and then transcribe the audio. The app supports the collection of a conversation between multiple people.

One contributor will send the invitation link to the other contributor. And with that link, those 2 people can start a conversation, which will be recorded. Now let's see how video data tracking eye movement is collected. In this example, we will record a selfie video where a contributor will watch a bouncing ball video on their screen while we capture their eyeball movements. The data helps our client to Contributors can even do handwriting tasks within our app.

In this case, we are collecting contributor handwriting strokes, start and end time with key point coordinates while they write.

Speaker 3

Appen Mobile creates Huge value for both our customers, also the crowd workers. It gives crowd members a much more intuitive experience. They can engage with Appen at any time from any places. It enables a lot of different data collection use cases. It also helps us to reach to a much bigger crowded population.

In the future, we are going to invest more on this mobile app. We're going to support new data collection use cases and also supporting other data and different tasks. Whatever task is fit into a mobile screen, we want to try that in the mobile, too. And also with a mobile app, We're just going to actively expand the crowd to support the diversity and also impact the sourcing. So this wraps up My presentation, I will now hand it back to Ryan.

Speaker 2

Thanks, Wilson. I'll spend a little bit of time on a recap and a close before we head into some Q and A. So Wilson took us through the product suite from Appen Connect, which is used for our crowd management, The Appen data annotation platform used for our crowd workers to do the labeling and also our customers to set up and customize jobs. And then some of the great new features that we've rolled out more recently, Appen Intelligence, in platform audit and Appen Mobile. We spoke about how these capabilities unlock huge value for our customers, from AI augmentation in data collection and labeling, Delivering speed, quality, scale and unit economics, the crowd management and some of the AI that we're using in that, including fraud detection That really increases our internal productivity, but also the crowd experience.

We've embedded a lot of expertise in Tools and that's really helping us deliver high quality annotation work for our customers. We've got inbuilt crowd management features And working and doing the crowd management on behalf of the customers is really important for them. And that native integration without crowd, it creates the competitive differentiation where we've got a complete set of tools and capabilities, both from a technology standpoint and from a crowd. And that's kind of the real differentiator at Appen and How we unlock a lot of value for our customers. It's not about having the right tools and it's not about having the large crowd or the expertise.

It's bringing all three of those together. And that's what our customers really value from us, and that's what we'll continue to focus on in the future. The product is going to be a very large part of what we do. So I'll now Hand it to Mark, who will moderate our Q and A session. I think we've got about 30 minutes.

We're a little bit more allocated for some questions.

Speaker 1

Thanks, Ryan, and thanks, Wilson.

Speaker 3

I

Speaker 1

hope you all enjoyed the presentations from Ryan Wilson and the demonstrations as well. We have some questions. I'll read through them, throw them first to Ryan, first of all, and he can Lupin Wilson as required. So the first question is, how does Appen's own data labeling platform and AI investments Compare with the competitors such as Scale AI, how are they different? Are you investing enough in R and D to keep up with new entrants?

Ryan?

Speaker 2

Thanks, Mike. And this it's a good question. So we monitor all of our competitors, as you can imagine, from information that's externally available And often do feature comparisons to understand where we are from the market and also speak to our customers to a lot of our customers obviously will look at different products in the market and see how we compare. To the best of our knowledge, we have a very comparable set of products and some Areas where we have a lot of deep expertise that is built into our products and creates a lot of differentiation for us. So I think from a technology standpoint and our product Sweet.

Wilson just took us through. It's comparable and in some areas leading the market. I think Like what we were just talking about before, there's a huge value about the combination of the product suite with the crowd. So we've got the products and that's comparable to the market. It's that combination with the crowd and our external expertise that really makes a huge differentiation for us.

Speaker 1

Yes, thanks, Ryan. I'd also add that we do know that our breadth of functionality It's superior to many of our competitors who tend to focus in on one area. Recall the evolution of our business from a language data provider through to a multimodal data provider. So earlier on in our evolution, we were focused on language and speech data. Similarly, the new entrants are focused on a particular area, mostly image data.

The question also asks about the rate of investment. Clearly, there's visibility into our investment through our publicly available accounts. We don't have that same Visibility into our competitors, we see the private competitors, we see the money they raise, but we don't know how much they're putting into R and D. Ultimately, what we try to do is work with our customers to make sure we've got the range of products they want. And per Ryan's feedback, our view is we are comparable, if not superior, The next question is, Are any of your technologies being implemented in the electric vehicle industry?

And if so, what and how are they being implemented? Who are your business partners in this market? And what about autonomous vehicles? A few parts to that question, Ryan.

Speaker 2

Yes, a few parts there. I'll focus on the autonomous vehicle part because I think the electric vehicle is more about the drivetrain, autonomous is more about perception where there's a lot more Need for training data. So I think through the demos that Wilson just showed, we've got an advanced set of capabilities In the annotation market for autonomous vehicles, Wilson showed LiDAR, but there's also computer vision is used A lot also in this space. We have a range of customers that we work with to support their autonomous driving models. So we definitely have the capabilities and the depth in that market, and it's an important focus area for us.

Speaker 1

Yes. Thanks, Ryan. I'd add that It's a very big problem for autonomous vehicles. As you can see from the demos, just the challenge to annotate Road lines and then think about road furniture and other elements that you have to deal with as a driver. It's it is a big challenge and requires a lot of data.

Okay. The next question also touches on autonomous vehicles. Is recency of training data

Speaker 2

Recency is super important across Pretty much every AI model there is. Autonomous vehicles are a good example where it's really important. So I'll give an example around recency and why it's important. Last mile commuting is becoming really important. So 2 years ago, there weren't too many electric scooters on the road.

There's not that many today. If you go to San Francisco, it's a bit of a different story. So if you think about the annotation of electric scooter, particularly for someone that's upright, if you're using data from 2 years ago, you might treat them as a pedestrian. But now all of a sudden, you've got these pedestrian looking objects that are traveling 20 kilometers an hour down the road. So just a basic example of how real world environment is changing and in the context of autonomous vehicle, That has a massive change, right, a massive set of implications.

I think the other thing, Mark alluded, autonomous vehicle is really difficult. The specific environment changes geography by geography and country by country, different sets of road rules, Different sets of buildings in the background, right hand side, left hand side. So a lot of the work that's being done to build the models today are quite U. S. Centric, but there's going to be a huge long tail of Market specific training required to make autonomous driving a truly global approach.

Speaker 1

Yes. Thanks, Ryan. And I might hand it over to Wilson to chime in on this one as well. As many of you know, the majority of the work we do is around search And recency is really important in search. So Wilson, given your background, perhaps you could add something on the importance of recency in search data.

Speaker 3

Yes. In search, actually, recently it's called the search. If I remember my old days when we changed the search algorithm, we basically raised a new model almost every week Just to catch up all those reasons. The kind of like search touched on a lot of different areas. There is Kind of culture change, there is society movement, there is all those new keywords, popular music.

There's a lot of new stuff like people coming out and the search has to be able to support all those new stuff. So that's really like the reason is super important for search. That's also the reason we re train our model almost every week.

Speaker 1

Yes. Thanks, Wilson. And if Everybody on the call thinks about their own experiences driving, just to go back to autonomous vehicles. Certainly, if I go back to the area in Sydney where I grew up, the roads changed. Sometimes it changes very quickly.

And so even humans need recent data, but of course, we can join dots very well, whereas the AI needs the training data to learn. So it's much harder for an autonomous vehicle to learn than it is for a human. But recency is super important. Great question. Thank you.

Okay. The next question. We talked a lot about the impact of Appen Intelligence on speech, text, image, video, etcetera. We spoke very little about the impact of it on content relevance. Can we go into this a little deeper?

So Ryan?

Speaker 2

Good question. Content's relevance is highly subjective and it's highly specific To a demographic, so we need that subjectivity of a human and the context of that Individuals' awareness around a specific environment, which is typically driven by demographics, where they live, etcetera. The a lot of the work that we do in the automation space is around improving The speed of the crowd workers. So we spoke about pre labeling and validation and support during the labeling process To improve the speed of the work. With content relevance, there's typically less of a need to do Some of the time consuming tasks like draw a polygon around a shape, for example.

So a lot of the AI that's being used in speech and image related training data support is to really speed up the process, Which helps with throughput and quality. One of the things we are focusing though on the content relevance part is more of the user experience changes. So how do we, not necessarily using AI in the process, but improving the environment that the workers are operating with To try and get those incremental step changes in the time that it takes to complete the task. That's an important focus For customers we support in our platform.

Speaker 1

Yes. Thanks, Ryan. And again, I'll hand it over to Wilson because we've talked a lot about this, Wilson, trying to automate content relevance, perhaps you can share some of your thoughts on that and maybe some examples that bring it to life.

Speaker 3

Yes, sure. I think this is really great question. And also, trust me, there is no lack of Try, right? This is a big part of our business, and we try very hard to see how we can see more labor effort there. But it is hard.

It is hard. I say, if you Connerisms normally require human to have certain culture background, certain knowledge. To do those, Sometimes you can use machine learning to try some of those. I say, now I have a search keyword, I want to Say some results, some search results or some product results isn't super relevant to the keyword or not. We can use machine learning to try those.

Sometimes you can see some success, but there's a big problem with that. Why it's really If you do that, and potentially can you produce a lot of bias? So you don't really want to using machine learning trained to have a lot of how do I put this? Let's say, you have a Kind of a machine learning class that I read out, then the worker because for them the task is easy, right? Relevant or not relevant, they're going to just pick Whatever you said there, then you introduced some bias, which is all good.

The other part, I can't see a good example, but it's really The machine doesn't really have those deep understanding on the culture element. A lot of like subjective Component only human presents. So it's really hard. We tried a lot, but we haven't seen a lot of success. I mean, I'll go back to the point Ryan mentioned.

We did have a pretty good success when we tried to design A new workflow or maybe a different UI so then for the workers when they do content relevance, it's much easier for them to deliver the result. We see some success there.

Speaker 1

Yes. Thanks, Wilson. And perhaps you may recall, earlier in Ryan's presentation, he had The three pictures of breakfast, the black pudding, the chocolate sprinkles and the vegemite. And depending upon what country you come from, you may think that's breakfast or not. So that's an example of a cultural type question.

The search the relevance task could be very simple. Is this breakfast? Or would you eat this for breakfast? And the majority of people may look at the black pudding and say no, whereas depending upon culture and where you come from, You'd have a different answer. So automating that is super tricky.

So no lack of trying. Keep in mind also, the companies who ask us to do this for them, the largest search and social media companies in the world, We've got some pretty smart data scientists. And I think if they could have automated, they would have. But there's still a need for that, the human element there. Okay.

The next question Sorry, this is quite dynamic. It's a life of its own. Here's the next question. This may be one for Wilson, but I'll throw it to Ryan. First of all, feature engineering is one of the major time consuming tasks undertaken by data scientists.

Does Appen plan to invest in this area? Any investment in medicalbiological data annotation Technologies, fairly specific. So Ryan, do you have a response there?

Speaker 2

So I'll start with the medical, and maybe I'll throw to Wilson for the feature engineering part of the question. So our tools today are capable of supporting medical imagery. So a lot of the imagery work is The computer vision related task and now the flexibility of our tools can support that those types of applications. So That is an area of support today. Feature engineering, yes, I'll pass on to Wilson for that one.

Speaker 3

Sure. I think this is a great question. And also, their scientists really spend a lot of time on feature engineering. And also, when you say feature engineering, most of those times related to the training data. All the products we shared today, all the Example we gave today, the umbrella term is also feature engineering.

We are really preparing all those features, Which is also the training data, so really to have to train the model. So we are working a lot on those Besides like all those traditional product that we have, help people to collect data to enter the data, which will become the feature later on to use To train the machine learning model, that's already a big part of it. And also the new product like Appen in platform audits, That's actually help you to understand the training data. Let's help you to understand the features. What's the distribution of the feature?

Is there any bias for your feature? So basically all our work, all our product is around helping people to do better feature engineering.

Speaker 1

Yes. Thank you, Wilson, and thanks for that question as well. The next question is can Appen mobile technology be used for data capture as well as being a crowd focused tool. Ryan?

Speaker 2

Yes, absolutely. That's one of the real core components of Appen Mobile. So Wilson spoke through the 2 main features, one being the interface for our crowd where they can sign on, view their tasks And manage the relationship with Appen. The second is as a really powerful data capture tool. So I'll give you an example around some of the features which were enabled in a mobile device that aren't in a desktop.

So things like GPS, If the task was to go out and take a photo of a real world environment, the GPS data is automatically tagged Within the metadata of the image, so that becomes a really important part of the metadata of the training data. So there are a whole raft of different native features of handheld mobile devices That open up a different set of data collection capabilities.

Speaker 1

Thanks, Ryan, and thanks for that question. The next question, Slide 46 in the pack refers to different AI technologies That is used in mature practices and highlights techniques that typically require some level of human annotation and or data preparation. Where it says some level, is that level of human involvement the same, higher or lower than 2 years ago? And are lower levels of human annotation positive, pardon me, negative or neutral

Speaker 2

to Appen? A good question. A lot in AI varies and there are different use cases and things evolve differently to others. I think that in those specific areas around supervised learning, there are some supervised learning techniques where Data can be taken directly from CRM systems or other which are automatically annotated, which may not require as much human annotation to complete the feature engineering, Whereas there are other supervised learning techniques that are heavily reliant on human annotations to complete the labeling process That's used to train the systems. So I think that's part 1 of the my response.

The second part in how that has changed, Our market is growing and we see an increasing need for human annotated data. There's also a lot of need for training of data that's already kind of prepared because it comes from CRM or other structured data systems. So AI is growing. I think it's growing everywhere. So to answer is it more or less, I think it's definitely more.

Speaker 1

Thanks, Ryan. The next question, will the rising concerns on privacy issues with Big Tech With the rising concerns on privacy issues with Big Tech, Big Tech are working on reducing data accessibility. Do you think this would hurt Appen AI Business Model? And if yes, what would be a solution?

Speaker 2

We're still seeing how this is playing out. A lot of these changes are quite new. But you're right in saying that there Seems to be a general view that there's a restriction on the data sharing and how that's used inside the outside the broader ecosystem. One of the views that we have is that with this restriction of sharing, there will be less available information to train the models That are used for things like advertising, targeting, search results, etcetera. So again, we're yet to see it play out.

It's very live, There is potential that this could be a net positive for us as more data Is required to fill those gaps that have been created by the restriction driven by privacy.

Speaker 1

And if you also think about it, what's changing is, for want of a better word, the unsolicited harvesting data. And we'll move to an environment where more permission is needed, where more protection is provided around personal data, more rights Provided around personal data. What's constant is that AI It's the center of many product developments in technology and it's also constant that AI needs training data. I hope we've We sort of illustrated that today. What's changing is the way that firms acquire data.

So there's no reduction in the need for data, but it's how companies acquire data. And as Ryan says, that could play to our advantage because we could see companies coming to us saying, we can't harvest this data anymore. How do we get this data in a manner that protects the owner of that data. So yet to play out, but it is an important part of the AI industry going forward. Okay.

Speaker 2

The next question,

Speaker 1

what proportions of models would you think

Speaker 2

It's a tricky one to answer. Self supervised learning is an interesting technique and it has applicability and it has its benefits because there's A lot less of the feature engineering and the data labeling required. It does have its drawbacks though. It Really, it's restrained on the assigning of meaning to the data. So for instance, it will group a whole bunch of different shapes together and that Can be inferred to have a certain meeting, but that doesn't quite have the same benefit as what we see through human annotated data used to Support supervised learning.

The other thing is that with a lot of the self supervised learning techniques, Particularly in the large scale applications, there's an opportunity for bias to be introduced and it's quite hard to control. Unsupervised requires huge amounts of data, so it's very difficult to filter out the inputs that create bias. I'll ask Wilson whether he has a view on that specific question around the proportion of models that are being that rely on Self supervised learning.

Speaker 3

Yes, I do. Actually, this is the area I Put a lot of effort, more into the progress and also see how this technology evolves, right? Self supervised learning is not a new thing. It's been there For a long time, I used that probably 10 years ago together with other technique. So one thing super important for self supervised learning, It also is called self representative supervisor What does Appen mean?

It really means, let's say, how you learn those Representative of a word, right? Let's say, for example, back to my Twitter example, like the Tweet classification example, I want to classify if I treat it positive or negative. I'm using a large labeling data like base trait Positive is too negative. I'm using that to train the model. But meanwhile, I also use a lot of self supervised learning to Prepare the features before that.

What does it mean? I can use a lot of technology to Make sure I call words, I'd say this is this particular word, what does this word mean, right? That one, I can use self supervised learning to really convert 1 word into a vector. As kind of a feature engineering step, I will convert that word into vector and then use that vector as an input For my supervised learning to train the end result. I know it's a little bit complicated, but it's just one step Of the overall machine learning and training progress and also we use those techniques together.

It's not that I use self supervised learning, That's a replace supervisor, that's not a key. I use them both to train 1 model.

Speaker 1

Yes. Thanks, Wilson. And I think Overall, as Ryan's you'll recall the slide in

Speaker 3

the deck

Speaker 1

for the speech application. It was a very complicated slide, sort of blue shaded Slide. There are many, many different models in one application or one product, and there are many techniques that go into those models. Training data is essential for AI, but it's expensive. If the only training data was available was human Data, it could be potentially prohibitively expensive.

So the developers of AI are looking for any technique they can to accelerate and improve the cost of the development of their AI products. So typically, there's a mix of techniques that goes into building one product. And that's reflected in the way, pardon me, that we're going about our business. Rather than rely on the simple technique of human annotated data. We're looking to use our AI ourselves to accelerate and improve the unit economics of The production of that data.

Speaker 2

Okay. The next question.

Speaker 1

Why do customers use your platform and tools rather than build their own?

Speaker 2

A good question. Customers rely on us for a variety of areas. So firstly, and I think we kind of covered some of this today, data labeling It's difficult. Data labeling and annotation at really high quality levels can be very difficult. So that's point 1.

I think the next is that managing a crowd is very difficult also. So there's It's one thing to assemble a 1000000 plus people. It's another to allocate work, manage the quality, do the payments, etcetera. So customers could and some of our customers or people in the industry, I should say, do build out their own annotation platforms. And it's largely to support or comes it's initiated by supporting a very specific use case.

So they will build a The pipeline and a workflow within the business and some annotation tools to support a specific use case. Then comes the step of, okay, we want to do some different things. It's not just this narrow use case, we want to expand beyond that. And that's when it starts to become very apparent to our Customers that this is big investment. It's and it's difficult to bring the expertise across that Wide variety of use cases.

During our sales processes, we have many customers who Have been down this journey where they start with a narrow use case, have built something internally and quickly realize that it is very difficult to manage quality in particular, and it's difficult to support a breadth of AI use cases. So Customers come to us when AI is getting serious and they really want to move into production and support high quality training data for High performing applications.

Speaker 1

Yes. Thanks, Ryan. I think it's like any developing industry and ours is still relatively early. There are many techniques that people try on their own, could even be developing their own platform to do this work. And it gets to a point where It's too complicated.

The scale of the operation is too large. And then at the same time as people are learning how to do this, there are companies like Appen Emerging to bring specialist expertise to the industry. So I'm sure there was a time when every company made their own payroll system, for example, whereas now you would never do So I think it's a bit of that evolution as well. And I'd also add that every one of our customers benefits from all of our And knowledge that's embedded in the platform as opposed to just covering a particular use case. So yes, we see a departure from people building their own platforms to wanting to work with a specialist provider.

Okay. The next question. How is the role of crowdsource workers changing for Appen as the model changes to a product led committed revenue model? What does crowdsource efficiency mean for Appen?

Speaker 2

A good question. There's a large amount of work for our crowd. The one evolution that we're seeing is that the demographics are getting more specific, So the ask from our customers. So one of the things that we're doing is ensuring that we're able to serve the customers' needs by Pitting the right demographics. So that's very important for us.

The other thing That we're working very hard on is what we spoke about with Appen Mobile. So making that crowd experience a lot more seamless. So there's greater visibility into the tasks that are available and greater matching of a person's skills to the task. So when they do come and work with us, It's tasked that they are able to deliver high quality work and do more on those tasks and support in a really strong approach. So our crowdsourcing approach, we continue to build our crowd.

We continue to Find ways to better match the people in the crowd with the right task, and that's a good experience for our crowd, it's a good experience for our customers and ultimately leads to stronger growth in the business.

Speaker 1

Yes. The crowd is a big expense. It's the cost of goods expense that goes through the business. So if we can get more data per crowd worker to improve the unit economics and that goes to our gross margin and ultimately to the bottom line. This is a multipart question.

Firstly, data ownership and licensing. If a client owns their data library, Does that mean they no longer need that type of service again? Or is ongoing data maintenance required? If so, is this typically provided by Appen or the client? Can data libraries from existing clients be sold to other clients Wanting the same type of data?

Was each dataset private and mostly customized to client needs? And that was just part 1.

Speaker 2

Okay. Let me handle part 1. So we spoke a lot about data recency. And there's Again, the general view that all AI models degrade in terms of performance. So it's not a matter of if, it's a matter of when.

So the view is that a store of data needs to be refreshed. And we support a lot of our customers in updating and supporting Additional collection or if they have the data themselves, the labeling to refresh those data assets and those features so that the models can be retrained and the The second part is around ownership of the data. It varies customer by customer. Some customers where we are doing the data collection on their behalf and Determined on the arrangement with the customer, we have the ability to access that data either for internal uses or to on sell. Other customers, we don't.

So it is very specific on a case by case basis.

Speaker 1

Yes. Thank you. Thank you, Ryan. The second part of this question is about semi supervised learning, and I think we've covered that. So in the interest of time, We'll move to the 3rd part, which is in relation to your expense item services purchased Data collection, do you expect long term cost improvements here?

And I think that relates to Appen Mobile, for example.

Speaker 2

Yes, definitely. So we continue to invest in, like Mark said, getting more data from A crowd worker. Appen Mobile is one of the big areas that we're focusing 1, to improve the unit economics of data collection, but I think more importantly is to create a more feature rich Set of data that we're collecting from the field. So there's a lot of exciting projects that we're working on in this space.

Speaker 1

And I wonder, Wilson, do you have anything to add on ways that we lower the cost of data collection?

Speaker 3

Yes. There's a few areas that we are looking to. Why is it really just to make the data collection work much easier Card worker, right? So they just pick up their phone, open the app and then pass get done. Super easy.

So cost By improving the experience, we can download some calls. And also there is we are also trying to use machine learning in some of those third party cases. I'll give you an example. Some data collection tasks, they need a worker to record some voice data And then transcribe that voice data. What technique we use there is really when they record the voice data, we are using our machine learning capability in the back end, We transcribe those data automatically, but we don't show them to the worker because we just want to make sure that the worker also provide their input.

What we do there, using all the pre transcribed data in the back end, we provide autocomplete feature for the worker. When they transcribe that data, We will show them, is this what you are going to say? Yes. Just give them time to really finish their data collection tasks. So that's the second area.

Basically, besides the better experience, easy to use, we are also applying machine learning to help data collection tasks.

Speaker 1

Yes, thanks, Wilson. And data collection is becoming more important because recall Ryan's example about the chatbot, The U. S. English language chatbot, there could be a lot of off the shelf data. But when you get down to a much more specific use case, a different language, A more specialized area, you've got to collect a lot of data for that.

So the easier it is and the more cost efficient it is to collect data, The more value it is for the customer. The final part of this question is what metrics sorry, do you In relation to Figure8, the Figure8 acquisition, what metrics are you using to measure its success? The key one is when we took everybody through yesterday, which is the growth in that new market figure. So the new market figure is data that we sorry, revenue that we derive from the enterprise sector, from the government sector, from China and also revenue that flows through our platform from our major customers. And you can see from yesterday's presentation that that's growing nicely.

None of that revenue would be available without that acquisition. Okay. Is the machine learning doing the OCR and audio transcription our proprietary software

Speaker 2

or off the shelf?

Speaker 1

I think I'll throw this one straight to Wilson. He's built it. You should know.

Speaker 3

It is proprietary models. We started with off the shelf models. It didn't work for a few reasons. 1 is A lot of use cases they are handling are very specialized, and we need to find the right training data supporting the use case. So those off the shelf model doesn't really work well for those use cases.

So we have to train our own model. So that's one reason. 2nd reason is also, our model is a little bit different from the end model. You can see like, for example, audio transcription. Our order not only need to transcribe the audio to text, We also need to flag, this is the background noise.

This is a different gender. This is some pause. We need to They were all those different activities. Off the shelf model doesn't need to handle those. So we have to train our own proprietary model.

So it will be more difficult than off the shelf model, but it just give us an advantage when we ask to do this type of job.

Speaker 1

Yes. Thanks, Wilson. Thanks for that question. I hope that provides a clear answer. The next question, do the Appen products integrate with client side data pipelines And applications, again, I'll throw this one straight to Wilson.

Speaker 3

Yes, that's a great question. And the answer is absolutely yes. We provide a very rich set of HCI to the client. They can using our HCI set up job, upload data, download data and just make Our system is part of their overall pipeline. That is used a lot.

That's also a big focus for our product engineering team.

Speaker 1

Thanks, Wilson. The next question is, will this webinar be Upload it for replay and the answer is yes. A recording of today's event will be available on our Investor Center on the Events and Presentations page early next week. The next question, one for Ryan. As use cases become more niche, Will you have to develop more tools?

Can AI models be transferred between use cases?

Speaker 2

Yes. Good question on the tooling. So Like what we said throughout this presentation, we believe we have a complete set of tools, but new AI use cases and with specific Data techniques continue to emerge. So we will continue to invest in the breadth of our tooling. Some examples there, Which are live at the moment is different light spectrum.

So non visible light spectrum It's a good example where there's a lot of interest in AI applications. Our tools support it today, but there's more that we could be doing in that space. So An example of an emerging area that we'll be focusing on. On the transferability of models, There is a very common technique in AI model development, which is called transfer learning. And transfer learning is used pretty much across the board for Every model, it only gets you a small part of the way though.

So there is still a lot of fine tuning required and that's really where The supervised learning comes in and the requirement of high quality training data.

Speaker 1

Thanks. You may also recall during the presentation, Wilson mentioned that we're doing some work on satellite image data, which is not just another data type, but of course there's a sort of a tiling nature of that data that requires certain tooling, etcetera. Okay. The next question. Regarding moving from model centric to data centric, How far into this move to data centric are customers and in particular, the big global customers?

And how much of a difference can this shift make to Appen's Financial Performance. Ryan?

Speaker 2

I think that we're well into this shift. It does vary industry, industry and customer by customer. I think that in the more forward AI Companies, including our largest customers, there is a very they're probably further along that shift, whereas There might be a set of customers who are more used to using internal data for their AI development And they focus more on the models, partially because of the realization that if they're able to use Human annotated data or different data sources, that will have a big unlock in terms of value for their AI models. But I might also get Wilson Yes. Chime in on this question.

Speaker 3

Yes. I think the Whole industry is moving more to data centric AI. That basically applies to almost every company who is working AI. It has become a common understanding or common sense in the machine learning community. Data just plays super critical role to the AI model So no matter if you are a professor or you are from a big company or a small company, I think all those data scientists, they just know the importance of data.

Speaker 1

Thanks, Wilson. Thanks, Ryan. The next question. Does everyone need the quality of the data that Appen provides? Ryan?

Speaker 2

Well, it depends on the quality of the model that they're looking to produce. If a customer wants to build a low quality model and that's sufficient for the needs of the application, then they may not need high quality training data. But if you want to build a model that has high quality across not just a small subset, but a broad subset of Inputs, then you will need high quality training data. And maybe another way to put it, if you want to build a high quality model And you've got low quality training data. It's kind of not possible.

You need high quality training data to build a high quality model.

Speaker 1

Wilson, maybe you have some examples of when you might choose to use low quality data.

Speaker 3

I do. I do. Actually, When I work on my catabolic product with my daughter during VPN, I don't need Appen to product service to me. It's good enough to have a toy, right? But if you are really using AI to do any serious business, high quality data is a must.

Speaker 1

I think that sums it up, folks. Building a chatbot with my daughter. And I can tell you Wilson's daughter It's quite young. So that's the extent of the knowledge there. Okay.

This is, I believe the last question we have. Does Appen market off the shelf data libraries for chatbots? And what is the extent of the service that Appen contributes to chatbot setups?

Speaker 2

We absolutely do off the shelf data. We have a rich catalog of data that we've collected. And one of the big differences for us is that we have very large volumes of data collected. So It's used by a lot of customers to kick start their development of chatbots. So yes, it's an important part of our product offering.

Speaker 1

That's all the questions we have. So I'd like to take this opportunity to thank all of you for attending our webinar today. I hope it was useful. If I can leave you with Three thoughts from today's presentation. The first is that the future of AI is very robust It absolutely relies on large volumes of high quality training data.

I think the examples we provide make that very clear. I think also that we've provided a lot of information on the need for technology to provide Those large volumes of high quality data, the complexity of use cases, the volumes required, Dealing with millions of crowd workers, it's not possible without a good strong product foundation. And then the third thing is I hope you see that we're investing into this. We've made a lot of progress in this area. There's lots to do, but Over time, we are building much more of a product first business and over time building more competitive advantage and resilience into our business as well.

So thank you once again. Thank you to my co presenters, Ryan and Wilson, for all of their input to this. And I'm looking forward to the next time that we all meet. Thank you, and good day.

Powered by