Problem: During our last team meeting, my co-founder described to me some of the biggest breakthroughs in the technology behind our product in the last week.
“I had to make an ML model to achieve task #2, but I had no training data for the model,” he described. “So I just went to ChatGPT and asked it to create 1,000 use-cases for us to use.” As he navigated over to his ChatGPT chat history, I nodded. He had also asked ChatGPT to create accuracy ratings on a scale of 1 - 10.
“Now let’s test it!” As he went to our project page and clicked around, my jaw dropped: our product worked perfectly even in the edge cases. And we didn’t even have to collect any data! Generative AI simply created it for us. Could the future of training data for text be synthetic data generation through general purpose transformers?
Subscribe here to get access to the first 500 ideas from our blog. For just one coffee a month, you can have access to more than $500 billion dollars of ideas. What's not to love?
Solution: I first stumbled upon this idea while talking with Google’s AI Research Robotics team. They were running experiments for training robots through synthetic data or through the data produced from other robots in the lab. It got me thinking of the power of highly-accurate (or inaccurate) data as a means of creating large-scale models. Now, with ChatGPT (and other general purpose transformers) I can confidently see a direct path to a billion-dollar company in the text-based synthetic data generation field.
In particular, this business would specialize in fine-tuning transformers that could allow users to generate and use “training data as a service.” The kicked, however, would be that all of this training data would be generated in real time.
For example, suppose McDonalds wanted to create a machine learning model to take text input and translate it into menu items. This would be extremely useful in their drive through: after a customer orders with their voice, this would need to be translated to text and eventually to one of nearly 100 McDonalds menu items. In order to do this, McDonalds would need to train the text-based model on thousands of variations of each menu item. It would have to recognize that a “Small Fries” is the same as “Not Large and Not Medium fries” or “The smallest size of fries” or “Fries, but small.” No matter what slang customers would say, McDonalds would have to understand and translate this into an order.
Today, the best way to solve this is by manually creating the different variations that someone could order items off their menu or try to record how people order today and manually tag this. With GPT SDG, the solution would be significantly simpler: “Hey ChatGPT, Imagine that you are a sophisticated synthetic data generation engine. Can you give me 25,000 unique ways to order a Big Mac?” Using this as the base of the training data, McDonalds would be able to train a phenomenal model. See below for an image of the actual prompt and response:
GPT SDG, can you give me the McDonald’s specialty with the classic sesame seed bun?
Monetization: Selling this to companies as a service. “Synthetic Data As A Service”
Contributed by: Michael Bervell (Billion Dollar Startup Ideas)