WE POST ONE NEW BILLION-DOLLAR STARTUP IDEA every day.

LLM's & ChatGPTs Future, The Wikipedia View

Problem: What business idea almost made the former Head of Product at OpenAI leave the company back in March 2021? Fraser described it in his latest Tweet: “AI API” (an AI marketplace).

In March 2021 I contemplated leaving @OpenAI to start a company around the ideas that open source and open research were going to win. I decided it was too early but obviously that's no longer the case. Here was my thinking at the time. Who is building this? I Want to help them.

This post is a description of his thinking at the time, which I call “The Wikipedia View.”

Large Language Models today are quite locked down: companies are competing on having a better or bigger model to then build products with. On the flip side, open source AI is “rather dismal… while performance on benchmarks against comparable models is good, performance in production settings is not” (source).

The Wikipedia View is the idea that overtime the industry will converge to a world where there will be a large general model, produced completely through open-source contributions and product teams will narrow down to their specific needs through an open ecosystem of smaller models tailored or customized to their specific product needs. For example, instead of using ChatGPT to ask math questions, I would use MathGPT: the only difference between the two platforms would be the model that my LLM was trained on. ChatGPT would be focused on chatting, whereas MathGPT would be focused on mathematic texts.

While private companies like Anthropic or OpenAI certainly will have large models vetted with hired humans in the loop, could there also be a future where open models are orders of magnitude larger and significantly more accurate?


Read Our First 500 Billion Dollar Ideas
$5.00
Every month

Subscribe here to get access to the first 500 ideas from our blog. For just one coffee a month, you can have access to more than $500 billion dollars of ideas. What's not to love?


Solution: This company would focused on creating an open ecosystem for research, software, and cloud services to enable individual AI enthusiasts to achieve cutting edge performance across a number of use-cases. In Fraser’s mind, this would occur through an API for advanced AI that is all routed to a base-layer of models best for their specific use-case.

Every business solves a problem, so what problem does this solution solve? The issue is that developers are currently blocked from building great, highly-specific AI-enabled products because general models are often not the best models for their use case. Right now, the best first mover in the space is Aquila, a Chinese/English open source LLM:

The Aquila language model inherits the architectural design advantages of GPT-3 and LLaMA, replacing a batch of more efficient underlying operator implementations and redesigning the tokenizer for Chinese-English bilingual support. It upgrades the BMTrain parallel training method, achieving nearly 8 times the training efficiency of Magtron+DeepSpeed ZeRO-2 in the training process of Aquila. The Aquila language model is trained from scratch on high-quality Chinese and English corpora. Through data quality control and various training optimization methods, it achieves better performance than other open-source models with smaller datasets and shorter training times. It is also the first large-scale open-source language model that supports Chinese-English-Knowledge, commercial licensing, and complies with domestic data regulations.

Another example of what Fraser’s solution looks is to crowdsource the fine-tuning of a few open source LLMs with vetted volunteers in the way Wikipedia was built. See below for a description of what this might look like from Yann LeCun, Chief AI Scientist at Meta AI:

If you want those systems ultimately to be the repository of all human knowledge, the dimension of that space (all human knowledge) is enormous. You're not going to do it by paying a few thousand people in Kenya or in India to rate answers. You're going to have to do it with millions of volunteers that fine-tune the system for all possible questions that might possibly be asked. Those volunteers will have to be vetted in the way Wikipedia is being done. Think of LLMs in the long run as a version of Wikipedia plus your favorite newspapers plus the scientific literature plus everything. But you can talk to it. You don't have to read articles, you can just talk to it.

So if it is supposed to become the repository of all human knowledge, the thing will have to be trained in a way similar to Wikipedia is. This is a very strong argument for having open source base models for LLMs. In my opinion the future is inevitably going to be that you're going have a small number of Open Source base LLMs that are not trained for any particular application. They just are trained on enormous amounts of data that requires enormous amounts of money so you're not going to have 25 of them, you're going to have two or three. Then the actual applications are going to be built on top of it by fine-tuning those systems for a particular vertical or applications. That's the future

And of course, see below for a full description of Fraser’s original idea.

Monetization: A per-API call usage fee.

Contributed by: Michael Bervell (Billion Dollar Startup Ideas)

Text to GIF

GPT SDG: Synthetic Data Generation