share_log

微软、谷歌和Meta押注合成数据构建AI模型

Microsoft, Google and Meta bet on synthetic data to build AI models

環球市場播報 ·  May 9 12:17

Every clever response from chatbots is supported by massive amounts of data — in some cases trillions of words need to be extracted from articles, books, and online reviews to teach artificial intelligence systems to understand users' queries. The industry's traditional view is that creating the next generation of artificial intelligence products will require more and more information.

However, there is a big problem with this plan: the quality data that can be provided on the internet is limited. To obtain this data, artificial intelligence companies usually either pay publishers millions of dollars to license content or download data from websites, putting themselves at risk of copyright disputes. More and more top artificial intelligence companies are exploring another way to cause division in the industry: using synthetic data, which is essentially fake data.

This approach works like this: technology companies can use their own artificial intelligence systems to generate text and other media. Future versions of the same system can then be trained using this artificial data, which Anthropic's CEO Dario Amodei (Dario Amodei) calls a potential “limitless data generation engine.” In this way, AI companies can avoid many legal, ethical, and privacy issues.

The idea of synthesizing data in computation isn't new — the technology has been used for decades in everything from de-anonymizing personal information to road condition simulation for autonomous driving technology. However, the rise of generative artificial intelligence has made it easier for people to create higher quality synthetic data on a large scale, and has also given new urgency to this approach.

At Microsoft, a generative artificial intelligence research team used synthetic data in a recent project. They wanted to build a smaller, less resource-intensive AI model, but still have effective language and reasoning skills. To do this, they tried to mimic the way children learn languages by reading stories.

Instead of supplying this artificial intelligence model with a large number of children's books, the team listed 3,000 vocabulary that four-year-olds can understand. They then asked this artificial intelligence model to create a children's story using a noun, a verb, and an adjective from the glossary. The researchers repeated this hint millions of times over a few days, generated millions of short stories, and ultimately helped develop another more powerful language model. Microsoft has open-sourced this new “small” language model series, Phi-3, and made it available to the public.

Sébastien Bubeck (Sébastien Bubeck), vice president of generative artificial intelligence at Microsoft, said, “All of a sudden, you have far more control than you used to. You can decide at a more granular level what you want your model to learn.”

Bubeck said that with synthetic data, you can also add more explanations to the data to better guide the artificial intelligence system through the learning process; otherwise, the machine may be confused during processing.

However, some AI experts are concerned about the risks of this technology. A group of researchers from Oxford, Cambridge, and several other well-known universities published a paper last year explaining how using synthetic data generated by ChatGPT to construct a new artificial intelligence model caused what they called a “model crash.”

In their experiment, the artificial intelligence model created based on ChatGPT's output content began to show “irreversible flaws” and seemed to have lost memory of the original training content. For example, researchers used texts about historic buildings in England to suggest a large-scale linguistic artificial intelligence model. After they retrained the model several times using synthetic data, the model began generating meaningless nonsense about long-eared hares.

Researchers are also concerned that synthetic data may amplify the bias and toxicity in the data set. Some supporters of synthetic data say that by taking appropriate measures, models developed in this way can be as accurate or even better than models built on real data.

Dr. Zakhar Shumaylov (Zakhar Shumaylov) of the University of Cambridge (University of Cambridge), said in an email: “If handled properly, synthetic data can be very useful. However, there are currently no clear answers on how to handle them properly; some biases may be difficult for humans to detect.” Shumelov is one of the co-authors of the above paper on model collapse.

There is also a more philosophical debate: if large-scale language models are caught in an endless cycle of training based on their own content, will AI eventually become less a machine that mimics human intelligence, but more of a machine that mimics the language of other machines?

Percy Liang (Percy Liang, transliteration), a computer science professor at Stanford University (Stanford University), said that in order to generate useful synthetic data, companies still need the crystallization of real human intelligence, such as books, articles, and code. In an email, Liang said, “Synthetic data is not real data; it's like if you dream of climbing Mount Everest and not actually climbing the summit.”

Pioneers in synthetic data and artificial intelligence agree that you can't leave humans out of this process. We still need real people to create and perfect artificial data sets.

Bubeck said, “Synthesizing data isn't about simply pressing a button and saying, 'Hey, help me generate some data. ' It's a very complicated process. The process of creating synthetic data on a large scale requires a significant investment of manpower.”

Disclaimer: This content is for informational and educational purposes only and does not constitute a recommendation or endorsement of any specific investment or investment strategy. Read more
    Write a comment