Microsoft, Google and Meta bet on synthetic data to build AI models

聊天机器人每一个巧妙的应答背后都有海量数据作为支撑——在某些情况下，需要从文章、书籍和网上评论中摘取数万亿个词汇，以教会人工智能系统理解用户的查询。业界的传统观点是，创建下一代人工智能产品将会需要越来越多的信息。

然而，这个计划存在一个大问题：互联网上能够提供的高质量数据是有限的。为了得到这些数据，人工智能公司通常的做法是，要么向出版商支付数百万美元以获得内容许可，要么从网站上下载数据，使自己面临版权纠纷的风险。越来越多的顶流人工智能公司正在探索另一种在业内引发分歧的办法：使用合成数据，从本质上来说就是假数据。

这种办法的工作原理是这样的：科技公司可以利用自己的人工智能系统来生成文字和其他媒体。然后，可以用这些人工数据训练同一个系统的未来版本，Anthropic的首席执行官达里奥·阿莫代伊（Dario Amodei）称之为潜在的“无限数据生成引擎”。这样一来，人工智能公司就可以避免引发许多法律、道德和隐私方面的问题。

在计算中合成数据的想法并不新鲜——这项技术已经被使用了几十年，涉及到从个人信息的去匿名化到自动驾驶技术路况模拟的各个领域。但是，生成式人工智能的兴起使人们可以更容易大规模创建质量更高的合成数据，而且也使这种做法有了新的紧迫性。

在微软，生成式人工智能研究团队在最近的一个项目中使用了合成数据。他们希望构建一个规模较小、资源密集程度较低的人工智能模型，但仍具有有效的语言和推理能力。为了做到这一点，他们试图模仿孩子通过阅读故事来学习语言的方式。

该团队并没有向这个人工智能模型提供大量儿童读物，而是列出了四岁孩子能够理解的3000个词汇。然后，他们要求这个人工智能模型使用词汇表中的一个名词、一个动词和一个形容词来创造一个儿童故事。研究人员在几天的时间内重复了数百万次这个提示，生成了数百万个短篇故事，最终帮助开发出了另一个更强大的语言模型。微软已经将这个新的“小型”语言模型系列Phi-3开源并向公众开放。

微软生成式人工智能副总裁塞巴斯蒂安·布贝克（Sébastien Bubeck）说：“突然之间，你拥有了远多于过去的控制权。你可以在更精细的层面上决定你希望自己的模型学习哪些东西。”

布贝克说，利用合成数据，你还可以通过为数据添加更多解释来更好地指导人工智能系统完成学习过程，不然的话，机器在处理过程中可能会感到困惑。

但是，一些人工智能专家对这种技术存在的风险感到担忧。牛津、剑桥和其他几所知名大学的一组研究人员去年发表了一篇论文，解释了使用ChatGPT生成的合成数据来构建新的人工智能模型为何会导致他们诉说的“模型崩溃”。

在他们的实验中，基于ChatGPT的输出内容创建的人工智能模型开始出现“不可逆转的缺陷”，而且似乎失去了对最初训练内容的记忆。举例来说，研究人员用有关英国历史建筑的文本提示一种大型语言人工智能模型。当他们使用合成数据多次重新训练这个模型后，这个模型开始生成有关长耳大野兔的毫无意义的胡言乱语。

研究人员还担心，合成数据可能会放大数据集当中的偏见和毒性。合成数据的一些支持者则表示，通过采取适当的措施，用这种方式开发的模型可以和基于真实数据构建的模型一样准确甚至更好。

剑桥大学（University of Cambridge）博士扎哈尔·舒梅洛夫（Zakhar Shumaylov）在一封电子邮件中说道：“如果处理得当，合成数据会很有用。然而，对于如何才能处理得当，目前还没有明确的答案；有些偏见对于人类来说可能很难察觉。”舒梅洛夫是上述关于模型崩溃论文的合著者之一。

还有一个更具哲学性的争论：如果大型语言模型陷入根据自身内容进行训练的无休止循环中，那么人工智能最终是否会变得不再是模仿人类智能的机器，而更多的是模仿其他机器语言的机器？

斯坦福大学（Stanford University）计算机科学教授珀西·梁（Percy Liang，音译）表示，为了产生有用的合成数据，公司仍然需要真正的人类智慧结晶，比如书籍、文章和代码。梁在一封电子邮件中说道：“合成数据不是真实的数据，就像你做梦登上了珠穆朗玛峰并不是真正登顶了一样。”

合成数据和人工智能领域的先驱们一致认为，你不能将人类排除在这个过程之外。我们仍然需要真人来创建和完善人工数据集。

布贝克说：“合成数据并不是简单地按下一个按钮然后对它说，‘嘿，帮我生成一些数据。’这是一个非常复杂的过程。在大规模创建合成数据的过程中需要投入大量的人力。”

Every clever response from chatbots is supported by massive amounts of data — in some cases trillions of words need to be extracted from articles, books, and online reviews to teach artificial intelligence systems to understand users' queries. The industry's traditional view is that creating the next generation of artificial intelligence products will require more and more information.

However, there is a big problem with this plan: the quality data that can be provided on the internet is limited. To obtain this data, artificial intelligence companies usually either pay publishers millions of dollars to license content or download data from websites, putting themselves at risk of copyright disputes. More and more top artificial intelligence companies are exploring another way to cause division in the industry: using synthetic data, which is essentially fake data.

This approach works like this: technology companies can use their own artificial intelligence systems to generate text and other media. Future versions of the same system can then be trained using this artificial data, which Anthropic's CEO Dario Amodei (Dario Amodei) calls a potential “limitless data generation engine.” In this way, AI companies can avoid many legal, ethical, and privacy issues.

The idea of synthesizing data in computation isn't new — the technology has been used for decades in everything from de-anonymizing personal information to road condition simulation for autonomous driving technology. However, the rise of generative artificial intelligence has made it easier for people to create higher quality synthetic data on a large scale, and has also given new urgency to this approach.

At Microsoft, a generative artificial intelligence research team used synthetic data in a recent project. They wanted to build a smaller, less resource-intensive AI model, but still have effective language and reasoning skills. To do this, they tried to mimic the way children learn languages by reading stories.

Instead of supplying this artificial intelligence model with a large number of children's books, the team listed 3,000 vocabulary that four-year-olds can understand. They then asked this artificial intelligence model to create a children's story using a noun, a verb, and an adjective from the glossary. The researchers repeated this hint millions of times over a few days, generated millions of short stories, and ultimately helped develop another more powerful language model. Microsoft has open-sourced this new “small” language model series, Phi-3, and made it available to the public.

Sébastien Bubeck (Sébastien Bubeck), vice president of generative artificial intelligence at Microsoft, said, “All of a sudden, you have far more control than you used to. You can decide at a more granular level what you want your model to learn.”

Bubeck said that with synthetic data, you can also add more explanations to the data to better guide the artificial intelligence system through the learning process; otherwise, the machine may be confused during processing.

However, some AI experts are concerned about the risks of this technology. A group of researchers from Oxford, Cambridge, and several other well-known universities published a paper last year explaining how using synthetic data generated by ChatGPT to construct a new artificial intelligence model caused what they called a “model crash.”

In their experiment, the artificial intelligence model created based on ChatGPT's output content began to show “irreversible flaws” and seemed to have lost memory of the original training content. For example, researchers used texts about historic buildings in England to suggest a large-scale linguistic artificial intelligence model. After they retrained the model several times using synthetic data, the model began generating meaningless nonsense about long-eared hares.

Researchers are also concerned that synthetic data may amplify the bias and toxicity in the data set. Some supporters of synthetic data say that by taking appropriate measures, models developed in this way can be as accurate or even better than models built on real data.

Dr. Zakhar Shumaylov (Zakhar Shumaylov) of the University of Cambridge (University of Cambridge), said in an email: “If handled properly, synthetic data can be very useful. However, there are currently no clear answers on how to handle them properly; some biases may be difficult for humans to detect.” Shumelov is one of the co-authors of the above paper on model collapse.

There is also a more philosophical debate: if large-scale language models are caught in an endless cycle of training based on their own content, will AI eventually become less a machine that mimics human intelligence, but more of a machine that mimics the language of other machines?

Percy Liang (Percy Liang, transliteration), a computer science professor at Stanford University (Stanford University), said that in order to generate useful synthetic data, companies still need the crystallization of real human intelligence, such as books, articles, and code. In an email, Liang said, “Synthetic data is not real data; it's like if you dream of climbing Mount Everest and not actually climbing the summit.”

Pioneers in synthetic data and artificial intelligence agree that you can't leave humans out of this process. We still need real people to create and perfect artificial data sets.

Bubeck said, “Synthesizing data isn't about simply pressing a button and saying, 'Hey, help me generate some data. ' It's a very complicated process. The process of creating synthetic data on a large scale requires a significant investment of manpower.”

Disclaimer: This content is for informational and educational purposes only and does not constitute a recommendation or endorsement of any specific investment or investment strategy. Read more

微软、谷歌和Meta押注合成数据构建AI模型

Microsoft, Google and Meta bet on synthetic data to build AI models

Risk Disclaimer

Statement