Intel CEO Debuts at Microsoft Conference: An Innovative Platform to Unleash the Superpowers of AI PCs and Optimize the Operation of AI Models

wallstreetcn · May 21 20:07

AI PC包括优化版的OpenVino和DirectML，可在CPU、GPU和NPU上高效运行例如Phi-3这样的生成式AI模型。部署能够推理并使用工具采取行动的AI代理，在AI PC上高效运行AI模型，利用推测解码和量化技术，适用于多种用例，如个人助手、安全本地聊天、代码生成、检索增强生成（Retrieval Augmented Generation，RAG）等等。

微软年度Build开发者大会周二来袭，英特尔主体软件架构师Saurabh Tangri和AI应用研究团队主管Guy Boudoukh介绍了AI PC的发展情况和应用趋势。

Tangri介绍，AI代理和生成式AI应用程序为PC用户提供了无与伦比的能力。AI PC包括优化版的OpenVino和DirectML，可在CPU、GPU和NPU上高效运行例如Phi-3这样的生成式AI模型。部署能够推理并使用工具采取行动的AI Agents，在AI PC上高效运行AI模型，利用推测解码和量化技术，适用于多种用例，如个人助手、安全本地聊天、代码生成、检索增强生成（Retrieval Augmented Generation，RAG）等等。

Tangri表示，目前的AI技术已经可以将一些功能内置于平台中。他表示，当用户有在静态数据库进行训练的静态的语言模型时，需要有同时运行这些模型的能力，目前可以通过运行检索增强生成（RAG）来增强其能力，从而增强AI执行更多任务的能力。

他举例说，在一个消费者场景，你经常会遇到的问题是“我是否超出了预算”。现在你可以通过AI引入你的私有数据，使用先进的LLM（大型语言模型）进行分析，你可以沿这些线路放置一些内容，然后你就能够从中提取一些结论和行动。

“这一元素非常新颖。我对此非常兴奋，这是我们首次展示这一完整管道，从RAG到LLM再到反应、推理，全部在你的PC上运行。这非常有趣，非常前沿。”

big

Guy Boudoukh随后演示了利用由英特尔Core Ultra处理器驱动的多模态小模型Phi-3，包括Phi-3AI代理的响应、与私人数据的交流、用户如何与文档对话并通过RAG来生成答案等。

Boudoukh介绍，Phi-3 ReAct代理前端是用户向语言模型提供的指令和上下文，以实现所需任务，这可以是聊天或问答。他介绍，ReAct提示去年由普林斯顿大学和谷歌首次引入，这是一种新的提示方法，ReAct代表推理和执行。

他说，这种方法允许LLM不止做简单的文本生成，它实际上允许LLM使用工具并执行操作，以更好地处理用户的输入。它允许LLM结合各种工具，如RAG、Gmail、维基百科、必应搜索等，其中一些工具可以访问设备上的私有数据，而一些工具可以访问互联网。

首先可将用户查询输入到ReAct模板中，然后将其注入Phi-3代理，代理决定是否需要使用工具来回答用户查询。如果需要工具，则调用工具，然后将工具的输出返回给提示对话框，然后再次返回给代理。代理可以决定是否需要使用另一个工具来回答这个问题，这个过程会再次重复。只有当代理认定，有足够的信息来回答用户查询时，它才会生成答案。

big

在演示中，Boudoukh询问今年有多少队伍参加了欧冠，代理进行了推理并理解，需要RAG来回答这个问题，于是搜索了160篇BBC体育新闻；然后他要求代理通过Gmail发送这个答案，因此代理就调用了另一个工具Gmail来解决这一问题。

随后，Boudoukh演示了Phi-3代理执行RAG的具体过程。他说，RAG允许LLM通过注入检索到的信息来访问外部知识。首先，用户在设备上索引数百甚至数千个文件，这些文件将嵌入索引并保存到一个向量数据库（Vector DB）中。现在，一旦用户提供查询，从数据库中检索信息，并创建一个由用户查询和检索信息组成的新统一提示，然后将这个提示注入LLM并生成答案。

他说，RAG有几个优势。首先，它增强了LLM的知识，而不需要训练模型。其次，这样的数据使用非常高效，因为不需要提供整个文档，只需要提供检索到的信息。这减少了模型的幻想并提高了可靠性，因为在提供答案时，它会参考获取答案的相关数据。

big

在随后的演示中，Boudoukh跳过代理，直接询问机器今年有多少队伍参加了欧冠，他首先并未使用RAG，结果代理生成了错误的答案，回答说今年有32支队伍，但实际上今年有36支队伍参赛。然后他调用RAG询问同一问题，就得出了正确的答案。

Boudoukh表示，这可以向开发者展示，如何利用软件栈在NPU、CPU和集成GPU之间分配工作。例如，这里的语音识别模型Whisper是在NPU上运行的，Phi-3推理则在集成GPU上运行，而数据库搜索则在CPU上运行。

最后Boudoukh进行了LLaVA Phi3多模态模型演示。他介绍，该模型是经过视觉和颜色训练的，因此可以处理涉及文本和图像的多模态任务。他将一张图像插入模型，并要求模型描述图像场景，模型则给出了对场景的详细理解，甚至建议在这里钓鱼放松。

big

他还展示了模型代码的核心部分之一，即LLM推理部分。他说，要在英特尔Core Ultra处理器上运行Phi-3和LLM推理很容易，只需要定义模型的名称，定义量化配置、加载模型、加载标记器（tokenizer），然后提供一些示例，进行标记操作，对输入进行标记，然后生成结果。而这一演示利用的优化版的OpenVino，即AI PC的一种。

big

Tangri表示，这就是AI PC与LLM共同运行的精彩表现。现实世界中的AI有四个支柱：效率、安全性、与网络协作的能力，以及开发者准备度。如果你拥有前三者，但没有为开发者做好准备，你将无法在这个平台上进行创新。

他表示，高效率指的是能够延长设备的电池寿命，而不只是追求高每秒浮点运算次数（TeraFLOPS）的假象。“归根结底，我们真正追求的是客户体验和用户体验，这涉及到将自然语言界面与图形用户界面结合起来。所以，最终，我们追求的是体验，而不是虚假的性能指标。”

Tangri表示，英特尔过去几年来已经和微软合作创立标准，如开放神经网络交换ONNX(Open Neural Network Exchange)的标准。而关于开发者的准备度，他表示，英特尔目前有一个前沿的尖端研究的运行演示，可以完全在PC环境中运行。“所以我们真正迎合了开发者的需求，降低了在我们的平台上创新的门槛，无需在线上和云端使用，这一切都可以在你的PC上完成。”

AI PCs include optimized versions of OpenVino and DirectML, which can efficiently run generative AI models such as Phi3 on CPUs, GPUs, and NPUs. Deploy AI agents that can reason and act using tools to efficiently run AI models on AI PCs, use speculative decoding and quantification techniques, and are suitable for various use cases such as personal assistants, secure local chat, code generation, enhanced search generation (RAG), etc.

Microsoft's annual Build Developers Conference starts on Tuesday. Intel's main software architect Saurabh Tangri and AI application research team leader Guy Boudoukh introduced the development status and application trends of AI PCs.

According to Tangri, AI agents and generative AI applications provide PC users with unparalleled capabilities. AI PCs include optimized versions of OpenVino and DirectML, which can efficiently run generative AI models such as Phi3 on CPUs, GPUs, and NPUs. Deploy AI Agents that can reason and use tools to act, efficiently run AI models on AI PCs, use speculative decoding and quantification techniques, and are suitable for various use cases such as personal assistants, secure local chat, code generation, enhanced search generation (RAG), etc.

Tangri said that current AI technology can already build some features into the platform. He said that when users have static language models for training in static databases, they need the ability to run these models simultaneously. Currently, they can enhance their ability by running Search Enhanced Generation (RAG), thereby enhancing the AI's ability to perform more tasks.

As an example, he said that in a consumer scenario, the question you often run into is “am I over budget?” Now you can bring in your private data through AI and analyze it using advanced LLM (Large Language Model), you can place some content along these lines, and then you can extract some conclusions and actions from it.

“This element is very novel. I'm really excited about this; this is our first time showing this complete pipeline, from RAG to LLM to reaction and reasoning, all running on your PC. It's very interesting and very cutting edge.”

Big

Guy Boudoukh then demonstrated the use of Phi-3, a multi-modal small model driven by the Intel Core Ultra processor, including responses from Phi3AI agents, communication with private data, and how users can talk to documents and generate answers through RAG.

Boudoukh explained that the Phi-3 React proxy front-end is the instructions and context provided by the user to the language model to perform the required tasks, which can be chat or Q&A. He explained that React tips were first introduced by Princeton University and Google last year. This is a new reminder method, and React represents reasoning and execution.

He said this approach allows LLM to do more than simply generate text; it actually allows LLM to use tools and perform actions to better handle user input. It allows LLM to combine various tools such as RAG, Gmail, Wikipedia, Bing Search, etc., some of which can access private data on the device, while others can access the internet.

User queries can first be entered into the React template, then injected into the Phi-3 agent, and the agent decides whether to use tools to answer user queries. If a tool is needed, call the tool, then return the tool's output to the prompt dialog, then back to the agent. Agents can decide if they need to use another tool to answer this question, and the process will be repeated again. The agent will only generate an answer if it determines that there is enough information to answer the user's query.

Big

In the presentation, Boudoukh asked how many teams participated in the Champions League this year. The agent deduced and understood that RAG was needed to answer this question, so he searched 160 BBC Sports News articles; then he asked the agent to send this answer via Gmail, so the agent called Gmail, another tool to solve this problem.

Boudoukh then demonstrated the specific process by which the PHI-3 agent performs RAG. He said RAG allows LLMs to access external knowledge by injecting retrieved information. First, users index hundreds or even thousands of files on the device, which are embedded in the index and saved to a vector database (Vector DB). Now, once the user provides a query, retrieves information from the database, and creates a new unified prompt composed of the user's query and retrieval of the information, then the prompt is injected into LLM and an answer is generated.

RAG has several advantages, he said. First, it enhances LLM knowledge without the need to train models. Second, this kind of data is very efficient to use because there is no need to provide the entire document; only the retrieved information is provided. This reduces the illusion of the model and increases reliability, because when providing answers, it refers to data relevant to obtaining the answers.

Big

In the demonstration that followed, Boudoukh skipped the agent and directly asked the machine how many teams participated in the Champions League this year. He did not use RAG at first. As a result, the agent generated the wrong answer. The answer was that there were 32 teams this year, but in fact, 36 teams participated in the competition this year. He then called RAG to ask the same question and got the right answer.

Boudoukh said this can show developers how to use the software stack to distribute work between NPU, CPU, and integrated GPU. For example, Whisper, the speech recognition model here runs on the NPU, PHI-3 inference runs on the integrated GPU, and the database search runs on the CPU.

Finally, Boudoukh demonstrated the LLAVA Phi3 multi-modal model. He explained that the model is visually and color-trained, so it can handle multi-modal tasks involving text and images. He inserted an image into the model and asked the model to describe the image scene. The model gave a detailed understanding of the scene, and even recommended fishing and relaxing here.

Big

He also showed one of the core parts of the model code, the LLM inference section. He said it's easy to run PHI-3 and LLM inference on an Intel Core Ultra processor; just define the name of the model, define the quantization configuration, load the model, load the tokenizer (tokenizer), then provide some examples, perform tagging operations, mark the input, and then generate the results. And this demo uses an optimized version of OpenVino, which is a type of AI PC.

Big

According to Tangri, this is a wonderful performance of the AI PC running with the LLM. Real-world AI has four pillars: efficiency, security, ability to collaborate with networks, and developer readiness. If you have the first three but aren't ready for developers, you won't be able to innovate on this platform.

He said that high efficiency refers to being able to extend the battery life of the device, rather than simply pursuing the illusion of high floating-point operations per second (TeraFlops). “At the end of the day, what we're really looking for is customer experience and user experience, which involves combining a natural language interface with a graphical user interface. So at the end of the day, we're looking for experience rather than false performance metrics.”

Tangri said that Intel has cooperated with Microsoft over the past few years to establish standards, such as the Open Neural Network Exchange ONNX (Open Neural Network Exchange) standard. Regarding the level of preparation of developers, he said that Intel currently has an operation demonstration of cutting-edge research, which can run entirely in a PC environment. “So we've really catered to developers' needs, lowered the threshold for innovation on our platform, and there's no need to use it online or in the cloud; it can all be done on your PC.”

Disclaimer: This content is for informational and educational purposes only and does not constitute a recommendation or endorsement of any specific investment or investment strategy. Read more

英特尔主管亮相微软大会：释放AI PC的超能力，优化AI模型运行的革新平台

Intel CEO Debuts at Microsoft Conference: An Innovative Platform to Unleash the Superpowers of AI PCs and Optimize the Operation of AI Models

Risk Disclaimer

Statement