Demand for Real-time AI Inference From Groq Accelerates Week Over Week

Groq 對實時 AI 推理的需求逐周加速

PR Newswire · 04/02 08:30

70,000 Developers in the Playground on GroqCloudand 19,000 New Applications Running on the LPU Inference Engine

MOUNTAIN VIEW, Calif., April 2, 2024 /PRNewswire/ -- Groq, a generative AI solutions company, announced today that more than 70,000 new developers are using GroqCloudand more than 19,000 new applications are running on the LPU Inference Engine via the Groq API. The rapid migration to GroqCloud since its launch on March 1st indicates a clear demand for real-time inference as developers and companies seek lower latency and greater throughput for their generative and conversational AI applications.

"From AI influencers and startups to government agencies and large enterprises, the enthusiastic reception of GroqCloud from the developer community has been truly exciting," said GroqCloud General Manager, Sunny Madra. "I'm not surprised by the unprecedented level of interest in GroqCloud. It's clear that developers are hungry for low-latency AI inference capabilities, and we're thrilled to see how it's being used to bring innovative ideas to life. Every few hours, a new app is launched or updated that uses our API."

70,000+ new developers are using GroqCloudand 19,000+ new applications are running on the LPU Inference Engine

Post this

The total addressable market (TAM) for AI chips is projected to reach $119.4B by 2027. Today, ~40% of AI chips are leveraged for inference, and that alone would put the TAM for chips used for inference at ~$48B by 2027. Once applications reach maturity they often allocate 90-95 percent of resources to inference, indicating a much larger market over time. The world is just beginning to explore the possibilities AI presents. That percentage is likely to increase as more applications and products are brought to market, making it an extremely conservative estimate. With nearly every industry and government worldwide looking to leverage generative and/or conversational AI, the TAM for AI chips, and systems dedicated to inference in particular, appears to be limitless.

"GPUs are great. They're what got AI here today," said Groq CEO and Founder, Jonathan Ross. "When customers ask me whether they should still buy GPUs I say, 'Absolutely, if you're doing training because they're optimal for the 5-10% of the resources you'll dedicate to training, but for the 90-95% of resources you'll dedicate to inference, and where you need real-time speed and reasonable economics, let's talk about LPUs.' As the adage goes, 'what got us here won't get us there.' Developers need low latency inference. The LPU is the enabler of that lower latency and that's what's driving them to GroqCloud."

GPUs are great for training models, bulk batch processing, and running visualization-heavy workloads while LPUs specialize in running real-time deployments of Large Language Models (LLMs) and other AI inference workloads that deliver actionable insights. The LPU fills a gap in the market by providing the real-time inference required to make generative AI a reality in a cost- and energy-efficient way via the Groq API.

Chip Design & Architecture Matter
Real-time AI inference is a specialized system problem. Both hardware and software play a role in speed and latency. No amount of software can overcome hardware bottlenecks created by chip design and architecture.

First, the Groq Compiler is fully deterministic and schedules every memory load, operation, and packet transmission exactly when needed. The LPU Inference Engine never has to wait for a cache that has yet to be filled, resend a packet because of a collision, or pause for memory to load – all of which plague traditional data centers using GPUs for inference. Conversely, the Groq Compiler plans every single operation and transmission down to the cycle, ensuring the highest possible performance and fastest system response.

Second, the LPU is based on a single-core deterministic architecture, making it faster for LLMs than GPUs by design. The Groq LPU Inference Engine relies on SRAM for memory, which is 100x faster than the HBM memory used by GPUs. Furthermore, HBM is dynamic and has to be refreshed a dozen or so times per second. While the impact on performance isn't necessarily large compared to the slower memory speed, it does complicate program optimization.

No CUDA Necessary
GPU architecture is complicated, making it difficult to program efficiently. Enter: CUDA. CUDA abstracts the complex GPU architecture and makes it possible to program. GPUs must also create highly tuned CUDA kernels to accelerate each new model, which, in turn, requires substantial validation and testing, creating more work and adding complexity to the chip.

Conversely, the Groq LPU Inference Engine does not require CUDA or kernels – which are essentially low-level hardware instructions – because of the Tensor Streaming architecture of the LPU. The LPU design is elegantly simple because the Groq Compiler maps operations directly to the LPU without any hand-tuning or experimentation. Furthermore, Groq quickly compiles models with high performance because it doesn't require the creation of custom "kernels" for new operations, which hamstrings GPUs when it comes to inference speed and latency.

Prioritizing AI's Carbon Footprint Through Efficient Design
LLMs are estimated to grow in size by 10x every year, making AI output incredibly costly when using GPUs. While scaling up yields some economies, energy efficiency will continue to be an issue when working within the GPU architecture because data still needs to move back and forth between the chips and HBM for every single compute task. Constantly shuffling data quickly burns joules of energy, generates heat, and increases the need for cooling, which, in turn, requires even more energy.

Understanding that energy consumption and cooling costs play fundamental roles in compute cost, Groq designed the chip hardware so that it is essentially an AI token factory within the LPU to maximize efficiencies. As a result, the current generation LPU is 10x more energy-efficient than the most energy-efficient GPU available today because the assembly line approach minimizes off-chip data flow. The Groq LPU Inference Engine is the only available solution that leverages an efficiently designed hardware and software system to satisfy the low carbon footprint requirements of today, while still delivering an unparalleled user experience and production rate.

What Supply Chain Challenges?
From day one Groq has understood a dependency on limited materials and a complex, global supply chain would increase risk, as well as hinder growth and revenue. Groq has side-stepped supply chain challenges by designing a chip that does not rely on 4-nanometer silicon to deliver record-breaking speeds or HBM, which is extremely limited. In fact, the current generation LPU is made with 14-nanometer silicon, and it consistently delivers 300 tokens per second per user when running Llama-2 70B. The LPU is the only AI chip designed, engineered, and manufactured entirely in North America.

About Groq
Groq is a generative AI solutions company and the creator of the LPU Inference Engine, the fastest language processing accelerator on the market. It is architected from the ground up to achieve low latency, energy-efficient, and repeatable inference performance at scale. Customers rely on the LPU Inference Engine as an end-to-end solution for running Large Language Models and other generative AI applications at 10x the speed. Groq Systems powered by the LPU Inference Engine are available for purchase. Customers can also leverage the LPU Inference Engine for experimentation and production-ready applications via an API in GroqCloud by purchasing Tokens-as-a-Service. Jonathan Ross, inventor of the Google Tensor Processing Unit, founded Groq to preserve human agency while building the AI economy. Experience Groq speed for yourself at groq.com.

Media Contact for Groq
Allyson Scott
[email protected]

SOURCE Groq

70,000 Developers in the Playground on GroqCloudand 19,000 New Applications Running on the LPU Inference Engine

70,000 名開發者在 GroqCloud 的遊樂場上以及在 LPU 上運行的 19,000 個新應用程序 推理引擎

加利福尼亞州山景城，2024年4月2日 /PRNewswire/ — 生成式人工智能解決方案公司Groq今天宣佈，超過7萬名新開發人員正在使用GroqCloud並且有超過19,000個新應用程序通過Groq API在LPU推理引擎上運行。自3月1日推出以來，GroqCloud的快速遷移表明，隨着開發人員和公司爲其生成式和對話式人工智能應用程序尋求更低的延遲和更高的吞吐量，對實時推理的需求明顯。

GroqCloud總經理桑尼·馬德拉說：“從人工智能影響者和初創公司到政府機構和大型企業，開發者社區對GroqCloud的熱情歡迎確實令人興奮。”“我對GroqCloud前所未有的興趣並不感到驚訝。顯然，開發人員渴望低延遲的人工智能推理能力，我們很高興看到它如何被用來將創新想法變爲現實。每隔幾個小時，就會啓動或更新一個使用我們的 API 的新應用程序。”

70,000+ new developers are using GroqCloudand 19,000+ new applications are running on the LPU Inference Engine

70,000 多名新開發人員正在使用 GroqCloud，19,000 多個新應用程序正在 LPU 推理引擎上運行

Post this

發佈這個

預計人工智能芯片的總潛在市場（TAM）將達到到 2027 年 1194 億美元。如今，大約40％的人工智能芯片被用於推理，僅此一項就可以使用於推理的芯片的TAM到2027年達到約480億美元。應用程序成熟後，它們通常會分配 90-95% 的資源用於推理，這表明隨着時間的推移，市場將大得多。世界才剛剛開始探索人工智能帶來的可能性。隨着越來越多的應用和產品推向市場，這一百分比可能會增加，因此這是一個極其保守的估計。全球幾乎每個行業和政府都在尋求利用生成式和/或對話式人工智能，因此人工智能芯片的TAM，特別是專門用於推理的系統，似乎是無限的。

“GPU 很棒。他們是今天將人工智能帶到這裏的原因。” Groq首席執行官兼創始人喬納森·羅斯說。“當客戶問我是否還應該購買 GPU 時，我會說，'當然，如果你正在進行訓練，因爲它們是你專門用於訓練的 5-10% 的資源的最佳選擇，但對於 90-95% 的資源將專門用於推理，以及需要實時速度和合理經濟效益的地方，讓我們來談談 LPU。'正如這句格言所說，“把我們帶到這裏的東西不會把我們帶到那裏。”開發人員需要低延遲推理。LPU 是降低延遲的推動力，也是推動他們轉向 GroqCloud 的原因。”

GPU 非常適合訓練模型、批量批處理和運行可視化密集型工作負載，而 LPU 則專門運行大型語言模型 (LLM) 和其他 AI 推理工作負載的實時部署，提供切實可行的見解。LPU通過Groq API提供以成本和節能的方式實現生成式人工智能所需的實時推理，填補了市場空白。

芯片設計與架構很重要
實時 AI 推理是一個專門的系統問題。硬件和軟件在速度和延遲中都起着作用。任何軟件都無法克服芯片設計和架構造成的硬件瓶頸。

首先，Groq 編譯器是完全確定性的，可以在需要時精確地安排每次內存負載、操作和數據包傳輸。LPU 推理引擎無需等待尚未填滿的緩存、因衝突而重新發送數據包或暫停內存加載——所有這些都困擾着使用 GPU 進行推理的傳統數據中心。相反，Groq Compiler 會對每一次操作和傳輸進行規劃，直至整個週期，從而確保儘可能高的性能和最快的系統響應。

其次，LPU 基於單核確定性架構，從設計上講，LLM 比 GPU 更快。Groq LPU 推理引擎依賴 SRAM 作爲內存，比 GPU 使用的 HBM 內存快 100 倍。此外，HBM 是動態的，必須每秒刷新十幾次左右。儘管與較慢的內存速度相比，對性能的影響不一定很大，但它確實會使程序優化變得複雜。

不需要 CUDA
GPU 架構複雜，難以高效編程。輸入：CUDA。CUDA 抽象了複雜的 GPU 架構，使編程成爲可能。GPU 還必須創建高度調整的 CUDA 內核來加速每個新模型，這反過來又需要大量的驗證和測試，從而增加工作量並增加芯片的複雜性。

相反，由於 Groq LPU 推理引擎採用 Tensor Streaming 架構，因此不需要 CUDA 或內核（本質上是低級硬件指令）。LPU 的設計非常簡單，因爲 Groq 編譯器無需任何手動調整或實驗即可將操作直接映射到 LPU。此外，Groq 可以快速編譯具有高性能的模型，因爲它不需要爲新操作創建自定義 “內核”，這在推理速度和延遲方面會限制 GPU。

通過高效設計優先考慮人工智能的碳足跡
據估計，LLM 的規模將增長到每年 10 次，這使得使用 GPU 時的人工智能輸出成本異常昂貴。儘管擴大規模可以帶來一些經濟效益，但在 GPU 架構中工作時，能效仍將是一個問題，因爲每項計算任務的數據仍然需要在芯片和 HBM 之間來回移動。不斷整理數據會迅速消耗數焦耳的能量，產生熱量，並增加冷卻需求，這反過來又需要更多的能量。

Groq 了解能耗和冷卻成本在計算成本中起着至關重要的作用，因此設計了芯片硬件，使其本質上是 LPU 內的 AI 代幣工廠，以最大限度地提高效率。因此，當前一代 LPU 的能效是當今可用的最節能的 GPU 的 10 倍，因爲裝配線方法最大限度地減少了片外數據流。Groq LPU 推理引擎是唯一可用的解決方案，它利用高效設計的硬件和軟件系統來滿足當今的低碳足跡要求，同時仍能提供無與倫比的用戶體驗和生產率。

供應鏈面臨哪些挑戰？
從第一天起，Groq就明白，對有限材料和複雜的全球供應鏈的依賴將增加風險，並阻礙增長和收入。Groq 通過設計一種不依賴 4 納米硅來提供破紀錄速度或極其有限的 HBM 的芯片，從而規避了供應鏈的挑戰。實際上，當前一代的LPU由14納米硅製成，並且可以始終如一地提供每位用戶每秒 300 個代幣運行 Llama-2 70B 時。LPU 是唯一一款完全在北美設計、設計和製造的人工智能芯片。

關於 Groq
Groq 是一家生成式人工智能解決方案公司，也是市場上最快的語言處理加速器 LPU 推理引擎的創建者。它的架構從頭開始，旨在實現低延遲、節能和可重複的大規模推理性能。客戶依賴 LPU 推理引擎作爲端到端解決方案，以 10 倍的速度運行大型語言模型和其他生成式 AI 應用程序。由 LPU 推理引擎提供支持的 Groq 系統可供購買。客戶還可以通過GroqCloud中的API通過購買代幣即服務，利用LPU推理引擎進行實驗和生產就緒應用程序。谷歌張量處理單元的發明者喬納森·羅斯創立了Groq，目的是在建設人工智能經濟的同時保持人的能動性。親自體驗 Groq 速度 groq.com。

Media Contact for Groq
Allyson Scott
[email protected]

Groq 的媒體聯繫人
艾莉森·斯科特
[電子郵件保護]

SOURCE Groq

來源 Groq

声明：本內容僅用作提供資訊及教育之目的，不構成對任何特定投資或投資策略的推薦或認可。更多信息

Demand for Real-time AI Inference From Groq Accelerates Week Over Week

Demand for Real-time AI Inference From Groq Accelerates Week Over Week

風險及免責聲明

聲明