Welcome to Semi-Literate, a guide to the chip industry through the lens of public policy.
BLUF: The consolidation of leading-edge semiconductor manufacturing in one company (TSMC) in one country (Taiwan) poses two problems for artificial intelligence: a short-term strategic risk and a long-term innovation risk. Both problems remain under-explored because advances in artificial intelligence algorithms have compensated for a relative lack of advances in AI hardware.
Background
Artificial intelligence (AI) has a hardware problem. Most of the conversation about AI’s hardware problem focuses on technical obstacles and solutions. However, the AI hardware problem is also market driven: the global semiconductor industry, which creates the hardware on which all AI operates, faces cyclical and structural challenges that pose potential roadblocks to ongoing innovation by AI designers and users.
In March 2021, the congressionally mandated National Security Commission on Artificial Intelligence (NSCAI) released a 756-page report on the state of U.S. AI competitiveness. Among its many findings, the report observed that U.S. leadership in microelectronics is critical to overall U.S. leadership in AI. However, the report went on to note that, “despite tremendous expertise in microelectronics research, development, and innovation across the country, the United States is constrained by a lack of domestically located semiconductor fabrication facilities, especially for state-of-the-art semiconductors.”
It is precisely these semiconductors that power today’s most advanced AI systems. Sustained U.S. leadership in AI is contingent on sustained U.S. firm leadership in semiconductors. However, U.S. leadership in semiconductors has faltered in recent years at exactly the same time that interest in AI has peaked yet again. This divergence has important strategic implications for the United States and important implications for the AI innovation environment more generally. The limited number of advanced semiconductor manufacturers shapes the evolution of AI chips and thus the depth, breadth, and pace of innovation of AI.
What Are AI Chips and How do Semiconductors Enable AI
AI, the science of making computer systems intelligent, relies on a combination of algorithms, hardware, data, and talent. AI algorithms are the computational techniques and theorems developed by researchers to create predictive models. AI hardware, or “compute,” refers to commercial off-the-shelf or custom integrated circuits on which AI algorithms function. The term “data” refers to digital information, structured or unstructured, on which AI algorithms can be trained to ultimately make predictive inferences. Finally, talent refers to the highly skilled human know-how that synthesizes the hardware, software, and data into an AI system that creates value.
The development of these four components of AI has occurred at different stages since the 1980s. The enthusiasm AI has generated in the past decade is largely attributable to a favorable confluence of breakthroughs in AI hardware and data. However, chokepoints in the AI hardware supply chain and the semiconductor innovation ecosystem threaten to stymie current advances in AI. In particular, the fragility of the semiconductor supply chain, the concentration of advanced manufacturing resources in Taiwan, the sole-source reliance of most AI systems on chips from these Taiwanese manufacturers, and the slowing of Moore’s law all indicate hardware is the most likely determinant of AI competitiveness to undermine current breakthroughs.
AI hardware refers to certain semiconductors or “chips” that power all information technologies. While AI algorithms can run on any microprocessor, most models benefit from the speed and efficiency that comes with using specialized chips. There are three types of chips commonly used for AI workloads: GPUs, field-programmable gate arrays (FPGAs), and application-specific integrated circuits (ASICs). The use of all three types of chips comes with benefits and tradeoffs among speed, efficiency, accuracy, and flexibility (Table 1).
GPUs, FPGAs, and ASICs are all optimized to handle different AI workloads. GPUs were originally designed for image-processing applications and found widespread commercial adoption in gaming systems before their more recent pivot to use in AI training tasks. GPUs efficiently and accurately process parallel computations, making them ideal for AI training in which an algorithm is iteratively “taught” using labeled data. FPGAs were invented in the 1980s and offer greater flexibility than other types of chips, allowing for post-fabrication customization depending on their AI workload. FPGAs can handle AI inference tasks, where efficiency and customization are more important. ASICs are specially designed chips optimized to perform specific algorithms powerfully and efficiently. ASICs can handle AI training or inference workloads depending on designer choices, but because their circuitry is customized for specific algorithms, they inherently lose relevance as AI algorithms evolve.
Companies engaged in AI research seek competitive advantages through better hardware. Semiconductor firms, which generally do not lead in AI research, compete to maximize the performance of their hardware. These competitive pressures have generally benefited recent advances in AI. Semiconductor manufacturers increase the computational power of their chips by adding more transistors on progressively smaller pieces of silicon. AI companies leverage advances in chip performance and cost by designing chips that optimize hardware performance to efficiently execute specific calculations associated with ML. This combination of specific design choices and advanced hardware is essential for leading AI firms because training a leading AI algorithm can require months of computing time and cost tens of millions of U.S. dollars. The high fixed costs of semiconductor fabrication mean AI firms have focused their research and development efforts on algorithms, data, and talent. As a result, no leading firms engaged in AI research manufacture their own hardware, choosing instead to design and outsource the fabrication of hardware to contract semiconductor manufacturers. These specialized “AI chips” are essential for cost-effectively implementing AI at scale. Attempts to deliver comparable AI application using older AI chips or general-purpose chips can cost tens to thousands of times more. As a result, AI firms are keen to secure access to the most advanced compute. However, after innovating at a relatively predictable pace for 65+ years, the end of Dennard scaling has coincided with a loss of U.S. semiconductor manufacturing leadership, presenting strategic and innovative risks to AI.
US Firm Leadership in AI Hardware
There are no U.S. semiconductor firms that lead in AI and no U.S. AI firms that lead in semiconductors. As a result, AI firms, large and small, all compete to secure access to state-of-the-art semiconductor manufacturing and leverage hardware advantages for their AI efforts. However, as the number of leading-edge manufacturers has shrunk, this has meant these leading AI firms are all competing to secure capacity at one of the ever-diminishing number of leading edge manufacturers. Many of the most celebrated AI chips in the United States like Google’s Tensor Processing Unit (TPU), Cerebras’s Wafer Scale Engine (WSE), and Nvidia’s A100 processor all share one thing in common: they are manufactured by Taiwan Semiconductor Manufacturing Corporation (TSMC). While U.S. firms continue to lead the world in designing custom silicon to handle AI workloads, the acute lack of advanced domestic manufacturing assets poses a supply chain risk that is under-analyzed by the AI community in the United States.
The loss of U.S. leadership in semiconductor manufacturing and its associated impact on AI chip development is perhaps most apparent when looking at Intel’s efforts to develop AI chips. After completing the US $2 billion acquisition of AI chipmaker Habana Labs in 2019, Intel suffered a series of process slips that delayed commercial rollout of its 10- and 7-nm chip manufacturing processes by several years. At the same time, TSMC and Samsung both ramped up commercial production of 7-nm chips. Though Intel operates a series of 15 wafer fabs in 10 countries, Habana Labs was forced to partner with TSMC to take advantage of the most advanced manufacturing process available to ensure its AI chips would remain competitive. This is a remarkable admission by the U.S.’s once-leading semiconductor manufacturer that it can no longer meet the fabrication needs of companies it acquires. However, this decision to turn to TSMC is commonplace among leading U.S. AI companies. Table 2 presents the leading AI hardware developed by U.S. firms organized by the manufacturer they turn to for their AI hardware fabrication.
This reliance on Taiwan for AI hardware fabrication is also true of the chip industry more generally. A recent semiconductor supply chain risks review conducted by the White House concluded, “The United States relies primarily on Taiwan for leading edge logic chips and relies on Taiwan, South Korea, and China to meet demand for mature node chips.” This review identified fragile supply chains, malicious supply chain disruptions, and geopolitical factors as several risks associated with this geographic concentration of leading-edge fabrication resources. These findings echo those of the NSCAI.
How Access to Advanced Manufacturing Bottlenecks AI Hardware Innovation
In addition to presenting a strategic risk to U.S. AI firms, the concentration of advanced semiconductor manufacturing in Asia also constrains innovation in the AI start-up world, shaping the depth, breadth, and pace of innovation in the AI field. Data on AI hardware start-ups indicate there are roughly 225 firms engaged in the design of AI accelerators. These 225 firms have raised approximately US$20.8 billion, though the 10 largest AI hardware startups account for US$12.7 billion (61%) of that total, and several of them have been acquired or are publicly held. Most AI hardware start-ups are headquartered in the United States (~85) and China (~75), followed by the United Kingdom (10), France (9), and Israel (8). AI hardware start-ups are a negligible part of the rest of the AI/ML start-up ecosystem however. Start-up investment data indicate AI hardware start-ups accounted for only 3% of all AI and ML venture capital deals from 2017 to 2021. Because of the high cost of getting their designs fabricated, U.S. AI hardware start-ups are unappealing to private investors. In fact, DARPA observed that from 2012– 2017 there was “virtually no early stage venture capital funding for U.S. chip start-ups [AI or otherwise] due to the perceived cost of first working silicon exceeding US$20 million.”
While this ecosystem contains a wide variety of firms engaged in ASIC-, GPU-, and FPGA-specific designs for AI optimization, this diversity is not reflected in the manufacturers on whom these start-ups rely. AI hardware startups disproportionately favor turning to TSMC for their manufacturing services, and the fact that these hardware start-ups secure manufacturing services correlates strongly with increased funding rounds down the road. TSMC is the leader in AI hardware manufacturing at advanced nodes. Table 3 presents a list of 23 AI start-ups that are developing hardware solutions as compiled by analysts at Credit Suisse. Of these 23 start-ups, seven have an unknown manufacturing roadmap (the foundry and/or the node at which they propose to manufacture are unknown). However, of the remaining 16 AI hardware start-ups, TSMC provides manufacturing services to 11 of them. GlobalFoundries is the only other contract manufacturer that counts more than one AI hardware start-up as a customer, coming in well behind TSMC with three start-ups turning to them for manufacturing services. The 11 start-ups that use TSMC have collectively raised US$4.065 billion, while the three startups using GlobalFoundries have raised US$500 million.
In effect, TSMC serves as a “kingmaker” in the AI hardware start-up world: the “best” manufacture with TSMC, the “rest” settle for other manufacturers. However, TSMC charges high rates and serves purely as a contract manufacturer, taking business only from those start-ups who are able to afford to pay for their services. They compete for TSMC’s capacity with the likes of Nvidia and AMD, along with Apple’s consumer electronics business. TSMC does not accept business based on the technical merits of the AI start-up itself, meaning that presumably some of the most innovative AI start-ups today are not getting access to the best manufacturing because they cannot compete on price. Interestingly, many AI start-ups that choose to manufacture with TSMC do so at more mature nodes: only four of the 11 start-ups had access to TSMC’s then-leading-edge 7-nm process, while the remainder manufactured at 16 and 28 nm, nodes for which there is substantial other manufacturing capacity from the likes of Semiconductor Manufacturing International Corporation (SMIC) (China), United Microelectronics Corporation (Taiwan), and GlobalFoundries (United States). This suggests that AI start-ups prefer to manufacture with TSMC for reasons that extend beyond purely technical merit and indicate that reputational affiliation with TSMC is seen as prohibitively favorable. While there have been some recent AI breakthroughs that have not relied on particularly powerful computing architectures, AI companies generally remain (understandably) myopic in their pursuit of maximizing compute.
How AI Algorithmic Advances Compensate for AI Hardware
Advances in AI algorithms have compensated for a relative lack of advances in AI compute and the increasing consolidation of firms capable of making the most advanced AI hardware. Researchers have stated that “the recent success of deep neural networks has been driven by advances in algorithms and network architectures, but also, notably, through the growing availability of vast amounts of data and the continuing development of ever more powerful computers.” However, the development of more powerful compute has lagged the other key components of AI. Recently, researchers at OpenAI observed that advances in AI algorithmic efficiency have improved performance such that it now “takes 44 times less compute to train a neural network to the level of AlexNet18 (by contrast, Moore’s Law would yield an 11x cost improvement over this period).”
AI’s hardware problems have been remarked-upon before. However, these observations have focused primarily on the technical aspects of hardware challenges. For example, in April 2018, the journal Nature observed that “the performance of the typical hardware platforms—graphics processing units (GPUs), field-programmable gate arrays (FPGAs) and application-specific integrated circuits (ASICs)—cannot keep pace with the increasing size of leading deep neural network designs.” The divergence in the level of progress between AI algorithms and AI hardware is striking considering that hardware is increasingly intensively consumed by AI tasks, making it all the more important. Stanford’s 2021 AI Index Report observed “the training time on ImageNet] has fallen from 6.2 min (December 2018) to 47 s (July 2020). At the same time, the amount of hardware used to achieve these results has increased dramatically” with the number of AI accelerators used to accomplish the task growing from 640 to 4,096.20 This observation echoes that of an OpenAI analysis from 2018 which found “amount of compute used in the largest AI training runs has been increasing exponentially with a 3.4- month doubling time (by comparison, Moore’s Law had a 2-year doubling period).”
This juxtaposition—that AI hardware is increasingly important, but its production is increasingly consolidated— presents a risk to overall advances in AI. Several technical solutions have been proposed to address hardware bottlenecks that aim to more tightly couple memory and processing (either via nanoscale resistive memory devices or by combining in-memory processing with digital processing). However, these solutions address technical challenges with AI hardware. The deeper problem remains that all new or novel solutions to AI’s technical hardware problems run through a small number of companies capable of producing these chips.
The views expressed here are my own and not those of employers past or present.