Normal view

There are new articles available, click to refresh the page.
Before yesterdayMIT Technology Review

OpenAI and Google are launching supercharged AI assistants. Here’s how you can try them out.

15 May 2024 at 14:18

This week, Google and OpenAI both announced they’ve built supercharged AI assistants: tools that can converse with you in real time and recover when you interrupt them, analyze your surroundings via live video, and translate conversations on the fly. 

OpenAI struck first on Monday, when it debuted its new flagship model GPT-4o. The live demonstration showed it reading bedtime stories and helping to solve math problems, all in a voice that sounded eerily like Joaquin Phoenix’s AI girlfriend in the movie Her (a trait not lost on CEO Sam Altman). 

On Tuesday, Google announced its own new tools, including a conversational assistant called Gemini Live, which can do many of the same things. It also revealed that it’s building a sort of “do-everything” AI agent, which is currently in development but will not be released until later this year.

Soon you’ll be able to explore for yourself to gauge whether you’ll turn to these tools in your daily routine as much as their makers hope, or whether they’re more like a sci-fi party trick that eventually loses its charm. Here’s what you should know about how to access these new tools, what you might use them for, and how much it will cost. 

OpenAI’s GPT-4o

What it’s capable of: The model can talk with you in real time, with a response delay of about 320 milliseconds, which OpenAI says is on par with natural human conversation. You can ask the model to interpret anything you point your smartphone camera at, and it can provide assistance with tasks like coding or translating text. It can also summarize information, and generate images, fonts, and 3D renderings. 

How to access it: OpenAI says it will start rolling out GPT-4o’s text and vision features in the web interface as well as the GPT app, but has not set a date. The company says it will add the voice functions in the coming weeks, although it’s yet to set an exact date for this either. Developers can access the text and vision features in the API now, but voice mode will launch only to a “small group” of developers initially.

How much it costs: Use of GPT-4o will be free, but OpenAI will set caps on how much you can use the model before you need to upgrade to a paid plan. Those who join one of OpenAI’s paid plans, which start at $20 per month, will have five times more capacity on GPT-4o. 

Google’s Gemini Live 

What is Gemini Live? This is the Google product most comparable to GPT-4o—a version of the company’s AI model that you can speak with in real time. Google says that you’ll also be able to use the tool to communicate via live video “later this year.” The company promises it will be a useful conversational assistant for things like preparing for a job interview or rehearsing a speech.

How to access it: Gemini Live launches in “the coming months” via Google’s premium AI plan, Gemini Advanced. 

How much it costs: Gemini Advanced offers a two-month free trial period and costs $20 per month thereafter. 

But wait, what’s Project Astra? Astra is a project to build a do-everything AI agent, which was demoed at Google’s I/O conference but will not be released until later this year.

People will be able to use Astra through their smartphones and possibly desktop computers, but the company is exploring other options too, such as embedding it into smart glasses or other devices, Oriol Vinyals, vice president of research at Google DeepMind, told MIT Technology Review.

Which is better?

It’s hard to tell without having hands on the full versions of these models ourselves. Google showed off Project Astra through a polished video, whereas OpenAI opted to debut GPT-4o via a seemingly more authentic live demonstration, but in both cases, the models were asked to do things the designers likely already practiced. The real test will come when they’re debuted to millions of users with unique demands.  

That said, if you compare OpenAI’s published videos with Google’s, the two leading tools look very similar, at least in their ease of use. To generalize, GPT-4o seems to be slightly ahead on audio, demonstrating realistic voices, conversational flow, and even singing, whereas Project Astra shows off more advanced visual capabilities, like being able to “remember” where you left your glasses. OpenAI’s decision to roll out the new features more quickly might mean its product will get more use at first than Google’s, which won’t be fully available until later this year. It’s too soon to tell which model “hallucinates” false information less often or creates more useful responses.

Are they safe?

Both OpenAI and Google say their models are well tested: OpenAI says GPT-4o was evaluated by more than 70 experts in fields like misinformation and social psychology, and Google has said that Gemini “has the most comprehensive safety evaluations of any Google AI model to date, including for bias and toxicity.” 

But these companies are building a future where AI models search, vet, and evaluate the world’s information for us to serve up a concise answer to our questions. Even more so than with simpler chatbots, it’s wise to remain skeptical about what they tell you.

Additional reporting by Melissa Heikkilä.

OpenAI’s new GPT-4o lets people interact using voice or video in the same model

13 May 2024 at 15:27

OpenAI just debuted GPT-4o, a new kind of AI model that you can communicate with in real time via live voice conversation, video streams from your phone, and text. The model is rolling out over the next few weeks and will be free for all users through both the GPT app and the web interface, according to the company. Users who subscribe to OpenAI’s paid tiers, which start at $20 per month, will be able to make more requests. 

OpenAI CTO Mira Murati led the live demonstration of the new release one day before Google is expected to unveil its own AI advancements at its flagship I/O conference on Tuesday, May 14. 

GPT-4 offered similar capabilities, giving users multiple ways to interact with OpenAI’s AI offerings. But it siloed them in separate models, leading to longer response times and presumably higher computing costs. GPT-4o has now merged those capabilities into a single model, which Murati called an “omnimodel.” That means faster responses and smoother transitions between tasks, she said.

The result, the company’s demonstration suggests, is a conversational assistant much in the vein of Siri or Alexa but capable of fielding much more complex prompts.

“We’re looking at the future of interaction between ourselves and the machines,” Murati said of the demo. “We think that GPT-4o is really shifting that paradigm into the future of collaboration, where this interaction becomes much more natural.”

Barret Zoph and Mark Chen, both researchers at OpenAI, walked through a number of applications for the new model. Most impressive was its facility with live conversation. You could interrupt the model during its responses, and it would stop, listen, and adjust course. 

OpenAI showed off the ability to change the model’s tone, too. Chen asked the model to read a bedtime story “about robots and love,” quickly jumping in to demand a more dramatic voice. The model got progressively more theatrical until Murati demanded that it pivot quickly to a convincing robot voice (which it excelled at). While there were predictably some short pauses during the conversation while the model reasoned through what to say next, it stood out as a remarkably naturally paced AI conversation. 

The model can reason through visual problems in real time as well. Using his phone, Zoph filmed himself writing an algebra equation (3x + 1 = 4) on a sheet of paper, having GPT-4o follow along. He instructed it not to provide answers, but instead to guide him much as a teacher would.

“The first step is to get all the terms with x on one side,” the model said in a friendly tone. “So, what do you think we should do with that plus one?”

Like previous generations of GPT, GPT-4o will store records of users’ interactions with it, meaning the model “has a sense of continuity across all your conversations,” according to Murati. Other new highlights include live translation, the ability to search through your conversations with the model, and the power to look up information in real time. 

As is the nature of a live demo, there were hiccups and glitches. GPT-4o’s voice might jump in awkwardly during the conversation. It appeared to comment on one of the presenters’ outfits even though it wasn’t asked to. But it recovered well when the demonstrators told the model it had erred. It seems to be able to respond quickly and helpfully across several mediums that other models have not yet merged as effectively. 

Previously, many of OpenAI’s most powerful features, like reasoning through image and video, were behind a paywall. GPT-4o marks the first time they’ll be opened up to the wider public, though it’s not yet clear how many interactions you’ll be able to have with the model before being charged. OpenAI says paying subscribers will “continue to have up to five times the capacity limits of our free users.” 

Additional reporting by Will Douglas Heaven.

Correction: This story has been updated to reflect that the Memory feature, which stores past conversations, is not new to GPT-4o but has existed in previous models.

What’s next in chips

13 May 2024 at 05:00

MIT Technology Review’s What’s Next series looks across industries, trends, and technologies to give you a first look at the future. You can read the rest of them here.

Thanks to the boom in artificial intelligence, the world of chips is on the cusp of a huge tidal shift. There is heightened demand for chips that can train AI models faster and ping them from devices like smartphones and satellites, enabling us to use these models without disclosing private data. Governments, tech giants, and startups alike are racing to carve out their slices of the growing semiconductor pie. 

Here are four trends to look for in the year ahead that will define what the chips of the future will look like, who will make them, and which new technologies they’ll unlock.

CHIPS Acts around the world

On the outskirts of Phoenix, two of the world’s largest chip manufacturers, TSMC and Intel, are racing to construct campuses in the desert that they hope will become the seats of American chipmaking prowess. One thing the efforts have in common is their funding: in March, President Joe Biden announced $8.5 billion in direct federal funds and $11 billion in loans for Intel’s expansions around the country. Weeks later, another $6.6 billion was announced for TSMC. 

The awards are just a portion of the US subsidies pouring into the chips industry via the $280 billion CHIPS and Science Act signed in 2022. The money means that any company with a foot in the semiconductor ecosystem is analyzing how to restructure its supply chains to benefit from the cash. While much of the money aims to boost American chip manufacturing, there’s room for other players to apply, from equipment makers to niche materials startups.

But the US is not the only country trying to onshore some of the chipmaking supply chain. Japan is spending $13 billion on its own equivalent to the CHIPS Act, Europe will be spending more than $47 billion, and earlier this year India announced a $15 billion effort to build local chip plants. The roots of this trend go all the way back to 2014, says Chris Miller, a professor at Tufts University and author of Chip War: The Fight for the World’s Most Critical Technology. That’s when China started offering massive subsidies to its chipmakers. 

cover of Chip War: The Fight for the World's Most Critical Technology by Chris Miller
SIMON & SCHUSTER

“This created a dynamic in which other governments concluded they had no choice but to offer incentives or see firms shift manufacturing to China,” he says. That threat, coupled with the surge in AI, has led Western governments to fund alternatives. In the next year, this might have a snowball effect, with even more countries starting their own programs for fear of being left behind.

The money is unlikely to lead to brand-new chip competitors or fundamentally restructure who the biggest chip players are, Miller says. Instead, it will mostly incentivize dominant players like TSMC to establish roots in multiple countries. But funding alone won’t be enough to do that quickly—TSMC’s effort to build plants in Arizona has been mired in missed deadlines and labor disputes, and Intel has similarly failed to meet its promised deadlines. And it’s unclear whether, whenever the plants do come online, their equipment and labor force will be capable of the same level of advanced chipmaking that the companies maintain abroad.

“The supply chain will only shift slowly, over years and decades,” Miller says. “But it is shifting.”

More AI on the edge

Currently, most of our interactions with AI models like ChatGPT are done via the cloud. That means that when you ask GPT to pick out an outfit (or to be your boyfriend), your request pings OpenAI’s servers, prompting the model housed there to process it and draw conclusions (known as “inference”) before a response is sent back to you. Relying on the cloud has some drawbacks: it requires internet access, for one, and it also means some of your data is shared with the model maker.  

That’s why there’s been a lot of interest and investment in edge computing for AI, where the process of pinging the AI model happens directly on your device, like a laptop or smartphone. With the industry increasingly working toward a future in which AI models know a lot about us (Sam Altman described his killer AI app to me as one that knows “absolutely everything about my whole life, every email, every conversation I’ve ever had”), there’s a demand for faster “edge” chips that can run models without sharing private data. These chips face different constraints from the ones in data centers: they typically have to be smaller, cheaper, and more energy efficient. 

The US Department of Defense is funding a lot of research into fast, private edge computing. In March, its research wing, the Defense Advanced Research Projects Agency (DARPA), announced a partnership with chipmaker EnCharge AI to create an ultra-powerful edge computing chip used for AI inference. EnCharge AI is working to make a chip that enables enhanced privacy but can also operate on very little power. This will make it suitable for military applications like satellites and off-grid surveillance equipment. The company expects to ship the chips in 2025.

AI models will always rely on the cloud for some applications, but new investment and interest in improving edge computing could bring faster chips, and therefore more AI, to our everyday devices. If edge chips get small and cheap enough, we’re likely to see even more AI-driven “smart devices” in our homes and workplaces. Today, AI models are mostly constrained to data centers.

“A lot of the challenges that we see in the data center will be overcome,” says EnCharge AI cofounder Naveen Verma. “I expect to see a big focus on the edge. I think it’s going to be critical to getting AI at scale.”

Big Tech enters the chipmaking fray

In industries ranging from fast fashion to lawn care, companies are paying exorbitant amounts in computing costs to create and train AI models for their businesses. Examples include models that employees can use to scan and summarize documents, as well as externally facing technologies like virtual agents that can walk you through how to repair your broken fridge. That means demand for cloud computing to train those models is through the roof. 

The companies providing the bulk of that computing power are Amazon, Microsoft, and Google. For years these tech giants have dreamed of increasing their profit margins by making chips for their data centers in-house rather than buying from companies like Nvidia, a giant with a near monopoly on the most advanced AI training chips and a value larger than the GDP of 183 countries. 

Amazon started its effort in 2015, acquiring startup Annapurna Labs. Google moved next in 2018 with its own chips called TPUs. Microsoft launched its first AI chips in November, and Meta unveiled a new version of its own AI training chips in April.

CEO Jensen Huang holds up chips on stage during a keynote address
AP PHOTO/ERIC RISBERG

That trend could tilt the scales away from Nvidia. But Nvidia doesn’t only play the role of rival in the eyes of Big Tech: regardless of their own in-house efforts, cloud giants still need its chips for their data centers. That’s partly because their own chipmaking efforts can’t fulfill all their needs, but it’s also because their customers expect to be able to use top-of-the-line Nvidia chips.

“This is really about giving the customers the choice,” says Rani Borkar, who leads hardware efforts at Microsoft Azure. She says she can’t envision a future in which Microsoft supplies all chips for its cloud services: “We will continue our strong partnerships and deploy chips from all the silicon partners that we work with.”

As cloud computing giants attempt to poach a bit of market share away from chipmakers, Nvidia is also attempting the converse. Last year the company started its own cloud service so customers can bypass Amazon, Google, or Microsoft and get computing time on Nvidia chips directly. As this dramatic struggle over market share unfolds, the coming year will be about whether customers see Big Tech’s chips as akin to Nvidia’s most advanced chips, or more like their little cousins. 

Nvidia battles the startups 

Despite Nvidia’s dominance, there is a wave of investment flowing toward startups that aim to outcompete it in certain slices of the chip market of the future. Those startups all promise faster AI training, but they have different ideas about which flashy computing technology will get them there, from quantum to photonics to reversible computation. 

But Murat Onen, the 28-year-old founder of one such chip startup, Eva, which he spun out of his PhD work at MIT, is blunt about what it’s like to start a chip company right now.

“The king of the hill is Nvidia, and that’s the world that we live in,” he says.

Many of these companies, like SambaNova, Cerebras, and Graphcore, are trying to change the underlying architecture of chips. Imagine an AI accelerator chip as constantly having to shuffle data back and forth between different areas: a piece of information is stored in the memory zone but must move to the processing zone, where a calculation is made, and then be stored back to the memory zone for safekeeping. All that takes time and energy. 

Making that process more efficient would deliver faster and cheaper AI training to customers, but only if the chipmaker has good enough software to allow the AI training company to seamlessly transition to the new chip. If the software transition is too clunky, model makers such as OpenAI, Anthropic, and Mistral are likely to stick with big-name chipmakers.That means companies taking this approach, like SambaNova, are spending a lot of their time not just on chip design but on software design too.

Onen is proposing changes one level deeper. Instead of traditional transistors, which have delivered greater efficiency over decades by getting smaller and smaller, he’s using a new component called a proton-gated transistor that he says Eva designed specifically for the mathematical needs of AI training. It allows devices to store and process data in the same place, saving time and computing energy. The idea of using such a component for AI inference dates back to the 1960s, but researchers could never figure out how to use it for AI training, in part because of a materials roadblock—it requires a material that can, among other qualities, precisely control conductivity at room temperature. 

One day in the lab, “through optimizing these numbers, and getting very lucky, we got the material that we wanted,” Onen says. “All of a sudden, the device is not a science fair project.” That raised the possibility of using such a component at scale. After months of working to confirm that the data was correct, he founded Eva, and the work was published in Science.

But in a sector where so many founders have promised—and failed—to topple the dominance of the leading chipmakers, Onen frankly admits that it will be years before he’ll know if the design works as intended and if manufacturers will agree to produce it. Leading a company through that uncertainty, he says, requires flexibility and an appetite for skepticism from others.

“I think sometimes people feel too attached to their ideas, and then kind of feel insecure that if this goes away there won’t be anything next,” he says. “I don’t think I feel that way. I’m still looking for people to challenge us and say this is wrong.”

Google DeepMind’s new AlphaFold can model a much larger slice of biological life

8 May 2024 at 11:00

Google DeepMind has released an improved version of its biology prediction tool, AlphaFold, that can predict the structures not only of proteins but of nearly all the elements of biological life.

It’s a development that could help accelerate drug discovery and other scientific research. The tool is currently being used to experiment with identifying everything from resilient crops to new vaccines. 

While the previous model, released in 2020, amazed the research community with its ability to predict proteins structures, researchers have been clamoring for the tool to handle more than just proteins. 

Now, DeepMind says, AlphaFold 3 can predict the structures of DNA, RNA, and molecules like ligands, which are essential to drug discovery. DeepMind says the tool provides a more nuanced and dynamic portrait of molecule interactions than anything previously available. 

“Biology is a dynamic system,” DeepMind CEO Demis Hassabis told reporters on a call. “Properties of biology emerge through the interactions between different molecules in the cell, and you can think about AlphaFold 3 as our first big sort of step toward [modeling] that.”

AlphaFold 2 helped us better map the human heart, model antimicrobial resistance, and identify the eggs of extinct birds, but we don’t yet know what advances AlphaFold 3 will bring. 

Mohammed AlQuraishi, an assistant professor of systems biology at Columbia University who is unaffiliated with DeepMind, thinks the new version of the model will be even better for drug discovery. “The AlphaFold 2 system only knew about amino acids, so it was of very limited utility for biopharma,” he says. “But now, the system can in principle predict where a drug binds a protein.”

Isomorphic Labs, a drug discovery spinoff of DeepMind, is already using the model for exactly that purpose, collaborating with pharmaceutical companies to try to develop new treatments for diseases, according to DeepMind. 

AlQuraishi says the release marks a big leap forward. But there are caveats.

“It makes the system much more general, and in particular for drug discovery purposes (in early-stage research), it’s far more useful now than AlphaFold 2,” he says. But as with most models, the impact of AlphaFold will depend on how accurate its predictions are. For some uses, AlphaFold 3 has double the success rate of similar leading models like RoseTTAFold. But for others, like protein-RNA interactions, AlQuraishi says it’s still very inaccurate. 

DeepMind says that depending on the interaction being modeled, accuracy can range from 40% to over 80%, and the model will let researchers know how confident it is in its prediction. With less accurate predictions, researchers have to use AlphaFold merely as a starting point before pursuing other methods. Regardless of these ranges in accuracy, if researchers are trying to take the first steps toward answering a question like which enzymes have the potential to break down the plastic in water bottles, it’s vastly more efficient to use a tool like AlphaFold than experimental techniques such as x-ray crystallography. 

A revamped model  

AlphaFold 3’s larger library of molecules and higher level of complexity required improvements to the underlying model architecture. So DeepMind turned to diffusion techniques, which AI researchers have been steadily improving in recent years and now power image and video generators like OpenAI’s DALL-E 2 and Sora. It works by training a model to start with a noisy image and then reduce that noise bit by bit until an accurate prediction emerges. That method allows AlphaFold 3 to handle a much larger set of inputs.

That marked “a big evolution from the previous model,” says John Jumper, director at Google DeepMind. “It really simplified the whole process of getting all these different atoms to work together.”

It also presented new risks. As the AlphaFold 3 paper details, the use of diffusion techniques made it possible for the model to hallucinate, or generate structures that look plausible but in reality could not exist. Researchers reduced that risk by adding more training data to the areas most prone to hallucination, though that doesn’t eliminate the problem completely. 

Restricted access

Part of AlphaFold 3’s impact will depend on how DeepMind divvies up access to the model. For AlphaFold 2, the company released the open-source code, allowing researchers to look under the hood to gain a better understanding of how it worked. It was also available for all purposes, including commercial use by drugmakers. For AlphaFold 3, Hassabis said, there are no current plans to release the full code. The company is instead releasing a public interface for the model called the AlphaFold Server, which imposes limitations on which molecules can be experimented with and can only be used for noncommercial purposes. DeepMind says the interface will lower the technical barrier and broaden the use of the tool to biologists who are less knowledgeable about this technology.

The new restrictions are significant, according to AlQuraishi. “The system’s main selling point—its ability to predict protein–small molecule interactions—is basically unavailable for public use,” he says. “It’s mostly a teaser at this point.”

Sam Altman says helpful agents are poised to become AI’s killer function

1 May 2024 at 15:52

A number of moments from my brief sit-down with Sam Altman brought the OpenAI CEO’s worldview into clearer focus. The first was when he pointed to my iPhone SE (the one with the home button that’s mostly hated) and said, “That’s the best iPhone.” More revealing, though, was the vision he sketched for how AI tools will become even more enmeshed in our daily lives than the smartphone.

“What you really want,” he told MIT Technology Review, “is just this thing that is off helping you.” Altman, who was visiting Cambridge for a series of events hosted by Harvard and the venture capital firm Xfund, described the killer app for AI as a “super-competent colleague that knows absolutely everything about my whole life, every email, every conversation I’ve ever had, but doesn’t feel like an extension.” It could tackle some tasks instantly, he said, and for more complex ones it could go off and make an attempt, but come back with questions for you if it needs to. 

It’s a leap from OpenAI’s current offerings. Its leading applications, like DALL-E, Sora, and ChatGPT (which Altman referred to as “incredibly dumb” compared with what’s coming next), have wowed us with their ability to generate convincing text and surreal videos and images. But they mostly remain tools we use for isolated tasks, and they have limited capacity to learn about us from our conversations with them. 

In the new paradigm, as Altman sees it, the AI will be capable of helping us outside the chat interface and taking real-world tasks off our plates. 

Altman on AI hardware’s future 

I asked Altman if we’ll need a new piece of hardware to get to this future. Though smartphones are extraordinarily capable, and their designers are already incorporating more AI-driven features, some entrepreneurs are betting that the AI of the future will require a device that’s more purpose built. Some of these devices are already beginning to appear in his orbit. There is the (widely panned) wearable AI Pin from Humane, for example (Altman is an investor in the company but has not exactly been a booster of the device). He is also rumored to be working with former Apple designer Jony Ive on some new type of hardware. 

But Altman says there’s a chance we won’t necessarily need a device at all. “I don’t think it will require a new piece of hardware,” he told me, adding that the type of app envisioned could exist in the cloud. But he quickly added that even if this AI paradigm shift won’t require consumers buy a new hardware, “I think you’ll be happy to have [a new device].” 

Though Altman says he thinks AI hardware devices are exciting, he also implied he might not be best suited to take on the challenge himself: “I’m very interested in consumer hardware for new technology. I’m an amateur who loves it, but this is so far from my expertise.”

On the hunt for training data

Upon hearing his vision for powerful AI-driven agents, I wondered how it would square with the industry’s current scarcity of training data. To build GPT-4 and other models, OpenAI has scoured internet archives, newspapers, and blogs for training data, since scaling laws have long shown that making models bigger also makes them better. But finding more data to train on is a growing problem. Much of the internet has already been slurped up, and access to private or copyrighted data is now mired in legal battles. 

Altman is optimistic this won’t be a problem for much longer, though he didn’t articulate the specifics. 

“I believe, but I’m not certain, that we’re going to figure out a way out of this thing of you always just need more and more training data,” he says. “Humans are existence proof that there is some other way to [train intelligence]. And I hope we find it.”

On who will be poised to create AGI

OpenAI’s central vision has long revolved around the pursuit of artificial general intelligence (AGI), or an AI that can reason as well as or better than humans. Its stated mission is to ensure such a technology “benefits all of humanity.” It is far from the only company pursuing AGI, however. So in the race for AGI, what are the most important tools? I asked Altman if he thought the entity that marshals the largest amount of chips and computing power will ultimately be the winner. 

Altman suspects there will be “several different versions [of AGI] that are better and worse at different things,” he says. “You’ll have to be over some compute threshold, I would guess. But even then I wouldn’t say I’m certain.”

On when we’ll see GPT-5

You thought he’d answer that? When another reporter in the room asked Altman if he knew when the next version of GPT is slated to be released, he gave a calm response. “Yes,” he replied, smiling, and said nothing more. 

The robot race is fueling a fight for training data

30 April 2024 at 05:00

Since ChatGPT was released, we now interact with AI tools more directly—and regularly—than ever before. 

But interacting with robots, by way of contrast, is still a rarity for most. If you don’t undergo complex surgery or work in logistics, the most advanced robot you encounter in your daily life might still be a vacuum cleaner (if you’re feeling young, the first Roomba was released 22 years ago). 

But that’s on the cusp of changing. Roboticists believe that by using new AI techniques, they will achieve something the field has pined after for decades: more capable robots that can move freely through unfamiliar environments and tackle challenges they’ve never seen before. 

“It’s like being strapped to the front of a rocket,” says Russ Tedrake, vice president of robotics research at the Toyota Research Institute, says of the field’s pace right now. Tedrake says he has seen plenty of hype cycles rise and fall, but none like this one. “I’ve been in the field for 20-some years. This is different,” he says. 

But something is slowing that rocket down: lack of access to the types of data used to train robots so they can interact more smoothly with the physical world. It’s far harder to come by than the data used to train the most advanced AI models like GPT—mostly text, images, and videos scraped off the internet. Simulation programs can help robots learn how to interact with places and objects, but the results still tend to fall prey to what’s known as the “sim-to-real gap,” or failures that arise when robots move from the simulation to the real world. 

For now, we still need access to physical, real-world data to train robots. That data is relatively scarce and tends to require a lot more time, effort, and expensive equipment to collect. That scarcity is one of the main things currently holding progress in robotics back. 

As a result, leading companies and labs are in fierce competition to find new and better ways to gather the data they need. It’s led them down strange paths, like using robotic arms to flip pancakes for hours on end, watching thousands of hours of graphic surgery videos pulled from YouTube, or deploying researchers to numerous Airbnbs in order to film every nook and cranny. Along the way, they’re running into the same sorts of privacy, ethics, and copyright issues as their counterparts in the world of chatbots. 

The new need for data

For decades, robots were trained on specific tasks, like picking up a tennis ball or doing a somersault. While humans learn about the physical world through observation and trial and error, many robots were learning through equations and code. This method was slow, but even worse, it meant that robots couldn’t transfer skills from one task to a new one. 

But now, AI advances are fast-tracking a shift that had already begun: letting robots teach themselves through data. Just as a language model can learn from a library’s worth of novels, robot models can be shown a few hundred demonstrations of a person washing ketchup off a plate using robotic grippers, for example, and then imitate the task without being taught explicitly what ketchup looks like or how to turn on the faucet. This approach is bringing faster progress and machines with much more general capabilities. 

Now every leading company and lab is trying to enable robots to reason their way through new tasks using AI. Whether they succeed will hinge on whether researchers can find enough diverse types of data to fine-tune models for robots, as well as novel ways to use reinforcement learning to let them know when they’re right and when they’re wrong. 

“A lot of people are scrambling to figure out what’s the next big data source,” says Pras Velagapudi, chief technology officer of Agility Robotics, which makes a humanoid robot that operates in warehouses for customers including Amazon. The answers to Velagapudi’s question will help define what tomorrow’s machines will excel at, and what roles they may fill in our homes and workplaces. 

Prime training data

To understand how roboticists are shopping for data, picture a butcher shop. There are prime, expensive cuts ready to be cooked. There are the humble, everyday staples. And then there’s the case of trimmings and off-cuts lurking in the back, requiring a creative chef to make them into something delicious. They’re all usable, but they’re not all equal.

For a taste of what prime data looks like for robots, consider the methods adopted by the Toyota Research Institute (TRI). Amid a sprawling laboratory in Cambridge, Massachusetts, equipped with robotic arms, computers, and a random assortment of everyday objects like dustpans and egg whisks, researchers teach robots new tasks through teleoperation, creating what’s called demonstration data. A human might use a robotic arm to flip a pancake 300 times in an afternoon, for example.

The model processes that data overnight, and then often the robot can perform the task autonomously the next morning, TRI says. Since the demonstrations show many iterations of the same task, teleoperation creates rich, precisely labeled data that helps robots perform well in new tasks.

The trouble is, creating such data takes ages, and it’s also limited by the number of expensive robots you can afford. To create quality training data more cheaply and efficiently, Shuran Song, head of the Robotics and Embodied AI Lab at Stanford University, designed a device that can more nimbly be used with your hands, and built at a fraction of the cost. Essentially a lightweight plastic gripper, it can collect data while you use it for everyday activities like cracking an egg or setting the table. The data can then be used to train robots to mimic those tasks. Using simpler devices like this could fast-track the data collection process.

Open-source efforts

Roboticists have recently alighted upon another method for getting more teleoperation data: sharing what they’ve collected with each other, thus saving them the laborious process of creating data sets alone. 

The Distributed Robot Interaction Dataset (DROID), published last month, was created by researchers at 13 institutions, including companies like Google DeepMind and top universities like Stanford and Carnegie Mellon. It contains 350 hours of data generated by humans doing tasks ranging from closing a waffle maker to cleaning up a desk. Since the data was collected using hardware that’s common in the robotics world, researchers can use it to create AI models and then test those models on equipment they already have. 

The effort builds on the success of the Open X-Embodiment Collaboration, a similar project from Google DeepMind that aggregated data on 527 skills, collected from a variety of different types of hardware. The data set helped build Google DeepMind’s RT-X model, which can turn text instructions (for example, “Move the apple to the left of the soda can”) into physical movements. 

Robotics models built on open-source data like this can be impressive, says Lerrel Pinto, a researcher who runs the General-purpose Robotics and AI Lab at New York University. But they can’t perform across a wide enough range of use cases to compete with proprietary models built by leading private companies. What is available via open source is simply not enough for labs to successfully build models at a scale that would produce the gold standard: robots that have general capabilities and can receive instructions through text, image, and video.

“The biggest limitation is the data,” he says. Only wealthy companies have enough. 

These companies’ data advantage is only getting more thoroughly cemented over time. In their pursuit of more training data, private robotics companies with large customer bases have a not-so-secret weapon: their robots themselves are perpetual data-collecting machines.

Covariant, a robotics company founded in 2017 by OpenAI researchers, deploys robots trained to identify and pick items in warehouses for companies like Crate & Barrel and Bonprix. These machines constantly collect footage, which is then sent back to Covariant. Every time the robot fails to pick up a bottle of shampoo, for example, it becomes a data point to learn from, and the model improves its shampoo-picking abilities for next time. The result is a massive, proprietary data set collected by the company’s own machines. 

This data set is part of why earlier this year Covariant was able to release a powerful foundation model, as AI models capable of a variety of uses are known. Customers can now communicate with its commercial robots much as you’d converse with a chatbot: you can ask questions, show photos, and instruct it to take a video of itself moving an item from one crate to another. These customer interactions with the model, which is called RFM-1, then produce even more data to help it improve.

Peter Chen, cofounder and CEO of Covariant, says exposing the robots to a number of different objects and environments is crucial to the model’s success. “We have robots handling apparel, pharmaceuticals, cosmetics, and fresh groceries,” he says. “It’s one of the unique strengths behind our data set.” Up next will be bringing its fleet into more sectors and even having the AI model power different types of robots, like humanoids, Chen says.

Learning from video

The scarcity of high-quality teleoperation and real-world data has led some roboticists to propose bypassing that collection method altogether. What if robots could just learn from videos of people?

Such video data is easier to produce, but unlike teleoperation data, it lacks “kinematic” data points, which plot the exact movements of a robotic arm as it moves through space. 

Researchers from the University of Washington and Nvidia have created a workaround, building a mobile app that lets people train robots using augmented reality. Users take videos of themselves completing simple tasks with their hands, like picking up a mug, and the AR program can translate the results into waypoints for the robotics software to learn from. 

Meta AI is pursuing a similar collection method on a larger scale through its Ego4D project, a data set of more than 3,700 hours of video taken by people around the world doing everything from laying bricks to playing basketball to kneading bread dough. The data set is broken down by task and contains thousands of annotations, which detail what’s happening in each scene, like when a weed has been removed from a garden or a piece of wood is fully sanded.

Learning from video data means that robots can encounter a much wider variety of tasks than they could if they relied solely on human teleoperation (imagine folding croissant dough with robot arms). That’s important, because just as powerful language models need complex and diverse data to learn, roboticists can create their own powerful models only if they expose robots to thousands of tasks.

To that end, some researchers are trying to wring useful insights from a vast source of abundant but low-quality data: YouTube. With thousands of hours of video uploaded every minute, there is no shortage of available content. The trouble is that most of it is pretty useless for a robot. That’s because it’s not labeled with the types of information robots need, like annotations or kinematic data. 

Photo Illustration showing a robotic hand using laptop, watching YouTube
SARAH ROGERS/MITTR | GETTY

“You can say [to a robot], Oh, this is a person playing Frisbee with their dog,” says Chen, of Covariant, imagining a typical video that might be found on YouTube. “But it’s very difficult for you to say, Well, when this person throws a Frisbee, this is the acceleration and the rotation and that’s why it flies this way.”

Nonetheless, a few attempts have proved promising. When he was a postdoc at Stanford, AI researcher Emmett Goodman looked into how AI could be brought into the operating room to make surgeries safer and more predictable. Lack of data quickly became a roadblock. In laparoscopic surgeries, surgeons often use robotic arms to manipulate surgical tools inserted through very small incisions in the body. Those robotic arms have cameras capturing footage that can help train models, once personally identifying information has been removed from the data. In more traditional open surgeries, on the other hand, surgeons use their hands instead of robotic arms. That produces much less data to build AI models with. 

“That is the main barrier to why open-surgery AI is the slowest to develop,” he says. “How do you actually collect that data?”

To tackle that problem, Goodman trained an AI model on thousands of hours of open-surgery videos, taken by doctors with handheld or overhead cameras, that his team gathered from YouTube (with identifiable information removed). His model, as described in a paper in the medical journal JAMA in December 2023, could then identify segments of the operations from the videos. This laid the groundwork for creating useful training data, though Goodman admits that the barriers to doing so at scale, like patient privacy and informed consent, have not been overcome. 

Uncharted legal waters

Chances are that wherever roboticists turn for their new troves of training data, they’ll at some point have to wrestle with some major legal battles. 

The makers of large language models are already having to navigate questions of credit and copyright. A lawsuit filed by the New York Times alleges that ChatGPT copies the expressive style of its stories when generating text. The chief technical officer of OpenAI recently made headlines when she said the company’s video generation tool Sora was trained on publicly available data, sparking a critique from YouTube’s CEO, who said that if Sora learned from YouTube videos, it would be a violation of the platform’s terms of service.

“It is an area where there’s a substantial amount of legal uncertainty,” says Frank Pasquale, a professor at Cornell Law School. If robotics companies want to join other AI companies in using copyrighted works in their training sets, it’s unclear whether that’s allowed under the fair-use doctrine, which permits copyrighted material to be used without permission in a narrow set of circumstances. An example often cited by tech companies and those sympathetic to their view is the 2015 case of Google Books, in which courts found that Google did not violate copyright laws in making a searchable database of millions of books. That legal precedent may tilt the scales slightly in tech companies’ favor, Pasquale says.

It’s far too soon to tell whether legal challenges will slow down the robotics rocket ship, since AI-related cases are sprawling and still undecided. But it’s safe to say that roboticists scouring YouTube or other internet video sources for training data will be wading in fairly uncharted waters.

The next era

Not every roboticist feels that data is the missing link for the next breakthrough. Some argue that if we build a good enough virtual world for robots to learn in, maybe we don’t need training data from the real world at all. Why go through the effort of training a pancake-flipping robot in a real kitchen, for example, if it could learn through a digital simulation of a Waffle House instead?

Roboticists have long used simulator programs, which digitally replicate the environments that robots navigate through, often down to details like the texture of the floorboards or the shadows cast by overhead lights. But as powerful as they are, roboticists using these programs to train machines have always had to work around that sim-to-real gap. 

Now the gap might be shrinking. Advanced image generation techniques and faster processing are allowing simulations to look more like the real world. Nvidia, which leveraged its experience in video game graphics to build the leading robotics simulator, called Isaac Sim, announced last month that leading humanoid robotics companies like Figure and Agility are using its program to build foundation models. These companies build virtual replicas of their robots in the simulator and then unleash them to explore a range of new environments and tasks.

Deepu Talla, vice president of robotics and edge computing at Nvidia, doesn’t hold back in predicting that this way of training will nearly replace the act of training robots in the real world. It’s simply far cheaper, he says.

“It’s going to be a million to one, if not more, in terms of how much stuff is going to be done in simulation,” he says. “Because we can afford to do it.”

But if models can solve some of the “cognitive” problems, like learning new tasks, there are a host of challenges to realizing that success in an effective and safe physical form, says Aaron Saunders, chief technology officer of Boston Dynamics. We’re a long way from building hardware that can sense different types of materials, scrub and clean, or apply a gentle amount of force.

“There’s still a massive piece of the equation around how we’re going to program robots to actually act on all that information to interact with that world,” he says.

If we solved that problem, what would the robotic future look like? We could see nimble robots that help people with physical disabilities move through their homes, autonomous drones that clean up pollution or hazardous waste, or surgical robots that make microscopic incisions, leading to operations with a reduced risk of complications. For all these optimistic visions, though, more controversial ones are already brewing. The use of AI by militaries worldwide is on the rise, and the emergence of autonomous weapons raises troubling questions.

The labs and companies poised to lead in the race for data include, at the moment, the humanoid-robot startups beloved by investors (Figure AI was recently boosted by a $675 million funding round), commercial companies with sizable fleets of robots collecting data, and drone companies buoyed by significant military investment. Meanwhile, smaller academic labs are doing more with less to create data sets that rival those available to Big Tech. 

But what’s clear to everyone I speak with is that we’re at the very beginning of the robot data race. Since the correct way forward is far from obvious, all roboticists worth their salt are pursuing any and all methods to see what sticks.

There “isn’t really a consensus” in the field, says Benjamin Burchfiel, a senior research scientist in robotics at TRI. “And that’s a healthy place to be.”

Here’s the defense tech at the center of US aid to Israel, Ukraine, and Taiwan

26 April 2024 at 09:55

MIT Technology Review Explains: Let our writers untangle the complex, messy world of technology to help you understand what’s coming next. You can read more from the series here.

After weeks of drawn-out congressional debate over how much the United States should spend on conflicts abroad, President Joe Biden signed a $95.3 billion aid package into law on Wednesday.

The bill will send a significant quantity of supplies to Ukraine and Israel, while also supporting Taiwan with submarine technology to aid its defenses against China. It’s also sparked renewed calls for stronger crackdowns on Iranian-produced drones. 

Though much of the money will go toward replenishing fairly standard munitions and supplies, the spending bill provides a window into US strategies around four key defense technologies that continue to reshape how today’s major conflicts are being fought.

For a closer look at the military technology at the center of the aid package, I spoke with Andrew Metrick, a fellow with the defense program at the Center for a New American Security, a think tank.

Ukraine and the role of long-range missiles

Ukraine has long sought the Army Tactical Missile System (ATACMS), a long-range ballistic missile made by Lockheed Martin. First debuted in Operation Desert Storm in Iraq in 1990, it’s 13 feet high, two feet wide, and over 3,600 pounds. It can use GPS to accurately hit targets 190 miles away. 

Last year, President Biden was apprehensive about sending such missiles to Ukraine, as US stockpiles of the weapons were relatively low. In October, the administration changed tack. The US sent shipments of ATACMS, a move celebrated by President Volodymyr Zelensky of Ukraine, but they came with restrictions: the missiles were older models with a shorter range, and Ukraine was instructed not to fire them into Russian territory, only Ukrainian territory. 

This week, just hours before the new aid package was signed, multiple news outlets reported that the US had secretly sent more powerful long-range ATACMS to Ukraine several weeks before. They were used on Tuesday, April 23, to target a Russian airfield in Crimea and Russian troops in Berdiansk, 50 miles southwest of Mariupol.

The long range of the weapons has proved essential for Ukraine, says Metrick. “It allows the Ukrainians to strike Russian targets at ranges for which they have very few other options,” he says. That means being able to hit locations like supply depots, command centers, and airfields behind Russia’s front lines in Ukraine. This capacity has grown more important as Ukraine’s troop numbers have waned, Metrick says.

Replenishing Israel’s Iron Dome

On April 13, Iran launched its first-ever direct attack on Israeli soil. In the attack, which Iran says was retaliation for Israel’s airstrike on its embassy in Syria, hundreds of missiles were lobbed into Israeli airspace. Many of them were neutralized by the web of cutting-edge missile launchers dispersed throughout Israel that can automatically detonate incoming strikes before they hit land. 

One of those systems is Israel’s Iron Dome, in which radar systems detect projectiles and then signal units to launch defensive missiles that detonate the target high in the sky before it strikes populated areas. Israel’s other system, called David’s Sling, works a similar way but can identify rockets coming from a greater distance, upwards of 180 miles. 

Both systems are hugely costly to research and build, and the new US aid package allocates $15 billion to replenish their missile stockpile. The missiles can cost anywhere from $100,000 to $10 million each, and a system like Iron Dome might fire them daily during intense periods of conflict. 

The aid comes as funding for Israel has grown more contentious amid the dire conditions faced by displaced Palestinians in Gaza. While the spending bill worked its way through Congress, increasing numbers of Democrats sought to put conditions on the military aid to Israel, particularly after an Israeli air strike on April 1 killed seven aid workers from World Central Kitchen, an international food charity. The funding package does provide $9 billion in humanitarian assistance for the conflict, but the efforts to impose conditions for Israeli military aid failed. 

Taiwan and underwater defenses against China

A rising concern for the US defense community—and a subject of “wargaming” simulations that Metrick has carried out—is an amphibious invasion of Taiwan from China. The rising risk of that scenario has driven the US to build and deploy larger numbers of advanced submarines, Metrick says. A bigger fleet of these submarines would be more likely to keep attacks from China at bay, thereby protecting Taiwan.

The trouble is that the US shipbuilding effort, experts say, is too slow. It’s been hampered by budget cuts and labor shortages, but the new aid bill aims to jump-start it. It will provide $3.3 billion to do so, specifically for the production of Columbia-class submarines, which carry nuclear weapons, and Virginia-class submarines, which carry conventional weapons. 

Though these funds aim to support Taiwan by building up the US supply of submarines, the package also includes more direct support, like $2 billion to help it purchase weapons and defense equipment from the US. 

The US’s Iranian drone problem 

Shahed drones are used almost daily on the Russia-Ukraine battlefield, and Iran launched more than 100 against Israel earlier this month. Produced by Iran and resembling model planes, the drones are fast, cheap, and lightweight, capable of being launched from the back of a pickup truck. They’re used frequently for potent one-way attacks, where they detonate upon reaching their target. US experts say the technology is tipping the scales toward Russian and Iranian military groups and their allies. 

The trouble of combating them is partly one of cost. Shooting down the drones, which can be bought for as little as $40,000, can cost millions in ammunition.

“Shooting down Shaheds with an expensive missile is not, in the long term, a winning proposition,” Metrick says. “That’s what the Iranians, I think, are banking on. They can wear people down.”

This week’s aid package renewed White House calls for stronger sanctions aimed at curbing production of the drones. The United Nations previously passed rules restricting any drone-related material from entering or leaving Iran, but those expired in October. The US now wants them reinstated. 

Even if that happens, it’s unlikely the rules would do much to contain the Shahed’s dominance. The components of the drones are not all that complex or hard to obtain to begin with, but experts also say that Iran has built a sprawling global supply chain to acquire the materials needed to manufacture them and has worked with Russia to build factories. 

“Sanctions regimes are pretty dang leaky,” Metrick says. “They [Iran] have friends all around the world.”

This US startup makes a crucial chip material and is taking on a Japanese giant

11 April 2024 at 12:04

It can be dizzying to try to understand all the complex components of a single computer chip: layers of microscopic components linked to one another through highways of copper wires, some barely wider than a few strands of DNA. Nestled between those wires is an insulating material called a dielectric, ensuring that the wires don’t touch and short out. Zooming in further, there’s one particular dielectric placed between the chip and the structure beneath it; this material, called dielectric film, is produced in sheets as thin as white blood cells. 

For 30 years, a single Japanese company called Ajinomoto has made billions producing this particular film. Competitors have struggled to outdo them, and today Ajinomoto has more than 90% of the market in the product, which is used in everything from laptops to data centers. 

But now, a startup based in Berkeley, California, is embarking on a herculean effort to dethrone Ajinomoto and bring this small slice of the chipmaking supply chain back to the US.

Thintronics is promising a product purpose-built for the computing demands of the AI era—a suite of new materials that the company claims have higher insulating properties and, if adopted, could mean data centers with faster computing speeds and lower energy costs. 

The company is at the forefront of a coming wave of new US-based companies, spurred by the $280 billion CHIPS and Science Act, that is seeking to carve out a portion of the semiconductor sector, which has become dominated by just a handful of international players. But to succeed, Thintronics and its peers will have to overcome a web of challenges—solving technical problems, disrupting long-standing industry relationships, and persuading global semiconductor titans to accommodate new suppliers. 

“Inventing new materials platforms and getting them into the world is very difficult,” Thintronics founder and CEO Stefan Pastine says. It is “not for the faint of heart.”

The insulator bottleneck

If you recognize the name Ajinomoto, you’re probably surprised to hear it plays a critical role in the chip sector: the company is better known as the world’s leading supplier of MSG seasoning powder. In the 1990s, Ajinomoto discovered that a by-product of MSG made a great insulator, and it has enjoyed a near monopoly in the niche material ever since. 

But Ajinomoto doesn’t make any of the other parts that go into chips. In fact, the insulating materials in chips rely on dispersed supply chains: one layer uses materials from Ajinomoto, another uses material from another company, and so on, with none of the layers optimized to work in tandem. The resulting system works okay when data is being transmitted over short paths, but over longer distances, like between chips, weak insulators act as a bottleneck, wasting energy and slowing down computing speeds. That’s recently become a growing concern, especially as the scale of AI training gets more expensive and consumes eye-popping amounts of energy. (Ajinomoto did not respond to requests for comment.) 

None of this made much sense to Pastine, a chemist who sold his previous company, which specialized in recycling hard plastics, to an industrial chemicals company in 2019. Around that time, he started to believe that the chemicals industry could be slow to innovate, and he thought the same pattern was keeping chipmakers from finding better insulating materials. In the chip industry, he says, insulators have “kind of been looked at as the redheaded stepchild”—they haven’t seen the progress made with transistors and other chip components. 

He launched Thintronics that same year, with the hope that cracking the code on a better insulator could provide data centers with faster computing speeds at lower costs. That idea wasn’t groundbreaking—new insulators are constantly being researched and deployed—but Pastine believed that he could find the right chemistry to deliver a breakthrough. 

Thintronics says it will manufacture different insulators for all layers of the chip, for a system designed to swap into existing manufacturing lines. Pastine tells me the materials are now being tested with a number of industry players. But he declined to provide names, citing nondisclosure agreements, and similarly would not share details of the formula. 

Without more details, it’s hard to say exactly how well the Thintronics materials compare with competing products. The company recently tested its materials’ Dk values, which are a measure of how effective an insulator a material is. Venky Sundaram, a researcher who has founded multiple semiconductor startups but is not involved with Thintronics, reviewed the results. When compared to other build-up films—the dielectric category in which Thintronics is competing—their most impressive Dk values are better than those of any other material available today, he says.

A rocky road ahead

Thintronics’ vision has already garnered some support. The company received a $20 million Series A funding round in March, led by venture capital firms Translink and Maverick, as well as a grant from the US National Science Foundation. 

The company is also seeking funding from the CHIPS Act. Signed into law by President Joe Biden in 2022, it’s designed to boost companies like Thintronics in order to bring semiconductor manufacturing back to American companies and reduce reliance on foreign suppliers. A year after it became law, the administration said that more than 450 companies had submitted statements of interest to receive CHIPS funding for work across the sector. 

The bulk of funding from the legislation is destined for large-scale manufacturing facilities, like those operated by Intel in New Mexico and Taiwan Semiconductor Manufacturing Corporation (TSMC) in Arizona. But US Secretary of Commerce Gina Raimondo has said she’d like to see smaller companies receive funding as well, especially in the materials space. In February, applications opened for a pool of $300 million earmarked specifically for materials innovation. While Thintronics declined to say how much funding it was seeking or from which programs, the company does see the CHIPS Act as a major tailwind.

But building a domestic supply chain for chips—a product that currently depends on dozens of companies around the globe—will mean reversing decades of specialization by different countries. And industry experts say it will be difficult to challenge today’s dominant insulator suppliers, who have often had to adapt to fend off new competition. 

“Ajinomoto has been a 90-plus-percent-market-share material for more than two decades,” says Sundaram. “This is unheard-of in most businesses, and you can imagine they didn’t get there by not changing.”

One big challenge is that the dominant manufacturers have decades-long relationships with chip designers like Nvidia or Advanced Micro Devices, and with manufacturers like TSMC. Asking these players to swap out materials is a big deal.

“The semiconductor industry is very conservative,” says Larry Zhao, a semiconductor researcher who has worked in the dielectrics industry for more than 25 years. “They like to use the vendors they already know very well, where they know the quality.” 

Another obstacle facing Thintronics is technical: insulating materials, like other chip components, are held to manufacturing standards so precise they are difficult to comprehend. The layers where Ajinomoto dominates are thinner than a human hair. The material must also be able to accept tiny holes, which house wires running vertically through the film. Every new iteration is a massive R&D effort in which incumbent companies have the upper hand given their years of experience, says Sundaram.

If all this is completed successfully in a lab, yet another hurdle lies ahead: the material has to retain those properties in a high-volume manufacturing facility, which is where Sundaram has seen past efforts fail.

“I have advised several material suppliers over the years that tried to break into [Ajinomoto’s] business and couldn’t succeed,” he says. “They all ended up having the problem of not being as easy to use in a high-volume production line.” 

Despite all these challenges, one thing may be working in Thintronics’ favor: US-based tech giants like Microsoft and Meta are making headway in designing their own chips for the first time. The plan is to use these chips for in-house AI training as well as for the cloud computing capacity that they rent out to customers, both of which would reduce the industry’s reliance on Nvidia. 

Though Microsoft, Google, and Meta declined to comment on whether they are pursuing advancements in materials like insulators, Sundaram says these firms could be more willing to work with new US startups rather than defaulting to the old ways of making chips: “They have a lot more of an open mind about supply chains than the existing big guys.”

This story was updated on April 12 to clarify how the Dk values of Thintronics’ materials compare to those of other build-up films.

How ASML took over the chipmaking chessboard

On a drab Monday morning in San Jose, California, at the drab San Jose Convention Center, attendees of the SPIE Advanced Lithography and Patterning Conference filed into the main ballroom until all the seats were taken and the crowd began to line the walls along the back and sides of the room. The convention brings together people who work in the chip industry from all over the world. And on this cool February morning, they had gathered to hear tech industry luminaries extol the late Gordon Moore, Intel’s cofounder and first CEO. 

Craig Barrett, also a former CEO of Intel, paid tribute, as did the legendary engineer Burn-Jeng Lin, a pioneer of immersion lithography, a patterning technology that enabled the chip industry to continue moving forward about 20 years ago. Mostly the speeches tended toward reflections on Moore himself—testaments to his genius, accomplishments, and humanity. But the last speaker of the morning, Martin van den Brink, took a different tone, more akin to a victory lap than a eulogy. Van den Brink is the outgoing co-president and CTO of ASML, the Dutch company that makes the machines that in turn let manufacturers produce the most advanced computer chips in the world. 

Moore’s Law holds that the number of transistors on an integrated circuit doubles every two years or so. In essence, it means that chipmakers are always trying to shrink the transistors on a microchip in order to pack more of them in. The cadence has been increasingly hard to maintain now that transistor dimensions measure in a few nanometers. In recent years ASML’s machines have kept Moore’s Law from sputtering out. Today, they are the only ones in the world capable of producing circuitry at the density needed to keep chipmakers roughly on track. It is the premise of Moore’s Law itself, van den Brink said, that drives the industry forward, year after year. 

To showcase how big an achievement it had been to maintain Moore’s Law since he joined ASML in 1984, van den Brink referred to the rice and chessboard problem, in which the number of grains of rice—a proxy for transistors—is doubled on each successive square. The exponential growth in the number of transistors that can be crammed on a chip since 1959 means that a single grain of rice back then has now become the equivalent of three ocean tankers, each 240 meters long, full of rice. It’s a lot of rice! Yet Moore’s Law compels the company—compels all of the technology industry—to keep pushing forward. Each era of computing, most recently AI, has brought increased demands, explained van den Brink. In other words, while three tankers full of rice may seem like a lot, tomorrow we’re going to need six. Then 12. Then 24. And so on. 

ASML’s technology, he assured the gathering, would be there to meet the demands, thanks to the company’s investment in creating tools capable of making ever finer features: the extreme-ultraviolet (EUV) lithography machines it rolled out widely in 2017, the high-numerical-aperture (high-NA) EUV machines it is rolling out now, and the hyper-NA EUV machines it has sketched out for the future. 

The tribute may have been designed for Gordon Moore, but at the end of van den Brink’s presentation the entire room rose to give him a standing ovation. Because if Gordon Moore deserves credit for creating the law that drove the progress of the industry, as van den Brink says, van den Brink and ASML deserve much of the credit for ensuring that progress remains possible. 

Yet that also means the pressure is on. ASML has to try and stay ahead of the demands of Moore’s Law. It has to continue making sure chipmakers can keep doubling the amount of rice on the chessboard. Will that be possible? Van den Brink sat down with MIT Technology Review to talk about ASML’s history, its legacy, and what comes next. 

Betting big on an unwieldy wavelength

ASML is such an undisputed leader in today’s chip ecosystem that it’s hard to believe the company’s market dominance really only dates back to 2017, when its EUV machine, after 17 years of development, upended the conventional process for making chips. 

Since the 1960s, photolithography has made it possible to pack computer chips with more and more components. The process involves crafting small circuits by guiding beams of light through a series of mirrors and lenses and then shining that light on a mask, which contains a pattern. Light conveys the chip design, layer by layer, eventually building circuits that form the computational building blocks of everything from smartphones to artificial intelligence. 

Martin Van Den Brink
ASML

Photolithographers have a limited set of tools at their disposal to make smaller designs, and for decades, the type of light used in the machine was the most critical. In the 1960s, machines used beams of visible light. The smallest features this light could draw on the chip were fairly large—a bit like using a marker to draw a portrait. 

Then manufacturers began using smaller and smaller wavelengths of light, and by the early 1980s, they could make chips with ultraviolet light. Nikon and Canon were the industry leaders. ASML, founded in 1984 as a subsidiary of Philips in Eindhoven, the Netherlands, was just a small player.

The way van den Brink tells it, he arrived at the company almost by accident. Philips was one of a few technology companies in Holland. When he began his career there in 1984 and was looking into the various opportunities at the company, he became intrigued by a photo of a lithography machine.

“I looked at the picture and I said, ‘It has mechanics, it has optics, it has software—this looks like a complex machine. I will be interested in that,” van den Brink told MIT Technology Review. “They said, well, you can do it, but the company will not be part of Philips. We are creating a joint venture with ASM International, and after the joint venture, you will not be part of Philips. I said yes because I couldn’t care less. And that’s how it began.”

When van den Brink joined in the 1980s, little about ASML made the company stand out from other major lithography players at the time. “We didn’t sell a substantial amount of systems until the ’90s. And we almost went bankrupt several times in that period,” van den Brink says. “So for us there was only one mission: to survive and show a customer that we could make a difference.”

By 1995, it had a strong enough foothold in the industry against competitors Nikon and Canon to go public. But all lithography makers were fighting the same battle to create smaller components on chips. 

If you could have eavesdropped on a meeting at ASML in the late 1990s about this predicament, you might have heard chatter about an idea called extreme-ultraviolet (EUV) lithography—along with concerns that it might never work). By that point, with pressure to condense chips beyond current capabilities, it seemed as if everyone was chasing EUV. The idea was to pattern chips with an even smaller wavelength of light (ultimately just 13.5 nanometers). To do so, ASML would have to figure out how to create, capture, and focus this light—processes that had stumped researchers for decades—and build a supply chain of specialized materials, including the smoothest mirrors ever produced. And to make sure the price point wouldn’t drive away its customers. 

Canon and Nikon were also pursuing EUV, but the US government denied them a license to participate in the consortium of companies and US national labs researching it. Both subsequently dropped out. Meanwhile ASML acquired the fourth major company pursuing EUV, SVG, in 2001. By 2006 it had shipped only two EUV prototype machines to research facilities, and it took until 2010 to ship one to a customer. Five years later, ASML warned in its annual report that EUV sales remained low, that customers weren’t eager to adopt the technology given its slow speed on the production line, and that if the pattern continued, it could have “material” effects on the business given the significant investment. 

Yet in 2017, after an investment of $6.5 billion in R&D over 17 years, ASML’s bet began to pay off. That year the company shipped 10 of its EUV machines, which cost over $100 million each, and announced that dozens more were on backorder. EUV machines went to the titans of semiconductor manufacturing—Intel, Samsung, and Taiwan Semiconductor Manufacturing Company (TSMC)—and a small number of others. With a brighter light source (meaning less time needed to impart patterns), among other improvements, the machines were capable of faster production speeds. The leap to EUV finally made economic sense to chipmakers, putting ASML essentially in a monopoly position.

Chris Miller, a history professor at Tufts University and author of Chip War: The Fight for the World’s Most Critical Technology, says that ASML was culturally equipped to see those experiments through. “It’s a stubborn willingness to invest in technology that most people thought wouldn’t work,” he told MIT Technology Review. “No one else was betting on EUV, because the development process was so long and expensive. It involves stretching the limits of physics, engineering, and chemistry.”

A key factor in ASML’s growth was its control of the supply chain. ASML acquired number of the companies it relies on, like Cymer, a maker of light sources. That strategy of pointedly controlling power in the supply chain extended to ASML’s customers, too. In 2012, it offered shares to its three biggest customers, which were able to maintain market dominance of their own in part because of the elite manufacturing power of ASML’s machines. 

“Our success depends on their success,” van den Brink told MIT Technology Review

It’s also a testament to ASML’s dominance that it is for the most part no longer allowed to sell its most advanced systems to customers in China. Though ASML still does business in China, in 2019, following pressure from the Trump administration, the Dutch government began imposing restrictions on ASML’s exports of EUV machines to China. Those rules were tightened further just last year and now also impose limits on some of the company’s deep-ultraviolet (DUV) machines, which are used to make less highly advanced chips than EUV systems.

Van den Brink says the way world leaders are now discussing lithography was unimaginable when the company began: “Our prime minister was sitting in front of Xi Jinping, not because he was from Holland—who would give a shit about Holland. He was there because we are making EUV.”

Just a few years after the first EUV machines shipped, ASML would face its second upheaval. Around the start of the pandemic, interest and progress in the field of artificial intelligence sent demand for computing power skyrocketing. Companies like OpenAI needed ever more powerful computer chips and by late 2022 the frenzy and investment in AI began to boil over. 

By that time, ASML was closing in on its newest innovation. Having already adopted a smaller wavelength of light (and realigned the entire semiconductor industry to it in the process), it now turned its attention to the other lever in its control: numerical aperture. That’s the measure of how much light a system can focus, and if ASML could increase it, the company’s machines could print even smaller components.

Doing so meant myriad changes. ASML had to source an even larger set of mirrors from its supplier Carl Zeiss, which had to be made ultra-smooth. Zeiss had to build entirely new machines, the sole purpose of which was to measure the smoothness of mirrors destined for ASML. The aim was to reduce the number of costly repercussions the change would have on the rest of the supply chain, like the companies that make reticles containing the designs of the chips. 

In December of 2023, ASML began shipping the first of its next-generation EUV device, a high-NA machine, to Intel’s facility in Hillsboro, Oregon. It’s an R&D version, and so far the only one in the field. It took seven planes and 50 trucks to get it to Intel’s plant, and installation of the machine, which is larger than a double-decker bus, will take six months. 

The high-NA machines will only be needed to produce the most precise layers of advanced chips for the industry; the designs on many others will still be printed using the previous generation of EUV machines or older DUV machines. 

ASML has received orders for high-NA machines from all its current EUV customers. They don’t come cheap: reports put the cost at $380 million. Intel was the first customer to strike, ordering the first machine available in early 2022. The company, which has lost significant market share to competitor TSMC, is betting that the new technology will give it a new foothold in the industry, even though other chipmakers will eventually have access to it too. 

“There are obvious benefits to Intel for being the first,” Miller says. “There are also obvious risks.” Sorting out which chips to use these machines for and how to get its money’s worth out of them will be a challenge for the company, according to Miller. 

The launch of these machines, if successful, might be seen as the crowning achievement of van den Brink’s career. But he is already moving on to what comes next.

The future

The next big idea for ASML, according to van den Brink and other company executives who spoke with MIT Technology Review, is hyper-NA technology. The company’s high-NA machines have a numerical aperture of .55. Hyper-NA tools would have a numerical aperture higher than 0.7. What that ultimately means is that hyper NA, if successful, will allow the company to create machines that let manufacturers shrink transistor dimensions even more—assuming that researchers can devise chip components that work well at such small dimensions. As it was with EUV in the early 2000s, it is still uncertain whether hyper NA is feasible—if nothing else, it could be cost prohibitive. Yet van den Brink projects cautious confidence. It is likely, he says, that the company will ultimately have three offerings available: low NA, high NA, and—if all goes well—hyper NA. 

“Hyper NA is a bit more risky,” says van den Brink. “We will be more cautious and more cost sensitive in the future. But if we can pull this off, we have a winning trio which takes care of all the advanced manufacturing for the foreseeable future.”

Yet although today everyone is banking on ASML to keep pushing the industry forward, there is speculation that a competitor could emerge from China. Van den Brink was dismissive of this possibility, citing the gap in even last-generation lithography. 

SMEE are making DUV machines, or at least claim they can,” he told MIT Technology Review, referring to a company that makes the predecessor to EUV lithography technology, and pointed out that ASML still has the dominant market share. The political pressures could mean more progress for China. But getting to the level of complexity involved in ASML’s suite of machines, with low, high, and hyper NA is another matter, he says: “I feel quite comfortable that this will be a long time before they can copy that.”

Miller, from Tufts University, is confident that Chinese companies will eventually develop these sorts of technologies on their own, but agrees that the question is when. “If it’s in a decade, it will be too late,” he says. 

The real question, perhaps, is not who will make the machines, but whether Moore’s Law will hold at all. Nvidia CEO Jensen Huang has already declared it dead. But when asked what he thought might eventually cause Moore’s Law to finally stall out, van den Brink rejected the premise entirely. 

“There’s no reason to believe this will stop. You won’t get the answer from me where it will end,” he said. “It will end when we’re running out of ideas where the value we create with all this will not balance with the cost it will take. Then it will end. And not by the lack of ideas.”

He had struck a similar posture during his Moore tribute at the SPIE conference, exuding confidence. “I’m not sure who will give the presentation 10 years from now,” he said, going back to his rice analogy. “But my successors,” he claimed, “will still have the opportunity to fill the chessboard.”

This story was updated to clarify information about ASML’s operations in China.

Apple researchers explore dropping “Siri” phrase & listening with AI instead

Researchers from Apple are probing whether it’s possible to use artificial intelligence to detect when a user is speaking to a device like an iPhone, thereby eliminating the technical need for a trigger phrase like “Siri,” according to a paper published on Friday.

In a study, which was uploaded to Arxiv and has not been peer-reviewed, researchers trained a large language model using both speech captured by smartphones as well as acoustic data from background noise to look for patterns that could indicate when they want help from the device. The model was built in part with a version of OpenAI’s GPT-2, “since it is relatively lightweight and can potentially run on devices such as smartphones,” the researchers wrote. The paper describes over 129 hours of data and additional text data used to train the model, but did not specify the source of the recordings that went into the training set. Six of the seven authors list their affiliation as Apple, and three of them work on the company’s Siri team according to their LinkedIn profiles. (The seventh author did work related to the paper during an Apple internship.)

The results were promising, according to the paper. The model was able to make more accurate predictions than audio-only or text-only models, and improved further as the size of the models grew larger. Beyond exploring the research question, it’s unclear if Apple plans to eliminate the “Hey Siri” trigger phrase.

Neither Apple, nor the paper’s researchers immediately returned requests for comment.

Currently, Siri functions by holding small amounts of audio and does not begin recording or preparing to answer user prompts until it hears the trigger phrase. Eliminating that “Hey Siri” prompt could increase concerns about our devices “always listening”, said Jen King, a privacy and data policy fellow at the Stanford Institute for Human-Centered Artificial Intelligence. 

The way Apple handles audio data has previously come under scrutiny by privacy advocates. In 2019, reporting from The Guardian revealed that Apple’s quality control contractors regularly heard private audio collected from iPhones while they worked with Siri data, including sensitive conversations between doctors and patients. Two years later, Apple responded with policy changes, including storing more data on devices and allowing users to opt-out of allowing their recordings to be used to improve Siri. A class action suit was brought against the company in California in 2021 that alleged Siri is being turned on even when not activated.  

The “Hey Siri” prompt can serve an important purpose for users, according to King. The phrases provide a way to know when the device is listening, and getting rid of that might mean more convenience, but less transparency from the device, King told MIT Technology Review. The research did not detail if the trigger phrase would be replaced by any other signal that the AI assistant is engaged. 

“I’m skeptical that a company should mandate that form of interaction,” King says.

The paper is one of a number of recent signals that Apple, which is perceived to be lagging behind other tech giants like Amazon, Google, and Facebook in the artificial intelligence race, is planning to incorporate more AI into its products. According to news first reported by VentureBeat, Apple is building a generative AI model called MM1 that can work in text and images, which would be the company’s answer to Open AI’s ChatGPT and a host of other chatbots by leading tech giants. Meanwhile, Bloomberg reported that Apple is in talks with Google about using the company’s AI model Gemini in iPhones, and on Friday the Wall Street Journal reported that it had engaged in talks with Baidu about using that company’s AI products.

This self-driving startup is using generative AI to predict traffic

15 March 2024 at 11:00

Self-driving company Waabi is using a generative AI model to help predict the movement of vehicles, it announced today.

The new system, called Copilot4D, was trained on troves of data from lidar sensors, which use light to sense how far away objects are. If you prompt the model with a situation, like a driver recklessly merging onto a highway at high speed, it predicts how the surrounding vehicles will move, then generates a lidar representation of 5 to 10 seconds into the future (showing a pileup, perhaps). Today’s announcement is about the initial version of Copilot4D, but Waabi CEO Raquel Urtasun says a more advanced and interpretable version is deployed in Waabi’s testing fleet of autonomous trucks in Texas that helps the driving software decide how to react. 

While autonomous driving has long relied on machine learning to plan routes and detect objects, some companies and researchers are now betting that generative AI — models that take in data of their surroundings and generate predictions — will help bring autonomy to the next stage. Wayve, a Waabi competitor, released a comparable model last year that is trained on the video that its vehicles collect. 

Waabi’s model works in a similar way to image or video generators like OpenAI’s DALL-E and Sora. It takes point clouds of lidar data, which visualize a 3D map of the car’s surroundings, and breaks them into chunks, similar to how image generators break photos into pixels. Based on its training data, Copilot4D then predicts how all points of lidar data will move. Doing this continuously allows it to generate predictions 5-10 seconds into the future.

A diptych view of the same image via camera and LiDAR.

Waabi is one of a handful of autonomous driving companies, including competitors Wayve and Ghost, that describe their approach as “AI-first.” To Urtasun, that means designing a system that learns from data, rather than one that must be taught reactions to specific situations. The cohort is betting their methods might require fewer hours of road-testing self-driving cars, a charged topic following an October 2023 accident where a Cruise robotaxi dragged a pedestrian in San Francisco. 

Waabi is different from its competitors in building a generative model for lidar, rather than cameras. 

“If you want to be a Level 4 player, lidar is a must,” says Urtasun, referring to the automation level where the car does not require the attention of a human to drive safely. Cameras do a good job of showing what the car is seeing, but they’re not as adept at measuring distances or understanding the geometry of the car’s surroundings, she says.

Though Waabi’s model can generate videos showing what a car will see through its lidar sensors, those videos will not be used as training in the company’s driving simulator that it uses to build and test its driving model. That’s to ensure any hallucinations arising from Copilot4D do not get taught in the simulator.

The underlying technology is not new, says Bernard Adam Lange, a PhD student at Stanford who has built and researched similar models, but it’s the first time he’s seen a generative lidar model leave the confines of a research lab and be scaled up for commercial use. A model like this would generally help make the “brain” of any autonomous vehicle able to reason more quickly and accurately, he says.

“It is the scale that is transformative,” he says. “The hope is that these models can be utilized in downstream tasks” like detecting objects and predicting where people or things might move next.

Copilot4D can only estimate so far into the future, and motion prediction models in general degrade the farther they’re asked to project forward. Urtasun says that the model only needs to imagine what happens 5 to 10  seconds ahead for the majority of driving decisions, though the benchmark tests highlighted by Waabi are based on 3-second predictions. Chris Gerdes, co-director of Stanford’s Center for Automotive Research, says this metric will be key in determining how useful the model is at making decisions.

“If the 5-second predictions are solid but the 10-second predictions are just barely usable, there are a number of situations where this would not be sufficient on the road,” he says.

The new model resurfaces a question rippling through the world of generative AI: whether or not to make models open-source. Releasing Copilot4D would let academic researchers, who struggle with access to large data sets, peek under the hood at how it’s made, independently evaluate safety, and potentially advance the field. It would also do the same for Waabi’s competitors. Waabi has published a paper detailing the creation of the model but has not released the code, and Urtasun is unsure if they will. 

“We want academia to also have a say in the future of self-driving,” she says, adding that open-source models are more trusted. “But we also need to be a bit careful as we develop our technology so that we don’t unveil everything to our competitors.”

LLMs become more covertly racist with human intervention

11 March 2024 at 14:35

Since their inception, it’s been clear that large language models like ChatGPT absorb racist views from the millions of pages of the internet they are trained on. Developers have responded by trying to make them less toxic. But new research suggests that those efforts, especially as models get larger, are only curbing racist views that are overt, while letting more covert stereotypes grow stronger and better hidden.

Researchers asked five AI models—including OpenAI’s GPT-4 and older models from Facebook and Google—to make judgments about speakers who used African-American English (AAE). The race of the speaker was not mentioned in the instructions.

Even when the two sentences had the same meaning, the models were more likely to apply adjectives like “dirty,” “lazy,” and “stupid” to speakers of AAE than speakers of Standard American English (SAE). The models associated speakers of AAE with less prestigious jobs (or didn’t associate them with having a job at all), and when asked to pass judgment on a hypothetical criminal defendant, they were more likely to recommend the death penalty. 

An even more notable finding may be a flaw the study pinpoints in the ways that researchers try to solve such biases. 

To purge models of hateful views, companies like OpenAI, Meta, and Google use feedback training, in which human workers manually adjust the way the model responds to certain prompts. This process, often called “alignment,” aims to recalibrate the millions of connections in the neural network and get the model to conform better with desired values. 

The method works well to combat overt stereotypes, and leading companies have employed it for nearly a decade. If users prompted GPT-2, for example, to name stereotypes about Black people, it was likely to list “suspicious,” “radical,” and “aggressive,” but GPT-4 no longer responds with those associations, according to the paper.

However the method fails on the covert stereotypes that researchers elicited when using African-American English in their study, which was published on arXiv and has not been peer reviewed. That’s partially because companies have been less aware of dialect prejudice as an issue, they say. It’s also easier to coach a model not to respond to overtly racist questions than it is to coach it not to respond negatively to an entire dialect.

“Feedback training teaches models to consider their racism,” says Valentin Hofmann, a researcher at the Allen Institute for AI and a coauthor on the paper. “But dialect prejudice opens a deeper level.”

Avijit Ghosh, an ethics researcher at Hugging Face who was not involved in the research, says the finding calls into question the approach companies are taking to solve bias.

“This alignment—where the model refuses to spew racist outputs—is nothing but a flimsy filter that can be easily broken,” he says. 

The covert stereotypes also strengthened as the size of the models increased, researchers found. That finding offers a potential warning to chatbot makers like OpenAI, Meta, and Google as they race to release larger and larger models. Models generally get more powerful and expressive as the amount of their training data and the number of their parameters increase, but if this worsens covert racial bias, companies will need to develop better tools to fight it. It’s not yet clear whether adding more AAE to training data or making feedback efforts more robust will be enough.

“This is revealing the extent to which companies are playing whack-a-mole—just trying to hit the next bias that the most recent reporter or paper covered,” says Pratyusha Ria Kalluri, a PhD candidate at Stanford and a coauthor on the study. “Covert biases really challenge that as a reasonable approach.”

The paper’s authors use particularly extreme examples to illustrate the potential implications of racial bias, like asking AI to decide whether a defendant should be sentenced to death. But, Ghosh notes, the questionable use of AI models to help make critical decisions is not science fiction. It happens today. 

AI-driven translation tools are used when evaluating asylum cases in the US, and crime prediction software has been used to judge whether teens should be granted probation. Employers who use ChatGPT to screen applications might be discriminating against candidate names on the basis of race and gender, and if they use models to analyze what an applicant writes on social media, a bias against AAE could lead to misjudgments. 

“The authors are humble in claiming that their use cases of making the LLM pick candidates or judge criminal cases are constructed exercises,” Ghosh says. “But I would claim that their fear is spot on.”

An OpenAI spinoff has built an AI model that helps robots learn tasks like humans

11 March 2024 at 09:00

In the summer of 2021, OpenAI quietly shuttered its robotics team, announcing that progress was being stifled by a lack of data necessary to train robots in how to move and reason using artificial intelligence. 

Now three of OpenAI’s early research scientists say the startup they spun off in 2017, called Covariant, has solved that problem and unveiled a system that combines the reasoning skills of large language models with the physical dexterity of an advanced robot.

The new model, called RFM-1, was trained on years of data collected from Covariant’s small fleet of item-picking robots that customers like Crate & Barrel and Bonprix use in warehouses around the world, as well as words and videos from the internet. In the coming months, the model will be released to Covariant customers. The company hopes the system will become more capable and efficient as it’s deployed in the real world. 

So what can it do? In a demonstration I attended last week, Covariant cofounders Peter Chen and Pieter Abbeel showed me how users can prompt the model using five different types of input: text, images, video, robot instructions, and measurements. 

For example, show it an image of a bin filled with sports equipment, and tell it to pick up the pack of tennis balls. The robot can then grab the item, generate an image of what the bin will look like after the tennis balls are gone, or create a video showing a bird’s-eye view of how the robot will look doing the task. 

If the model predicts it won’t be able to properly grasp the item, it might even type back, “I can’t get a good grip. Do you have any tips?” A response could advise it to use a specific number of the suction cups on its arms to give it better a grasp—eight versus six, for example. 

This represents a leap forward, Chen told me, in robots that can adapt to their environment using training data rather than the complex, task-specific code that powered the previous generation of industrial robots. It’s also a step toward worksites where managers can issue instructions in human language without concern for the limitations of human labor. (“Pack 600 meal-prep kits for red pepper pasta using the following recipe. Take no breaks!”)

Lerrel Pinto, a researcher who runs the general-purpose robotics and AI lab at New York University and has no ties to Covariant, says that even though roboticists have built basic multimodal robots before and used them in lab settings, deploying one at scale that’s able to communicate in this many modes marks an impressive feat for the company. 

To outpace its competitors, Covariant will have to get its hands on enough data for the robot to become useful in the wild, Pinto told me. Warehouse floors and loading docks are where it will be put to the test, constantly interacting with new instructions, people, objects, and environments. 

“The groups which are going to train good models are going to be the ones that have either access to already large amounts of robot data or capabilities to generate those data,” he says.

Covariant says the model has a “human-like” ability to reason, but it has its limitations. During the demonstration, in which I could see a live feed of a Covariant robot as well as a chat window to communicate with it, Chen invited me to prompt the model with anything I wanted. When I asked the robot to “return the banana to Tote Two,” it struggled with retracing its steps, leading it to pick up a sponge, then an apple, then a host of other items before it finally accomplished the banana task. 

“It doesn’t understand the new concept,” Chen said by way of explanation, “but it’s a good example—it might not work well yet in the places where you don’t have good training data.”

The company’s new model embodies a paradigm shift rippling through the robotics world. Rather than teaching a robot how the world works manually, through instructions like physics equations and code, researchers are teaching it in the same way humans learn: through millions of observations. 

The result “really can act as a very effective flexible brain to solve arbitrary robot tasks,” Chen said. 

The playing field of companies using AI to power more nimble robotic systems is likely to grow crowded this year. Earlier this month, the humanoid-robotics startup Figure AI announced it would be partnering with OpenAI and raised $675 million from tech giants like Nvidia and Microsoft. Marc Raibert, the founder of Boston Dynamics, recently started an initiative to better integrate AI into robotics.  

This means that advancements in machine learning will likely start translating to advancements in robotics. However, some issues remain unresolved. If large language models continue to be trained on millions of words without compensating the authors of those words, perhaps it will be expected that robotics models will also be trained on videos without paying their creators. And if language models hallucinate and perpetuate biases, what equivalents will surface in robotics?

In the meantime, Covariant will push forward, keen on having RFM-1 continually learn and refine. Eventually, the researchers aim to have the robot train on videos that the model itself creates—the type of meta-learning that not only makes my head spin but also sparks concern about what will happen if errors made by the model compound themselves. But with such a hunger for more training data, researchers see it almost as inevitable.

“Training on that will be a reality,” Abbeel says. “If we talk again a half year from now, that’s what we’ll be talking about.”

❌
❌