Normal view

There are new articles available, click to refresh the page.
Before yesterdayMIT Technology Review

Chatbot answers are all made up. This new tool helps you figure out which ones to trust.

25 April 2024 at 08:59

Large language models are famous for their ability to make things up—in fact, it’s what they’re best at. But their inability to tell fact from fiction has left many businesses wondering if using them is worth the risk.

A new tool created by Cleanlab, an AI startup spun out of a quantum computing lab at MIT, is designed to give high-stakes users a clearer sense of how trustworthy these models really are. Called the Trustworthy Language Model, it gives any output generated by a large language model a score between 0 and 1, according to its reliability. This lets people choose which responses to trust and which to throw out. In other words: a BS-o-meter for chatbots.

Cleanlab hopes that its tool will make large language models more attractive to businesses worried about how much stuff they invent. “I think people know LLMs will change the world, but they’ve just got hung up on the damn hallucinations,” says Cleanlab CEO Curtis Northcutt.

Chatbots are quickly becoming the dominant way people look up information on a computer. Search engines are being redesigned around the technology. Office software used by billions of people every day to create everything from school assignments to marketing copy to financial reports now comes with chatbots built in. And yet a study put out in November by Vectara, a startup founded by former Google employees, found that chatbots invent information at least 3% of the time. It might not sound like much, but it’s a potential for error most businesses won’t stomach.

Cleanlab’s tool is already being used by a handful of companies, including Berkeley Research Group, a UK-based consultancy specializing in corporate disputes and investigations. Steven Gawthorpe, associate director at Berkeley Research Group, says the Trustworthy Language Model is the first viable solution to the hallucination problem that he has seen: “Cleanlab’s TLM gives us the power of thousands of data scientists.”

In 2021, Cleanlab developed technology that discovered errors in 10 popular data sets used to train machine-learning algorithms; it works by measuring the differences in output across a range of models trained on that data. That tech is now used by several large companies, including Google, Tesla, and the banking giant Chase. The Trustworthy Language Model takes the same basic idea—that disagreements between models can be used to measure the trustworthiness of the overall system—and applies it to chatbots.

In a demo Cleanlab gave to MIT Technology Review last week, Northcutt typed a simple question into ChatGPT: “How many times does the letter ‘n’ appear in ‘enter’?” ChatGPT answered: “The letter ‘n’ appears once in the word ‘enter.’” That correct answer promotes trust. But ask the question a few more times and ChatGPT answers: “The letter ‘n’ appears twice in the word ‘enter.’”

“Not only does it often get it wrong, but it’s also random, you never know what it’s going to output,” says Northcutt. “Why the hell can’t it just tell you that it outputs different answers all the time?”

Cleanlab’s aim is to make that randomness more explicit. Northcutt asks the Trustworthy Language Model the same question. “The letter ‘n’ appears once in the word ‘enter,’” it says—and scores its answer 0.63. Six out of 10 is not a great score, suggesting that the chatbot’s answer to this question should not be trusted.

It’s a basic example, but it makes the point. Without the score, you might think the chatbot knew what it was talking about, says Northcutt. The problem is that data scientists testing large language models in high-risk situations could be misled by a few correct answers and assume that future answers will be correct too: “They try things out, they try a few examples, and they think this works. And then they do things that result in really bad business decisions.”

The Trustworthy Language Model draws on multiple techniques to calculate its scores. First, each query submitted to the tool is sent to one or more large language models. The tech will work with any model, says Northcutt, including closed-source models like OpenAI’s GPT series, the models behind ChatGPT, and open-source models like DBRX, developed by San Francisco-based AI firm Databricks. If the responses from each of these models are the same or similar, it will contribute to a higher score.

At the same time, the Trustworthy Language Model also sends variations of the original query to each of the models, swapping in words that have the same meaning. Again, if the responses to synonymous queries are similar, it will contribute to a higher score. “We mess with them in different ways to get different outputs and see if they agree,” says Northcutt.

The tool can also get multiple models to bounce responses off one another: “It’s like, ‘Here’s my answer—what do you think?’ ‘Well, here’s mine—what do you think?’ And you let them talk.” These interactions are monitored and measured and fed into the score as well.

Nick McKenna, a computer scientist at Microsoft Research in Cambridge, UK, who works on large language models for code generation, is optimistic that the approach could be useful. But he doubts it will be perfect. “One of the pitfalls we see in model hallucinations is that they can creep in very subtly,” he says.

In a range of tests across different large language models, Cleanlab shows that its trustworthiness scores correlate well with the accuracy of those models’ responses. In other words, scores close to 1 line up with correct responses, and scores close to 0 line up with incorrect ones. In another test, they also found that using the Trustworthy Language Model with GPT-4 produced more reliable responses than using GPT-4 by itself.

Large language models generate text by predicting the most likely next word in a sequence. In future versions of its tool, Cleanlab plans to make its scores even more accurate by drawing on the probabilities that a model used to make those predictions. It also wants to access the numerical values that models assign to each word in their vocabulary, which they use to calculate those probabilities. This level of detail is provided by certain platforms, such as Amazon’s Bedrock, that businesses can use to run large language models.

Cleanlab has tested its approach on data provided by Berkeley Research Group. The firm needed to search for references to health-care compliance problems in tens of thousands of corporate documents. Doing this by hand can take skilled staff weeks. By checking the documents using the Trustworthy Language Model, Berkeley Research Group was able to see which documents the chatbot was least confident about and check only those. It reduced the workload by around 80%, says Northcutt.

In another test, Cleanlab worked with a large bank (Northcutt would not name it but says it is a competitor to Goldman Sachs). Similar to Berkeley Research Group, the bank needed to search for references to insurance claims in around 100,000 documents. Again, the Trustworthy Language Model reduced the number of documents that needed to be hand-checked by more than half.

Running each query multiple times through multiple models takes longer and costs a lot more than the typical back-and-forth with a single chatbot. But Cleanlab is pitching the Trustworthy Language Model as a premium service to automate high-stakes tasks that would have been off limits to large language models in the past. The idea is not for it to replace existing chatbots but to do the work of human experts. If the tool can slash the amount of time that you need to employ skilled economists or lawyers at $2,000 an hour, the costs will be worth it, says Northcutt.

In the long run, Northcutt hopes that by reducing the uncertainty around chatbots’ responses, his tech will unlock the promise of large language models to a wider range of users. “The hallucination thing is not a large-language-model problem,” he says. “It’s an uncertainty problem.”

Correction: This article has been updated to clarify that the Trustworthy Language Model works with a range of different large language models.

Generative AI can turn your most precious memories into photos that never existed

10 April 2024 at 05:53

Maria grew up in Barcelona, Spain, in the 1940s. Her first memories of her father are vivid. As a six-year-old, Maria would visit a neighbor’s apartment in her building when she wanted to see him. From there, she could peer through the railings of a balcony into the prison below and try to catch a glimpse of him through the small window of his cell, where he was locked up for opposing the dictatorship of Francisco Franco.

There is no photo of Maria on that balcony. But she can now hold something like it: a fake photo—or memory-based reconstruction, as the Barcelona-based design studio Domestic Data Streamers puts it—of the scene that a real photo might have captured. The fake snapshots are blurred and distorted, but they can still rewind a lifetime in an instant.

“It’s very easy to see when you’ve got the memory right, because there is a very visceral reaction,” says Pau Garcia, founder of Domestic Data Streamers. “It happens every time. It’s like, ‘Oh! Yes! It was like that!’”

a generated black and white image of people dancing
In 1960s post-civil war Barcelona, 14-year-old Denia (now 73) and her family, newly arrived from Alcalá de Júcar, found solace and excitement in the lively dance hall ‘La Gavina Azul’. It was a sanctuary of joy amid the post-war reality, where the thrill of music and dance promised freedom from daily monotony and poverty of that time.
DOMESTIC DATA STREAMERS

Dozens of people have now had their memories turned into images in this way via Synthetic Memories, a project run by Domestic Data Streamers. The studio uses generative image models, such as OpenAI’s DALL-E, to bring people’s memories to life. Since 2022, the studio, which has received funding from the UN and Google, has been working with immigrant and refugee communities around the world to create images of scenes that have never been photographed, or to re-create photos that were lost when families left their previous homes.

Now Domestic Data Streamers is taking over a building next to the Barcelona Design Museum to record people’s memories of the city using synthetic images. Anyone can show up and contribute a memory to the growing archive, says Garcia. 

Synthetic Memories could prove to be more than a social or cultural endeavor. This summer, the studio will start a collaboration with researchers to find out if its technique could be used to treat dementia.

Memorable graffiti

The idea for the project came from an experience Garcia had in 2014, when he was working in Greece with an organization that was relocating refugee families from Syria. A woman told him that she was not afraid of being a refugee herself, but she was afraid of her children and grandchildren staying refugees because they might forget their family history: where they shopped, what they wore, how they dressed.

Garcia got volunteers to draw the woman’s memories as graffiti on the walls of the building where the families were staying. “They were really bad drawings, but the idea for synthetic memories was born,” he says. Several years later, when Garcia saw what generative image models could do, he remembered that graffiti. ”It was one of the first things that came to mind,” he says.

a generated image of a mother walking on a footpath with three children in a green field
In 1990, 14-year-old Emerund lived in a small Cameroonian village, spending his afternoons helping his mother in the fields planting corn and potatoes after school. These moments were a mix of duty and joy as he balanced the responsibilities towards his family with the simple pleasures of being close to nature and his siblings. These memories from his childhood hold a special place in his heart, as he remembered one specific part of the fields where his siblings would play hide and seek with their mother.
DOMESTIC DATA STREAMERS

The process that Garcia and his team have developed is simple. An interviewer sits down with a subject and gets the person to recall a specific scene or event. A prompt engineer with a laptop uses that recollection to write a prompt for a model, which generates an image.

His team has built up a kind of glossary of prompting terms that have proved to be good at evoking different periods in history and different locations. But there’s often some back and forth, some tweaks to the prompt, says Garcia: “You show the image generated from that prompt to the subject and they might say, ‘Oh, the chair was on that side’ or ‘It was at night, not in the day.’ You refine it until you get it to a point where it clicks.”

So far Domestic Data Streamers has used the technique to preserve the memories of people in various migrant communities, including Korean, Bolivian, and Argentine families living in São Paolo, Brazil. But it has also worked with a care home in Barcelona to see how memory-based reconstructions might help older people. The team collaborated with researchers in Barcelona on a small pilot with 12 subjects, applying the approach to reminiscence therapy—a treatment for dementia that aims to stimulate cognitive abilities by showing someone images of the past. Developed in the 1960s, reminiscence therapy has many proponents, but researchers disagree on how effective it is and how it should be done.

The pilot allowed the team to refine the process and ensure that participants could give informed consent, says Garcia. The researchers are now planning to run a larger clinical study in the summer with colleagues at the University of Toronto to compare the use of generative image models with other therapeutic approaches.

One thing they did discover in the pilot was that older people connected with the images much better if they were printed out. “When they see them on a screen, they don’t have the same kind of emotional relation to them,” says Garcia. “But when they could see it physically, the memory got much more important.”    

Blurry is best

The researchers have also found that older versions of generative image models work better than newer ones. They started the project using two models that came out in 2022: DALL-E 2 and Stable Diffusion, a free-to-use generative image model released by Stability AI. These can produce images that are glitchy, with warped faces and twisted bodies. But when they switched to the latest version of Midjourney (another generative image model that can create more detailed images), the results did not click with people so well.

“If you make something super-realistic, people focus on details that were not there,” says Garcia. “If it’s blurry, the concept comes across better. Memories are a bit like dreams. They do not behave like photographs, with forensic details. You do not remember if the chair was red or green. You simply remember that there was a chair.” 

a group of people cluster around a synthetic memory with expressions of surprise
“When they could see it physically, the memory got much more important.”
@MARCASENSIO_FOTO

The team has since gone back to using the older models. “For us, the glitches are a feature,” says Garcia. “Sometimes things can be there and not there. It’s kind of a quantum state in the images that works really well with memories.”

Sam Lawton, an independent filmmaker who is not involved with the studio, is excited by the project. He’s especially happy that the team will be looking at the cognitive effects of these images in a rigorous clinical study. Lawton has used generative image models to re-create his own memories. In a film he made last year, called Expanded Childhood, he used DALL-E to extend old family photos beyond their borders, blurring real childhood scenes with surreal ones.

“The effect exposure to this kind of generated imagery has on a person’s brain was what spurred me to make the film in the first place,” says Lawton. “I was not in a position to launch a full-blown research effort, so I pivoted to the kind of storytelling that’s most natural to me.”

Lawton’s work explores a number of questions: What will long-term exposure to AI-generated or altered images have on us? Can such images help reframe traumatic memories? Or do they create a false sense of reality that can lead to confusion and cognitive dissonance?

Lawton showed the images in Expanded Childhood to his father and included his comments in the film: “Something’s wrong. I don’t know what that is. Do I just not remember it?”

Nuria, now 90, vividly recalls the men and boys who waited outside bomb shelters in Barcelona during the Spanish Civil War, ready with picks and axes to rescue anyone trapped inside. These individuals, braving the danger of falling bombs, showed incredible courage and selflessness. Their actions, risking their lives to save others, left a lasting impression on Nuria. Even now, she remembers in detail the clothes and dirty coats these men wore.
DOMESTIC DATA STREAMERS

Garcia is aware of the dangers of confusing subjective memories with real photographic records. His team’s memory-based reconstructions are not meant to be taken as factual documents, he says. In fact, he notes that this is another reason to stick with the less photorealistic images produced by older versions of generative image models. “It is important to differentiate very clearly what is synthetic memory and what is photography,” says Garcia. “This is a simple way to show that.”

But Garcia is now worried that the companies behind the models might retire their previous versions. Most users look forward to bigger and better models; for Synthetic Memories, less can be more. “I’m really scared that OpenAI will close DALL-E 2 and we will have to use DALL-E 3,” he says.

A conversation with OpenAI’s first artist in residence

29 March 2024 at 11:37

Alex Reben’s work is often absurd, sometimes surreal: a mash-up of giant ears imagined by DALL-E and sculpted by hand out of marble; critical burns generated by ChatGPT that thumb the nose at AI art. But its message is relevant to everyone. Reben is interested in the roles humans play in a world filled with machines, and how those roles are changing.

“I kind of use humor and absurdity to deal with a lot of these issues,” says Reben. “Some artists may come at things head-on in a very serious manner, but I find if you’re a little absurd it makes the ideas more approachable, even if the story you’re trying to tell is very serious.”

COURTESY OF ALEXANDER REBEN

Reben is OpenAI’s first artist in residence. Officially, the appointment started in January and lasts three months. But Reben’s relationship with the San Francisco–based AI firm seems casual: “It’s a little fuzzy, because I’m the first, and we’re figuring stuff out. I’m probably going to keep working with them.”

In fact, Reben has been working with OpenAI for years already. Five years ago, he was invited to try out an early version of GPT-3 before it was released to the public. “I got to play around with that quite a bit and made a few artworks,” he says. “They were quite interested in seeing how I could use their systems in different ways. And I was like, cool, I’d love to try something new, obviously. Back then I was mostly making stuff with my own models or using websites like Ganbreeder [a precursor of today’s generative image-making models].”

In 2008, Reben studied math and robotics at MIT’s Media Lab. There he helped create a cardboard robot called Boxie, which inspired the cute robot Baymax in the movie Big Hero 6. He is now director of technology and research at Stochastic Labs, a nonprofit incubator for artists and engineers in Berkeley, California. I spoke to Reben via Zoom about his work, the unresolved tension between art and technology, and the future of human creativity.

Our conversation has been edited for length and clarity.

You’re interested in ways that humans and machines interact. As an AI artist, how would you describe what you do with technology? Is it a tool, a collaborator?

Firstly, I don’t call myself an AI artist. AI is simply another technological tool. If something comes along after AI that interests me, I wouldn’t, like, say, “Oh, I’m only an AI artist.”

Okay. But what is it about these AI tools? Why have you spent your career playing around with this kind of technology?

My research at the Media Lab was all about social robotics, looking at how people and robots come together in different ways. One robot [Boxie] was also a filmmaker. It basically interviewed people, and we found that the robot was making people open up to it and tell it very deep stories. This was pre-Siri, or anything like that. These days people are familiar with the idea of talking to machines. So I’ve always been interested in how humanity and technology co-evolve over time. You know, we are who we are today because of technology.

three small sculptures on a white plinth. The first is a puppet head wearing a white cowboy hat and the other two are small smiling cardboard robots on plastic conveyor wheels
A few cardboard BlabDroids displayed next to a plastic mask from a performative art piece, entitled Five Dollars Can Save Planet Earth.
COURTESY OF ALEXANDER REBEN

Right now, there’s a lot of pushback against the use of AI in art. There’s a lot of understandable unhappiness about technology that lets you just press a button and get an image. People are unhappy that these tools were even made and argue that the makers of these tools, like OpenAI, should maybe carry some more responsibility. But here you are, immersed in the art world, continuing to make fun, engaging art. I’m wondering what your experience of those kinds of conversations has been?

Yeah. So as I’m sure you know, being in the media, the negative voices are always louder. The people who are using these tools in positive ways aren’t quite as loud sometimes.

But, I mean, it’s also a very wide issue. People take a negative view for many different reasons. Some people worry about the data sets, some people worry about job replacement. Other people worry about, you know, disinformation and the world being flooded with media. And they’re all valid concerns.

When I talk about this, I go to the history of photography. What we’re seeing today is basically a parallel of what happened back then. There are no longer artists who paint products for a living—like, who paint cans of peaches for an advertisement in a magazine or on a billboard. But that used to be a job, right? Photography eliminated that swath of folks.

You know, you used the phrase—I wrote it down—“just press a button and get an image,” which also reminds me of photography. Anyone can push a button and get an image, but to be a fine-art photographer, it takes a lot of skill. Just because artwork is quick to make doesn’t necessarily mean it’s any worse than, like, someone sculpting something for 60 years out of marble. They’re different things.

AI is moving fast. We’ve moved past the equivalent of wet-plate photography using cyanide. But we’re certainly not in the Polaroid phase quite yet. We’re still coming to terms with what this means, both in a fine-art sense but also for jobs.

But, yeah, your question has so many facets. We could pick any one of them and go at it. There’s definitely a lot of valid concerns out there. But I also think looking at the history of technology, and how it’s actually empowered artists and people to make new things, is important as well.

There’s another line of argument that if you have a potentially infinite supply of AI-generated images, it devalues creativity. I’m curious about the balance you see in your work between what you do and what the technology does for you. How do you relate that balance to this question of value, and where we find value in art?

Sure, value in art—there’s an economic sense and there’s a critical sense, right? In an economic sense, you could tape a banana to a wall and sell it for 30,000 dollars. It’s just who’s willing to buy it or whatever.

In a critical sense, again, going back to photography, the world is flooded with images and there are still people making great photography out there. And there are people who set themselves apart by doing something that is different.

installation view from "AI am I?"
Reben’s exhibition “AI am I?” featuring The Plungers is on view at Sacramento’s Crocker Art Museum until the end of April.
COURTESY OF ALEXANDER REBEN

I play around with those ideas. A little bit like—you know, the plunger work was the first one. [The Plungers is an installation that Reben made by creating a physical version of an artwork invented by GPT-3.] I got GPT to describe an artwork that didn’t exist; then I made it. Which kind of flips the idea of authorship on his head but still required me to go through thousands of outputs to find one that was funny enough to make.

Back then GPT wasn’t a chatbot. I spent a good month coming up with the beginning bits of texts—like, wall labels next to art in museums—and getting GPT to complete them.

I also really like your ear sculpture, Ear we go again. It’s a sculpture described by GPT-3, visualized by DALL-E, and carved out of marble by a robot. It’s sort of like a waterfall, with one kind of software feeding the next.

When text-to-image came out, it made obvious sense to feed it the descriptions of artworks I’d been generating. It’s a chain, sort of back and forth, human to machine back to human. That ear, in particular: it starts with a description that’s fed into DALL-E, but then that image was turned into a 3D model by a human 3D artist.

And after that it was carved by robots. But the robots get only so far with the detail, so human sculptors have to come in and finish it by hand. I’ve made 10 or 15 permutations of this, playing with those back-and-forths, chaining technology together. And the final thing that happens now is that I will take a picture of the artwork and get GPT-4 to create the wall label for it. 

Yeah, that keeps coming up in your work, the different ways that humans and machines interact.

You know, I made some videos of the process of these things being made to show how many artisans were employed in making them. There are still huge industries where I can see AI increasing work for folks, people who will make stuff that AI comes up with.  

I’m struck by the serendipity that often comes with generative tools, making art out of something random. Do you see a connection between your work and found art or ready-mades, like Duchamp’s Fountain? I mean, you’re maybe not just coming across a urinal and thinking, “Oh, that’s cool.” But when you play around with these tools, at some point you must get something presented to you that you react to and think, “I can use that.”

For sure. Yeah, it actually reminds me a little bit more of street photography, which I used to do when I was in college in New York City, where you would just kind of roam around and wait for something to inspire you. Then you’d set yourself up to capture the image in the way that you wanted. It’s kind of like that for sure. There’s definitely a curatorial process to it. There’s a process of finding things, which I think is interesting.

We talked about photography. Photography changed the art that came after it. You know, you had movements where people wanted to try to get at a reality that wasn’t photographic reality—things like Impressionism, and Cubism or Picasso. Do you think we’ll see something similar happening because of AI?

I think so. Any new artistic tool definitely changes the field as people figure out not only how to use that tool but how to differentiate themselves from what that tool can do.

Talking of AI as a tool—do you think that art will always be something made by humans? That no matter how good the tech gets, it will always just be a tool? You know, the way you’ve strung together these different AIs—you could do that without being in the loop. You could just have some kind of curator AI at the end that chooses what it likes best. Would that ever be art?

I actually have a couple of works in which an AI creates an image, uses the image to create a new image, and just keeps going. But I think even in a super-automated process you can go back far enough to find some human somewhere who made a decision to do something. Like, maybe they chose what data set to use.

We might see hotel rooms filled with robot paintings. I mean, stuff we hardly even look at, that never even makes its way through human curation.

I guess the question is really how much human involvement is needed to make something art. Is there a threshold or, like, a percentage of involvement? It’s a good question.

Yeah, I guess it’s like, is it still art if there’s no one there to see it?

You know, what is and isn’t art is one of those questions that has been asked forever. I think more to the point is: What is good art versus bad art? And that’s very personal.

But I think humans are always going to be doing this stuff. We will still be painting in the far future, even when robots are making paintings.

How three filmmakers created Sora’s latest stunning videos

28 March 2024 at 08:49

In the last month, a handful of filmmakers have taken Sora for a test drive. The results, which OpenAI published this week, are amazing. The short films are a big jump up even from the cherry-picked demo videos that OpenAI used to tease its new generative model just six weeks ago. Here’s how three of the filmmakers did it.

Air Head” by Shy Kids

Shy Kids is a pop band and filmmaking collective based in Toronto that describes its style as “punk-rock Pixar.” The group has experimented with generative video tech before. Last year it made a music video for one of its songs using an open-source tool called Stable Warpfusion. It’s cool, but low-res and glitchy. The film it made with Sora, called “Air Head,” could pass for real footage—if it didn’t feature a man with a balloon for a face.

One problem with most generative video tools is that it’s hard to maintain consistency across frames. When OpenAI asked Shy Kids to try out Sora, the band wanted to see how far they could push it. “We thought a fun, interesting experiment would be—could we make a consistent character?” says Shy Kids member Walter Woodman. “We think it was mostly successful.”

Generative models can also struggle with anatomical details like hands and faces. But in the video there is a scene showing a train car full of passengers, and the faces are near perfect. “It’s mind-blowing what it can do,” says Woodman. “Those faces on the train were all Sora.”

Has generative video’s problem with faces and hands been solved? Not quite. We still get glimpses of warped body parts. And text is still a problem (in another video, by the creative agency Native Foreign, we see a bike repair shop with the sign “Biycle Repaich”). But everything in “Air Head” is raw output from Sora. After editing together many different clips produced with the tool, Shy Kids did a bunch of post-processing to make the film look even better. They used visual effects tools to fix certain shots of the main character’s balloon face, for example.

Woodman also thinks that the music (which they wrote and performed) and the voice-over (which they also wrote and performed) help to lift the quality of the film even more. Mixing these human touches in with Sora’s output is what makes the film feel alive, says Woodman. “The technology is nothing without you,” he says. “It is a powerful tool, but you are the person driving it.”

[Update: Shy Kids have posted a behind-the-scenes video for Air Head on X. Come for the pro tips, stay for the Sora bloopers: “How do you maintain a character and look consistent even though Sora is a slot machine as to what you get back?” asks Woodman.]

Abstract“ by Paul Trillo

Paul Trillo, an artist and filmmaker, wanted to stretch what Sora could do with the look of a film. His video is a mash-up of retro-style footage with shots of a figure who morphs into a glitter ball and a breakdancing trash man. He says that everything you see is raw output from Sora: “No color correction or post FX.” Even the jump-cut edits in the first part of the film were produced using the generative model.

Trillo felt that the demos that OpenAI put out last month came across too much like clips from video games. “I wanted to see what other aesthetics were possible,” he says. The result is a video that looks like something shot with vintage 16-millimeter film. “It took a fair amount of experimenting, but I stumbled upon a series of prompts that helps make the video feel more organic or filmic,” he says.

Beyond Our Reality” by Don Allen Stevenson

Don Allen Stevenson III is a filmmaker and visual effects artist. He was one of the artists invited by OpenAI to try out DALL-E 2, its text-to-image model, a couple of years ago. Stevenson’s film is a NatGeo-style nature documentary that introduces us to a menagerie of imaginary animals, from the girafflamingo to the eel cat.

In many ways working with text-to-video is like working with text-to-image, says Stevenson. “You enter a text prompt and then you tweak your prompt a bunch of times,” he says. But there’s an added hurdle. When you’re trying out different prompts, Sora produces low-res video. When you hit on something you like, you can then increase the resolution. But going from low to high res is involves another round of generation, and what you liked in the low-res version can be lost.

Sometimes the camera angle is different or the objects in the shot have moved, says Stevenson. Hallucination is still a feature of Sora, as it is in any generative model. With still images this might produce weird visual defects; with video those defects can appear across time as well, with weird jumps between frames.

Stevenson also had to figure out how to speak Sora’s language. It takes prompts very literally, he says. In one experiment he tried to create a shot that zoomed in on a helicopter. Sora produced a clip in which it mixed together a helicopter with a camera’s zoom lens. But Stevenson says that with a lot of creative prompting, Sora is easier to control than previous models.

Even so, he thinks that surprises are part of what makes the technology fun to use: “I like having less control. I like the chaos of it,” he says. There are many other video-making tools that give you control over editing and visual effects. For Stevenson, the point of a generative model like Sora is to come up with strange, unexpected material to work with in the first place.

The clips of the animals were all generated with Sora. Stevenson tried many different prompts until the tool produced something he liked. “I directed it, but it’s more like a nudge,” he says. He then went back and forth, trying out variations.

Stevenson pictured his fox crow having four legs, for example. But Sora gave it two, which worked even better. (It’s not perfect: sharp-eyed viewers will see that at one point in the video the fox crow switches from two legs to four, then back again.) Sora also produced several versions that he thought were too creepy to use.

When he had a collection of animals he really liked, he edited them together. Then he added captions and a voice-over on top. Stevenson could have created his made-up menagerie with existing tools. But it would have taken hours, even days, he says. With Sora the process was far quicker.

“I was trying to think of something that would look cool and experimented with a lot of different characters,” he says. “I have so many clips of random creatures.” Things really clicked when he saw what Sora did with the girafflamingo. “I started thinking: What’s the narrative around this creature? What does it eat, where does it live?” he says. He plans to put out a series of extended films following each of the fantasy animals in more detail.

Stevenson also hopes his fantastical animals will make a bigger point. “There’s going to be a lot of new types of content flooding feeds,” he says. “How are we going to teach people what’s real? In my opinion, one way is to tell stories that are clearly fantasy.”

Stevenson points out that his film could be the first time a lot of people see a video created by a generative model. He wants that first impression to make one thing very clear: This is not real.

What’s next for generative video

28 March 2024 at 08:23

MIT Technology Review’s What’s Next series looks across industries, trends, and technologies to give you a first look at the future. You can read the rest of them here.

When OpenAI revealed its new generative video model, Sora, last month, it invited a handful of filmmakers to try it out. This week the company published the results: seven surreal short films that leave no doubt that the future of generative video is coming fast. 

The first batch of models that could turn text into video appeared in late 2022, from companies including Meta, Google, and video-tech startup Runway. It was a neat trick, but the results were grainy, glitchy, and just a few seconds long.

Fast-forward 18 months, and the best of Sora’s high-definition, photorealistic output is so stunning that some breathless observers are predicting the death of Hollywood. Runway’s latest models can produce short clips that rival those made by blockbuster animation studios. Midjourney and Stability AI, the firms behind two of the most popular text-to-image models, are now working on video as well.

A number of companies are racing to make a business on the back of these breakthroughs. Most are figuring out what that business is as they go. “I’ll routinely scream, ‘Holy cow, that is wicked cool’ while playing with these tools,” says Gary Lipkowitz, CEO of Vyond, a firm that provides a point-and-click platform for putting together short animated videos. “But how can you use this at work?”

Whatever the answer to that question, it will probably upend a wide range of businesses and change the roles of many professionals, from animators to advertisers. Fears of misuse are also growing. The widespread ability to generate fake video will make it easier than ever to flood the internet with propaganda and nonconsensual porn. We can see it coming. The problem is, nobody has a good fix.

As we continue to get to grips what’s ahead—good and bad—here are four things to think about. We’ve also curated a selection of the best videos filmmakers have made using this technology, including an exclusive reveal of “Somme Requiem,” an experimental short film by Los Angeles–based production company Myles. Read on for a taste of where AI moviemaking is headed. 

1. Sora is just the start

OpenAI’s Sora is currently head and shoulders above the competition in video generation. But other companies are working hard to catch up. The market is going to get extremely crowded over the next few months as more firms refine their technology and start rolling out Sora’s rivals.

The UK-based startup Haiper came out of stealth this month. It was founded in 2021 by former Google DeepMind and TikTok researchers who wanted to work on technology called neural radiance fields, or NeRF, which can transform 2D images into 3D virtual environments. They thought a tool that turned snapshots into scenes users could step into would be useful for making video games.

But six months ago, Haiper pivoted from virtual environments to video clips, adapting its technology to fit what CEO Yishu Miao believes will be an even bigger market than games. “We realized that video generation was the sweet spot,” says Miao. “There will be a super-high demand for it.”

“Air Head” is a short film made by Shy Kids, a pop band and filmmaking collective based in Toronto, using Sora.

Like OpenAI’s Sora, Haiper’s generative video tech uses a diffusion model to manage the visuals and a transformer (the component in large language models like GPT-4 that makes them so good at predicting what comes next), to manage the consistency between frames. “Videos are sequences of data, and transformers are the best model to learn sequences,” says Miao.

Consistency is a big challenge for generative video and the main reason existing tools produce just a few seconds of video at a time. Transformers for video generation can boost the quality and length of the clips. The downside is that transformers make stuff up, or hallucinate. In text, this is not always obvious. In video, it can result in, say, a person with multiple heads. Keeping transformers on track requires vast silos of training data and warehouses full of computers.

That’s why Irreverent Labs, founded by former Microsoft researchers, is taking a different approach. Like Haiper, Irreverent Labs started out generating environments for games before switching to full video generation. But the company doesn’t want to follow the herd by copying what OpenAI and others are doing. “Because then it’s a battle of compute, a total GPU war,” says David Raskino, Irreverent’s cofounder and CTO. “And there’s only one winner in that scenario, and he wears a leather jacket.” (He’s talking about Jensen Huang, CEO of the trillion-dollar chip giant Nvidia.)

Instead of using a transformer, Irreverent’s tech combines a diffusion model with a model that predicts what’s in the next frame on the basis of common-sense physics, such as how a ball bounces or how water splashes on the floor. Raskino says this approach reduces both training costs and the number of hallucinations. The model still produces glitches, but they are distortions of physics (like a bouncing ball not following a smooth curve, for example) with known mathematical fixes that can be applied to the video after it is generated, he says.

Which approach will last remains to be seen. Miao compares today’s technology to large language models circa GPT-2. Five years ago, OpenAI’s groundbreaking early model amazed people because it showed what was possible. But it took several more years for the technology to become a game-changer.

It’s the same with video, says Miao: “We’re all at the bottom of the mountain.”

2. What will people do with generative video? 

Video is the medium of the internet. YouTube, TikTok, newsreels, ads: expect to see synthetic video popping up everywhere there’s video already.

The marketing industry is one of the most enthusiastic adopters of generative technology. Two-thirds of marketing professionals have experimented with generative AI in their jobs, according to a recent survey Adobe carried out in the US, with more than half saying they have used the technology to produce images.

Generative video is next. A few marketing firms have already put out short films to demonstrate the technology’s potential. The latest example is the 2.5-minute-long “Somme Requiem,” made by Myles. You can watch the film below in an exclusive reveal from MIT Technology Review.

“Somme Requiem” is a short film made by Los Angeles production company Myles. Every shot was generated using Runway’s Gen 2 model. The clips were then edited together by a team of video editors at Myles.

“Somme Requiem” depicts snowbound soldiers during the World War I Christmas ceasefire in 1914. The film is made up of dozens of different shots that were produced using a generative video model from Runway, then stitched together, color-corrected, and set to music by human video editors at Myles. “The future of storytelling will be a hybrid workflow,” says founder and CEO Josh Kahn.

Kahn picked the period wartime setting to make a point. He notes that the Apple TV+ series Masters of the Air, which follows a group of World War II airmen, cost $250 million. The team behind Peter Jackson’s World War I documentary They Shall Not Grow Old spent four years curating and restoring more than 100 hours of archival film. “Most filmmakers can only dream of ever having an opportunity to tell a story in this genre,” says Kahn.

“Independent filmmaking has been kind of dying,” he adds. “I think this will create an incredible resurgence.”

Raskino hopes so. “The horror movie genre is where people test new things, to try new things until they break,” he says. “I think we’re going to see a blockbuster horror movie created by, like, four people in a basement somewhere using AI.”

So is generative video a Hollywood-killer? Not yet. The scene-setting shots in ”Somme Requiem”—empty woods, a desolate military camp—look great. But the people in it are still afflicted with mangled fingers and distorted faces, hallmarks of the technology. Generative video is best at wide-angle pans or lingering close-ups, which creates an eerie atmosphere but little action. If ”Somme Requiem” were any longer it would get dull.

But scene-setting shots pop up all the time in feature-length movies. Most are just a few seconds long, but they can take hours to film. Raskino suggests that generative video models could soon be used to produce those in-between shots for a fraction of the cost. This could also be done on the fly in later stages of production, without requiring a reshoot.

Michal Pechoucek, CTO at Gen Digital, the cybersecurity giant behind a range of antivirus brands including Norton and Avast, agrees. “I think this is where the technology is headed,” he says. “We’ll see many different models, each specifically trained in a certain domain of movie production. These will just be tools used by talented video production teams.”

We’re not there quite yet. A big problem with generative video is the lack of control users have over the output. Producing still images can be hit and miss; producing a few seconds of video is even more risky.

“Right now it’s still fun, you get a-ha moments,” says Miao. “But generating video that is exactly what you want is a very hard technical problem. We are some way off generating long, consistent videos from a single prompt.”

That’s why Vyond’s Lipkowitz thinks the technology isn’t yet ready for most corporate clients. These users want a lot more control over the look of a video than current tools give them, he says.

Thousands of companies around the world, including around 65% of the Fortune 500 firms, use Vyond’s platform to create animated videos for in-house communications, training, marketing, and more. Vyond draws on a range of generative models, including text-to-image and text-to-voice, but provides a simple drag-and-drop interface that lets users put together a video by hand, piece by piece, rather than generate a full clip with a click.

Running a generative model is like rolling dice, says Lipkowitz. “This is a hard no for most video production teams, particularly in the enterprise sector where everything must be pixel-perfect and on brand,” he says. “If the video turns out bad—maybe the characters have too many fingers, or maybe there is a company logo that is the wrong color—well, unlucky, that’s just how gen AI works.”

The solution? More data, more training, repeat. “I wish I could point to some sophisticated algorithms,” says Miao. “But no, it’s just a lot more learning.”

3. Misinformation isn’t new, but deepfakes will make it worse.

Online misinformation has been undermining our faith in the media, in institutions, and in each other for years. Some fear that adding fake video to the mix will destroy whatever pillars of shared reality we have left.

“We are replacing trust with mistrust, confusion, fear, and hate,” says Pechoucek. “Society without ground truth will degenerate.”

Pechoucek is especially worried about the malicious use of deepfakes in elections. During last year’s elections in Slovakia, for example, attackers shared a fake video that showed the leading candidate discussing plans to manipulate voters. The video was low quality and easy to spot as a deepfake. But Pechoucek believes it was enough to turn the result in favor of the other candidate.

“Adventurous Puppies” is a short clip made by OpenAI using with Sora.

John Wissinger, who leads the strategy and innovation teams at Blackbird AI, a firm that tracks and manages the spread of misinformation online, believes fake video will be most persuasive when it blends real and fake footage. Take two videos showing President Joe Biden walking across a stage. In one he stumbles, in the other he doesn’t. Who is to say which is real?

“Let’s say an event actually occurred, but the way it’s presented to me is subtly different,” says Wissinger. “That can affect my emotional response to it.” As Pechoucek noted, a fake video doesn’t even need to be that good to make an impact. A bad fake that fits existing biases will do more damage than a slick fake that doesn’t, says Wissinger.

That’s why Blackbird focuses on who is sharing what with whom. In some sense, whether something is true or false is less important than where it came from and how it is being spread, says Wissinger. His company already tracks low-tech misinformation, such as social media posts showing real images out of context. Generative technologies make things worse, but the problem of people presenting media in misleading ways, deliberately or otherwise, is not new, he says.

Throw bots into the mix, sharing and promoting misinformation on social networks, and things get messy. Just knowing that fake media is out there will sow seeds of doubt into bad-faith discourse. “You can see how pretty soon it could become impossible to discern between what’s synthesized and what’s real anymore,” says Wissinger.

4. We are facing a new online reality.

Fakes will soon be everywhere, from disinformation campaigns, to ad spots, to Hollywood blockbusters. So what can we do to figure out what’s real and what’s just fantasy? There are a range of solutions, but none will work by themselves.

The tech industry is working on the problem. Most generative tools try to enforce certain terms of use, such as preventing people from creating videos of public figures. But there are ways to bypass these filters, and open-source versions of the tools may come with more permissive policies.

Companies are also developing standards for watermarking AI-generated media and tools for detecting it. But not all tools will add watermarks, and watermarks can be stripped from a video’s metadata. No reliable detection tool exists either. Even if such tools worked, they would become part of a cat-and-mouse game of trying to keep up with advances in the models they are designed to police.

Online platforms like X and Facebook have poor track records when it comes to moderation. We should not expect them to do better once the problem gets harder. Miao used to work at TikTok, where he helped build a moderation tool that detects video uploads that violate TikTok’s terms of use. Even he is wary of what’s coming: “There’s real danger out there,” he says. “Don’t trust things that you see on your laptop.” 

Blackbird has developed a tool called Compass, which lets you fact check articles and social media posts. Paste a link into the tool and a large language model generates a blurb drawn from trusted online sources (these are always open to review, says Wissinger) that gives some context for the linked material. The result is very similar to the community notes that sometimes get attached to controversial posts on sites like X, Facebook, and Instagram. The company envisions having Compass generate community notes for anything. “We’re working on it,” says Wissinger.

But people who put links into a fact-checking website are already pretty savvy—and many others may not know such tools exist, or may not be inclined to trust them. Misinformation also tends to travel far wider than any subsequent correction.

In the meantime, people disagree on whose problem this is in the first place. Pechoucek says tech companies need to open up their software to allow for more competition around safety and trust. That would also let cybersecurity firms like his develop third-party software to police this tech. It’s what happened 30 years ago when Windows had a malware problem, he says: “Microsoft let antivirus firms in to help protect Windows. As a result, the online world became a safer place.”

But Pechoucek isn’t too optimistic. “Technology developers need to build their tools with safety as the top objective,” he says. “But more people think about how to make the technology more powerful than worry about how to make it more safe.”

Made by OpenAI using Sora.

There’s a common fatalistic refrain in the tech industry: change is coming, deal with it. “Generative AI is not going to get uninvented,” says Raskino. “This may not be very popular, but I think it’s true: I don’t think tech companies can bear the full burden. At the end of the day, the best defense against any technology is a very well-educated public. There’s no shortcut.”

Miao agrees. “It’s inevitable that we will massively adopt generative technology,” he says. “But it’s also the responsibility of the whole of society. We need to educate people.” 

“Technology will move forward, and we need to be prepared for this change,” he adds. “We need to remind our parents, our friends, that the things they see on their screen might not be authentic.” This is especially true for older generations, he says: “Our parents need to be aware of this kind of danger. I think everyone should work together.”

We’ll need to work together quickly. When Sora came out a month ago, the tech world was stunned by how quickly generative video had progressed. But the vast majority of people have no idea this kind of technology even exists, says Wissinger: “They certainly don’t understand the trend lines that we’re on. I think it’s going to catch the world by storm.”

❌
❌