That is how the magic DALL-E picture generator from OpenAI works

It seems like every few months someone posts a machine learning paper or demo that makes my jaw drop. This month it’s OpenAI’s new imaging model. GIVE HER.

This gigantic neural network with 12 billion parameters uses a text label (ie “an armchair in the shape of an avocado”) and generates corresponding images:


I find the images quite inspiring (I would buy one of those avocado chairs) but what’s even more impressive is DALLE’s ability to understand and convey concepts of space, time, and even logic (more on that in a second) . .

In this post, I’ll give you a quick rundown of what DALL · E can do, how it works, how it fits in with the latest trends in ML, and why it’s important. Let’s go!

What is DALL · E and what can it do?

In July, the inventor of DALL · E, the OpenAI company, released a similarly sized model called the GPT-3 that excited the world its ability to generate human-like text, including op eds, poems, sonnets, and even computer code. DALL · E is a natural extension of GPT-3 that analyzes text messages and then responds with images rather than words. For example, in an example from the OpenAI blog, the model renders images from the prompt, “A living room with two white armchairs and a painting of the Coliseum. The painting is mounted over a modern fireplace “:

DALLE generated imagesFrom

Pretty smart, isn’t it? You can probably already see how useful this could be for designers. Note that DALL · E can generate a large number of images from a command prompt. The images are then called by a second OpenAI model called CLIP that tries to determine which pictures fit best.

How was DALL · E built?

Unfortunately, we don’t have many details on this yet as OpenAI has not yet published a full paper. At its core, however, DALL · E uses the same new neural network architecture that has driven recent advances in ML: the Transformer. Transformers, discovered in 2017, are an easy-to-parallelize neural network that can be scaled and trained on large amounts of data. They were particularly revolutionary in natural language processing (they are the basis for models like BERT, T5, GPT-3, and others) and have improved the quality of Google search Results, translation and even in Predicting the structures of proteins.

[Read: Meet the 4 scale-ups using data to save the planet]

Most of these large language models are trained on huge text data sets (like all of Wikipedia or Crawls the web). What makes DALL · E unique, however, is that it was trained on sequences that were a combination of words and pixels. We don’t yet know what the record was (it probably had pictures and captions in it), but I can guarantee you it was probably huge.

How “smart” is DALL · E?

While these results are impressive, the skeptical machine learning engineer rightly asks whenever we train a model on a huge data set whether the results are only of high quality because they were copied or saved from the source material.

To prove that DALL · E doesn’t just prod up images, the OpenAI authors rendered some pretty unusual prompts:

“A professional, high quality illustration of a giraffe turtle chimera.”


“A snail made from a harp.”


It’s hard to imagine that the model encountered many giraffe-turtle hybrids in its training dataset, which makes the results more impressive.

In addition, these weird prompts suggest something even more intriguing about DALL · E: its ability to perform “visual thinking without a shot”.

Zero-Shot Visual Reasoning

Typically in machine learning, we train models by giving them thousands or millions of examples of tasks to perform and hoping they’ll pick up on the pattern.

For example, to train a model that identifies dog breeds, we can show a neural network thousands of pictures of dogs tagged by breed and then test its ability to tag new pictures of dogs. It’s a limited-scope task that seems almost curious compared to the latest OpenAI feats.

Zero-shot learning, on the other hand, is the ability of models to perform tasks for which they were not specially trained. For example, DALL · E was trained to generate images from subtitles. However, with the correct prompt, images can also be converted to sketches:

Results of the prompt “Exactly same cat above as sketch below”. From

DALLE can also render custom text on street signs:

Results of the prompt “A shop front with the word” openai “written on it”. From

This allows DALL · E to behave almost like a Photoshop filter, although it is not specifically designed for that behavior.

The model even shows an “understanding” of visual concepts (i.e., “macroscopic” or “cross-sectional images”), locations (i.e., “a photo of the food from China”), and time (“a photo of Alamo Square, San Francisco, at night of one Street “;” a photo of a phone from the 1920s “). For example, here is what it spat out in response to the prompt, “a photo of the food from China”:

“A photo of the food from China” from

In other words, DALL · E can do more than just paint a pretty picture for a lettering. In a sense, it can also answer questions visually.

To test DALL · E’s ability to think visually, the authors had a visual IQ test performed. In the examples below, the model had to complete the lower right corner of the grid following the test’s hidden pattern.

A screenshot of the visual IQ test OpenAI for testing DALL · E at

“DALL · E is often capable of solving matrices that continue simple patterns or basic geometric considerations,” the authors write, but some problems did better than others. When the colors of the puzzles were inverted, DALL · E was worse – “which suggests his skills may become brittle in unexpected ways.”

What does that mean?

What strikes me most about DALL · E is its ability to perform surprisingly well on so many different assignments that the authors didn’t even expect:

“We think that DALL · E. […] is able to perform various types of picture-to-picture translation tasks when prompted in the correct way.

We didn’t anticipate this ability to emerge, nor made any changes to the neural network or training process to encourage it. “

It’s amazing, but not entirely unexpected. DALL · E and GPT-3 are two examples of a larger subject in deep learning: Exceptionally large neural networks trained on unlabeled internet data (an example of “self-supervised learning”) can be very versatile and many things doing weren’t special for developed.

Of course, don’t confuse this with general intelligence. It is Not hard make these types of models look pretty dumb. We will know more when they are openly available and we can start playing around with them. But that doesn’t mean that I can’t get excited in the meantime.

This article was written by Dale Markowitz, an applied AI engineer at Google in Austin, Texas, where she is working on applying machine learning to new areas and industries. She also likes solving her own life problems with AI and talks about it on YouTube.

Published on January 10, 2021 – 11:00 UTC

Leave your vote

0 points
Upvote Downvote

Related Articles

Log In

Forgot password?

Forgot password?

Enter your account data and we will send you a link to reset your password.

Your password reset link appears to be invalid or expired.

Log in

Privacy Policy

Add to Collection

No Collections

Here you'll find all collections you've created before.