Google announces text-image AI model Imagen: better than DALL-E 2

The artificial intelligence (AI) world is still figuring out what to do with the amazing display of capabilities in DALL-E 2, the ability to draw/paint/imagine almost anything…but OpenAI isn’t the only one working on something like that people. Google Research has unveiled a similar model it’s been working on — and it says this model is even better.

Text-to-image models that take text input such as “a dog on a bike” and produce corresponding images have been doing this for years, but have recently made a huge leap in quality and accessibility.

Part of that is using a diffusion technique, basically starting with a pure noise image and slowly refining it bit by bit until the model thinks it can’t make it look more like a cyclist than it already does the dog. This is an improvement over top-to-bottom generators that can be hilariously wrong at first guess, while others are easily led astray.

join us on telegram

Another part is improving language understanding through large language models using transformer methods, but it and some other recent advances have led to convincing language models such as GPT-3 and others.

Imagen first generates a small (64×64 pixel) image, then “super-resolution” it twice, bringing it to 1024×1024. But it’s not like an ordinary upscaling, by building on the original image, AI’s super-resolution creates new details that are in harmony with the small image.

Speaking of the dog on the bike above, in the first image, the dog’s eyes are only 3 pixels wide. But in the second image, its width is 12px. Where do the details of this need come from? –AI knows what a dog’s eyes look like, so it produces more detail when drawing.

Then, when the eyes are drawn again, this happens again, but with a width of 48 pixels. Like many artists, it starts with what amounts to a rough sketch, then fills it out in research and implements it on the final canvas.

This is not without precedent, in fact, artists using AI models are already using this technique to create works that are much larger than what AI can handle in a single pass. If you divide a canvas into pieces and super-resolution them individually, you end up with something bigger and more complex, and you can even do this repeatedly.

Google’s researchers say Imagen’s advances cover several areas. Existing text models can be used in the text encoding part, and their quality is more important than simply improving visual fidelity, they claim. This makes intuitive sense, as a detailed gibberish picture is definitely worse than a slightly less detailed picture.

As in the paper describing Imagen, they compared it to the DALL-E 2 for “a panda doing latte art”. In all of the latter images, panda latte art is shown; in most of Imagen’s images, pandas are doing (latte) art.

In Google’s tests, Imagen led the way in human-rated tests, both in accuracy and fidelity. While this is fairly subjective, it’s pretty remarkable that it even matches the perceived quality of the DALL-E2, which is considered a giant leap ahead of everything else to this day.

But OpenAI is a step or two ahead of Google in several areas. DALL-E 2 isn’t just a research paper, it’s a private beta and people are using it just like they used its predecessor and GPT-2 and 3. And ironically, the company with “open” in its name has been focused on productizing its text-to-image research, something the lucrative internet giant has yet to try.

This is evident in the choices made by the researchers at DALL-E 2, who curated the training dataset ahead of time and removed anything that might violate their own guidelines. The model couldn’t do it even if it wanted to do NSFW stuff. However, Google’s team used some large datasets known to include inappropriate material. In an insightful section on the Imagen website describing “limitations and social implications,” the researchers write:

“Downstream applications of the text-image model are diverse and can affect society in complex ways. The potential risk of abuse has raised concerns about responsible open code and demonstrations. At this time, we have decided not to release code or Public demo.

The data requirements of text-image models have led researchers to rely heavily on large, mostly uncurated, web-collected datasets. While this approach has led to rapid algorithmic advancements in recent years, datasets of this nature often reflect social stereotypes, oppressive views, and derogatory or otherwise harmful associations with marginalized identity groups. While a subset of our training data was filtered for noise and inappropriate content such as pornographic images and toxic language, we also leveraged the LAION-400M dataset, which is known to contain a wide range of inappropriate content, including pornographic images, racist slogans and harmful social stereotypes. Imagen relies on text encoders trained on uncurated web-scale data and inherits the social biases and limitations of large language models. As such, Imagen runs the risk of encoding harmful stereotypes and representations, which guides our decision not to release Imagen for public use without further safeguards.”

While some might scoff at this, saying that Google is concerned that its AI might not be politically correct, that’s an unethical and shortsighted view. An AI model is only as good as the data it’s trained on, not every team can spend the time and effort removing what these scrapers find really scary when they collect data sets of millions of pictures or billions of words s things.

Such biases are meant to show up during research, which exposes how the system works and provides an unconstrained testing ground for identifying these and other limitations.

While dismantling systemic bias is a lifelong project for many, it is easier for AI, whose creators can first remove the content that causes it to misbehave. AI may one day be required to write in the style of the racist, sexist experts of the 1950s, but the benefits of including this data are currently too small and the risks too great.

Regardless, Imagen, like other technologies of its kind, is clearly still experimental, and it’s not ready to be used in ways other than strictly human-supervised methods.

Leave a Comment