Midjourney Flips the Formula with New Image-to-Text Generator

Midjourney has announced a new “/describe” command that allows users to leverage the powerful artificial intelligence (AI) platform to transform images into words, upending Midjourney’s typical procedure of converting text to images.

Paul DelSignore describes the feature on Medium, writing that “describe” has numerous significant benefits for a wide range of use cases.


One of the best aspects of the describe feature is that it should improve accessibility. For people with visual impairments, navigating the web can be challenging. It’s made more accessible by Alt text elements that describe images. Creating these Alt elements manually is time-consuming, and Midjourney’s describe functionality may overcome this hurdle.

Improved search functionality is beneficial to nearly every internet user. Search engines can index images more effectively when they include better and more plentiful descriptions.

DelSignore also highlights the importance of captions, as detailed captions help explain images and provide more clarity to viewers.

Image-to-text generation creates an interesting feedback loop with Midjourney’s text-to-image system. While Midjourney users can already generate similar images based on a selection, image-to-text tools may make it easier to develop alternate and potentially more fruitful descriptions for the text-to-image generator.


In its current iteration, like with its text-to-image generator, Midjourney will create four different text descriptions of an uploaded image. It’s also possible to generate new variations based on a selected description. To upload a photo, users write “/describe” into the text field, and a drag-and-drop upload field appears.

Users can then select one of the generated descriptions and “remix” the uploaded image using the new text prompt. The user can also edit the text prompt, adding a new element of control to the creative process.

Testing Midjourney’s new /describe feature

PetaPixel tested the feature, first using a portrait captured by editor-in-chief Jaron Schneider.

Jaron Schneider Tamron 35mm
Photo by Jaron Schneider

Midjourney’s four generated descriptions are of varying quality.

  • young woman in polka dot dress standing on top of hill, in the style of pensive portraiture, lens flares, Voigtlander Heliar 15mm f/4.5, troubadour style, close up, calarts, simple
  • a young woman wearing a black and white polka dot dress standing, in the style of hazy landscapes, pensive portraiture, sun-soaked colors, candid portraiture, tumblewave, troubadour style, close up
  • a young woman in black and white checkered dress standing on slope of hill, in the style of anamorphic lens flare, polka dots, california plein air, pensive stillness, sun-soaked colors, candid portraits, delicate
  • leigh wyatt | san francisco weddings|daughtersmaidsoflake benson | san francisco, in the style of neo-pop sensibility, gentle lyricism

The first two descriptions are pretty good, especially the second one. It’s interesting that Midjourney described a specific Voigtlander 15mm prime lens, though, for the record, the image was shot with a Tamron 35mm f/1.8 prime. Using the second description to generate a remix leads to pretty impressive results.

Midjourney v5
Four AI-generated images that Midjourney v5 made using the “a young woman wearing a black and white polka dot dress standing, in the style of hazy landscapes, pensive portraiture, sun-soaked colors, candid portraiture, tumblewave, troubadour style, close up” description of the original portrait.

Using another image by Schneider, this time a landscape image from Mono Lake in California, Midjourney again generates mostly useful text descriptions, albeit with the wrong location information about Mono Lake.

Jaron Schneider Mono Lake
Photo by Jaron Schneider

  • monolake, las vegas, utah, united states of america near crystal, in the style of shot on 70mm, mikalojus konstantinas ciurlionis, post processing, 32k uhd, antoni gaudí, hazy landscapes, fenghua zhong
  • mono lake at sunset after a rain, in the style of focus stacking, light sky-blue and bronze, 32k uhd, national geographic photo, stock photo, dansaekhwa
  • mono lake, utah, in the style of 32k uhd, balanced symmetry, american tonalist, hazy, dreamlike quality, nikon d850, fenghua zhong
  • mono lake, california, sunrise photograph 1, in the style of 32k uhd, isolated landscapes, low depth of field

Using the third description as a remix prompt, Midjourney delivered four very realistic new images.

Midjourney v5
Four AI-generated images based upon “mono lake, utah, in the style of 32k uhd, balanced symmetry, american tonalist, hazy, dreamlike quality, nikon d850, fenghua zhong”

Midjourney’s /describe tool is intriguing, even in its early state. The tool should help creators make more detailed Alt text, captions, and even different AI-generated artwork. While some parts of the descriptions are puzzling, to say the least, they show promise.


Image credits: Jaron Schneider and Midjourney

Discussion