Meta Launches ‘Human-Like’ AI Image Generation Model

Meta I-JEPA AI model

Meta has announced a “human-like” artificial intelligence (AI) image generation model, I-JEPA, and provided components to researchers. The company says the model will be more effective than generative AI models thanks to relying on background knowledge and context, mimicking typical human cognition.

Reuters reports that I-JEPA “uses background knowledge about the world to fill in missing pieces of images, rather than looking at nearby pixels like other generative AI models.”

In a research post, Meta describes how I-JEPA builds upon Meta’s Chief AI Scientist Yann LeCun’s vision for “more human-like AI.”

“We’re excited to introduce the first AI model based on a key component of LeCun’s vision. This model, the Image Joint Embedding Predictive Architecture (I-JEPA), learns by creating an internal model of the outside world, which compares abstract representations of images (rather than comparing the pixels themselves). I-JEPA delivers strong performance on multiple computer vision tasks, and it’s much more computationally efficient than other widely used computer vision models,” Meta explains.

Meta I-JEPA AI model
“The Image-based Joint-Embedding Predictive Architecture (I-JEPA) uses a single context block to predict the representations of various target blocks originating from the same image. The context encoder is a Vision Transformer (ViT) that only processes the visible context patches. The predictor is a narrow ViT that takes the context encoder output and predicts the representations of a target block at a specific location, conditioned on positional tokens of the target (shown in color). The target representations correspond to the outputs of the target-encoder, the weights of which are updated at each iteration via an exponential moving average of the context encoder weights.” | Credit: Meta

Meta also highlights I-JEPA’s training efficiency, promising that a 632-million parameter visual transformer model can be trained using 16 A100 GPUs in under 72 hours, which the company claims is two to 10 times faster than other methods while delivering fewer errors.

“Our work on I-JEPA (and Joint Embedding Predictive Architecture (JEPA) models more generally) is grounded in the fact that humans learn an enormous amount of background knowledge about the world just by passively observing it. It has been hypothesized that this common sense information is key to enable intelligent behavior such as sample-efficient acquisition of new concepts, grounding, and planning,” Meta says, speaking about the importance of background knowledge within human intelligence.

To effectively incorporate human-like learning into AI algorithms, Meta says that systems must encode common sense background information into a digital representation that an algorithm can access later. Further, Meta says that a system must learn these representations in a self-supervised manner.

Meta I-JEPA AI model
“Illustrating how the predictor learns to model the semantics of the world. For each image, the portion outside of the blue box is encoded and given to the predictor as context. The predictor outputs a representation for what it expects to be in the region within the blue box. To visualize the prediction, we train a generative model that produces a sketch of the contents represented by the predictor output, and we show a sample output within the blue box. Clearly the predictor recognizes the semantics of what parts should be filled in (the top of the dog’s head, the bird’s leg, the wolf’s legs, the other side of the building).” | Credit: Meta

“At a high level, the JEPA aims to predict the representation of part of an input (such as an image or piece of text) from the representation of other parts of the same input. Because it does not involve collapsing representations from multiple views/augmentations of an image to a single point, the hope is for the JEPA to avoid the biases and issues associated with another widely used method called invariance-based pretraining,” says Meta.

Meta argues that by “predicting representations at a high level of abstraction rather than predicting pixel values directly,” I-JEPA can “avoid the limitations of generative approaches.”

Meta I-JEPA AI model

Some of these limitations are well-documented, including the issues that generative models have with hands. Meta says that these issues arise because generative methods try to fill in “every bit of missing information,” leading to pixel errors humans wouldn’t make. Meta says generative methods focus on “irrelevant details” instead of “capturing high-level predictable concepts.”

“The idea behind I-JEPA is to predict missing information in an abstract representation that’s more akin to the general understanding people have. Compared to generative methods that predict in pixel/token space, I-JEPA uses abstract prediction targets for which unnecessary pixel-level details are potentially eliminated, thereby leading the model to learn more semantic features,” explains Meta.

While Meta’s JEPA models remain in development, Meta has emphasized the importance of sharing components of its AI models with researchers, a move it believes will aid innovation. The full details of JEPA are outlined in a new research paper, which Meta will present at the annual Computer Vision Foundation’s Computer Vision and Pattern Recognition Conference this week.


Image credits: Header photo licensed via Depositphotos.

Discussion