Adobe’s Magic Fixup is an AI Cut-and-Paste Photo Editor Trained on Videos

Aug 22, 2024

Jeremy Gray

Two images of the same male lion in a grassy savanna. The left image shows the lion partially pixelated with a zebra mane, while the right image shows the lion's natural mane. An arrow and a wand icon between the images indicate transformation or editing.

Although Adobe routinely releases new artificial intelligence features in its software, the company’s research division is always working hard on technological breakthroughs, long before related features make it into consumer-facing software. Adobe Research’s latest creation is Magic Fixup, which automates complex image editing tasks.

While photo editing tools, including AI ones, are typically trained using still frames, or photos, Adobe’s engineers believe photo editing tools could perform better when trained not on photos, but video content.

“Our key insight is that videos are a powerful source of supervision for this task: objects and camera motions provide many observations of how the world changes with viewpoint, lighting, and physical interactions,” the researchers explain on a GitHub page for Magic Fixup.

As Venture Beat writes in its coverage, this “novel method” enables the new AI technology to understand better how “objects and scenes change under varying conditions of light, perspective, and motion.”

A three-column, three-row image shows comparisons of an original scene, user edits with object removal, and results using a 'Magic Eraser' feature. Scenes include a colorful room, a building with pipes, and a group of oranges. The 'Magic Eraser' blends edits seamlessly. — ‘Spatial recomposition results: By spatially rearranging the scene in a coarse way, we can quickly clean up the edit and make it photorealistic through Magic Fixup, fixing up the global illumination, connecting edited pieces together, and addressing moving objects to different regions of focus.’ | Credit: Adobe Research, University of Maryland

The team calls image editing a “labor-intensive process.” It claims that while human editors can easily rearrange parts of an image to compose a new one, edits can look unrealistic, especially when the object has been moved to an area where the prevailing lighting conditions on that object no longer make sense. Suppose a person takes a picture of a home interior being side-lit by a window. If they move a piece of furniture from the brighter side to a darker area of the room, the object’s lighting will not make sense in its new location.

A comparison of image editing techniques featuring two sets of images. Top row: a coastal hill with a statue. Bottom row: a palace and garden scene with a fountain. Columns: Reference image, User edit, MagicFixup (ours), SDEdit, and ZoomShop (for reference only). — ‘Perspective editing: Inspired by ZoomShop (Liu et al. 2022), we edit the scene perspective by rendering regions at different depths with different focal lengths. To clean up the editing, ZoomShop required as many as 4 hours of manual editing. However, with Magic Fixup, this can be done in less than 5 seconds! The ZoomShop outputs are not directly comparable, since they do not use the coarse edit we create, but they are results taken directly from the ZoomShop paper for reference.’ | Credit: Adobe Research, University of Maryland

It is one thing to move objects around in an image, but another to make them look like they fit in the new spot. That is where Magic Fixup hopes to shine.

The team has built a user interface for its tool that can live within a developer version of Photoshop. The user can essentially cut out the object they want to change and paste it where they want it to go. Magic Fixup interprets the user’s cut-and-paste operation to recreate the image.

A comparative image showing different transformations using a teddy bear, panda, and car as examples. The transformations are organized in columns: "Reference image," "User edit," "Ours (sample 1)," "Ours (sample 2)," and "Baseline (SDEdit)". — ‘Colorization: Although the model was not trained to see any color editing, we find that by coloring objects with partial opacity brush, we can get a much cleaner edit through Magic Fixup. Here we show multiple samples to highlight the diverse generation we get from Magic Fixup.’ | Credit: Adobe Research, University of Maryland

The researchers explain that videos offer helpful information concerning how objects exist in the real world, including in response to changing conditions. How it precisely works is very sophisticated and beyond the scope of this article. However, the basics are that the Magic Fixup pipeline relies on two different diffusion models that operate simultaneously.

One pipeline handles the reference image, pulling out the required detail for recreation, while the other synthesizes the user’s coarse edit and the details from the reference image.

A two-row image comparing different image editing techniques. The top row shows a lion overlay on a landscape split into five stages: reference image, user edit, Magic Fixup, DragDiffusion, and MotionGuidance. The bottom row shows similar stages with a galloping horse. — ‘Comparison with reposing methods: By augmenting our user interface to keep track of dense correspondences, we can generate dragging key-handles for DragDiffusion (Shi et al. 2024), and the dense flow needed for Motion Guidance (Geng et al. 2024). However, we find that both of these SoTA methods are unable to handle complex reposing scenarios.’ | Credit: Adobe Research, University of Maryland

The results, while in early development, are very impressive. Experimental data shows that most users prefer Magic Fixup’s results to those of contemporary competing models. There are limitations, as the team admits its model has trouble with hands, faces, and small objects, but the results show that, at least in some situations, Magic Fixup can perform as well as a human editor in a fraction of the time.

The complete research is detailed in a research paper, “Magic Fixup: Streamlining Photo Editing by Watching Dynamic Videos.” The research has been conducted by Hadi AlZayer, Zhihao Xia, Xuaner Zhang, Eli Shechtman, Jia-Bin Huang, and Michael Gharbi. The researchers work for Adobe and the University of Maryland.

Image credits: Adobe Research, University of Maryland. AlZayer, Xia, Zhang, Shechtman, Huang, Gharbi

Miscellaneous