Image Segmentation Explained: From Manual Selection to AI

Image segmentation — the process of separating an image into meaningful regions — is one of the most fundamental tasks in computer vision and image editing. Every time you select an object in Photoshop, remove a background from a photo, or use a portrait mode camera feature, segmentation is at work.

But the technology behind segmentation has changed dramatically over the decades. What once required painstaking manual work now happens with a single click, thanks to AI models like SAM2. Let’s trace the journey from the earliest selection tools to today’s foundation models.

The Manual Era: Pixel by Pixel

In the earliest days of digital image editing, segmentation was entirely manual. If you wanted to isolate an object from its background, you had to define the boundary yourself, pixel by pixel.

The Lasso Tool

One of the earliest selection tools in image editors was the freehand lasso. Introduced in early versions of Adobe Photoshop in the early 1990s, the lasso let you draw a selection boundary by tracing around an object with your mouse.

The lasso was simple and intuitive — just draw around what you want. But it was also painfully imprecise. Tracing around complex shapes like hair, fur, or tree branches was nearly impossible to do cleanly. A shaky hand or a momentary lapse in concentration meant starting over. Professional retouchers would spend hours carefully tracing selections around a single subject.

The polygonal lasso improved things slightly by letting you click to place straight-line segments, building up a selection point by point. This was more precise for objects with angular edges but even worse for organic, curved shapes.

The Magic Wand

The magic wand tool represented the first attempt at semi-automated selection. Instead of tracing a boundary manually, you clicked on a pixel, and the tool selected all contiguous pixels of a similar color. Adjust the “tolerance” slider and you could select larger or smaller regions.

The magic wand was revolutionary for simple cases — selecting a solid blue sky, for instance, worked beautifully with a single click. But it fell apart with textured surfaces, gradients, or anything without uniform color. Selecting a person’s face against a similarly-toned background was an exercise in frustration, requiring dozens of clicks with carefully adjusted tolerance settings.

To compensate for the limitations of basic selection tools, editors introduced refinement features. Feathering softened selection edges to reduce harsh boundaries. Grow and Similar expanded selections to adjacent or matching pixels. Quick Mask mode let you paint selections with a brush, combining the precision of manual work with the flexibility of painting.

These tools helped, but they were fundamentally band-aids on a manual process. Segmentation remained one of the most time-consuming tasks in image editing.

The Edge Detection Era: Let the Computer Help

The next major step was asking the computer to detect object boundaries automatically, at least partially.

Classical Edge Detection

Algorithms like the Sobel filter, Canny edge detector, and Laplacian of Gaussian could identify pixels where sharp intensity changes occurred — which often corresponded to object boundaries. These mathematical techniques, developed through the 1970s and 1980s, were among the first computer vision algorithms to gain practical use.

Edge detection could find boundaries that the human eye could see, but it produced raw edges without understanding what those edges meant. A Canny edge detector applied to a photo of a cat would highlight the cat’s silhouette, but also every whisker, every fur texture, every shadow, and every pattern in the background. Converting raw edges into a meaningful object selection required significant additional processing.

Active Contours (Snakes)

In 1988, Kass, Witkin, and Terzopoulos introduced active contours, playfully called “snakes.” A snake was a curve that you placed roughly around an object, and it would automatically shrink or expand to snap to nearby edges. The algorithm balanced three forces: internal forces kept the curve smooth, image forces attracted it to edges, and external forces let the user guide it.

Snakes were a significant conceptual advance — the computer was now actively helping to find the boundary. Adobe Photoshop’s Magnetic Lasso tool, introduced in version 5.0 (1998), used similar principles. As you moved your cursor near an edge, the selection line would snap to it, dramatically speeding up manual tracing.

Graph-Based Methods

The early 2000s saw the rise of graph cut methods, most notably GrabCut (2004). The idea was to model the image as a graph where pixels were nodes and edges represented similarity between neighboring pixels. By marking some pixels as definitely foreground and others as definitely background, the algorithm could compute an optimal cut through the graph that separated the object from its surroundings.

GrabCut was remarkably effective for its time. You drew a rough rectangle around an object, and the algorithm figured out the precise boundary. It handled gradual color transitions and complex textures far better than the magic wand ever could. This technology found its way into tools like Photoshop’s Quick Selection and Select and Mask features.

The Deep Learning Revolution

Everything changed when deep learning entered the picture. Starting around 2015, neural networks began outperforming all traditional approaches to image segmentation.

Fully Convolutional Networks (FCN)

In 2015, Long, Shelhamer, and Darrell published their landmark paper on Fully Convolutional Networks for semantic segmentation. Instead of classifying entire images, FCNs could classify every pixel in an image, assigning each one to a category (person, car, sky, etc.).

This was the first time a computer could automatically segment an entire image into meaningful regions without any human guidance. The results weren’t perfect — boundaries were often rough and small objects were frequently missed — but the paradigm shift was profound. Segmentation had moved from a human-guided process to a largely automated one.

U-Net: Precision for Medical Imaging

Published in 2015, U-Net introduced a symmetric encoder-decoder architecture with skip connections that preserved fine spatial details through the network. Originally designed for medical image segmentation (where precise boundaries are critical for diagnosing tumors or measuring organs), U-Net achieved remarkable accuracy even with limited training data.

U-Net’s architecture became one of the most influential designs in deep learning. Its core idea — encoding an image down to a compact representation and then decoding it back to full resolution with help from skip connections — appears in countless segmentation models to this day.

Mask R-CNN: Instance-Level Understanding

While semantic segmentation assigns a class to every pixel, it doesn’t distinguish between individual objects of the same class. In a photo of three people, semantic segmentation labels all person pixels the same way. Mask R-CNN (2017) by He et al. solved this with instance segmentation — identifying and separately masking each individual object.

Mask R-CNN combined object detection (finding bounding boxes around objects) with pixel-level segmentation within each box. For the first time, a computer could not only say “these pixels are a person” but “these pixels are Person #1, those pixels are Person #2, and those over there are Person #3.”

This capability unlocked applications from autonomous driving (tracking individual cars and pedestrians) to photo editing (selecting a specific person in a group photo). But Mask R-CNN and similar models had a significant limitation: they could only segment object categories they were trained on. A model trained to segment people and cars couldn’t segment an unusual object like a rubber duck or a specific piece of furniture without retraining.

The Foundation Model Era: Segment Anything

The most recent and arguably most transformative development came with foundation models — large AI models trained on massive datasets that can generalize to virtually any task.

SAM: Segment Anything Model (2023)

Meta’s Segment Anything Model (SAM), released in 2023, represented a fundamental shift in segmentation technology. Trained on over 1 billion masks across 11 million images, SAM could segment any object in any image — not just the categories it was explicitly trained on.

SAM’s key innovation was its promptable interface. Instead of requiring class labels or extensive training data, SAM accepts simple prompts: click on an object, draw a box around it, or even provide a text description. The model then generates a high-quality segmentation mask.

This made professional-quality segmentation accessible to everyone. No training data needed. No category limitations. No technical expertise required. Just click on what you want, and SAM figures out the rest.

SAM2: Faster, Better, and Video-Ready (2024)

SAM2 improved on the original with better accuracy, faster inference, and the ability to segment objects in video (tracking them across frames). The architecture was refined to be more efficient, making it practical to run on consumer hardware rather than just research servers.

SAM2 handles ambiguous cases more gracefully — when you click on an object that could be interpreted at multiple scales (is it the eye, the face, or the whole person?), SAM2 generates multiple candidate masks and lets you choose. It also handles challenging boundaries like hair, transparent objects, and objects with holes more accurately than its predecessor.

SAM2 in Practice: WobblePic

WobblePic brings SAM2’s capabilities to a unique application: physics-based image interaction. When you click on an object in WobblePic, SAM2 generates a segmentation mask that defines the object’s boundary. WobblePic then creates an independent physics mesh for that region, allowing it to wobble, jiggle, and bounce independently from the rest of the image.

This represents the full arc of segmentation history compressed into a single click. What once took a professional retoucher hours of painstaking lasso work — isolating an object from its background with pixel-perfect precision — now happens in seconds, powered by a foundation model running locally on your GPU through ONNX Runtime and DirectML.

The tutorial page walks through the segmentation process step by step, and the gallery showcases the results. If you want to experience the state of the art in image segmentation yourself, download WobblePic and try clicking on any object in any image. It’s a vivid demonstration of just how far this technology has come.

What’s Next?

Segmentation technology continues to advance rapidly. Current research explores real-time video segmentation at higher resolutions, 3D segmentation from single images, segmentation guided by natural language descriptions, and integration with generative AI for seamless object manipulation.

The trajectory is clear: segmentation is becoming faster, more accurate, and more accessible with each generation. What started as manual pixel tracing in the 1990s has evolved into AI that understands objects as well as humans do — and in some cases, better. The gap between “I want to select this object” and “it’s selected” has shrunk from hours to seconds to a single click.