Skip to content
RESEARCH
6 min readЧитать на русском

Virtual Try-On for Eyewear

Virtual Try-On for Eyewear

Latent-Space Inpainting with Identity Preservation for Real-Time Eyewear Preview

Author: Alex Nix Status: Published research


Abstract

Virtual try-on for eyewear occupies a uniquely difficult niche in computer vision. Unlike garments or footwear — where the item replaces a body region — glasses must coexist with the face: transparent lenses refract light, frames interact with skin tone and facial geometry, and the user's eyes must remain identifiable behind the lenses at all times. Any drift in eye color, position, or shape erodes user trust immediately.

This case study describes a production try-on system built on a ComfyUI pipeline combining LoRA fine-tuning for frame styles, ControlNet canny edge detection for facial structure preservation, and latent-space inpainting for seamless integration. The system addresses four technical challenges specific to eyewear: transparent lens rendering in latent space, eye identity preservation across generation, eyebrow masking to prevent frame deformation, and non-standardized input handling. The architecture evolved into two generalizable frameworks — the Lock Layer Pattern for identity preservation and Typed Reference Composition for multi-condition generation.


1. Problem

Why glasses are harder than clothes

Virtual try-on for fashion has matured rapidly — garment replacement, shoe fitting, and makeup transfer all benefit from a shared property: the generated item replaces a body region. The model can generate the entire region from scratch.

Glasses break this assumption. The frames sit on the face, but the eyes must remain the user's eyes. The lenses are transparent — they don't replace skin, they sit above it and refract light through it. This creates three interlocking problems:

  1. Transparent lens rendering. Latent diffusion models struggle with transparency. A glass lens in latent space tends toward either opaque (blocking the eye) or invisible (no lens rendered). The model has no native concept of "transparent material that refracts what's behind it."

  2. Eye identity preservation. The user's eye color, iris pattern, pupil size, and eye shape are identity signals. Any generation that alters these — even subtly — makes the try-on result feel wrong. Users don't articulate "my eye color drifted from hazel to brown"; they say "this doesn't look like me." Trust breaks instantly.

  3. Frame-face interaction. Glasses frames cast shadows on the nose bridge and cheeks, touch the skin at the temple arms, and their shape depends on face geometry. Generating frames without face-awareness produces floating rectangles.

The user-trust constraint

In virtual try-on, the user is evaluating a purchase decision. The output isn't art — it's a fidelity proxy. If the eyes look wrong, the user rejects the tool, not just the specific pair. This makes eye preservation a hard constraint, not a nice-to-have.


2. Approach

Pipeline architecture

The system is implemented as a ComfyUI graph with three core stages:

Glasses VTON pipeline: face detection → canny edge extraction → LoRA frame conditioning → ControlNet-guided inpainting → output
Glasses VTON pipeline: face detection → canny edge extraction → LoRA frame conditioning → ControlNet-guided inpainting → output

Stage 1 — Face analysis and mask generation. The input portrait is analyzed for facial landmark positions. A precise inpainting mask is generated covering the eye region — frames, lenses, and surrounding area — while excluding the eyebrows (a critical detail discussed in §3.3).

Stage 2 — Structure conditioning via ControlNet. Canny edge detection extracts the face's structural edges: jawline, nose bridge, eye socket contour. ControlNet (canny model) receives these edges as a structural condition, ensuring the generation respects the underlying facial geometry. The canny map includes the desired frame shape from the LoRA-conditioned target style.

Stage 3 — LoRA-conditioned inpainting. An inpainting diffusion model, conditioned on both the ControlNet structure map and a LoRA adapter trained on the target frame style, generates the glasses region. The LoRA encodes frame geometry, material, and color; ControlNet anchors it to the face; the inpainting model blends the result into the surrounding context.

Why this combination

  • LoRA over full fine-tuning: Frame styles change seasonally. LoRA adapters can be trained per-style on ~20-30 reference images in under an hour, vs. full fine-tuning that requires hundreds of images and hours of compute. New styles ship as new adapters.

  • ControlNet over pure text conditioning: "Aviator glasses on a face" is ambiguous — the model doesn't know which face or which size. Canny edge maps encode the exact facial structure and desired frame placement, removing the ambiguity.

  • Inpainting over full image generation: Regenerating the entire face risks identity drift everywhere. Inpainting the eye region only constrains the generation space and preserves the rest of the face identically.


3. Key technical challenges

3.1 Transparent lens rendering in latent space

Latent diffusion operates in a compressed representation. Transparency — the property of "showing what's behind, but modified" — doesn't have a clean encoding in latent space. The model tends to either:

  • Fill the lens area opaque, blocking the eye entirely (the model treats the lens region as "glasses = filled")
  • Skip the lens, rendering only the frames (the model can't reconcile "present but transparent")

Solution: Prompt engineering combined with weighted attention. The generation prompt explicitly specifies "transparent lenses, eyes visible through glass, slight reflection." Attention weights on the eye-region latent are boosted during the denoising process, ensuring the underlying eye detail bleeds through the lens generation. The result isn't physically accurate refraction — it's a perceptually convincing approximation where the eye color and shape remain readable through a subtle lens tint.

3.2 Eye identity preservation

Standard inpainting treats the masked region as "to be replaced." For glasses try-on, the eyes inside the mask must remain identical to the input. Two approaches were evaluated:

  • Post-generation blending: Generate the full glasses region, then alpha-blend the original eyes back in. Produces visible seam artifacts at the eye-lens boundary.
  • Latent-space eye locking: Before inpainting, extract the eye-region latents from the original image. During denoising, constrain the eye sub-region to these original latents at each step, allowing only the frame and lens areas to change.

The latent-space eye locking approach proved significantly more effective. The eyes remain identical at the pixel level, and the frame-lens boundary renders naturally because the diffusion model sees coherent surrounding context during denoising.

This technique directly informed the development of the Lock Layer Pattern — the general principle of separating "what must not change" from "what is being generated."

3.3 Eyebrow masking

An unexpected failure mode: without explicit eyebrow exclusion from the inpainting mask, the model would deform the eyebrows to match the frame shape. Thin-frame glasses would thin the eyebrows; thick frames would thicken them. The model treated the entire masked region as "glasses-related" and made everything harmonize — including unrelated facial features.

Solution: The inpainting mask was refined to exclude a margin above the eye socket, preserving the eyebrow region from any generation. This small masking change eliminated the deformation entirely.

3.4 Non-standardized input handling

User-submitted photos vary wildly: different lighting, angles, resolutions, backgrounds, partial occlusion (hair over eyes), and accessories (existing glasses, sunglasses). The pipeline handles this through:

  • Automatic face detection and alignment before processing
  • Fallback paths for extreme angles (beyond ~30° off-axis, the system flags the input rather than producing a bad result)
  • Pre-processing normalization for lighting and white balance, ensuring consistent LoRA behavior across input conditions

4. Results

Virtual try-on result: frames rendered with transparent lenses, original eye color preserved
Virtual try-on result: different frame style, consistent eye identity across styles

The system achieves:

  • Eye identity preservation at pixel-level accuracy — the generated eyes match the input with no perceptible drift in color, shape, or position.
  • Transparent lens rendering that reads as "glass over my eyes" rather than "colored filters" or "missing lenses." The lens tint is subtle and realistic.
  • Frame integration that respects facial geometry — frames sit on the nose bridge correctly, temple arms align with head width, and shadow/lighting is consistent.
  • Style consistency across different LoRA adapters — swapping frame styles produces visually coherent results without pipeline reconfiguration.

5. Lessons and evolution

This project was the starting point for two broader architectural patterns that became central to subsequent work:

Lock Layer Pattern. The latent-space eye locking technique — constraining specific identity features during generation while allowing creative variation elsewhere — generalized into the Lock Layer Pattern. The pattern separates preservation constraints from generative instructions, replaying the preservation layer verbatim across every generation in a batch. What began as "don't change the eyes" became "don't change the identity, the brand, the product."

Typed Reference Composition. The multi-condition architecture — LoRA for style, ControlNet for structure, inpainting mask for region — evolved into Typed Reference Composition. The general principle: different types of visual information (style, structure, identity, region) should be carried by dedicated, typed conditioning channels rather than competing inside a single prompt.

Identified next steps

  • Video try-on: Extending the single-frame pipeline to video with temporal consistency across frames.
  • Physical accuracy: Replacing the perceptual lens approximation with a physics-based refraction model for higher-fidelity lens rendering.
  • Multi-angle synthesis: Generating try-on results from a single input photo at multiple head angles, using face-aware 3D priors.

References

  • ControlNet: Zhang, Lv, et al. "Adding Conditional Control to Text-to-Image Diffusion Models." ICCV 2023.
  • LoRA: Hu, Shen, et al. "LoRA: Low-Rank Adaptation of Large Language Models." ICLR 2022.
  • ComfyUI: Node-based GUI for Stable Diffusion. github.com/comfyanonymous/ComfyUI