[Reels] Imagen yourself

BayesianBacteria 2024. 10. 15. 21:49

Link , Personalized text-to-image generation by Meta AI

Insights

they design kind of improved IP-adaptor for “any-personality” generation.
- therefore, the model does not need to be trained for a new subject, unlike LoRA or Dreambooth.
- meanwhile, other “any-personality generation models” could come with a strong over-fitting behavior such as copy-paste effect to the reference image. (it can be resolved in the synthetic pair dataset below)

the limitation of the existing personalization task is “copy-paste” effect—the generated image looks super-similar to the given reference image.
- it means that the target generated image “does not follow” the given prompt.
to resolve such issue, authors proposes the synthetic-data pipeline consisting of several real and synthetic data for one identity.
- sadly, the details of the pipeline is not included (such as, how to generate synthetic “personalized image”)

(common space between image and text) CLIP
(encoding “characters”) ByT5: Byte-Level (Character-level) T5 architecture. (might improve the “text image” generation—for instance, the sign of “moreh is cool”)
(Comprehending long and intricate text prompts) UL2: “improved T5”

only applicable for the models with cross-attention text conditioning.
- at least, from the proposed architecture in the paper
- to apply SD3-like architecture (w/o cross attentions), it need to be adjusted