세균맨

(3.5/5.0) 이문설농탕: 두한이 형님의 맛집

BayesianBacteria — Sat, 15 Mar 2025 12:50:36 +0900

고대식당도굴기-챕터 2

이문설농탕 In a nutshell

김두한 형님의 맛집이자, 김두한이 아르바이트를 했던 곳. 한국에서 제일 오래된 식당 (이라고 알려져 있다).
깍두기가 매우 맛있다. 나 깍두기 잘 안 먹는데. (그래서 신빙성 없을지도)
뭔가 통통한 밥알.
무난히 맛있는 설렁탕.

입구

"고대 종로"의 느낌이 잘 훈제되어 있는 간판. 물론, 이문설농탕 (123세) 은 2011년에 지금 집으로 이사했다. 지금은 무너졌다고 하는 100년 된 한옥집이 궁금해지는 시점이다.

더 들어가면, 입구. 충분히 낡지 않아서 아쉽달까. 언제부터 낡은 것들이 선호되기 시작했을까? 모두가 여유로워지고부터 일까 아니면 지금이 싫어서일까? 어느 쪽 일지 궁금하다.

내부와 음식

"차림표"의 느낌은 압권이다. 아마 이전 가게에서 그대로 가져온 걸까? 그렇다면, 이 가게의 최고 어른님은 차림표일 것이다. "마나" 란 대체 뭘까? 정신력이 채워질 것 같은 느낌이다.

"설렁탕 좀 말아온나", "대파 많이".

밥알이 통통하다. 고기의 양도 많다. 종로에서 설렁탕 잘한다고 하는 곳을 가면, 보통 집에서 먹을 때처럼 고기를 많이 주더라.

국밥을 선호하는 사람으로서, 설렁탕이나 곰국에 손이 잘 안 가는 이유는 고기의 양이 감질나서인데---즉, 한 숟갈마다 1 고기를 못함---여기는 한 숟갈 한 고기를 할 수 있다. 두한이 형님은 여기에 매력을 느끼신 걸까. 그 시절엔 사실 이렇게 고기가 없었을지도.

뭔 모둠 전골 같은 걸 시켜봤었는데, 국물맛이 설렁탕과 좀 다르다. 다른 육수를 쓰는 걸까. 채수맛 때문에 차이가 나는 걸까.

사실 이 집에서의 진짜는 깍두기였는데... 이런, 깍두기를 처먹느라 얼마나 정신이 없었으면 깍두기 사진이 없다. 그렇다. 오히려 너무 맛있고 너무 아름다운 것은 차마 담기도 전에 마음속에서 그리고 뱃속에서, 방금 뀐 지독한 방귀냄새처럼 강렬하다가도 재빠르게 사라지고 만다. 그렇담 나는 맛있고 아름다운 것을 기록하려 노력해야 하는 걸까 아니면 그냥 지독한 순간---그 인돌과 스카톨의 조화---를 더 즐겁게 느껴야 하는 걸까?

- 난 깍두기를 잘 안 먹는다, 깍두기가 맛있게 느껴진 건 어쩌면 나 같은 비-깍두기파에게 좀 친절한 맛이었기 때문일 것 같은데, 무의 시큼한 맛이 좀 덜했던 것 같기도 하다.

본구's opinion

깍두기도 나쁘지 않았지만, 김치가 더 맛있는걸
"마나"는 물컹한 솜 먹은 스펀지 (구멍 많은)를 씹는 식감, 소라고동 맛이 남. 비추 (취향 아님).
이것저것 넣은 전골은 대전 전민동의 초명물, 한우곰탕 생각이 나는 맛이었음. 다만, 채수맛이 강하고 설렁탕보다 진하게 우러난 무언가는 없었던 듯.

(4.5/5.0) 밴건디 스테이크하우스

BayesianBacteria — Sun, 16 Feb 2025 15:28:45 +0900

밴건디 in a nutshell

서래마을 초입의 -로칼-스테이크 집: 근처사는 가족단위 손님이 대부분인 편임.
합리적인 가격: 울프강을 생각하다 보니, 이게 혜자라고 생각이 들어 버린다.
"맛"집: 로메인 샐러드가 매우 맛있다. 한입에 로메인 한장 가득 넣으면 죽것다. 티본스테이크의 "안심"이 충격이다. 맛있음.
Rating: 4.5 / 5.0

입구

발렛 차량을 왜 창문 옆에다 붙여두는걸까? 밖이 안보임...

음식, 스테이크 이전

27만 5천원짜리 2인 세트같은걸 시켰다. 스테이크 이전에 샐러드 + 파스타 하나를 선택할 수 있었음.

로메인 시저 샐러드 + 통베이컨이 일종의 "시그니처" 인듯 한데, 로메인에 김장하듯 시저 샐러드 드레싱을 구석구석 묻히고, 치즈 그라인딩 해서 올리는듯. 로메인이 드레싱을 먹으면서 적당히 숨이 죽는데, 아, 아주 맛있다. 통베이컨도 매우 맛있음. 재밌는점은 통베이컨을 다 못먹고 남겨서 포장해갔는데, 집에서 에어프라이어로 튀기니 한 두배 더 맛있었음.

아래는 (한정판) 키워드가 있어서 시켰던 Bisque Gamberi 어쩌고 파스타. 파스타 면이 굵거나 겉이 까슬한 그런 면은 아닌것 같았다. 사실 파스타는 면이 맛있다, (갑각류) 소스가 맛있다 라기보다, 새우가 매우 탱글했다. 새우는 다먹고 면은 좀 남긴듯함.

스테이크

티본 스테이크. 이게 진짜 죽음이다. 굽기는 미디움. 나는 잘 몰랐는데, 지방이 적은 부위는 좀 더 오래 굽는것이 부드러우며, 지방이 많은 부위는 좀 더 적게 굽는것이 부드럽다고 한다. 여기서 똑같이 적용하면, 안심은 조금 더 구워져야 하며 등심은 좀 덜 구워야 하나보다. 평소에 난 아무것도 모르고 미디움레어를 시키는 편인데, 어쩐지 미디움을 시키고 싶더랬다.

그러고, 안심을 먼저 먹으라하여 안심 한점 덜어 칼을 쑤시는순간 알았다. 이자식, 칼이 쑥 들어간다. 이녀석, 보드랍다. 아니나 다를까 뒤지게 맛있었다. 등심도 맛있었지만, 안심이 세수 (wash up 아님) 앞서있다. 여기, 소금 안준다. 그냥 스테이크 소스랑 홀그레인 머스타드를 주는데, 스테이크 시즈닝이 잘 되어있는건지 그냥 먹어도 맛있다. 그리고, 스테이크 소스가 너무,,, 자기주장이 강하지 않다. 딱 필요한 만큼만 어시스트를 주는 느낌이다.

스테이크가 나올즘에 선택 같이한 사이드메뉴를 받는데, 메쉬포테이토나 울프강의 묵은지, 시금치 (사실 시금치 죽이다. 나물 생각하면 안됨) 같은것들이다. 여기서의 사이드도 제법 맛있었는데, (1) 죽같은 시금치에 좀 향이 강한 치즈 (뭔지모르겠음) 를 버무려서 소스처럼 만들어낸 것 (2) 똑같이, 매우 부드러운 매쉬포테이토---거의 알리고(치즈감자) 정도의 질감을 가짐; 의 두가지다. 아. 제법 맛있다. 특히나 시금치치즈죽(?) 은 고기랑 제법 어울린다.

칼 잘박히는 안심 하나 다시먹기 위해서라도 재방문 의사 ++.

본구's Opinion

이빨 없이도 먹을 수 있는 안심.
로메인 샐러드 먹고, 집에 로메인 따로 주문함.
울프강 보다 나음.
새우가 너무 탱글함.

대체로 일치함.

아 그리고, 솔직히 배불러서 제대로 못먹긴 했는데, 티라미수도 좀 맛있는편임. 0.8 비스테까 라고 해도 손색없음 (비스테까는 이제 폐업했다.)
옆 테이블들은 둘이와서 저거 다 먹더라, 나도 컨디션 좋았으면 다 먹을것 같은데 아쉽. 동행자가 입이 매우 짧다. 짧은 입으로 배가 터질듯이 집어 넣었으나 역부족이었다: 덕분에 나는 항상 모든 다이닝에서 라지세트를 먹어야 한다.
스테이크하우스, 다이닝 등을 먹은 이후에는 산책은 필수다. 그대로 집에가서 누우면 토한다.
여기 콜키지 되는데, 포트와인을 가져갔다. 울프강에서 포트와인을 고기랑 같이 먹었을 때 몇번 되물어보더라. '이 새끼들, 맛알못인가? 이거랑 고기랑 같이 먹는다고?' 였을까. 알빠노, 난 이게 좋다.

HiRA: parameter efficient Hadamard high rank adaptation for Large Language Models

BayesianBacteria — Tue, 11 Feb 2025 11:36:56 +0900

ICLR2025

논문의 내용과 상관없이, 제가 이야기해보고 싶은 부분은 이탤릭체로 남깁니다.

texts with Italic represent parts that are not the part of the original paper, but rather my personal thoughts or additions

Limitation of the LoRA

LoRA and most of its variants do not perform well when applied to complex tasks—such as commonsense reasoning.
This degradation might be inevitable since the additional trainable parameters are “lower-rank” matrices.

Goal: Achieve a higher-rank adaptation for LLMs under the PEFT (parameter-efficient-fine-tune) strategy.

Motivation: recap the goal of PEFT

PEFT requires a careful balance between model expressiveness and computational efficiency.
- Or, additionally, PEFT may help to prevent catastrophic forgetting of prior knowledge of models (https://arxiv.org/abs/2405.09673)
(Observation 1) LoRA with higher ranks enhance peformance of Llama3-8B, implying the higher-rank adapatation offer significant advantages

However, increasing the rank in LoRA heightened computational budgets and it is difficult to train due to gradient explosion.
We need an another and better tool for PEFT, but "higher-ranks"

HiRA in a nutshell: LoRA vs HiRA

Above Figure 3 is about everything of HiRA: trainable low-rank matrix A, B will be elementwisely multiplied with the original weight matrix (in other words, Hadamard product)

(Q: Connection with "Gated-Linear-Units" (GLU)-style activation): GLU activation functions has also similar structure with HiRA. $h(x) = f(x) \odot g(x)$ where $f$ and $g$ are some functions. The similar factor (Higher rank) is the main contribution of that GLUs outperforms other activations?

Why HiRA outperforms LoRA?

Simply put, the LoRA has lower "Rank". The property of the matrix rank is that the decompsed matrix $\Delta W = L_1 L_2$, where $L_1 \in \mathbb{R}^{d \times r}, L_2 \in \mathbb{R}^{r\times k}$ for $r < d, k$, has $\r$ rank. It may limit its capability to capture high-rank updates (it could be the complex tasks, such as reasoning of LLMs).

However, HiRA is free from such degradation due to the lower rank. HiRA utilizes the Hadamard product (element-wise product). And crucial benefit of using Hadamard product ($\odot$) is that the rank of the result matrix $P\odot Q$ is:

$$Rank(P \odot Q) \leq Rank(P) \times Rank(Q)$$

It is much larger (in the most of practical scenarios) than the rank of matmul utilized by LoRA:

$$Rank(PQ) \leq \min(Rank(P), Rank(Q)$$

Above just states upper bound of the rank. However, they empirically found that the Hadamard product could enhance the rank like belows:

HiRA even can have higher rank than the original weight matrix

Consider below the additional weight, $\delta W$, which becomes part of the PEFT $W_{\text{new_weight}} = W_{\text{original}} + \Delta W$:

$$\Delta W = W_{\text{original}} \odot W_{hi}$$

here, $W_{hi}$ is the NEW trainable parameter. With inequality of the above---Rank of Hadamard product---, rank of $\Delta W$ is bounded as below:

$$Rank(\Delta W) \leq Rank(W_{\text{original}}) \times Rank(W_{hi})$$

Therefore, the rank of additional weight what we newly train possibly exceeding the rank of original matrix. And, importantly, it is still computationally efficient---equivalent to that of LoRA---if we decompose $W_{hi}$ with $W_{hi} = AB$ where $A$ and $B$ is low-rank matrices. And of course, the additional weights can be merged into the original weights just like LoRA.

Intrinsic dimensionality, expressive power: Does higher rank for finetuning make sense?

Intrinsic dimensionality and matrix rank

Several works have shown that LLMs have a low intrinsic dimensionality, stating only small subsets of parameters is necessary for fine-tuning. LoRA is grounded for this discovery. Contrary to this, HiRA use higher rank trainable paramater. Does it make sense?

The answer is yes. Intrinsic dimensionality only consider the number of parameters and the small number of parameters does not inherently mandate low rank. It is not paradox.

(Does low number of parameters and high expressive power---In here, this was represented by rank of weights---is the key of the fine-tuning? or beyond that, is it the key of the model-design?) I guess not. Deep learning architectures have been developed from poor inductive bias to calibrated inductive bias. For instance, we experienced MLP to CNN: MLP has higher expressive power than CNN but function space of CNN is much more suitable for computer vision task than that of MLPs.

Not sure for fine-tuning regime. Once we choice the model architecture with nice inductive bias, low number of parameters and high expressive power rules might work?

Formal analysis of expressive power of HiRA, and the role of the original weight

The author also give formal analysis of the expressive power of HiRA. They define the expressive power as the minimal difference between the updated weight (the additional weight is added to the original weight) and its optimal parameter update (following previous work). The lower minimal difference to optimal value, the higher expressive power. For LoRA, this is equal to the (r + 1)-th largest singular value (where r is the rank of original weight matrix).

With the above definition of the expressive power, it is bounded to the singluar value and original weight (unlike LoRA, only depended on the singluar value of original weight) like below:

However, the role and contribution of the pretrained weight $W_0$ is somewhat unclear. They claimed that $W_0$ serves a dual role for both confining and facilitating the adaptation. It may be due to limiting the flexibility to reduce expressive power with $\sigma_{r+1}(\bar{E} \oslash W_0)$---since it might reduce the singular value?

Gradient analysis: gradient exploits the prior knowledge of the original weight

Author also claimed that training HiRA surpass LoRA since it leverges the information encoded in the original weight $W_0$. This encoding might be explained by considering the gradient of lower rank matrices contains the $W_0$, unlike LoRA:

Experiments

(1) It outperformed LoRa, DoRA, MoRA in the various task

Especially for mathematics (GSM8K), which may be relatively requiring complex reasoning.

Note that ConvAI2 dataset from Table 2 is for the open-domain dialogue generation evaluation

(2) Singular value scales of HiRA somehow is well-matched that of Full-Fine-Tuning (FFT)

If we consider the singular value scale (its norm of the counts of larger singular value in the weight) indicates the expressive power of the model, HiRA is in the "high, but not dangerously high" zone. For instance, MoRA tends to be larger singular values and their counts are also high may increase the risk of forgetting. The only small number of singular values of LoRA mainly contribute (in other words, high norm). Unlike them, HiRA has proper number of large-singular values and their norms. Well-matched with FFT supports this claim.

(3) $W_0$ may contributes the fine-tuning

They explore the impact of different choice of $R$ in $\Delta W = R \odot W_{hi}$ (for the original choice, $R=W_0$, the pretrained weight). if the $W_0$ is replaced with $R$, the finetuning is doomed. This may highlights $W_0$ plays a key role for HiRA fine-tuning by conveying useful information from pretrained weight, as they claimed.

(4) HiRA works best to apply for both fully-connected layers and QKV of transformers.

(Q) LoRA has same tendency or not?

(5) Interestingly, hybrid approach (LoRA + HiRA) works

$r_1$: rank of HiRA, $r_2$: rank of LoRA

Authors also demonstrates usefulness of LoRA + HiRA: $W_0\odot A_{hira}B_{hira} + A_{lora}B_{lora}$ is added into the original pretrained weight $W_0$. Furthermore, they claimed that table 6 shows that the higher rank of HiRA is preferable over LoRA since higher $r_1$ achieve the best score.

On the "forget-less" benefit

Recently, (Biderman et al., 2024) have shown that although LoRA fine-tuning underperform full-finetuning, it is better maintaining the performance of base model on tasks outside of the target domain. It suggests that there is a tradeoff between preserving original information of the pretrained model (or, knowledge of it) and adaptation of the target task.

The below figure from the Biderman's paper is a proper example to show such tradeoffs. Y axis represents the target task the model newly learns, and the X axis represents the other tasks that pre-trained model already learned.

Full-fine tuning and higher rank of LoRA can achieve the higher accuracy on the target task, but they sacrifice the prior knowledge of the pre-trained weight (HellaSwag, ARC-challenge, WinoGrande here). If so, a question naturally arises: Can HiRA achieve a better Pareto front for the forgetting-adaptation trade-off? If it can, I believe this would be another significant contribution of HiRA. If it can not, it still has the benefits to acquire full-finetuning-level expressive power despite of its computational-efficiency. Exploring LoRA-level-efficienchy and full-finetuning-level-expressive-power methods for the forgetting-adaptation trade-off could be valuable research direction.

Design patterns for machine learning in a nutshell (1): Creational Design Pattern

BayesianBacteria — Thu, 31 Oct 2024 14:19:24 +0900

Creational design patterns and their motivation

Creational design pattern: Instance 생성과 관련된 패턴. 어떤 Instance를 효율적으로 생성하고 관리할 수 있게 한다.

Creational pattern은 위 처럼 instance 생성이라는 목적으로 만들어진 것들이다. 그렇다면, 모델 학습 파이프라인 등을 구성하는데에는 어떠한 예시가 있을 수 있을까? Instance 생성이 효과적이고, 직관적으로 되는 케이스가 어떤 것들이 있을까?

HuggingFace의 AutoModel 같은 경우를 생각해 보자. 이는 (End-user의 입장에서) 모델 instance를 생성하는 데에 굉장히 편리한 역할을 한다: AutoModel.from_pretrained. 혹은, 어떤 object를 생성하는데에 이것저것 복잡한 로직이 들어간다고 생각해 보자. torch.nn.Module instance를 만들기 전에 configuration을 파싱 하고, accelerate, deep-speed를 쓸 건지, Single-gpu를 쓸 건지 따위의 것들이 들어가거나, Dataloader instance를 만들기 전에 필요한 몇 가지 로직이 있는 경우가 있겠다. 혹은 새로운 데이터로더 로직을 작성하려고 하는데, 기존에 존재하는 코드를 수정하면서 작성하고 싶지 않은 상황이 있을 수 있겠다.

이 포스팅 에서는, ML 학습을 주로 예시로 들어서 몇 가지 creational pattern들을 설명한다. 전통적으로 사용되어 온 패턴들을 소개하고, 중간중간 Huggingface 등에서 보이는 같은 목적을 가진 modern-pattern 들도 다룬다.

Patterns

Factory Patterns

여러 클래스들의 instance를 생성하는 목적을 가진 패턴. 특정 학습방법론을 실험하는데, 여러 architecture 에서 모두 잘 working 하는지를 보고 싶다고 생각해 보자. 다른 모든 logic에 대한 코드는 같고, model을 불러오는 부분만 다르다고 할 때, naive 하게는 분기로써 아래와 같이 구현할 수 있겠다.

def create_model(model_type):
    if model_type == "CNN":
        return nn.Sequential(
            nn.Conv2d(3, 16, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(2, 2),
            nn.Conv2d(16, 32, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(2, 2),
            nn.Flatten(),
            nn.Linear(32 * 8 * 8, 10)
        )
    elif model_type == "RNN":
        return nn.Sequential(
            nn.RNN(input_size=10, hidden_size=20, num_layers=2, batch_first=True)
        )
    elif model_type == "Transformer":
        return nn.Transformer(d_model=512, nhead=8, num_encoder_layers=6)
    else:
        raise ValueError("Unknown model type")

Factory Pattern은 위처럼 학습을 하는 Main code에서, 이런 분기를 타는 create_model을 만들지 말고, ModelFactory 라는 별도의 클래스가 모델 인스턴스 생성을 하도록 만들어보자. 아래와 같이, 여러 종류의 instance를 만들 수 있는 하나의 class를 가지는것을 Simple factory pattern 이라한다.

class ModelFactory:
    def create_model(self, model_type: str, some_complex_config: SomeConfig):
        if model_type == "CNN":
        	...
            return CNNModel()
        elif model_type == "RNN":
            ...
            return RNNModel()
        elif model_type == "Transformer":
        	...
            return TransformerModel()
        else:
            raise ValueError("Unknown model type")

class CNNModel:
    def train(self):
        print("Training CNN model...")

class RNNModel:
    def train(self):
        print("Training RNN model...")

class TransformerModel:
    def train(self):
        print("Training Transformer model...")

factory = ModelFactory()
model = factory.create_model("CNN", some_complex_configuration)
model.train()

하지만, 이는 여전히 팩토리 클래스 내부적으로는 -분기-를 기반으로 하고있다. 더 나은 해결책으로는, 아래처럼 모델마다 모델 생성 팩토리를 만들 수 있겠다. 이러한 패턴은 Factory method pattern 이라 불리운다.

from abc import ABC, abstractmethod
import torch.nn as nn

class ModelFactory(ABC):
    @abstractmethod
    def create_model(self):
        pass

class CNNFactory(ModelFactory):
    def create_model(self):
        return nn.Sequential(...)

class RNNFactory(ModelFactory):
    def create_model(self):
        return nn.Sequential(nn.RNN(input_size=10, hidden_size=20, num_layers=2, batch_first=True))

class TransformerFactory(ModelFactory):
    def create_model(self):
        return nn.Transformer(d_model=512, nhead=8, num_encoder_layers=6)

def get_model(factory: ModelFactory):
    model = factory.create_model()
    return model

cnn_factory = CNNFactory()
cnn_model = get_model(cnn_factory)
print("CNN Model:", cnn_model)

rnn_factory = RNNFactory()
rnn_model = get_model(rnn_factory)
print("RNN Model:", rnn_model)

transformer_factory = TransformerFactory()
transformer_model = get_model(transformer_factory)
print("Transformer Model:", transformer_model)

이제, 새로운 모델이 생길때마다 분기를 만드는 짓을 하지 않아도 된다. 그런데, 얼핏 보면, 새로운 모델을 추가할 때마다 새로운 무언가를 추가해야 하는 건 팩토리를 만드나 안 만드나 똑같고 (해당하는 팩토리 클래스를 만들어야 하니까), 복잡성만 증가한 것이 아닌가 생각되기도 한다.

실제로는, 이렇게 분리하게 되면 아래와 같은 몇 장점이 있다.

새로운 모델을 추가 할 때, 완전히 별도의 Factory class에서 진행하게 되면 기존 코드를 잘 못 건드릴 일이 없다. 즉, 안정성이 증가한다.
Open/Closed Pricinciple (OCP) 준수: 즉, 기존 코드를 전혀 수정하지 않고 새로운 기능을 확장할 수 있다. 내가 구현하려는 새로운 모델의 클래스에 대해서만 구현에 신경을 쓰면 되는 것이다. 기존의 Interface (Base Class or Abstract class of the factory)를 수정하지 않아도 된다. (혹은, 않아야 한다!)
동일 인터페이스를 따를 수 밖에 없기 때문에 (ModelFactory), 코드에 일관된 패턴을 강제할 수 있다.

working with the classmethod: AutoClass

전통적인 의미의 Factory 패턴은 아니지만, 비슷한 목적을 가지지만 Python의 유연성을 활용한 디자인을 한번 알아보자. Python의 classmethod를 이용해서, Factory패턴과 유사한 목적-인스턴스 생성-을 위해 마치 Huggingface의 AutoClass처럼 사용할 수도 있다:

class AutoModel:
    @classmethod
    def create_model(cls, model_type):
        if model_type == "CNN":
            return CNNModel()
        elif model_type == "RNN":
            return RNNModel()
        elif model_type == "Transformer":
            return TransformerModel()
        else:
            raise ValueError("Unknown model type")

...

model = AutoModel.create_model("CNN")
model.train()  # Output: Training CNN model...

여전히 분기가 거슬린다고하면, HF에서 처럼 Model-registery를 활용해서 해당 분기를 없애고, 모델을 추가할 때 아예 별도의 Factory를 만들 필요도 없게 만들 수도 있다! 각 모델의 팩토리를 만드는 것이 아니고, 그냥 Class decorator를 추가하기만 하면 된다.

class AutoModel:
    _registry = {}

    @classmethod
    def register_model(cls, model_type: str):
        def inner_wrapper(wrapped_class):
            cls._registry[model_type] = wrapped_class
            return wrapped_class
        return inner_wrapper

    @classmethod
    def from_type(cls, model_type: str, some_complex_config):
        if model_type not in cls._registry:
            raise ValueError(f"Unknown model type: {model_type}")
        return cls._registry[model_type](some_complex_config)

@AutoModel.register_model("CNN")
class CNNModel:
    def __init__(self, config):
        print("Initializing CNN model with config:", config)
    def train(self):
        print("Training CNN model...")

@AutoModel.register_model("RNN")
class RNNModel:
    def __init__(self, config):
        print("Initializing RNN model with config:", config)
    def train(self):
        print("Training RNN model...")

@AutoModel.register_model("Transformer")
class TransformerModel:
    def __init__(self, config):
        print("Initializing Transformer model with config:", config)
    def train(self):
        print("Training Transformer model...")

model = AutoModel.from_type("CNN", some_complex_configuration)
model.train()

이는 원래의 Factory pattern의 장점인 OCP, 객체의 생성과 클래스 기능의 분리가 없다. 하지만, 코드가 간결해지고 '라이브러리 구조' 에 적합하다. 즉, End-user는 아묻따 AutoSomething.give_my_instance 만 하면 되는 것이다.

Abstract Factory Pattern

이름에서 알 수 있듯, Factory Pattern처럼 "instance를 생성하는 기능"을 하는 클래스를 가지는 패턴이다. 다만, Abstract Factory Class에서는 instance들을 생성한다. ML 학습을 예시로 들면, 어떤 Class가 데이터로더, 모델, 학습루프 (trainer)와 관련된 instance를 모조리 생성하는 것이다.

다양한 실험 환경 구성을 한다고 해보자. Abstract Factory Pattern을 사용하면 아마 아래와 같이 구현할 수 있을 것이다.

from abc import ABC, abstractmethod
import torch
import torch.nn as nn
import torch.optim as optim

# Abstract Factory
class ExperimentFactory(ABC):
    @abstractmethod
    def create_model(self):
        pass
    
    @abstractmethod
    def create_data_loader(self):
        pass
    
    @abstractmethod
    def create_trainer(self):
        pass

class CNNExperimentFactory(ExperimentFactory):
    def create_model(self):
        return nn.Sequential(
            nn.Conv2d(3, 16, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(2, 2),
            nn.Conv2d(16, 32, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(2, 2),
            nn.Flatten(),
            nn.Linear(32 * 8 * 8, 10)
        )

    def create_data_loader(self):
        return torch.utils.data.DataLoader(
            torch.randn(100, 3, 32, 32), batch_size=32, shuffle=True)

    def create_trainer(self):
        return Trainer(self.create_model(), self.create_data_loader())

class RNNExperimentFactory(ExperimentFactory):
    def create_model(self):
        return nn.RNN(input_size=10, hidden_size=20, num_layers=2, batch_first=True)
    
    def create_data_loader(self):
        return torch.utils.data.DataLoader(
            torch.randn(100, 10, 10), batch_size=32, shuffle=True)

    def create_trainer(self):
        return Trainer(self.create_model(), self.create_data_loader())

class Trainer:
    def __init__(self, model, data_loader):
        self.model = model
        self.data_loader = data_loader
        self.optimizer = optim.SGD(model.parameters(), lr=0.01)
    
    def train_loop(self):
        print(f"Starting training with {self.model.__class__.__name__}")
        for epoch in range(N): 
            for data in self.data_loader:
                outputs = self.model(data)
                ...
                loss.backward()
                self.optimizer.step()
                self.optimizer.zero_grad()
        print("Training completed.")

cnn_factory = CNNExperimentFactory()
cnn_trainer = cnn_factory.create_trainer()
cnn_trainer.train_loop()

print("\n---\n")

rnn_factory = RNNExperimentFactory()
rnn_trainer = rnn_factory.create_trainer()
rnn_trainer.train_loop()

위에서는 create_data_loader가 단순히 DataLoader class가 들어갔지만, 실제로는 Dataloader의 Factory가 들어갈 수 있겠다.

이렇게 하면, Trainer 같은 공통 로직을 재사용할 수 있고, 확장성이 좋아진다. 즉, 새로운 모델과 데이터로 실험을 한다고 할 때, 기존 코드를 손대지 않고 (즉, 기존 코드에 문제가 발생할 일은 없다.) 새로운 ExperimentFactory만 추가하면 된다.

Singleton Pattern

특정 클래스가 단 하나의 instance만 가지도록 강제하는 패턴. 무언가 공유되어야 하는 리소스 (log, gpu resource) 등을 다룰 때 유용하다. 예를 들어, torch로 학습을 할 때는 cpu, cuda:0 등의 device 설정을 model.to(device)로 해 줄 때가 종종 있고, 이런 Device는 모든 code에 걸쳐 동일해야 한다 (하나의 Device setting을 가져야 한다).
Accelerate 등을 사용하지 않는다면 말이다. 아래의 예시처럼 구현할 수 있다:

import torch

class DeviceManager:
    _instance = None

    def __new__(cls):
        if cls._instance is None:
            print("Initializing Device Manager...")
            cls._instance = super(DeviceManager, cls).__new__(cls)
            cls._instance.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        return cls._instance

    def get_device(self):
        return self.device

device_manager1 = DeviceManager()
print("Device:", device_manager1.get_device())  # cuda or cpu

device_manager2 = DeviceManager()
print("Device:", device_manager2.get_device()) 

print(device_manager1 is device_manager2)  # True, since two different device_managers share the instance

Surpassing global variable

이 때, Singleton Pattern이 Global variable과 유사하다고 생각될 수 있다.

하지만 Singleton pattern은 global variable로 하나의 값을 관리하는 것에 비해 장점이 몇가지 있다:

Singleton object는 "필요할 때만 초기화" 할 수 있고 (gpu 여부를 항상 체크하지 않아도 될 수 있다)
관련 설정을 더 추가하고 싶을 때 용이하며 ( 특정 파라미터 수 미만의 모델은 CPU로 돌리고 싶거나, config를 받아서 testing일 때는 CPU로 돌리거나
global variable과 달리 mocking이 용이해서 테스트가 더 쉬워진다.

Builder Pattern

Tensor Flow나 torch의 Sequential을 생각하면 쉽다. 복잡한 instance를 여러 단계로 구성할 때 사용된다. 가장 간단한 예시는 아래와 같다.

import torch.nn as nn

class CNNModelBuilder:
    def __init__(self):
        self.layers = []

    def add_conv_layer(self, in_channels, out_channels, kernel_size, stride=1, padding=0):
        self.layers.append(nn.Conv2d(in_channels, out_channels, kernel_size, stride, padding))
        return self

    def add_relu(self):
        self.layers.append(nn.ReLU())
        return self

    def add_max_pool(self, kernel_size, stride):
        self.layers.append(nn.MaxPool2d(kernel_size, stride))
        return self

    def add_fc_layer(self, in_features, out_features):
        self.layers.append(nn.Linear(in_features, out_features))
        return self

    def build(self):
        return nn.Sequential(*self.layers)

builder = CNNModelBuilder()
model = (
    builder.add_conv_layer(3, 16, 3)
           .add_relu()
           .add_max_pool(2, 2)
           .add_conv_layer(16, 32, 3)
           .add_relu()
           .add_max_pool(2, 2)
           .add_fc_layer(32 * 6 * 6, 10)
           .build()
)

print(model)

실제로는, 2020년도 초반에 ML research를 하면 항상 볼 수 있던 ResNet code block에서 이러한 로직을 많이 사용한다. 다만, 실제로 ML프로젝트에 class로써 builder pattern 적용된 케이스는 많지 않은 듯하다.

Extracting a secret sauce from Meta MovieGen

BayesianBacteria — Thu, 17 Oct 2024 23:14:47 +0900

Intro

Modern foundation models for image and video generation are often closed-source, with many technical details remaining unavailable to the public—this is especially true for video generation models. (In my view, this closed-source trend is even more pronounced than in the LLM domain.). For example, we know very little, if anything, about the technical specifics behind models like Runway Gen3, LumaLabs, or OpenAI’s Sora. However, thanks to the Meta MovieGen team, a detailed technical report on their work has been made publicly available. In this article, I won’t focus on MovieGen itself, but rather on the know-how it provides: we will explore their recipes for designing and training video (and image) generation models.

Following the structure of the original paper, we will explore (1) their design choices, including architecture, loss functions, and others; (2) dataset construction; (3) training methods, such as training stages and parallelism; and finally(4) inference techniques. Each ‘recipe’ contains only the bullet points, with details excluded for simplicity. The aim is to provide a high-level overview of the key aspects without diving into technical specifics.

Design choices: architecture, loss-functions, and others

Common

Use flow-matching (ensure "zero terminal SNR"---pure-gaussian noise input is trained), the rising star, (maybe) better alternatives of Diffusion.
- use $\sigma_{min}=1e^{-5}$, not zero unlike SD3

Temporal VAE

Use Spatio-temporal autoencoder, initialized by spatial (commonly-used) VAE
- Jointly train them. in a ratio of 1 batch of images to 3 batches of videos.
- More channels in a latent space
- impose "Outlier penalty loss" -> to remove "high-norm" meaningless pixels of decoded image.
  - (Opinion) This can be replaced with the register tokens which aims to alleviate "high-norm" tokens of the vision transformer.
- Tilling for efficient inference
- train with variable length of videos

Main transformer backbone architecture

On their philosophy for the architecture design:
"We intentionally keep the design of our backbone simple and similar to LLMs, specifically LLaMa3. This design choice allows us scale the model size and training, as discussed in ..."

- Guess: they might want to leverage their infrastructure and AI-framework, optimized to the LLaMa-family, rather than exploring the T2I or T2V suitable architecture.
- Furthermore, they found that their simple architecture works better than specialized blocks while being more stable to train.

patching via 3D conv, but not compress temporal dimension (like OpenSora or others)
learnable positional embedding for arbitrary size, aspect ratio and video length (from the NeurIPS2024-submitted architecture NaViT)
LLama3-like transformer architecture: RMSNorm, SwiGLU
- Cross-attention for text conditioning
- use multiple text encoders (rationale of the text-encoders are interesting)
- Not causal attention unlike original LLama3

Text encoder choice and its rationale

Use 3 encoders: UL2, ByT5, and MetaCLIP. Interestingly, their model-choice has specific reasons.
- MetaCLIP is obtained by finetuning the CLIP to increase token-length 77 to 256. MetaCLIP is specialized in "text-image-alignment"
- ByT5 is byte-level text-encoder. Therefore, it is specialized in supporting "generating characters". (e.g. the prompt A developer holding a sign that says 'I need a job')
- UL2 is for long-sentence understanding.
(FPS controling with input text) they control the FPS (Frame-per-second) of the video with the input text prompt: during pre-training, the text prompt includes "FPS-16", "FPS-XX"

Spatial upsampling

Train a new super-resolution transformer, but smaller one (e.g. original: 30B, here: 7B)
- transformer trained with only spatial computation (there is no temporal-attention)
- input channel is doubled, due to low-resolution frame inputs.
- (Improving temporal consistency with Multi-Diffusion, not with temporal-attention) they utilize the Multi-Diffusion (training-free, applied at inference time) to ensure consistency between upsampled-frames.

As mentioned above, flow-matching and the simple architecture with LLaMa3-like contributes the quality of the video according to the ablation study..

Dataset

they use a lot of visual filtering, like above.
- they even remove videos with excessive text with a video OCR model. why?
- scene boundary detection is done by FFmpeg
- with aspect ratio filtering, they achieve a mix of 60% landscape and 40% portrait videos
- remove first few seconds of clips which usually contain unstalble camera movement or transition effects
they filter out low motion videos
remove "perceptually duplicate clips" via similarity in a copy-detection embedding space (feature space specialized in de-duplication)
of course, as like many other text-to-something model development, they generate synthetic captions.
at least 60% of their high-resolution training set contains humans
bucketing their training dataset by different aspect ratios and video durations (5 buckets)

Training

Towards model scaling and training efficiency

they have well-organized, optimized, scaled architecture. yes.
use vanilla Multi-head-attention, not use Grouped-Query-Attention since the video generation does not need a merit of it: auto-regressive generation with causal mask.
like other video model training, multi-stage training is used: stage 1 for image-only training, stage 2 for low-resolution videos, stage 3 for high-resolution videos...
- the majority of training budget is long-context high-resolution video training.

On the parallelism

- they use complex multiple parallelism for efficient training : Tensor-parallelism, sequence-parallelism, context-paralleism, fully-sharded data parallel
- the fundamental aim of using different parallelism for each transformer components is reducing communication of machines
- for instance, layer-norm treats sequence dimension independently. So, apply the sequence parallelism.
- they have done this with just "pytorch" and compiled into CUDAGraphs not with DeepSpeed or others. How?

Fine-tuning

(Dataset 1) they focus on balancing the concepts in set of videos from their fine-tuning dataset.
- it is done by deduplication with k-NN on text feature space.
- the details of dataset pipeline is described at "Dataset" in this article.
(Dataset 2) they manually identify cinematic videos and again manually caption the videos for subset of high-quality video data
(Model averaging) they average model checkpoints obtained from SFT experiments that use various version of finetune data, hyperparameters and pre-train checkpoints.

Comments on model averaging

Model averaging has a rich history in improving generalization in machine learning. For instance, SWA(Stochastic weight averaging), EMA(Exponential moving average, with Adam), ModelSoup, merging LoRA weights and its variants, and so on.

In modern deep learning, some of these approaches are still actively applied: for instance, EMA in diffusion models and LoRA weight merging for large language models. MovieGen demonstrates additional possibilities for leveraging these techniques. They average the more "diverse" weights with different dataset, hyperparameters and even pre-train checkpoint.

In ensemble and Bayesian deep learning, the diversity of the ‘merged’ (or marginalized) models often leads to better generalization, such as models with weights in different basins of the weight space. I believe ‘diversity’ is also a key factor in model averaging, even though the mechanism of weight averaging is not identical to that in Bayesian methods.

Learning-rate

unlike previous public T2I report, they consecutively adjust their learning rate.
- they reduce learning rate by half at some point, which further reduces the loss.
- decrease the learning rate whenever the validation loss plateaus
  - This kind of ‘learning-rate’ engineering used to be Deep Learning 101, but in modern training setups (such as LLMs and Diffusion models), it is often overlooked. It serves as a reminder of the importance of this kind of ‘basic engineering,’ even if it seems a bit tedious.
  - Additionally, the validation loss of flow-matching loss is highly correlated with the quality of images or videos according to reports from StableDiffusion 3 and MovieGen.

Inference

they utilize inference prompt rewrite on the input text prompt
- finetune the another 8B LLaMa3 model specialized in the prompt rewriting, with human-in-the-loop (generate LLaMa3 70B, and select the high-quality rewrite pairs)
they use "linear-quadratic t-schedule" focusing on the initial timesteps of flow-matching ODE, since the difference between model input and output is high at the initial step (they might try to balance it)
- (Q) is this the common phenomenon of the flow-matching models?
- if not, we can design more suitable timestep schedule of our own model based on creating a timestep schedule that focuses on areas with greater variation.
- this linear quadratic schedule can significantly reduce the number of steps to generate high-quality videos.
(Observation 1) they found that simple Euler ODE solver outperforms higher-order solvers.
(Observation 2) Video generation is more sensitive to the number of inference steps compared to image generation (the higher number the more significant the improvement)

To conclude

We have distilled the essential components of MovieGen, focusing on its core text-to-video backbone model. While it’s clear that training such a massive video generation model is not feasible for us, even with full knowledge of its underlying mechanisms (and I believe we don’t have all the details), my goal was to gain meaningful insights into the creation of text-to-something foundation models and to learn from their trial and error.

The original technical reports provide detailed applications of the text-to-video backbone, including personalized generation, video editing through instructions, and even sound generation corresponding to the videos. For more amazing results, check out the original paper and their youtube playlist.

[Reels] LCM-Lookahead for Encoder-based Text-to-Image Personalization

BayesianBacteria — Tue, 15 Oct 2024 21:55:11 +0900

Summary

("Lookahead-loss") During encoder training for models like IP-adaptor, they use single-pass generation models such as the Consistency model to generate noisy images, which are then compared with the reference image to compute an additional loss (LCM-lookahead loss).
- However, the LCM model can end up optimizing the loss regardless of the input (z_rt below), which could break the alignment between the model and its intended task. To prevent this and regularize the LCM's generation capability, they randomly adjust the LoRA scale within a specific range.
(Additional reference image encoding through KV value) The Key-Value (KV) pairs of the noisy reference image are computed (using a duplicated Denoising network) and concatenated with the KV pairs of the target image (the main image we’re generating), as shown on the left side of figure 3.
- The KV Encoder consists of parameters copied from the original UNet.
- If the KV Encoder is completely frozen, it can cause “excessive appearance transfer” or a “loss of editing capability,” so it's made trainable using LoRA (a bit unclear here).
(Synthetic data generation for personalization) Inspired by the mode-collapse issue in SDXL-turbo (e.g., similar prompts generating very similar images—leading to repeated faces, etc.), we address this by generating multiple images of the same identity, using around 500k images for this purpose.

Look-ahead loss

architecture design with KV encoder

Key Insight

Single-pass models (consistency models, progressive distillation, or etc) distilled from the original model can be utilized to apply “image-space loss”.
- the distilled model and original model are “aligned”: they can generate similar output with identitical prompt and initial noise.
- Here, the target task was the personalization but I believe it is capable of solving other task which needs the image space loss (for instance, aesthetic score)
Key-value of the noisy-reference image include the feature of the target identity.
Diffusion models with discriminator (e.g. SDXL) may lead to mode-collapse and it can be useful to generate synthetic personalization data (having consistent identity).

[Reels] Imagen yourself

BayesianBacteria — Tue, 15 Oct 2024 21:49:02 +0900

Link , Personalized text-to-image generation by Meta AI

IP-adaptor---almost the standard design choice of the tuning-free personalization---like architecture, but improved.

Insights

They create “architecture tailored for personalization image generation”

they design kind of improved IP-adaptor for “any-personality” generation.
- therefore, the model does not need to be trained for a new subject, unlike LoRA or Dreambooth.
- meanwhile, other “any-personality generation models” could come with a strong over-fitting behavior such as copy-paste effect to the reference image. (it can be resolved in the synthetic pair dataset below)

Synthetic pair dataset for personalization task

the limitation of the existing personalization task is “copy-paste” effect—the generated image looks super-similar to the given reference image.
- it means that the target generated image “does not follow” the given prompt.
to resolve such issue, authors proposes the synthetic-data pipeline consisting of several real and synthetic data for one identity.
- sadly, the details of the pipeline is not included (such as, how to generate synthetic “personalized image”)

Rationale in their text encoders

(common space between image and text) CLIP
(encoding “characters”) ByT5: Byte-Level (Character-level) T5 architecture. (might improve the “text image” generation—for instance, the sign of “moreh is cool”)
(Comprehending long and intricate text prompts) UL2: “improved T5”

Limitations

only applicable for the models with cross-attention text conditioning.
- at least, from the proposed architecture in the paper
- to apply SD3-like architecture (w/o cross attentions), it need to be adjusted

Vision transformers need registers

BayesianBacteria — Fri, 23 Aug 2024 15:25:24 +0900

이 포스트는 Academic-reels와 특집 어딘가에 있는... shorts는 아니지만 그렇다고 10분짜리 비디오도 아닌 그런 구성입니다. 정리본 같은 거랄까요
되게 Scientific 하게 잘 쓰인 논문입니다. 점수가 매우 높아요. 흥미로운 가설을 설정했고, 가설을 support 하는 좋은 관측들을 했고, 그에 따른 simple-but-effective 한 방법을 제시합니다. 글도 매우 잘 써진 것 같고요.
별 5.

FAIR, Meta, ICLR 2024 Oral

Objective and motivation

(Our objective) Vision transformer의 마지막 attention layer를 visualize 해보면, 위와 같은 “abnormal patch“ 가 보인다 (semantically not important, but strongly attended by other patches). 이런 것들을 오른쪽처럼, “semantically meaningful“ 한 패치에만 attention score가 높도록 만들고 싶다
(Doubtful point) 근데, 애초에 왜 abnormal-patch들이 생겨나는 걸까? 이것들의 정체가 뭘까?

Observation: the problem of the abnormal patches

(Observation 1) self-supervise-trained 모델의 attention map을 활용해서 object detection, object discovery, segmentation 같은 문제를 풀려고 하면 (LOST method), 위 모델 중 “DINO“ 가 잘한다.
(Observation 2) 반면에, DINO 이후에 개발된 DINOv2는 다른 task에서 DINO보다 잘 함에도 불구하고, 위처럼 attention map이 좋지 않아서 attention-map-driven 방법론 (LOST 같은)을 사용할 수 없다.
- 관측해 보니, 이러한 문제는 다른 많은 ViT 들에서도 나타나는 일반적인 현상이다

왜 이런 일이 일어날까? 그리고 이런 현상을 해소하는 방법은 무엇일까? 해소하게 되면 모델이 더 잘 동작하게 될까? 아래 Obesrvation 3, 4에서는 먼저 왜 이런 일이 일어날까? 에 대해서 설명한다.

(Observation 3) 위 attention-map의 heatmap에서 "abnormal patch"들은 다른 pixel에 비해 10배 이상의 높은 norm을 가진다. 그리고 이러한 현상은 ViT의 중간 layer에서, 그리고 충분히 큰 ViT에서 충분히 오랜시간 학습했을때 일어난다.
(Observation 4: Strong clue what the abnormal patch is) 해당 patch들에 linear layer를 붙이면, classification task를 아주 잘 수행한다. (다른 patch들에 붙인것에 비해서)
- → Interpretation: semantically not meaningful token에 대해서, ViT는 해당 image의 “global information“을 해당 토큰에 저장한다. 따라서 attention score도 높으며, 해당 토큰 만을 이용한 downstream task도 잘 하게 된다.

Take a closer look at the problem

"Artifacts" in the local features of DINOv2

Definition of the -Artifacts-: they are high-norm outlier tokens

위 figure-3에서의 결과처럼, high norm and outlier patches이다. outlier 하다는 것은, 특별히 이미지의 sementics에 별 영향을 주지 않는 어떤 patch라는 소리이다 (위 그림의 white-background처럼)

아래에서는, 이러한 outlier-patch 혹은 artifact들이 실제로 global-information을 견인하는가? 에 대한 가설을 위해 여러 evidence를 보여준다.

(Evidence 1) Outliers appear during the training of large models

1/3 이상의 학습이 되어야 outlier 가 등장한다
Outlier는 중간 이후의 layer에 등장한다
Outlier는 모델 사이즈가 어느 정도 커야 증가한다

작은 모델, 학습이 덜 된 모델은 아예 global-information을 배우지 못한 것이라 해석해 볼 수 있다. 반면에, "정보를 더 많이 가졌다"라고 할 수 있는 큰 모델, 학습이 더 된 모델, 더 많은 정보를 가지고 있는 상위 layer에서 이런 norm이 커지는 token들이 많아지는 것은 global-information과 outlier-patch 간의 연관성을 생각할 수 있게 해 준다.

(Evidence 2) High norm tokens appear where patch information is redundant & High norm tokens hold little local information

Outlier patch는 다 또이 또 이하고 의미 없게 생겼다.. background 라던지 그런 것들.
따라서, 걔네들끼리의 cosine similarity도 높다.
또한, 이런 outlier patch들은 local information 대신 global information만을 가지고 있어서, local-information이 필요한 어떤 위치에 있는지 맞추는 task와, 해당 픽셀이 무엇인지 reconstruction 하는 task 둘 다 맞추기 어렵다 (high errors and poor accuracy in the above figure-5 (b)).

(Evidence 3) Artifacts can resolve classification problem, maybe due to their global information-carrying

Outlier patch 만으로 classification task 등을 풀면, normal patch 만으로 푸는 것보다 잘 풀린다.

Hypothesis and remediation

아래와 같은 가설을 논문에 걸쳐서 세우고 있고, 실제로 이를 위 증거들로 어느 정도 shed-light를 해주었다.

(Hypothesis) 충분히 크고, 오래 학습된 ViT 모델은 어떤 불필요한 토큰을 global information을 저장하고, 처리하고, 탐색하는 데 사용된다

그렇다면, 이를 실제로 어떻게 활용해보아야 할까? 원래의 목적이었던, attention-map을 적절하게 유지하면서도, 이러한 global-information을 활용하는 방법은 없을까? 논문에서는 이에 대한 간단한 해답으로 register token을 제시한다.

Remediation: the register tokens

정말 불필요한 token인 register token을 정의하고, 이러한 global information을 대신해서 받아주길 기대한다. (CLS token에 더해)
- 맨 마지막 output에는 해당 register token을 뺀다
memory transformer와 비슷한 구조.. 라는데 memory transformer 잘 모름. translation에 사용되었다고 한다.
- (Opinion 1) seq2 seq 생성에 쓰였으면… Diffusion도 이런 게 필요하지 않을까 하는 망상 (seq 2 seq이긴 하니까…?)
- (Opinion 2) T2I diffusion에서, text embedding을 register token에 갖다 박아버리면 global-information에 text-conditioning을 더 잘할 수 있지 않을까?… -> 이미 CLIP의 pooled-embedding을 사용하면 비슷한 기능을 하지만 말이다.

Experiments

아래에서는, register token을 사용했을 때 실제로 기대했던 효과가 생기는지 (outlier가 없어지는지, attention map이 정상화가 되는지 등)에 대해서 입증한다.

Verification: 진짜 outlier가 없어짐

(maybe) side-effect: down-stream task를 좀 더 잘하게 됨

논문에서 주장하는 게 meaningful attention map without outlier patches leading to the better performance는 아니었기에… 일종의 side-effect로 해석됨 ~~(의외로 이게 major-contribution이 아니었다. 리뷰어가 닦달했나?)~~
큰 모델 (DINOv2)가 좀 더 향상이 있는 걸로 보임.

Effect of the number of register tokens

Global 보다 local이 더 중요할 것 같은 depth-estimation, segmentation이 필요한 reg가 더 적고… global information이 더 중요할것 같은 ImageNet이 필요한 token이 더 많은 게 이상하긴 하다.

Object discovery via attention map is now working.

논문에서 제일 처음 이야기했던, attention-map을 활용한 downstream task의 degradation을 해소하는 파트이다.
특히나 이러한 outlier가 심하게 발생하는 큰 모델인 DINOv2에서 꽤 큰 improvement가 일어난다.

attention map of the register token

흥미롭게도, 개별 register token과 다른 pixel들 간에 걸리는 attention map을 살펴보면 각 register token이 담당하는 어떤 "object"가 있는 것처럼 보인다. reg0 은 전체적인 edge, reg6은 캐러멜의, reg8은 스푼의, reg12는 커피의 texture. 이들이 interpretability 에도 도움이 될 것이라 기대해 볼 수 있다.

Discussion point: How about the registers for DiT?

여기서는 Diffusion-transformer에 이러한 register token을 적용하면 어떻게 될지에 대해서 짧게 이야기해 본다.

DINOv2랑 CLIP은 모델 구조는 DiT와 같지만, Objective는 전혀 다르다. 그래도 register token을 CLIP에 적용시켜도 될까?
- CLIP은 text-image contrastive learning
- DINOv2는 aligning randomly-cropped image; patch masking; and other regularization terms
- 하지만, 적어도 위 실험에서는 abnormal-patch가 CLIP (OpenCLIP) 에도 존재하며, register token이 abnormal-patch를 없애주는 걸 관측할 수 있다.
  - 하지만, 그것이 downstream task에 미치는 영향은 어떨지 모른다. 즉, 개선된 attention-map이 과연 실제로 downstream task---feature embedding for image generation---에 도움이 될지는 명확하지 않은 것.
Self supervised learning regime에서는 이런 global-feature가 생기는데, diffusion에서는 아닐 수 있을까?
- 이런 dummy token 콘셉트를 사용했던 memory transformer에서는 translation task에 register-token like 한 방법론을 적용한 적이 있다 (그것이 같은 문제 때문인지는 모르겠다). 그럼 아주아주 naive 하게 생각한다면, 같은 generative task인 diffusion에서도 적용해 볼 수 있지 않을까?
- restoring masked patch도 Diffusion처럼 “input을 복원해 간다 “라는 개념에선 비슷하기도 하다.

Example of the super-easy implementation

실제 구현은, 아래와 같이 간단하게 trainable-parameter로써 register token을 추가해 주고, 기존 transformer block에 추가해 버리는 식으로 구현할 수 있다.

아래의 예시는 stable-diffusion 3에서 MMDiT model class (official code 아님, 자체 구현한 것)에서 이를 활용하는 예시이다.

__init__()에서, self.register_tokens를 정의하고
forward()에서, transformer의 input-sequence에 해당하는 emb_x에 해당 register-token을 sequence-dimension에 concat 시켜주고, 맨 마지막에는 이 부분을 제외시켜 준다.

class MMDiT(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.patch_size = config.patch_size
        self.h, self.w = config.height, config.width
        self.register_token_num = config.register_token_num

        # Embedding for (1) text; (2) input image; (3) time 
        self.text_cond = TextConditionModule(config.text_emb_size, config.hidden_size)
        self.patching = LatentPatchModule(config.patch_size, config.hidden_size)
        self.time_emb = TextTimeEmbedding(config.time_embed_size, config.pooled_text_size, config.cond_size)
        
        self.mmdit_blocks = nn.ModuleList(
            [MMDiTBlock(config.hidden_size, config.time_embed_size, config.attn_embed_size, config.mlp_dim, config) for layer_idx in range(config.num_layers)]
        )
        
        self.final_linear = nn.Linear(config.hidden_size, config.out_channel)
        self.modulation = nn.Linear(config.cond_size, 2)
        self.pos_embed = torch.from_numpy(get_2d_sincos_pos_embed(config.hidden_size, (self.h//self.patch_size, self.w//self.patch_size)))
        self.register_tokens = nn.Parameter(torch.zeros(1, self.register_token_num, config.hidden_size))

    def forward(self, latent, t, text_embs: List[torch.Tensor], pooled_text_embs):
        """
            latent (torch.Tensor)
            t (torch.Tensor)
            text_embs (List[torch.Tensor])
            pooled_text_embs (torch.Tensor)
        """
        emb_c = self.text_cond(*text_embs) # (N, L, D)
        emb_t = self.time_emb(pooled_text_embs, t) # (N, D)
        emb_x = self.patching(latent) + self.pos_embed # (N, T, D), where T = H*W / (patch_size ** 2)

        # additional "register" tokens, to convey the global information
        # see https://openreview.net/forum?id=2dnO3LLiJ1
        emb_x = torch.cat((self.register_tokens.expand(emb_x.shape[0], -1, -1), emb_x), dim=1)
        

        for block in self.mmdit_blocks:
            emb_x, emb_c = block(emb_x, emb_c, emb_t)

        # remove register token for the output layer
        emb_x = emb_x[:,self.register_token_num:] 

        scale, shift = self.modulation(emb_x, emb_t)
        emb_x = self.final_linear(scale*emb_x + shift) # (N, T, patch_size**2 * out_channels)
        return self.patching.unpatchify(emb_x) # (N, out_channels, H, W)

(3.8/5.0) 몽중식: "진짜" 경험팔이 가게

BayesianBacteria — Tue, 20 Aug 2024 23:26:03 +0900

경험팔이

사실 요즘 동네 주변이나 회사 주변 말고 밥집이든 지나가던 잡화상 (매우 예스럽게 말하자면) 이든, 가게들을 돌아다니다 보면 이게 뭔가 해당 가게의 본질---맛있는 음식, 유용하거나 이쁜 물건들---과는 벗어나서, 그냥 놀이나 경험을 판다는 생각을 하게 만드는 경우가 많다. 뭐 다들 한 번쯤은 생각하겠지만,,,

성수동에서의 실제로 살 물건이 있는건 아니지만 신기해서 들어가 보는 팝업스토어라던지, 귀여운 물건들을 보고 와 굉장히 귀엽다라고 말하고서는 사지 않고 뒤돌아서는 소품샵들이라던지 (이거 정말로 궁금한 건데, 소품샵에서의 실구매율은 얼마나 될지 다른 업종들과 비교해서 측정해보고 싶다). 사실은 피시방에서 끓여 먹는 짜계치가 더 맛있지만 평소에 먹어보지 못한 맛과 멋좀 부리려고 가는 다이닝이나---적어도 나는 막입이라서 그렇고, 다른 사람들은 어떨지 모르겠다---모두 "경험팔이"로 보일 때가 종종 있다.

경험팔이라 한들, 아무렴 어떠하랴, 양질의 경험만 제공하고 좋은 놀잇거리가 되면은 아무래도 그만인것이다. 그런 의미에서, 몽중식의 경험팔이는 훌륭하다. 재방문을 3회 한 다이닝 (이라고 부르는 게 맞긴 할까?) 은 일단 지금까지 최초.

그래서, 어떤 경험을 파나요?

우선, 몽중식은 식당이다. 당연히 음식 판다. 일반적인 다이닝 처럼 매 시즌별로 메뉴가 바뀌고, 메뉴는 고를 수 없이 코스다. 중식을 기본으로 하고, 고량주 페어링을 할 수가 있다. 그럼, 일반적인 다이닝의 경험팔이를 넘어서, 무엇이 몽중식을 진짜 경험팔이로 만드는가?

몽중식은 테마가 있는 식당이다. 이러한 테마를 판매한다. 하나의 테마를 이루는것은

theme_key = ('영화', '음식', '스토리텔링', '드레스코드', '소품')

의 quint로 구성된다.

이들 중에서, 해당 테마의 identity를 결정하는 것은 영화이다, 매 계절별로 (주로 중화권의) 영화를 선정하고, 해당 영화의 스토리라인을 따라가면서 음식이 서빙되고 (특정 장면이나 인물에 연관 지어 음식이 결정된다), 해당 영화와 관련한 드레스코드를 제시하며, 식사 중에는 스토리텔러 (라고 부르던가? 무튼 이걸 담당하는 분이 진짜 있다)가 영화의 흐름을 설명해 준다. 처음 자리에 앉으면, 영화와 관련된 소품들과 영화의 스토리와 그 스토리에 해당하는 음식을 설명해 주는 "그림 카드" 같은 것들이 자리에 놓여있다---코스가 지날 때마다, 그림 카드를 한 장씩 넘겨준다.

'소품' 에 해당하는 것들. 영화 '암살' 에서의 안옥윤씨의 안경과 총, 그리고 뭔가 개화기스러운 물건들과 안경집.

'드레스코드' 에 해당하는것들. 사실은 한국적인 미가 드레스코드였으나, 무시하고 자체적으로 극중 인물들의 분장을 했다. 콧수염은 그려진것.

(스포주의) 암살에서의 염석진 (이정재분) 씨를 상징하는 음식. 시꺼먼 마음을 가지고 있대나 뭐래나. 뭐 대충 들리는대로 듣고 잘 파먹었다.

흥미롭지 않은가? 그 외에도 몇 가지 깨알 같은 재미가 있다. 고량주 페어링의 3-4번째 즈음에는 (기억 잘 안남) 항상 몽비어 (맥주를 양조하시더라. 몽중식 자체맥주)에 다 함께 고량주를 집어넣어야 한다던지...

하나의 테마를 관통하는 영화로 꿰어진 여러 가지 경험요소들이 몽중식을 진짜 경험팔이로 만든다.

라고 얘기했는데, 그냥 가벼운 맘으로 재밌어 보이니까 가보자 해도 충분한 곳. 다만, 예약이 쉬운 편은 아니었던 걸로. 그리고 재방문자가 매우 많다. 매우... 중간에 처음 와보신 분들 손들라고 시키는데, 그 수가 많지 않다.

식당인데, 맛은 있나요?

개인적인 기준으로는, 오락가락한다. 워낙 자극적인 맛을 추구하고, 중식에 기대하는 어느 정도의 자극적인 맛이 있는 상태에다가, 입도 막입이다 보니 "아 마라샹궈 생각난다" "아, 불닭 생각난다" 할 때가 있다. 다만, 다채롭다. 보통 9코스로 진행되는데 뭔 어디서도 먹어본 적 없는 걸 먹을 때가 흔하다. 그리고 보통은 "먹을만하다" "맛있는 것 같다" 이상일 때가 많다.

신기한데 (내입에만) 별로였던것의 예시. 이름은 기억이 안나는데, 위에는 표고버섯 아래는 튀김이었다. 튀김 속이 뭐였는지도 기억이 안나네... 불호였던 이유는 (1) 내 기준 간이 삼삼함 (2) 표고가 너무 딱딱했음 이었다. 하지만 여전히! 테마와 경험에는 충실하다. 안옥윤씨가 백화점에서 안경을 샀을때를 표현하는것... 그러니까, 저 표고버섯이 안경이다 미친...

페어링때 나온 술. 무려 이름이 혁명 (레볼루씨옹) 소주다. 붉은 색체와 저 특유의 선전 그림체. 아! 낫과 망치를 가져오라. (따위로 생각했는데, 사실 그것과는 별로 관계가 없다고 스토리텔러분이 말해주더라)

재방문할 건가요?

흥미로운 영화 테마가 있다면 가고 싶다. 옛 홍콩영화 테마에서, 가죽재킷을 입고 선글라스를 낀 채로, 플래시 페이퍼에 100달러를 프린팅 해서 불태워버리고 싶다. 경험증강이랄까.

[Reels] Battle of the Backbones: A large-Scale Comparison of pre-trained models across computer vision tasks

BayesianBacteria — Tue, 20 Aug 2024 12:56:48 +0900

NeurIPS 2023, Arxiv Link

What they did

They propose "Battle-of-the-Backones"(BoB), benchmarking a diverse suite of (vision) pre-trained models, including classic ImageNet-trained CNN, vision-language models (perhaps CLIP).

The target tasks are OOD-generalization, classification, object-detection and Image retrieval.

Key observations

with more than 1,500 training runs, they found that CNN pretrained in a supervised fashion still perfrom best on most tasks.

IN standing for the ImageNet, which represents the supervise-trained CNNs

(My opinion 1) it is not suprising that the supervisly-pre-trained models (IN-nK in the above figure) works better on "supervised-learning classification problem."
(My opinion 2) the recent attentions and success of the ViT architecture is not because its model architecture is well-suited for a vision task---in other words, it has a good inductive-bias for vision tasks)---, but it is because the transformer-like architecture has been well-optimized throughout the deep-learning community after the advent of the LLM.
ViTs are more sensitive to the amount of the data and number of parameters than CNNs
- (My opinion 3) this may claim that the ViT architecture may scale better than CNN, at the some scale of data and model parameters. (In fact, the authors already discuss it in Sec 5.)
Monocular depth-estimation achieves performances competitive with top-conventional supervised and SSL backbones -> the author claims it implies the depth-estimation tasks may serve as a powerful and generalizable primary or auxiliary pre-training task.
- (Q) Even pre-trained depth-estimation task can work well on the generative task such as the Stable-Diffusion?? (Fortunately, the input-output dimension is well-matched for both tasks unlike other tasks like classification or others, meaning no need to modify the model itself.)

And there are many other insightful observation in the paper! please check-out the original paper.

Arabic	Hebrew	Polish
Bulgarian	Hindi	Portuguese
Catalan	Hmong Daw	Romanian
Chinese Simplified	Hungarian	Russian
Chinese Traditional	Indonesian	Slovak
Czech	Italian	Slovenian
Danish	Japanese	Spanish
Dutch	Klingon	Swedish
English	Korean	Thai
Estonian	Latvian	Turkish
Finnish	Lithuanian	Ukrainian
French	Malay	Urdu
German	Maltese	Vietnamese
Greek	Norwegian	Welsh
Haitian Creole	Persian