[Reels] Battle of the Backbones: A large-Scale Comparison of pre-trained models across computer vision tasks

티스토리 뷰

Showing off studying ML/ML - academic reels

BayesianBacteria 2024. 8. 20. 12:56

NeurIPS 2023, Arxiv Link

They propose "Battle-of-the-Backones"(BoB), benchmarking a diverse suite of (vision) pre-trained models, including classic ImageNet-trained CNN, vision-language models (perhaps CLIP).

The target tasks are OOD-generalization, classification, object-detection and Image retrieval.

with more than 1,500 training runs, they found that CNN pretrained in a supervised fashion still perfrom best on most tasks.

(My opinion 1) it is not suprising that the supervisly-pre-trained models (IN-nK in the above figure) works better on "supervised-learning classification problem."
(My opinion 2) the recent attentions and success of the ViT architecture is not because its model architecture is well-suited for a vision task---in other words, it has a good inductive-bias for vision tasks)---, but it is because the transformer-like architecture has been well-optimized throughout the deep-learning community after the advent of the LLM.
ViTs are more sensitive to the amount of the data and number of parameters than CNNs
- (My opinion 3) this may claim that the ViT architecture may scale better than CNN, at the some scale of data and model parameters. (In fact, the authors already discuss it in Sec 5.)
Monocular depth-estimation achieves performances competitive with top-conventional supervised and SSL backbones -> the author claims it implies the depth-estimation tasks may serve as a powerful and generalizable primary or auxiliary pre-training task.
- (Q) Even pre-trained depth-estimation task can work well on the generative task such as the Stable-Diffusion?? (Fortunately, the input-output dimension is well-matched for both tasks unlike other tasks like classification or others, meaning no need to modify the model itself.)

And there are many other insightful observation in the paper! please check-out the original paper.

[Reels] LCM-Lookahead for Encoder-based Text-to-Image Personalization (2)	2024.10.15
[Reels] Imagen yourself (0)	2024.10.15
[Reels] HyDE (Hypothetical Document Embedding) (0)	2024.05.02
[Reels] The simple theoretical background of Domain generalization (0)	2024.04.30

공지사항

최근에 올라온 글

최근에 달린 댓글

링크

글 보관함