ESMfold: Language models of protein sequences at the scale of evolution enable accurate structure prediction

티스토리 뷰

Showing off studying ML/ML - legacy

ESMfold: Language models of protein sequences at the scale of evolution enable accurate structure prediction

BayesianBacteria 2024. 1. 1. 02:03

얘는 꽤 예전에 notion에서 작성한걸 옮겨온것..

Contributions

Protein sequence 데이터 만으로, 간단한 langauge modeling task를 통해서 protein structure를 예측한다.
ESM-2—최대 15B의 파라미터—는 single protein sequence만으로 atomic-resolution structure prediction이 가능하며, language model의 크기가 증가할 수록 더 outperform하게 됨을 관측했다 (Scalability).

비슷한 sequence에서 initial structure를 위한 template, co-evolution 정보를 위한 MSA를 사용하는 AF2, RoseTTAFold 와는 다르게, 오직 language model의 internal representation을 사용한다.
ESM-2는 (1) 다른 deep learning based folding에 비해서 빠른 예측, (2) 다른 protein과 similarity가 낮은 (novel한 구조의 protein) 에 대해서 높은 성능을 보인다.

How to Achieve the Goal?

(Evaluation metric) Perplexity in NLP task$$ \exp(-\frac{1}{N} \sum_i \log P(x_i|x_{<i})) $$
Perplexity는 language modeling task에서 사용되는 metric이다. 이는 “이전의 sequence들을 모두 알고 있을때,” “다음 sequence에 대한 log-lieklihood”의 average로 표현된다 (⬇️ is better). 정확히는 exponential-log-lieklihood로, 아래와 같다:
(Evaluation metric) TM-score

(Main idea) ESM-2는 모델의 “initial representation”을 위해 기존 NLP의 MLM pretraining을 채택하는 비교적 naive—AF2의 복잡한 embedding에 비해—한 전략을 취하고 있다. Biology 도메인에서, 이러한 naive MLM 전략은 secondary structure, binding site prediction 그리고 unsupervised contact prediction 등에서 이미 성공적으로 사용되어 왔고, 이를 structure-prediction에 확장시킨 것으로 보인다. 이 중에서도, unsupervised contact prediction—Predicting sequnce-to-sequence distance from the attention map of the transformer—을 모델의 input으로 적극 활용하는 것을 전략으로 삼았다.

또한, 본문의 Meta AI (FAIR team)의 저자들은 이전 버전의 ESM-1b에 비해 비약적인 상승을 이루게된 몇가지 요소들을 나열했다:

Relative positional embeddings, AF2 already has this procedure
More parameters, based on scaling laws in the large language modeling field
Model architecture, with advent of transformers

무려 “These modifications lead to a significantly better model” 이라 언급하는데, 이전 버전에 비해 약 1/4 규모의 파라미터로도 더 향상된 성능을 보이는것을 그 근거로 삼고 있다. 이들중 (2), (3) 요소는 기본적으로 NLP task를 해결해 온 접근법인데, 저자는 (1) Perplexity—NLP task의 evaluation metric—가 model-size에 따라 증가하게 되면, TM-score의 격차도 많이 벌어지는 것; (2) Ablation study에서 language model이 가장 높은 기여를 함 을 보임으로써 NLP task에서의 접근법으로 해당 문제를 푸는 것을 정당화 하고 있다.

Method: Overview

들어가기: 그래서, 무엇이 AF2와 다른가?

AF2와의 주요한 차이점은 위에서도 언급되었듯, MSA와 template을 사용하지 않고 이를 language model의 representation으로 대체하는 것이라 볼 수 있다:

“A key difference between ESMfold and AlphaFold2 is the use of language model representations to remove the need for explicit homologous sequences (in the form of an MSA) as input.”

Transformer based 연산을 시작하기 전의 “embedding (representation, feature-engineering)” 과정 비교. ESM-2는 pretrained language model을 사용하고, AF2는 template, MSA등에 대해 differentiable한 연산을 복잡하게 부여함으로 end-to-end modeling을 할 수 있도록 만들었다.

이후의 Evoformer (Folding trunk for ESM-2) 및 Structure module은 AF2의 그것과 크게 다르지 않다.

실험 결과와의 연관

(+) 결국 Main idea는 “Language model의 representation”이 evolutional information을 대체할 수 있다는 것이고, 이를 support 하기 위해 저자들은 language model의 metric—perplexity—과 실제 protein structure prediction의 metric—TM-score—가 correlation이 있음을 실험적으로 보여주었다.

(+) 또한, Ablation study에서는 folding trunk (AF2의 Evoformer)와 Language model의 규모를 축소시키며 성능을 측정하는데, language model의 사이즈를 줄이는 것이 가장 영향이 큰 것을 보였다. → Language model이 가장 큰 contrubtion을 하고 있는 것으로 예상.

LSM-2의 ablation study 결과. Folding block이 많은 영향을 주는것은 당연하게 보인다 (Evoformer가 없는 AF2를 상상해 보라.). 다만, 여기서 가장 drop이 큰 결과는 language model의 유무이다.

Method: Details

Dataset

Dataset acquisition과 sequence preprocessing—mainly bioinformatical procedure—에 대해서는 여기서 다루지 않도록 한다.

ESM-2의 training procedure는 기본적으로 AF2의 그것을 따르는데, 몇가지 차이점 및 특이점이 있다.

같은 residue를 가지는 sequence들을 제외한다.
Training time에서 cluster—AF2에서는 sequence들을 MSA의 computational efficiency를 위해 clustering 시킨다—별로 고루 sampling 되도록 batch를 구성한다. → 이는 MSA를 구성하는 데에 사용되던 방법론을 batch를 구성하는 방법으로 치환시킨 것으로 볼 수 있다.
Inverse folding [Hsu et al., 2022 ICML] model을 통해 얻어진 predicted sequence-structure pair를 데이터로 사용한다. → 이 또한 FAIR, Meta AI researcher들이 수행한 연구인데… 과연 해당 데이터를 쓰느냐 안쓰느냐에 대한 결과는 얼마나 다를까? (위의 ablation study figure 참고)
- 다만, 70 lDDT 보다 큰 데이터셋을 filtering 한다.
- lDDT (Local Distance Difference Test): “In CASP9, the local Distance Difference Test (lDDT) score was introduced, assessing how well local atomic interactions in the reference protein structure are reproduced in the prediction” : 간략하게는 RSMD—Protein 간의 구조적 distance—와 같은 역할을 한다고 봐도 무방할 것.
- 독특한 점은, 해당 lDDT는 ESM-2 모델이 per-residue로 “예측” 하기도 하는데, 이 때 70 이하로 예측된 residue에는 back-propagation을 하지 않는다. 그리고 이것이 predicted structure로 improvement를 얻는데에 필수적임을 언급하고 있다.

Architecture: what is the difference?

크게 보았을 때, ESM-2는 AF2의 방식인 Evofomer → Structure module의 흐름을 똑같이 따르며, architecture의 차이도 크지 않다. 다만 Evoformer에서, 2-dimensional 입력인 MSA를 처리하기 위해 row-column wise attention을 해주게 되는데 이를 일반적인 attention으로 변경하였다. 또한, template의 pair-wise representation이 들어갈 자리에, language model의 attention-head를 통해 예측된 contact map으로 이를 대체하였다.

정리하자면, AF2와 main difference는:

MSA를 위한 row-column wise attention을 standard attention으로 대체
Template을 위한 연산을 predicted contact map에 대해서 하도록 변경

ESM-2의 method section에서 가장 높은 비중을 차지하는 것은 (2) 번 항목에 대한 것으로, 어떻게 “contact map”을 language model로 부터 알아내는가? 에 대한 해답이다. 이는 아래에 기술한다.

Language Models

Unsupervised Contact Prediction

ESM-2와 AF2의 가장 큰 차이는 initial structure—contact map—를 template을 사용하지 않고 language model을 통해 추론한다는 것이다. Section 제목에서 유추할 수 있듯, ESM-2의 contact prediction—residue 끼리의 거리 예측—은 supervision이 전혀 없는 상황에서 만들어진다. 이는 Rao et al., 2021 (Meta AI)의 연구를 기반으로 한다. 구체적으로, 해당 contact map은 MLM task를 수행한 language model의 attention map이 되고, 이는 ESM-2의 Folding trunk의 input이 된다 (AF2의 pair-representation).

Figure 1 of Transformer Protein Language Models are Unsupervised Structure Learners. MLM 이후 Attention-map을 통해 contact map을 inference한다.

아래에서는 attention map을 어떻게 contact map으로 변환시키는지를 기술한다.

Attention map → Contact map

MLM pre-trained protein language model의 attention map은 다음과 같은 과정을 거쳐서 contact map으로 변환된다.

(Step 1) Symmetrize - 대칭 행렬 만들기

Let M be a square matrix (attentiom map), then

$$ M_{sym}=\frac{M+M^T}{2} + \frac{M-M^T}{2} $$

Contact map은 그 정의에 따라 symmetric matrix이다. 따라서 해당 주어진 attention map을 먼저 symmetric matrix로 변경시켜 주는데, 이는 보통 다음과 같은 과정을 거친다:

(Step 2) Average Product Correction (APC) - background noise 줄이기

APC는 protein contact prediction에 고전적으로 사용되어 오던 기법으로 아래와 같이 정의된다: ,$$ F^{APC}{ij} =F{ij} - \frac{F_iF_j}{F} $$

where $F_i$, $F_j$, and $F$ are the sum over the i-th row, j-th column, and the full matrix respectively.

Given an L X L coupling matrix F, APC is defined as

(Step 3) Logistic regression for contact map - 어떤 attention head가 informative 한가?

여러 head의 attention map이 (1), (2) 번 연산을 거치고 나면 이제 어떠한 head가 실제 구조 정보를 담고 있는가를 결정하여야 한다. 이 때, 실제 구조 정보는 여러 attention map의 linear combination일 수 있다는 가정을 하여, 작은 수의 (n < 20) 구조정보를 가지고 linear coefficient를 최적화 하게 된다.

$c_{ij}$를 i, j 번째 amio acid의 contact 정보 $\in [0,1]) $, $a_{ijkl} $ 을 k’th attenthon head, l’th attention layer의 i, j번째 amio acid contact 정보라 할 때, i,j번째 포지션의 예측된 contact map은 다음과 같다: $$ p(c_{ij})= (1+\exp(-\beta_0 -\sum^L_{l=1}\sum^K_{k=1} \beta_{kl} \alpha^{kl}_{ij}))^{-1} $$

(Note) 엄밀히 말해서, (1) ESM-2는 template의 개념을 전혀 사용하지 않은 것 그리고 (2) Learning contact map in unsupervised manner 는 아니라고 볼 수 있다. 어느 layer와 어떤 attention head를 조합해서 사용할 것인가에 대해서는 supervision of structual information—the labels—이 들어가기 때문이다. 그럼에도 불구하고, linear 모델의 특성상 필요한 구조 정보의 숫자가 굉장히 적다.

공지사항

About

최근에 올라온 글

최근에 달린 댓글

Total

Today

Yesterday

링크

TAG more

« 2025/06 »
일	월	화	수	목	금	토
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30

글 보관함

세균맨

티스토리 뷰