bioinfo-statistics
24.08.23 Language models for biological research: a primer Nat. Mathods (2024) 본문
24.08.23 Language models for biological research: a primer Nat. Mathods (2024)
spnz3 2024. 8. 29. 10:25
논문
Language models for biological research: a primer
https://www.nature.com/articles/s41592-024-02354-y
읽은 이유:
최근에 ChatGPT와 Claude를 매우 유용하게 사용하고 있는데 ChatGTP 같은 것들이 Large Language Model이며, LLM이 생물 및 의학 연구에 굉장히 유용하게 쓰일 수 있다는 점에 흥미가 있어서 읽어보았다.
나는 대학생 때 항상 생물의 복잡성이 어떻게 가능한지에 대해 고민했었고, 핵심적으로는 염기 A,C,T,G의 무한한 조합으로 인해서 가능하다고 생각했었다. 또한 비슷하게 사람의 언어도 글자와 단어들의 무한한 조합으로 이뤄져 있으며, 이것이 인간의 뛰어난 창조 능력과 밀접한 관련이 있다고 생각했다. 특히 이런 창조 능력은 전산 분야에서 극대화되어, 컴퓨터 언어 어의 0과 1의 무한한 조합을 이용해 무한한 프로그램들을 만들어내고 있다고 생각했다.
나는 이렇게 정보, 언어와 같은 주제로 항상 관심이 갔던 것 같다.
LMM은 컴퓨터 언어를 사용해 생물 언어와 사람의 일상적인 언어를 해석, 활용한다는 점이 흥미로운 것 같다.
요약 / 흥미로웠던 부분
목차
- LMM & Natural language models
- Biological language models
- Protein language models
- Single-cell language models
- Multimodal language models for biology
- Best practices when using language models for biology
- Limitations
LMM & Natural language models
LMM으로 natural language 뿐 아니라 biological language 처리도 가능하다.
Owing to this flexibility, language models are often foundation models that enable broad downstream applications. Language models are not limited to natural language (for example, English); they can also process biological language, which consists of sequences of biological entities, such as amino acids3 or genes4.
Sequential data이기만 하면 LMM을 적용할 수 있다. 예를 들어 single-cell expression 데이터를 sequence 형식으로 변환하면 LMM을 적용할 수 있다. > 내가 생각한 것보다 더 다양한 종류의 데이터에 적용 가능하다는 점이 흥미로움. 또 어떤 데이터들을 sequence 형식으로 변환해 LMM을 적용하면 좋을까 생각해보게 됨.
Language models can be applied to any sequential data, whether the basic unit of the sequence, called a token, is a word in a sentence or an amino acid in a protein. Although sentences and proteins are naturally sequential, other types of biological data can be formulated as sequences. For example, single-cell gene expression data, which are not typically represented as sequences, can be formulated sequentially by creating a sequence in which genes appear in the order of their RNA expression levels in a cell. By viewing each single cell as a sequence of genes, a biological language model can then use these sequences as input to model single-cell RNA expression levels between cells.
Alternative modality에 대한 언급. 뭔지 잘 몰르겠음.
Additionally, natural language models can be augmented with data from alternative modalities, such as images or gene sequences, to form multimodal models6 that can provide insight into various forms of biological entities.
Biological language models
Protein language models
During training, random subsets of amino acids in each sequence are replaced with fake ‘mask’ amino acids, and the model predicts the original amino acids that were masked. By learning to accurately predict which amino acids fit into a given sequence context, the models learn the patterns and constraints that govern protein structure and function.
Application: direct prediction
Therefore, these predictions can be used out-of-the-box to estimate the effects of protein-coding mutations18.
whether protein sequences are likely to form functional structures, which has enabled protein language models to evaluate and design new sequences21,22.
Application: embedding analysis
The embeddings for each amino acid can then be used on their own or combined into a single protein representation. For example, prior work has found that clustering protein sequence embeddings can identify homologous proteins.
Application: transfer learning
For example, these embeddings have been used to predict protein stability24, immune escape with viral antigen mutations25 and, using a small quantity of labeled data, the pathogenicity of missense variants26.
Interactive example
As a demonstration of how protein language models can be applied to various downstream tasks, we provide an interactive notebook with examples for direct prediction, embedding analysis and transfer learning with ESM-2, which can be run from a browser using Google Colab: https://colab.research.google.com/drive/1zIIRGeqpXvKyz1oynHsyLYRxHHiIbrV5?usp=sharing.
Protein structure models
Although models for protein structure prediction, such as AlphaFold2 (ref. 28) and ESMFold3, are not the focus of this primer, it is worth mentioning that including structural information with protein sequences to train the model, as is done in models for protein structure prediction, can improve protein representations for various downstream tasks. Protein-structure-prediction models, like language models, have proven to be widely adaptable for diverse downstream applications through direct prediction, embedding analysis and transfer learning.
Single-cell language models
Why is large-scale pre-training useful?
Single-cell gene expression data provide insights into the cellular state and function of individual cells, but their high dimensionality makes interpretation challenging. AI methods have recently been developed to help analyze these complex data.
Single-cell language model example: Geneformer
Single cell 데이터를 학습해 expression pattern을 예측!
- Geneformer represents each cell as a list of the top 2,048 genes expressed in the cell, sorted on the basis of RNA expression levels.
- The training process is similar to that of the previously described protein language models, in that subsets of the genes are masked out and the model is trained to predict the missing genes. > language model들은 항상 이런식으로 학습하나? 다른 방식은 없나?
- To properly predict the missing genes in the order of their expression levels, the model must understand interactions between expression levels of various genes and implicitly learn cell-type-specific patterns and context. > expression 뿐 아니라 context에 대한 정보도 같이 학습할 수 있다면 expression으로 부터 context를 예측 가능? 그런 예측에 맞춰 질병을 치료한다던지...
- Geneformer was trained on 30 million single-cell transcriptomes spanning 40 tissue types, which helps it learn diverse expression patterns.
Application: direct prediction
A variety of creative in silico experiments... For example, Geneformer simulated the reprogramming of fibroblasts by artificially adding POU5F1, SOX2, KLF4 and MYC to the top of the gene rankings for the cells, thereby computationally shifting the cells toward the induced pluripotent stem cell state. Similarly, single-cell language models can predict the sensitivity of cells to gene removal by artificially deleting genes from the ranked list for a cell and examining the effect on the cell embeddings.> 어떤 gene들을 perturb했을 때 나머지 expression 변화 예측
Application: embedding analysis
Can implicitly reduce batch effects while maintaining biological variability, enabling them to identify nuanced cell subtypes from datasets containing many experimental batches4.
Application: transfer learning
...models can also be fine-tuned to predict properties of individual cells. For example, single-cell language models can be fine-tuned to integrate data across experimental conditions and predict cell-type labels and cell states. They can even support multimodal representations of genes. For example, scGPT can be fine-tuned to include chromatin accessibility and protein abundance alongside gene expression levels, enabling dataset integration across modalities.
Multimodal language models for biology
- Pathology Language-Image Pre-training (PLIP)30 was trained on Twitter data to match pathology images to their captions, enabling users to get captions for a given image or find images for a given text description.
- Similarly, Med-PaLM Multimodal31 was trained to answer questions on the basis of biomedical images,
- and MolT5 (ref. 32) was trained to describe molecules in natural language, including information about their potential biological functions, on the basis of their molecular structures.
- Natural language models can also be applied in a multimodal setting without additional training by combining fixed language model embeddings of biological text with data from other domains. GenePT33 provides an example for single-cell data. GenePT leverages the implicit genomic knowledge of language models to embed cells. Specifically, GenePT embeds cells with ChatGPT by first embedding text descriptions of genes from NCBI using ChatGPT and then creating single-cell embeddings by averaging text-based gene embeddings, weighted by single-cell expression. In some applications, these embeddings derived from a natural language model match or outperform embeddings from biological language models such as Geneformer. > 흥미로운 예시인데 정확히 이해가 안감.
Best practices when using language models for biology
모르는 용어들이라 이해가 안가는 그림이지만 일단...

Biological language model resources
| ChatGPT | Natural language (general) | https://chat.openai.com | Chat with the model at chat.openai.com or programmatically query the API as per the documentation. |
| BioBERT | Natural language (biomedical) | https://github.com/dmis-lab/biobert | Links to Hugging Face repositories with pre-trained BioBERT models of different sizes, along with code to fine-tune BioBERT. |
| Med-PaLM 2 | Natural language (biomedical) | https://cloud.google.com/vertex-ai/generative-ai/docs/medlm/overview | Available to certain customers as part of Google’s Vertex AI platform. |
| ESM | Protein language | https://github.com/facebookresearch/esm | Links to model code, pre-trained models and tutorials. |
| ProGen | Protein language | https://github.com/salesforce/progen/tree/main | Links to pre-trained models and code for making predictions. |
| Geneformer | Single-cell language | https://huggingface.co/ctheodoris/Geneformer | Trained models available through Hugging Face. Provides documentation with example code for various applications. |
| scGPT | Single-cell language | https://github.com/bowang-lab/scGPT | Links to pre-trained models and websites with cell annotations and other features. |
| GenePT | Multimodal: natural language and single-cell gene expression | https://github.com/yiqunchen/GenePT | Pre-computed gene embeddings and notebooks with example tutorials for various applications. |
| PLIP | Multimodal: natural language and pathology images | https://huggingface.co/spaces/vinid/webplip | Links to training data, pre-trained models and code. |
| Hugging Face | Multiple language model types | https://huggingface.co | Repository of many different trained machine-learning models. Query for specific terms (for example, protein) to find models in a specific domain. |
Limitations
1. Language models still cannot perfectly solve many biological problems, even the ones that they were originally trained to address. For example, natural language models contain only the biological knowledge that is included in their training data, so they will be unaware of findings discovered after training. Protein language models are typically trained on the standard amino acids and therefore cannot reflect the significance of any post-translational modifications in the input representation.
2. Models tailored to specific biological applications can still sometimes outperform biological language models, particularly when prior knowledge can inform the model design. For example, methods that include information about protein structures have been shown to outperform methods that use language models trained on protein sequences34,35.
