Weimin Wu

Ph.D. Candidate, Computer Science, Northwestern University

DNABERT-S: LEARNING SPECIES-AWARE DNA EMBEDDING WITH GENOME FOUNDATION MODELS


Conference paper


Zhihan Zhou*, Weimin Wu*, Harrison Ho, Jiayi Wang, Lizhen Shi, Ramana V Davuluri, Zhong Wang, Han Liu
2024

View PDF https://arxiv.org/abs/2402.08777
Cite

Cite

APA   Click to copy
Zhou*, Z., Wu*, W., Ho, H., Wang, J., Shi, L., Davuluri, R. V., … Liu, H. (2024). DNABERT-S: LEARNING SPECIES-AWARE DNA EMBEDDING WITH GENOME FOUNDATION MODELS.


Chicago/Turabian   Click to copy
Zhou*, Zhihan, Weimin Wu*, Harrison Ho, Jiayi Wang, Lizhen Shi, Ramana V Davuluri, Zhong Wang, and Han Liu. “DNABERT-S: LEARNING SPECIES-AWARE DNA EMBEDDING WITH GENOME FOUNDATION MODELS,” 2024.


MLA   Click to copy
Zhou*, Zhihan, et al. DNABERT-S: LEARNING SPECIES-AWARE DNA EMBEDDING WITH GENOME FOUNDATION MODELS. 2024.


BibTeX   Click to copy

@inproceedings{zhihan2024a,
  title = {DNABERT-S: LEARNING SPECIES-AWARE DNA EMBEDDING WITH GENOME FOUNDATION MODELS},
  year = {2024},
  author = {Zhou*, Zhihan and Wu*, Weimin and Ho, Harrison and Wang, Jiayi and Shi, Lizhen and Davuluri, Ramana V and Wang, Zhong and Liu, Han}
}

Abstract:

Effective DNA embedding remains crucial in genomic analysis, particularly in scenarios lacking labeled data for model fine-tuning, despite the significant advancements in genome foundation models. A prime example is metagenomics binning, a critical process in microbiome research that aims to group DNA sequences by their species from a complex mixture of DNA sequences derived from potentially thousands of distinct, often uncharacterized species. To fill the lack of effective DNA embedding models, we introduce DNABERT-S, a genome foundation model that specializes in creating species-aware DNA embeddings. To encourage effective embeddings to error-prone long-read DNA sequences, we introduce Manifold Instance Mixup (MI-Mix), a contrastive objective
that mixes the hidden representations of DNA sequences at randomly selected layers and trains the model to recognize and differentiate these mixed proportions at the output layer. We further enhance it with the proposed Curriculum Contrastive Learning (C2LR) strategy. Empirical results on 18 diverse datasets showed DNABERT-S’s remarkable performance. It outperforms the top baseline’s performance in 10-shot species classification with just a 2-shot training while doubling the Adjusted Rand Index (ARI) in species clustering and substantially increasing the number of correctly identified species in metagenomics binning. The code, data, and pre-trained model are publicly available at https://github.com/Zhihan1996/DNABERT_S.




Follow this website


You need to create an Owlstown account to follow this website.


Sign up

Already an Owlstown member?

Log in