Weimin Wu

Ph.D. Candidate, Computer Science, Northwestern University

GenomeOcean: An Efficient Genome Foundation Model Trained on Large-Scale Metagenomic Assemblies


Unpublished


Zhihan Zhou, Robert Riley, Satria Kautsar, Weimin Wu, Rob Egan, Steven Hofmeyr, Shira Goldhaber-Gordon, Mutian Yu, Harrison Ho, Fengchen Liu, Feng Chen, Rachael Morgan-Kiss, Lizhen Shi, Han Liu, Zhong Wang
2025

View PDF https://www.biorxiv.org/content/10.1101/2025....
Cite

Cite

APA   Click to copy
Zhou, Z., Riley, R., Kautsar, S., Wu, W., Egan, R., Hofmeyr, S., … Wang, Z. GenomeOcean: An Efficient Genome Foundation Model Trained on Large-Scale Metagenomic Assemblies. 2025. Genome foundation models hold transformative potential for precision medicine, drug discovery, and understanding complex biological systems. However, existing models are often inefficient, constrained by suboptimal tokenization and architectural design, and biased toward reference genomes, limiting their representation of low-abundance, uncultured microbes in the rare biosphere. To address these challenges, we developed GenomeOcean, a 4-billion-parameter generative genome foundation model trained on over 600 Gbp of high-quality contigs derived from 220 TB of metagenomic datasets collected from diverse habitats across Earth’s ecosystems. A key innovation of GenomeOcean is training directly on large-scale co-assemblies of metagenomic samples, enabling enhanced representation of rare microbial species and improving generalizability beyond genome-centric approaches. We implemented a byte-pair encoding (BPE) tokenization strategy for genome sequence generation, alongside architectural optimizations, achieving up to 150× faster sequence generation while maintaining high biological fidelity. GenomeOcean excels in representing microbial species and generating protein-coding genes constrained by evolutionary principles. Additionally, its fine-tuned model demonstrates the ability to discover novel biosynthetic gene clusters (BGCs) in natural genomes and perform zero-shot synthesis of biochemically plausible, complete BGCs. GenomeOcean sets a new benchmark for metagenomic research, natural product discovery, and synthetic biology, offering a robust foundation for advancing these fields.


Chicago/Turabian   Click to copy
Zhou, Zhihan, Robert Riley, Satria Kautsar, Weimin Wu, Rob Egan, Steven Hofmeyr, Shira Goldhaber-Gordon, et al. “GenomeOcean: An Efficient Genome Foundation Model Trained on Large-Scale Metagenomic Assemblies.” 2025. Genome foundation models hold transformative potential for precision medicine, drug discovery, and understanding complex biological systems. However, existing models are often inefficient, constrained by suboptimal tokenization and architectural design, and biased toward reference genomes, limiting their representation of low-abundance, uncultured microbes in the rare biosphere. To address these challenges, we developed GenomeOcean, a 4-billion-parameter generative genome foundation model trained on over 600 Gbp of high-quality contigs derived from 220 TB of metagenomic datasets collected from diverse habitats across Earth’s ecosystems. A key innovation of GenomeOcean is training directly on large-scale co-assemblies of metagenomic samples, enabling enhanced representation of rare microbial species and improving generalizability beyond genome-centric approaches. We implemented a byte-pair encoding (BPE) tokenization strategy for genome sequence generation, alongside architectural optimizations, achieving up to 150× faster sequence generation while maintaining high biological fidelity. GenomeOcean excels in representing microbial species and generating protein-coding genes constrained by evolutionary principles. Additionally, its fine-tuned model demonstrates the ability to discover novel biosynthetic gene clusters (BGCs) in natural genomes and perform zero-shot synthesis of biochemically plausible, complete BGCs. GenomeOcean sets a new benchmark for metagenomic research, natural product discovery, and synthetic biology, offering a robust foundation for advancing these fields, n.d.


MLA   Click to copy
Zhou, Zhihan, et al. “GenomeOcean: An Efficient Genome Foundation Model Trained on Large-Scale Metagenomic Assemblies.” 2025, Genome foundation models hold transformative potential for precision medicine, drug discovery, and understanding complex biological systems. However, existing models are often inefficient, constrained by suboptimal tokenization and architectural design, and biased toward reference genomes, limiting their representation of low-abundance, uncultured microbes in the rare biosphere. To address these challenges, we developed GenomeOcean, a 4-billion-parameter generative genome foundation model trained on over 600 Gbp of high-quality contigs derived from 220 TB of metagenomic datasets collected from diverse habitats across Earth’s ecosystems. A key innovation of GenomeOcean is training directly on large-scale co-assemblies of metagenomic samples, enabling enhanced representation of rare microbial species and improving generalizability beyond genome-centric approaches. We implemented a byte-pair encoding (BPE) tokenization strategy for genome sequence generation, alongside architectural optimizations, achieving up to 150× faster sequence generation while maintaining high biological fidelity. GenomeOcean excels in representing microbial species and generating protein-coding genes constrained by evolutionary principles. Additionally, its fine-tuned model demonstrates the ability to discover novel biosynthetic gene clusters (BGCs) in natural genomes and perform zero-shot synthesis of biochemically plausible, complete BGCs. GenomeOcean sets a new benchmark for metagenomic research, natural product discovery, and synthetic biology, offering a robust foundation for advancing these fields.


BibTeX   Click to copy

@unpublished{zhihan-a,
  title = {GenomeOcean: An Efficient Genome Foundation Model Trained on Large-Scale Metagenomic Assemblies},
  journal = {2025},
  author = {Zhou, Zhihan and Riley, Robert and Kautsar, Satria and Wu, Weimin and Egan, Rob and Hofmeyr, Steven and Goldhaber-Gordon, Shira and Yu, Mutian and Ho, Harrison and Liu, Fengchen and Chen, Feng and Morgan-Kiss, Rachael and Shi, Lizhen and Liu, Han and Wang, Zhong},
  howpublished = {Genome foundation models hold transformative potential for precision medicine, drug discovery, and understanding complex biological systems. However, existing models are often inefficient, constrained by suboptimal tokenization and architectural design, and biased toward reference genomes, limiting their representation of low-abundance, uncultured microbes in the rare biosphere. To address these challenges, we developed GenomeOcean, a 4-billion-parameter generative genome foundation model trained on over 600 Gbp of high-quality contigs derived from 220 TB of metagenomic datasets collected from diverse habitats across Earth’s ecosystems. A key innovation of GenomeOcean is training directly on large-scale co-assemblies of metagenomic samples, enabling enhanced representation of rare microbial species and improving generalizability beyond genome-centric approaches. We implemented a byte-pair encoding (BPE) tokenization strategy for genome sequence generation, alongside architectural optimizations, achieving up to 150× faster sequence generation while maintaining high biological fidelity. GenomeOcean excels in representing microbial species and generating protein-coding genes constrained by evolutionary principles. Additionally, its fine-tuned model demonstrates the ability to discover novel biosynthetic gene clusters (BGCs) in natural genomes and perform zero-shot synthesis of biochemically plausible, complete BGCs. GenomeOcean sets a new benchmark for metagenomic research, natural product discovery, and synthetic biology, offering a robust foundation for advancing these fields}
}

Abstract:

Genome foundation models hold transformative potential for precision medicine, drug discovery, and understanding complex biological systems. However, existing models are often inefficient, constrained by suboptimal tokenization and architectural design, and biased toward reference genomes, limiting their representation of low-abundance, uncultured microbes in the rare biosphere. To address these challenges, we developed GenomeOcean, a 4-billion-parameter generative genome foundation model trained on over 600 Gbp of high-quality contigs derived from 220 TB of metagenomic datasets collected from diverse habitats across Earth’s ecosystems. A key innovation of GenomeOcean is training directly on large-scale co-assemblies of metagenomic samples, enabling enhanced representation of rare microbial species and improving generalizability beyond genome-centric approaches. We implemented a byte-pair encoding (BPE) tokenization strategy for genome sequence generation, alongside architectural optimizations, achieving up to 150× faster sequence generation while maintaining high biological fidelity. GenomeOcean excels in representing microbial species and generating protein-coding genes constrained by evolutionary principles. Additionally, its fine-tuned model demonstrates the ability to discover novel biosynthetic gene clusters (BGCs) in natural genomes and perform zero-shot synthesis of biochemically plausible, complete BGCs. GenomeOcean sets a new benchmark for metagenomic research, natural product discovery, and synthetic biology, offering a robust foundation for advancing these fields


Tools
Translate to