Accelerating biology through machine learning and open-science

OpenBioML is a decentralized, collaborative research community founded on the belief that open source machine learning and open science can accelerate biotechnology. Join our Discord server, participate in the community's research, or suggest an entirely new project.

Our Mission

We seek to support the broader community by providing an avenue where talented researchers worldwide can discuss, develop, and release machine learning models and tools focusing on downstream impact. We aim to build a diverse, globally-distributed community where skills and resources are shared openly to solve important problems.

Skills are global. Resources are not.

While talent is distributed globally, opportunities and resources are not. We aim to bridge this gap by allowing researchers worldwide to connect and work together. Through open collaboration and access to large-scale computational resources, we seek to break down barriers and enable high-impact research at the intersection of machine learning and biology.

Explore some of our research

DNA Diffusion

  • Diffusion models are powerful generative models that can create realistic and diverse data based on text or other inputs. In this project we aim to develop and apply these models to generate and analyze DNA sequences with specific regulatory functions.

    We will use a large-scale genomic dataset to train our models to generate synthetic DNA sequences with desired properties, such as cell type specificity, gene expression level, transcription factor binding, chromatin accessibility, etc. We will evaluate the generated sequences using various metrics and experimental validation.

    Furthermore, the project will also explore how diffusion models can help us study the function of regulatory elements, such as enhancers and locus control regions. We will manipulate and modify existing sequences to test the effects of single nucleotide variations, evolutionary changes, or combinatorial interactions. We will also generate novel regulatory loci by combining multiple elements together.

    We will leverage the latest advances in diffusion models, such as bit diffusion and guided diffusion, to handle discrete and conditional data. We will also investigate different prompting and sampling strategies to optimize the generative process and the quality of the outputs.

    This project has a range of potential applications in genomics research and biotechnology, such as designing synthetic regulatory circuits, discovering new regulatory mechanisms, understanding disease-associated variants, and creating educational tools for learning about DNA.

    Join the discussion on this project in our #dna-diffusion channel on our Discord server.

Large Language Chemistry Dataset and Model

  • This project aims to create an extensive chemistry dataset and use it to train large language models (LLMs) that can leverage the data for a wide range of chemistry applications.

    The dataset itself will be very valuable for downstream NLP research focused on chemistry. The use cases for the model include generative applications from text to chemical data (e.g., molecular structure, properties, reactions, etc.) and the other way around, generation of code for computational chemistry applications, prediction of properties of materials and molecules (i.e., classification and regression tasks) as well as chemical reactions. However, the applications should not be limited to those, as we want to specifically look for (new) emergent behaviors not present in smaller LMs or LLMs trained on less chemistry data.

    Furthermore, the resulting models should be investigated with different prompting and sampling strategies to examine their capabilities. Such a capable chemistry LLM has a range of potential use cases ranging from a tool for experts that want to create new ideas to non-experts that could use the tool to explore new chemistry topics. Furthermore, such a tool can be helpful in education as an intelligent learning assistant.

    Join the discussion on this project in our #chemnlp channel on our Discord server.

Journal Club

Attend our journal clubs, where we have bi-weekly presentations with biology's best and brightest. Catch up on the previous talks on our YouTube channel.

Join our Discord server, participate in the community's research, or suggest an entirely new project.