Large Language Models (LLMs) have revolutionized Artificial intelligence (AI). While originally designed for text sequences, these models turn out to be incredibly effective for biological sequences like DNA and proteins. We were among the first to develop protein language models (pLMs) as global predictors of protein structure and protein function. We also showed that pLMs can accurately distinguish between disease-causing and benign mutations. Thanks in part to our work, LLMs are now regarded as one the most promising approaches to study proteins and the consequences of genetic variation. The Brandes Lab continues to work on the remaining challenges to unlock their full potential for prediagnosing and treating disease and understanding our genomes.
Join Us!
If you are excited about this research agenda, come work with us.
1. Incorporate variant effect prediction in statistical genetics to implicate rare variants and establish causality.
Genome-wide association studies (GWAS) and polygenic risk scores (PRS) are purely statistical: they search for genetic variants correlated with disease status without knowing anything about the molecular effects of these variants. In contrast, we are leveraging variant effect predictions, especially those made by frontier AI models, to guide GWAS and PRS towards variants more likely to have an effect. We are testing these methods on large-scale genetic cohorts. Using functional priors is especially important in the presence of limited evidence (rare mutations) and when attempting to distinguish between causal and non-causal associations.
2. Identify specific genetic effects and optimize genomic sequences.
Existing variant effect prediction algorithms try to predict whether a given variant is damaging or neutral. But variant effect is not a one-dimensional phenomenon. For example, different mutations in the same gene may lead to loss-of-function, gain-of-function or dominant-negative effects. We are using modern AI, which is inherently high-dimensional, to tease apart these different effects. We are also using data from high-throughput experiments such as deep mutational scans and single-cell RNA-sequencing (Perturb-Seq) to refine and improve our predictions over the mutational landscape. We also leverage this approach to search for mutation combinations that optimize the genetic background of cells, for example to make immune cells more potent against tumors.
3. Make AI suitable for the non-coding genome.
Protein language models, despite their effectiveness, are restricted to the 1-2% of our genome that is coding for protein sequences. DNA language models on the other hand can cope with any genomic region, but they are currently still lagging behind due to a host of technical challenges. We aim to tackle these challenges and come up with generalized AI models that could predict the effects of non-coding variants, implicate their role in disease, and identify the specific mutations most likely to push cells into desirable states.