Imagine you have 20 or so experimentally validated EGFR-binding proteins, maybe from our recent protein design competition. None of them are fit for your application yet, so you want to further optimize them before you submit them to round two. Or maybe you got some decent binders, but need to re-optimize to find a trade-off between their binding strength and their expression in a high-yield bioreactor for commercialization.
You can of course re-use tools like RFdiffusion or the newly released AlphaProteo. You can have the model propose hundreds of promising variants of your designs, and then use heuristics (for example other ML models) to select candidates for another round of validation. However, you might not have the budget of DeepMind or the capacity of the Baker lab to filter out and screen hundreds of candidates playing this “de novo slot machine” in hopes of improvement.
In this blog post, we will give a brief introduction to optimization techniques aiming to minimize the number of real world measurements required, i.e. aiming for "sample efficiency" to use the technical term. We will use a hypothetical lab budget constrained protein engineering scenario, characteristic of an individual designer or small company which just got some angel investment to build a proof of concept, but the same techniques can be used purely in silico, for example to optimize the score function used in your favourite EGFR binding competition.
This series of blog posts is aimed at practitioners or those aiming to become one, so while we try to avoid obscure jargon, if you are completely new to this your favorite LLM or search engine might be required for some more technical sections and concepts.
To ground the discussion and constrain the survey at the end of the post, we will assume that
As we will see, most techniques aiming at sample efficient protein optimization* fall under the framework of model-based optimization** (MBO, Hie & Yang, 2022). Here a surrogate model is trained to predict a fitness score (e.g., protein folding stability, binding affinity, expressibility, etc.) from a set of labelled training points. This model is then used to either:
* We distinguish optimization from single-shot de novo/ conditional design with filtering, the key difference being that are trying to improve a seed sequence using a surrogate model trained on true fitness values (from a small labelled dataset - low-N regime, or entire Deep Mutational Scanning/ DMS datasets).
** For those in the know, think Bayesian Optimization (BO) using deep learning models applied to protein engineering.
Following Hie & Yang, 2022, we can distinguish between sequentially-optimized models and fixed model approaches. Sequentially-optimized means we are retraining the surrogate used for scoring and/ or the generator used to propose new candidates online, as we obtain newly validated data. The generator can be a generative model such as a VAE or a GAN, but also a simple mutation and recombination scheme, although it generally won't be as adaptable.
Thus, the optimization algorithm acts in a self-guided manner: it proposes new sequences, receives feedback and then learns to either generate better sequences in the same sequence space (if updating the generator) and/ or improves its fitness estimates (if updating the surrogate).
Sequentially-optimized models need to balance exploration (learning the fitness landscape) and exploitation (samping from regions with the highest expected fitness). Some models can adaptively transition between these two modes by directly using the surrogate, while others might balance them with a fixed parameter or avoid exploration completely and only suggest high fitness variants (exploit-only).
We will refer to the former category as model-based adaptive sampling methods, and to the latter as greedy heuristics. You can also check out this resource to get a better feel for the explore-exploit trade-off and Bayesian optimization.
In contrast, within the fixed model approach, all training happens before the optimization campaign. A model can be conditioned, guided, or simply trained once on the initial set of training points and used to validate randomly sampled designs (”sample and filter”).
This might be done because adaptive sampling is too computationally demanding, because it's too complex to reason about, implement or justify to stakeholders, or for other reasons. Although this approach is not the focus of this blog post, it is very popular, so we will briefly give it some consideration in the appendix.
As mentioned, we will split sequentially-optimized models into explicit adaptive models — which rely on the trained surrogate (or the generative model generating the candidates, in some cases) to balance between exploration and exploitation — and greedy/ heuristic methods, which either select the top predicted sequences to be evaluated (exploit-only) or a priori establish how the fitness landscape should be parsed (fixed parameter).
Or, to make the distinction clear: models either are told how to navigate the landscape beforehand, or they adaptively learn how to do it. In both cases, our final goal is to keep the number of experimental evaluations within our budget while maximizing fitness, in our example binding affinity to EGFR.
For the model-based sampling methods, an uncertainty-aware surrogate (a Gaussian Process, for example) models a distribution over a set of fitness scores given an initial pool of sequences.
If you bear with us through some mathematical notation, we can write the true fitness of a protein \( \mathfrak{p} \) as \( f^*(\mathfrak{p})\) and our estimate as random function \( \mathbf{f}(\mathfrak{p}) \sim \mathcal{N}\left( \mu\left(\mathfrak{p}\right),\sigma\left(\mathfrak{p}\right)\right) \), making \( \mu(\mathfrak{p})\) our expected value of the protein fitness and \( \sigma(\mathfrak{p}) \) out measure of uncertainty.
Some sequences might have a high expected fitness \( \mu(\mathfrak{p})\) with low uncertainty, indicating a region in the fitness landscape from which it might be desirable to sample to optimize binding (exploitation). Some might have a slightly lower but still good \( \mu(\mathfrak{p})\) but also high uncertainty \( \sigma(\mathfrak{p})\), and we might want to look into these to see if they lead to regions with even better binding affinity (exploration).
The strategy that determines which regions the model should explore next, i.e. which sequence should be validated next, is captured in what is called an acquisition function \( \alpha(\mathfrak{p})\). A popular choice is called the Upper Confidence Bound (\( \text{UCB}_\beta\)) (Srinivas 2009), which is simply \( \alpha(\mathfrak{p})\coloneqq \text{UCB}_\beta(\mathfrak{p})\coloneqq \mu(\mathfrak{p})+\sqrt{\beta}\sigma(\mathfrak{p})\), with \( \beta \geq 0 \) being a hyperparameter that lets you choose your risk appetite. For a finite space, with \( \vert D \vert \) options, for a confidence level \( \delta \), the theory developed in (Srinivas 2009) suggests choosing \( \beta=\beta_t^*= 2 \log( \vert D \vert t^2 \pi^2 / 6\delta) \), increasing the importance of exploration at each round \( {t} \) as we learn more about the domain. With this, the model will act conservatively, making sure to keep exploring uncertain areas about as much as highly promising areas, while \( \beta=0\) means we are directly improving the target property in a greedy manner. (Srinivas 2009) found that the sweet spot lies between the two. Choosing a smaller, but non-zero value than their theory suggests, e.g \( \beta=0.2\beta_t^* \) yielded the best results for them, but this will depend on the exact problem you are tackling.
Visualizing things, it will look like this:
You might encounter the terms Bayesian Optimization and Active Learning as separate concepts while digging through the literature. Both fit into this model-based sampling category, the former being more common when tasked for optimizing a protein’s property, the latter for obtaining a more accurate fitness predictor or better training data.
Historically, Bayesian optimization involved using a budget to make a single, close to optimal decision as measured by a completely black-box fitness function, while active learning was independently developed for the efficient acquisition of a training set in hopes of learning a better predictive surrogate model. However, recent research has blurred the distinction between the two, and mathematically they can be treated similarly (Kirsch, 2024). We therefore recommend thinking of them as a single approach and taking ideas from both communities as you embark on your protein optimization journey.
All methods we will discuss have the following key components:
We start the optimization by pre-training our models and generating candidates. Then the surrogate scores each candidate, with the acquisition function providing the final decision on which candidate to evaluate. Finally, we validate the chosen sequences, retrain the models on the extended dataset and repeat the cycle. The algorithm converges when it reaches a fitness threshold, no longer improves after a set number of iterations, or when we can no longer validate new candidates because our sampling budget of 200 sequences is exhausted.
In addition to choosing the way we incorporate information during optimization, we also need to decide what space we want to operate in.
In the literature, optimization commonly occurs directly at the sequence level: new mutations are sampled, fed through the surrogate, and selected if they reach a fitness threshold. However, since sequences are discrete objects, i.e. there is no residue that is halfway between S and I, optimizing them directly is often noisy and non-smooth due to the drastic fitness changes due to single-point mutations (landscape roughness).
Even worse, fitness values can be greatly influenced by epistasis, where several sites in the protein bear long-range interactions. In this case combinations of mutations can have an additive effect on the overall activity, whilst most of the landscape bears a similar fitness (landscape sparsity). This landscape roughness and sparsity problem has been addressed in two major ways: latent optimization and landscape smoothing techniques.
Latent optimization involves replacing sequences with embedding vectors. These vectors come from pre-trained model, usually variational autoencoders or protein language models.
Several studies have found them to be meaningful, evolutionary-informed representations (Lin et al., 2023; Detlefsen, Hauberg & Boomsma, 2022; Ding, Zou & Brooks, 2019). Assuming our representations are good enough, optimization can then take place in a continuous vector space. Such a space allows for arbitrarily small (i.e. careful) exploration steps. Together with other aspects of representational embeddings, this makes task more amenable to classic methods of Bayesian optimization like Gaussian Processes using RBF kernels (Maus et al., 2022; Gómez-Bombarelli et al., 2018; Tripp, Daxberger & Miguel Hernández-Lobato, 2020). Moreover, we can combine gradient-based optimization with Bayesian methods on continuous inputs to find the global optimum faster and more robustly (Stanton et al., 2022; Khang Ngo et al., 2024; Gómez-Bombarelli et al., 2018). You should check out the references above to get more familiar with latent optimization.
Kirjner, Yim, and colleagues addressed the roughness problem with a different approach.They converted the fitness landscape into a graph. In this graph, node attributes were based on fitness values and edges were based on the Levenshtein distance between sequence. They then further regularized the resulting graph Laplacian and used it to train their fitness surrogate. Under the crucial assumption that single-point mutants likely have an incremental effect on the fitness,this landscape smoothing ensures more similar sequences have similar fitness values.
Now, let's say you want to use one of these methods for your protein design campaign. Where should you start?
We have performed a review of the state of the art protein ML models used for optimization and summarized our findings for you. We have focused on design and property targets, costs for optimization, fold-change improvements versus the wild-type, availability of good implementations and other aspects practitioners might care about to help you easily find only the papers that are of interest to you Untitled.
In the following, we highlight some of these, mainly recent adaptive models (Bayesian optimization/ Active learning/ Greedy), that we consider SOTAs or just good baselines. Any of these should be a decent starting point for budget constrained optimization campaigns.
Evolutionary algorithms like AdaLead (Sinai et al., 2020) and PEX (Ren et al., 2022) generate new candidates by mutating and/ or recombining previous ones, followed by selection. In our framework they are sequentially-optimized models, with greedy/ heuristic candidate propositions, as sampling does not explicitly depend on the learned model and its acquisition function.
Code: https://flexs.readthedocs.io/en/latest/
AdaLead is a greedy search optimization method, used in the literature as a robust benchmark. At each step, a model predicts fitness values for a selected batch of variants, with fixed parameter rates for recombination and mutation. The explore-exploit trade-off is not learned as the optimization proceeds. Instead, it is instilled a priori via a parameter κ that controls the fitness threshold for selecting new variants (within 1 − κ of the maximum fitness observed previously).
The top variants within this threshold are then selected (greedy algorithm). The paper argues this is still an adaptive sampling approach, since exploration is encouraged on flat fitness landscapes, followed by a switch to exploitation as the maximum observed fitness increases, with fewer potential candidates being proposed. It is relatively robust, with minimal assumptions, high interpretability, and easy to implement, thanks to the FLEXS landscape simulation environment and Python package (Sinai et al., 2020). Moreover, FLEXS offers implementations of other landscape explorers like DbAS, CbAS, and reinforcement learning ones like DyNA-PPO.
Code: https://github.com/HeliXonProtein/proximal-exploration/tree/main
Next, PEX adapted this idea to focus more on local search. The acquisition function now includes an edit distance regularization term, controlled by a parameter λ, which penalizes candidates with high edit distance from wild-type sequences. This algorithm also deviates from the greedy selection approach. It instead selects candidates along the “proximal frontier”, defined as all maximal points at various λ. With this, PEX optimization yielded a great fitness improvement on an avGFP landscape. Maximum fitness (fluorescence intensity) was improved by 2-fold over the maximum value in the starting pool of sequences after 10 cycles with 100 sequences each, with a relatively lower mutation count (<10). On the same data, AdaLead reports a slightly worse performance (3.12 fitness score for PEX, 2.61 for AdaLead), with < 15 mutation count for the best performing variant as well. On an adeno-associated viruses VP1 protein dataset, AdaLead’s best variant had a mutation count of over 20 and a fitness score of 3.58, while PEX maintained it at below 5 and 4.45 score. The paper above offers an implementation of the PEX explorer.
Both PEX and AdaLead can achieve relatively similar maximum fitness scores with the same number of oracle evaluations (1000). However, this is clearly beyond our constraint of 200. We recommend you use these 2 as benchmarks for a custom sequentially-optimized adaptive model you’re building, on either simulated fitness landscape (using the NK model for example) or available ones. Ensure your method is robust by testing it on several ground-truth landscapes. If it can beat these two evolutionary algorithms while reaching your fitness fold-change threshold within 200 total candidates, it could be used in a real active learning setting.
Code: https://github.com/churchlab/low-N-protein-engineering
Biswas and colleagues engineered a pre-training method using unlabeled data which can be fine tuned using small enough number of samples (N) that it can qualify as a heuristics guided sequential optimization scheme.
Evolutionary-related sequences to the optimization target are first mined into an unsupervised training dataset for a language model and used for ”evotuning”. In this case UniRep (Alley et al., 2019) was used, but it could easily be adapted to more recent ones like ESM2 (Lin et al., 2023) or 3 (Hayes et al., 2024).
The model is subsequently fine-tuned given a small amount of labelled data (N = 24 or 96 random seed sequences) and used to navigate a fitness landscape with a Markov-chain Monte Carlo sampling scheme and greedy selection.
Finally, following ranking and a threshold of at most 15 mutations distance from the wild-type, 300 optimized designs are selected for a single round. From these, 10% were hits, with a fitness higher than the wildtype. Further restrictions on the mutation distance (to 7) enabled a hit rate of 18% with N = 96 and 1.8% with N = 24. We believe this approach could be plausibly be employed in a sequential optimization platform given decent hit rates in the low-N training regime, however it doesn’t have the theoretical interpretability of “proper” BO methods and might be sensitive to initialization.
Despite this, when scaled to a greater screening budget (larger batches than normally seen in classical BO), such “almost fixed” model loops with a surrogate-guided random search have been shown to effectively find high fold-change variants.
For example, Li and colleagues achieved a 28.7-fold improvement in antibody binding affinity over the training set following MLDE (Machine Learning-assisted Directed Evolution) using an ensemble of fitness predictors, a protein language model and a greedy hill climb sampling strategy. This required pre-training on a library of 22,188 heavy chain sequences with binding affinities (Engelhart et al., 2022). We can imagine how these models could be employed in sequential optimization loops given large enough batches or number of iterations. Another option would be likelihood-based directed evolution - Hie and colleagues leveraged this to obtain 5.1-fold affinity increase antibody for the Omicron BA.1 receptor-binding domain. You can play with the same technique here, or with a more recent ones employing inverse folding (structure to sequence, Shanker et al., 2024) models here.
If your screening budget is a bit larger than our assumptions or you already have a substantial screened library, you could fine-tune some PLMs. With model repositories like HuggingFace and exhaustive online docs, it is now quite straight-forward to do so.
Code: https://github.com/idmjky/EvolvePro
Jiang and colleagues achieved impressive results with their EVOLVEpro active learning framework which makes use of ESM2 embeddings and iterative optimization of a random forest (RF) surrogate.
They managed to evolve a miniature CRISPR nuclease with 2.2-44-fold increase in the indel%, an antibody with 63% better binding to the SARS-CoV-2 spike protein, a 4-fold improved integrase and even a T7 polymerase with a 7-fold increase expression and 2-fold decrease in the product immunogenicity.
Starting from a set of seed sequences, an initial round of experimental validation is performed (N = 10-16 variants), with the surrogate later trained on this data. This is followed by embedding all single point mutants of the seed sequences, using these to predict the fitness with the RF surrogate, followed by ranking and selection.
For the CRISPR nuclease evolution, 12 mutants were screened for 5 rounds, yielding the impressive fold-change improvement with a minimal experimental burden of 60 total variants. Similar number of total variants were screened across the other optimization campaigns (e.g., 96 variants across 8 iterations for the Bxb1 integrase). If these impressive results hold up — which one might hope for given the solid experimental validation in the paper — EVOLVEpro appears well suited for our budget constrained use case.
The code is accessible here and is easy to understand if you are in the field. We think EVOLVEpro, accompanied by AdaLead and PEX (all of them sequentially optimized models with greedy selection), will become a the baseline to beat for any optimization endeavour.
Code: https://github.com/jsunn-y/ALDE
Within their optimization framework ALDE (Active Learning-assisted Directed Evolution), Yang and colleagues compared active learning loops with Bayesian methods (acquisition functions, uncertainty-aware surrogates) with simple machine learning-assisted directed evolution.
They tackled a epistatic fitness landscape for a biocatalyst based on a protoglobin ParPgb. Epistasis reflects a great challenge in protein engineering: fitness cannot be decomposed as a simple sum of each mutation’s effect and some combinations might greatly increase it or drastically reduce it (Lipsh-Sokolik & Fleishman, 2024).
ALDE benchmarked several sequence encoding schemes (AAIndex, Georgiev, one-hot, ESM2), fitness surrogates (boosting ensemble, Gaussian Processes, deep neural network ensemble/ DNN, deep kernel learning), acquisition functions (greedy, upper confidence bound, Thompson sampling) on two ground-truth landscapes - GB1 (Wu et al., 2016) and TrpB (Johnston et al., 2024). The most successful combination was one-hot encodings, a DNN ensemble, and Thompson sampling.
Adapted to the biocatalyst evolution task, two rounds of optimization increased the cyclopropanation yield of the lead variant from 12% to 99%, with better cis:trans selectivity (14:1). Overall, the entire campaign required only 396 experimental calls (216 seed sequences and 90 per round), effectively giving rise to an 8.5-fold fitness increase variant.
ALDE identified a promising recipe for parsing epistatic landscapes. The paper’s code is accessible and well-documented, and you should find it easy to implement it from scratch if you are familiar with active learning and Bayesian optimization. The these validations requirements are moderately beyond our budget - one could explore reducing the number of seed variants, but this could have a detrimental effect on the overall convergence. As stated in the paper, zero-shot fitness predictors (PLMs, mutation effect predictors like EVmutation, Hopf et al., 2017) could be used to design the starting library if the target property you want to optimize is captured by evolutionary (as predicted by EVmutation) or stability-based (as predicted by ESM2 due to its training data) metrics. Similarly, a recent pre-print from JURA Bio showed that variational synthesis models can generate a more informative library compared to random mutagenesis/ NNK libraries (Weinstein et al., 2024).
We hope this blog post clarified some similarities and key differences between the classes of machine learning models in protein desing, offered insights into the currently used approaches, and whetted your appetite to explore model-based sampling.
Now what are you waiting for? Start optimizing those binders!