The Need For An AI “Game” in Multiple Sequence Alignments for Genomics

Gregory Scott Muhs
9 min readOct 3, 2024

--

Tonight I had a conversation with ChatGPT, trying to find tools that I could use to align multiple bacterial genomes.

Backstory

Recently, I downloaded multiple genomes from bacterial chromosomes and was surprised at how different they were from one another, despite being from the same species. I loaded these genomes into MEGA 11, expecting to be able to align them. Yet, the program kept crashing.

I decided to do two things: 1.) Perform the alignments by highlighting portions of the genomes at a time. This semi-manual approach has taken the better part of the day, and I am maybe a quarter of the way done.

2.) Find better tools to perform this process. While chatting with ChatGPT (after my Google searches) I realized that many of the tools that are currently being used are antiquated. Many of the MSA tools being suggested are from the early 2010’s or early 2000’s and at least one is no longer maintained.

Gamify The Problem

Yet the idea kept coming to my mind of a game that could insert “gaps” into the sequence and build up “points” as it increases the “score.” I thought about Deep Mind and how that particular AI has been given video games, and it learns how to play them on it’s own. (Clearly, an amazing advancement since the days of Deep Blue, which could only play chess.)

Yes… this computer teaches itself how to play videogames:

How AI Is Unlocking the Secrets of Nature and the Universe | Demis Hassabis | TED

After a long conversation with ChatGPT, I finally said the following:

In the day and age of AI, this all seems antiquated. I heard that there is an AI that can learn to play any game that it is given. I think that it is called “Deep Mind” — could an alignment task be converted into a game that an AI can learn to “play” in order to create the best multiple sequence alignments?

I was thinking of games such as Foldit, which tackles protein folding and assigns points as virtual protein structures improve. Since this tackles moving amino acids in 3D space, moving letters along a 1-dimensional axis should be much more doable.

Chat gave several long responses to the effect that similar bioinformatics AIs are in early development, however, these were the key paragraphs:

Current Status:

As of now, if you’re looking for a fully AI-powered tool to take an unaligned FASTA file of multiple genomes and perform a high-quality MSA automatically, there isn’t a widely available tool that fits that description exactly. However, traditional tools like MAFFT, MUSCLE, or ClustalW are still among the best for this task, and there are ongoing efforts to incorporate AI-driven methods into these tools.

Summary:

At this stage, no fully AI-based tool is available that can directly take an unaligned FASTA file and perform MSAs on multiple genomes from scratch. However, some promising developments in AI and bioinformatics, including tools like DeepAlign and MSA Transformer, show that AI-enhanced MSA is an area of active research. For now, traditional MSA tools with AI-inspired features remain the best option, though fully AI-powered MSA tools may become more common in the future.

Finally, I asked the following:

What would it take to create a tool like what I am describing? An AI similar to “Deep Mind” that can create MSAs.

Chat gave two responses at this time, and I chose the following:

Creating an AI tool like **DeepMind’s AlphaZero** but designed to perform **multiple sequence alignments (MSA)** would be a groundbreaking development in bioinformatics. The key challenge is to adapt the AI to “play” the alignment process like a game, learning to optimize the placement of sequences, gaps, and matches across multiple genomes. Below is a breakdown of the requirements, challenges, and opportunities to create such a tool.

### 1. **Define the Problem as a Game**:
To create an AI that can perform MSAs effectively, the task must be framed as a problem that the AI can learn and optimize, similar to how **AlphaGo/AlphaZero** learned to play Go or chess.

- **State Space**: The state in this game would be the current configuration of sequences, gaps, matches, and mismatches in the alignment. This state space is vast, as it involves every possible arrangement of the genomes.

- **Actions**: The AI’s actions would be operations like introducing gaps, shifting sequences, or matching particular regions. These actions modify the state of the alignment.

- **Rewards**: The AI needs a reward function to guide it toward better alignments. This function would likely be based on biological criteria, such as:
— **Maximizing sequence identity**.
— **Preserving conserved regions**.
— **Minimizing the number of gaps**.
— **Aligning functionally or structurally important motifs**.
— **Respecting evolutionary constraints** (e.g., using phylogenetic information).

- **Goal**: The AI’s ultimate objective is to generate an MSA that maximizes alignment quality, given these rewards, across multiple genomes.

### 2. **Reinforcement Learning (RL) Framework**:
The AI would likely use **reinforcement learning (RL)**, which is the same technique behind **AlphaZero** and other game-playing AIs. Here’s how RL could be applied to MSA:

- **Agent**: The AI would act as an “agent” making decisions about where to place sequences and gaps in the alignment.

- **Environment**: The environment is the collection of sequences and the current state of the alignment.

- **Training**: The AI would be trained through millions of iterations of aligning sequences. Each time it aligns sequences (or makes a partial alignment), it would receive a reward (or penalty) based on the quality of the alignment. Over time, the AI would learn to make more accurate and biologically meaningful decisions.

### 3. **Training Data**:
- **MSA Datasets**: The AI would need a large training dataset of pre-aligned sequences, such as from protein families or whole genomes, to learn from. Datasets like **Pfam**, **Rfam**, or **TreeFam** could be used to train the AI on what good alignments look like.

- **Synthetic Data**: The AI could also be trained on synthetic datasets with known alignments, which could help it learn faster by providing clear examples of optimal solutions.

- **Phylogenetic Trees**: Including evolutionary data, such as phylogenetic trees, would allow the AI to better understand evolutionary relationships between sequences, helping to guide the alignment.

### 4. **Model Architecture**:
The architecture of the AI would be critical to handling the complexity of MSA tasks. Here are some possible approaches:

- **Transformer Architecture**: Like **MSA Transformer**, this model could use an attention-based mechanism to learn which parts of the sequence are most important to align. Attention models are effective at handling long-range dependencies, which is important for aligning distant regions in large genomes.

- **Deep Learning**: Deep learning models could be trained to recognize patterns of conservation, divergence, and functional motifs across multiple genomes. A deep neural network could learn features from existing alignments and generalize them to new, unseen data.

- **Convolutional Neural Networks (CNNs)**: CNNs could be used to scan through sequences, detecting conserved regions and generating optimal alignments.

### 5. **Scalability**:
One of the biggest challenges is scaling the AI to handle **large genome datasets**. Unlike Go or chess, where the state space is limited by the rules of the game, genome alignment involves billions of potential moves, especially for large, complex datasets. To overcome this:

- **Parallel Processing**: The AI could be trained using **massive parallel processing** on GPUs or TPUs, allowing it to explore multiple alignment strategies simultaneously.

- **Divide and Conquer**: The AI could break the alignment task into smaller, manageable chunks, aligning regions of genomes independently and then stitching them together — this is somewhat akin to **progressive alignment** strategies used in traditional tools like **MAFFT**.

### 6. **Algorithmic Enhancements**:
The AI could also be augmented with additional algorithmic techniques to improve performance:

- **Genetic Algorithms**: Inspired by evolutionary processes, the AI could apply genetic algorithms to evolve better alignments over time, mutating and recombining sequence alignments to find optimal solutions.

- **Monte Carlo Tree Search (MCTS)**: This technique, used by **AlphaZero**, could be employed to explore different alignment possibilities, with the AI iteratively refining its search for the best alignment.

- **Simulated Annealing**: The AI could use simulated annealing, where it allows for suboptimal moves early on in the alignment process (to explore more of the alignment space) and gradually refines its alignments as it converges on the best solution.

### 7. **Incorporating Biological Knowledge**:
To make sure the AI’s alignments are biologically relevant, it must incorporate biological domain knowledge:

- **Evolutionary Models**: The AI should be aware of evolutionary relationships between the sequences, using models like **maximum likelihood** or **Bayesian methods** to guide alignment decisions.

- **Structural Constraints**: If aligning protein sequences, the AI could use information about **protein structures** to ensure that functional motifs are properly aligned.

- **Functionality Preservation**: For genomes, the AI could incorporate knowledge about conserved genes, regulatory regions, and other biologically important features to ensure that they are correctly aligned.

### 8. **Validation and Testing**:
Once developed, the AI would need to be rigorously tested and validated:

- **Benchmarking**: The AI’s alignments would be compared against alignments generated by existing tools like **MAFFT**, **MUSCLE**, or **ClustalW** to ensure they are at least as accurate.

- **Real-World Genomic Datasets**: The AI would be tested on real-world datasets, such as microbial genomes, plant genomes, or mammalian genomes, to ensure it can handle the complexity and variability of real sequences.

### 9. **User Interface and Deployment**:
For the AI to be useful in practice, it needs to be accessible to researchers:

- **Web Interface**: The AI could be deployed as a cloud-based tool where users can upload unaligned FASTA files and receive high-quality MSAs.

- **Command-Line Interface (CLI)**: Advanced users could interact with the AI through a command-line tool that allows for more flexibility and customization.

- **API Access**: The AI could also be available as an API, allowing other tools to integrate it into bioinformatics pipelines.

### 10. **Challenges and Opportunities**:
#### a. **Computational Resources**:
- **Training** an AI on this scale requires enormous computational resources, particularly for large genome datasets. The AI would likely need access to **GPU/TPU clusters** or **high-performance computing** environments.

- **Efficiency**: AI models, while powerful, can be computationally expensive to train and run. Careful design choices would be needed to balance alignment quality and runtime performance.

#### b. **Data Complexity**:
- Handling the complexity of biological data, including **repetitive elements**, **conserved motifs**, and **variable regions**, would require highly specialized reward functions and training datasets.

### 11. **Future Prospects**:
If successful, such an AI could:
- **Revolutionize MSA** by producing faster, more accurate alignments, especially for large, complex genomes.
- **Handle divergent sequences** that are difficult to align with traditional methods.
- **Scale effortlessly** across massive datasets, reducing the time and computational resources needed for alignment tasks.

### Summary:
To create an AI for multiple sequence alignment, like **DeepMind** for MSAs, you would need:
- A clear **game-like formulation** of the alignment task.
- **Reinforcement learning** or **deep learning** methods to train the AI.
- A **reward system** that encourages biologically accurate alignments.
- Access to **large-scale training data** and **high-performance computational resources**.
- Collaboration with experts in **machine learning** and **bioinformatics** to design the architecture, train the model, and integrate biological knowledge.

This approach is feasible with today’s AI technology, but it would require significant development effort and computational power. It could represent a major breakthrough in genomics and bioinformatics.

Would you like help exploring any aspects of this development, or would you be interested in working with available tools as a starting point?

Conclusions

In my humble opinion, a “game” should be developed around the concept of moving “chunks” of letters around, like an abacus, to create the score with the most “points.” Such a game would not have to be entertaining for a human mind, it just has to be conceptually easy and repetitive. Early versions of this game would not have to understand biochemistry or evolutionary motifs or any other complex details, it just has to allow the “player” (i.e. the eventual AI) to move letters back and forth to create a score.

It would also have to allow for very very large files (very large “abacus” types of files) to be used in a way that minimizes the use of CPU memory at any given time. This way the process can be done a little at a time, and build on itself. (In technical terms, the use of recursion and/or iteration would be key.)

Image Credit Pixabay

By creating a “game” like this, it would allow many problems in bioinformatics to be approached in a way that is currently time-consuming and would hopefully open the door to similar bioinformatics tools to solve other problems later on down the road.

--

--

No responses yet