Save Sequences to Workspace
Without a centralized repository for sequence data, this information can be disorganized making it challenging to effectively store, compare, and analyze sequence information. The current MSA limitations are not sufficient for many workflows involving longer sequences and larger result sets. Therefore, it is now possible to save up to 100,000 Sequences to a Workspace to allow users to organize and manage work more effectively by saving all work in one space. This also increases the efficiency of collaboration by allowing users to share different types of data with colleagues. Please note that historical MSA data will be automatically migrated to a Workspace.
To save sequences to a workspace, you need to select the sequences of interest or all the sequences which will open up a pop-up. On the pop-up, you will see the option 'Save Sequence'. After clicking that, another pop-up will open where you can select the workspace you would like to save the sequences to.
Other Functionalities
Some other functionalities of the sequence management platform include:
- Import Sequences via other identifiers (Seq Code, GenbankID, CAS RN) in Workspace.
- Renaming the Seq Code to a more specific or descriptive name to make it easier to find the data you need. The Seq Code is Patsnap-specific so renaming this can make it clearer to the user what the sequence is.
- Add comments to your sequences. Collaborate with colleagues by sharing insights, asking questions, or providing feedback. Please note that the upper limit of a single sequence export is 50,000.
- Compare up to 100 sequences at a time with the ‘Align Sequence’ view. This generates an Alignment Report and replaces the previous MSA feature
Please note that the Multi-sequence alignment reports need to be manually saved. This can then be accessed on the sidebar under Alignment Report.
Sequence Alignment Algorithms
When users input multiple sequences, they can choose between two alignment modes - "Multi-sequence alignment" and "Pair-sequence alignment". The default selected will be Multi-sequence alignment. Under the Multi-sequence alignment option, users are able to select from two alignment algorithms - Clustal Omega, and MAFFT. By default, Clustal Omega is used, and the alignment result is regenerated after the user switches to the other option. More information on what each alignment algorithm is will be covered below.
When selecting the Pair-seq alignment mode, by default, the first sequence is the template sequence, which is something you can change by using the drop-down list. The options you have for alignment algorithms include Local Alignment (Smith-Waterman), Global Alignment (Needleman-Wunsch), Semi-global alignment, and Blast. The Local Alignment algorithm is used by default and when using a Global or Semi-global Alignment algorithm, the selected target sequences will be displayed in groups. More information on each of these algorithm types is discussed below.
1. Clustal Omega
Clustal Omega is a multiple-sequence alignment program for aligning three or more sequences together in a computationally efficient and accurate manner. It uses a progressive alignment strategy, where sequences are first grouped in pairs and aligned based on pairwise alignments, and then these pairwise alignments are combined to create a final multiple-sequence alignment. The algorithm employs a combination of heuristics and iterative refinement steps to optimize the alignment, making it efficient for aligning large numbers of sequences. The different parameter options include the following:
- Cluster size: Soft maximum of sequences in sub-clusters.
- Full Matrix: Use full distance matrix for guide-tree calculation (might be slow; mBed is the default).
- Full: Use full distance matrix for guide-tree calculation during iteration(might be slow; mBed is the default).
- Iteration Matrix: Number of (combined guide-tree/HMM) iterations.
- Iterations Guide Tree: Maximum number of guide-tree iterations.
- Iterations HMM: Maximum number of HMM iterations.
- Iterations Transitivity: use transitivity.
- Kimura: Use Kimura distance correction for aligned sequences.
2. MAFFT
MAFFT (Multiple Alignment using Fast Fourier Transform) is a widely used algorithm for multiple sequence alignment, which is a bioinformatics technique used to align multiple sequences of biological data, such as DNA or protein sequences, to identify conserved regions and infer evolutionary relationships. MAFFT is known for its efficiency and accuracy in aligning large numbers of sequences, including protein sequences, DNA sequences, and RNA sequences. It uses a progressive alignment strategy, where sequences are first grouped in pairs and aligned based on pairwise alignments, and then these pairwise alignments are combined to create a final multiple-sequence alignment. MAFFT also incorporates a Fast Fourier Transform (FFT) technique, which allows for rapid calculation of sequence similarity scores, making it computationally efficient for aligning large datasets. The different parameter options include the following:
- FFT: Number of threads to run in parallel. A mathematical technique used by MAFFT to accelerate the alignment process.
- Penalty Offset: Adjust the gap penalty offset by this amount. It can be used to fine-tune by adding a constant value to the gap penalty.
- Gap Penalty: Gap opening penalty FFT: Use Fast Fourier Transform approximation in group-to-group alignment.
- Matrix: Comparison matrix to be used when adding sequences to the alignment.
- Sparse Pickup Iterations: Number of pickup iterations.
- Tree Building Iterations: The number of iterations the guide tree is built in the progressive stage. Valid with 6mer distance.
3. Local Alignment(Smith-Waterman)
Local alignment, also known as the Smith-Waterman algorithm, is a widely used algorithm for pairwise sequence alignment and is commonly used in bioinformatics and computational biology to identify regions of local similarity between two sequences, such as proteins or DNA/RNA sequences. The Smith-Waterman algorithm is designed to find the best local alignment between two sequences, which means it identifies the most similar sub-regions or segments within the sequences, rather than aligning the entire sequences from start to end. This makes it particularly useful for identifying conserved regions or functional domains in proteins or identifying local sequence similarities in DNA/RNA sequences, such as identifying exons or regulatory regions. The different parameter options include the following:
-
Gap Extent: Whole number penalty that must be >=0
-
Gap Open: Whole number penalty that must be >=0
-
Match Score: Must be >=0
-
Mismatch Score: Must be >=0
-
Substitution Matrix: Score matrix for aligning any possible pair of residues.
4. Global Alignment(Needleman-Wunsch)
Global alignment, also known as the Needleman-Wunsch algorithm, is a widely used algorithm for pairwise sequence alignment and is commonly used in bioinformatics and computational biology to identify the best alignment between two sequences, such as proteins or DNA/RNA sequences, by considering the entire length of the sequences. This algorithm is designed to find the best global alignment between two sequences, which means it aligns the entire length of the sequences from start to end, including gaps or insertions/deletions, and identifies the optimal alignment that maximizes a given scoring scheme. This makes it particularly useful for comparing two sequences in their entirety and identifying regions of similarity or dissimilarity. The different parameter options include the following:
-
Gap Extent: Whole number penalty that must be >=0
-
Gap Open: Whole number penalty that must be >=0
-
Match Score: Must be >=0
-
Mismatch Score: Must be >=0
-
Substitution Matrix: Score matrix for aligning any possible pair of residues.
5. Semi-global Alignment
Semi-global alignment combines features of both global and local alignment. It seeks to find the optimal alignment between two sequences but allows for gaps or insertions/deletions at the beginning or end of sequences while penalizing gaps within the sequences. Semi-global alignment is often used in situations where the sequences being compared have known regions of similarity at the termini but may differ in length or contain gaps within the sequences. It is commonly used in bioinformatics and computational biology for tasks such as gene prediction, sequence assembly, and identification of conserved regions in proteins or nucleotide sequences. The different parameter options include the following:
-
Gap Extent: Whole number penalty that must be >=0
-
Gap Open: Whole number penalty that must be >=0
-
Match Score: Must be >=0
-
Mismatch Score: Must be >=0
-
Substitution Matrix: Score matrix for aligning any possible pair of residues.
6. Blastn
BLASTn is a widely used algorithm and software tool for performing sequence similarity searches of nucleotide sequences against a nucleotide sequence database. It stands for "Basic Local Alignment Search Tool for nucleotides". It is designed to identify regions of similarity or homology between a query nucleotide sequence and a nucleotide sequence database. It uses a heuristic approach to rapidly search for local alignments, or regions of high similarity, between the query sequence and sequences in the database. The algorithm is based on the principles of local alignment, where short regions of similarity, rather than global alignments, are sought between the query sequence and the database sequences. The different parameter options include the following:
- Match with gaps: This refers to the ability of the BLASTn algorithm to allow for gaps (insertions or deletions) in the alignments between the query sequence and the database sequences.
- Word size: Larger word sizes may increase the speed of the search but may reduce sensitivity to shorter matches, while smaller word sizes may increase sensitivity to shorter matches but may decrease the speed of the search.
- Reward/penalty for match/mismatch: The reward is the score assigned for a match, indicating the level of similarity, while the penalty is the score assigned for a mismatch, indicating the level of dissimilarity. Higher reward scores for matches and lower penalty scores for mismatches generally result in more stringent matching criteria.
- Gap costs: Lower gap costs generally result in more permissive gap alignments, allowing for longer gaps, while higher gap costs result in more stringent gap alignments, allowing for shorter or fewer gaps.
Custom Highlighting
A new custom highlighter view has been added to allow users to identify regions of conservation and variation between sequences better. Users will be able to compare to a reference sequence and highlight matches/mismatches, use 4-color highlighting, or view CDRs.
Comments
0 comments
Please sign in to leave a comment.