Welcome to DiMA User Manual

1. About

1.1. Sequence Diversity Dynamics Analyser for Viruses (DiMA)

Viral infectious diseases are a major public health threat. Sequence diversity is one of the major challenges in the design of diagnostic, prophylactic and therapeutic interventions against viruses. The diversity can be an outcome of a combination of underlying evolutionary processes (mutation, recombination, and assortment). A continuing goal is a greater understanding of viral proteome sequence diversity, the dynamics of substitutions, and effective strategies to overcome the diversity for drug or vaccine design.

Herein, we present Diversity Motif Analyser (DiMA), a tool designed to facilitate the quantification and dissection of viral sequence diversity dynamics. DiMA provides a quantitative measure of sequence diversity by use of Shannon’s entropy (PMID: 18698358), applied via a user-defined k-mer sliding window to a protein alignment. Additionally, DiMA further interrogates the diversity by dissecting the entropy at each aligned k-mer position to various diversity motifs (PMIDs: 32518710, 23593157), based on the incidence of distinct k-mer sequences at the position. At a given position, index is the predominant sequence and all other distinct k-mers are referred to as total variants to the index, sub-classified into major variant (the most common variant), minor variants (comprising of k-mers with incidence lower than major and higher than unique), and unique variants (k-mers seen only once in the alignment). Moreover, the description line of the sequences in the input alignment can be enriched for inclusion of meta-data as part of the analysis, such as spatio-temporal information, among others. DiMA outputs a JSON file that provides multiple facets of sequence diversity: sequence name, k-mer position, entropy, distinct k-mers at the position, and their incidence, motif classification and metadata (if available). DiMA enables comparative sequence diversity dynamics analyses, within and between proteins of a virus species, and proteomes of different species.

DiMA is an outcome of many years of viral studies on several different species since 2007. These studies have been published in peer-reviewed journals (PMIDs: 17434154, 18030326, 18698358, 19401763, 21471731, 22573867, 23593157, 26680743, 29322922, 29363421, 31874646, 34068495 and 32518710 and international conferences. Besides being available as a webserver, it can also be downloaded as a standalone client tool (https://github.com/PU-SDS/DiMA), particularly for big data analyses.

DiMA webserver has been under development since March 2020. It has been extensively tested with 18 datasets (9 structural and 9 non-structural proteins) from six viral species. External validation of our tool has been performed by three individuals, with a total of 36 datasets (18 structural and 18 non-structural proteins), originating from three viral species.

1.2. Accesibility

The webserver is publicly available at: https://dima.bezmialem.edu.tr/

1.3. Browser compatibility

browserc

1.4. Frontend/Backend Frameworks

Python FastAPI utilized for DiMA webserver Backend. ReactJS utilized for DiMA webserver Frontend.

1.5. Novel features

Table 1. Novel features of DiMA in comparison with other web servers for viral sequence variation analysis
Features	DiMA	PVS	LANL	BV-BRC
Analysis of nucleic acid - amino acid sequences	✅ - ✅	❌ - ✅	✅ - ✅	✅ - ✅
Shannon Entropy on user-defined sliding window	✅	❌	✅	❌
Entropy correction for size bias	✅	❌	❌	❌
K-mer/variant frequency calculation	✅	❌	❌	✅
K-mer diversity motifs classification	✅	❌	❌	❌
Metadata inclusion	✅	❌	❌	❌
Idenfication of historically conserved sequences	✅	✅	❌	❌
Multiple interactive visualizations	✅	❌	❌	❌
Web service input size limit	100 MB ⭐	~0.2 MB	?	No limit

: ⭐ Analysis of larger files possible with CLI version which there is no limit.

1.6. Defining diversity motifs

For a given sequence alignment, all sequences at each of the aligned k-mer positions are quantified for distinct sequences and ranked-classified into diversity motifs based on their incidences, as described in Hu et al. (2013) (Supplementary Figure 1, see extract below) (PMID: 23593157).

: Figure 1. Definitions of diversity motifs. The ‘‘Index’’ nonamer is the most prevalent sequence, present in 8 of the 20 isolates. The ‘‘Major’’ variant is the most common variant of the index (5/20). ‘‘Minor’’ variants are multiple different repeated sequences, each with incidences less than the major variant. ‘‘Unique’’ variants are those represented by a single aligned sequence. Distinct variant sequences at a given nonamer position are the different sequence at the position; in this example one of major, two of minor, and three of unique.

2. Algorithms

: Figure 2. Workflow schema. Input: Viral protein sequences, typically obtained from publicly available databases (NCBI virus and GISAID, among others), aligned and submitted to DiMA in aligned FASTA (.afa) format. Process: DiMA provides a quantitative measure of sequence diversity by use of Shannon’s entropy, applied via a user-defined k-mer sliding window. Further, the entropy value is corrected for sample size bias by applying a statistical adjustment (Lipinski’s rule). Additionally, DiMA further interrogates the diversity by dissecting the entropy value at each k-mer position to various distinct k-mer sequences that are classified into diversity motifs (index, major, minor and unique; see Section 3 for the definition of the diversity motifs) based on their incidence. Output: The entropy values, diversity motifs, and each of the k-mer corresponding metadata is plotted to provide a panoramic overview of the protein sequence diversity.

2.1. Entropy algorithm

: Figure 3. Entropy algorithm.

2.2. Performance testing of DiMA

DiMA has been extensively tested with 18 protein datasets from six viral species. External validation of our tool has been performed by three individuals, with a total of 36 protein datasets, originating from three viral species.

2.3. Performance testing of Sample size bias correction

As it was explained in Figure 3, entropy is corrected for sample size bias. Uncorrected (baseline) and corrected entropy values were calculated over a wide range of sample sizes (100 to 100,000)subsetted from SARS-CoV-2 Spike protein alignment. Then, protein-wide average entropy were plotted for each dataset.

Conclusion:

The baseline entropy value appears to be generally an under-estimate relative to the corrected entropy, which can be a reflection of better data distribution achieved through the resampling approach for the corrected entropy.

3. Input file and parameters

3.1. Input file

DiMA only uses multiple sequence alignment (protein sequences; DNA should also work) in (aligned) FASTA (.afa or .fas) format. Any existing, published alignment tool can be used to produce the MSA, such as MAFFT or MUSCLE, as long as the aligned sequences are provided to DiMA as input in (aligned) FASTA format.

3.1.1 MSA preparation workflow

Sequence Collection

MSA for DiMA begins with the careful retrieval of sequences from at least one reliable data source. When working with multiple data sources, the retrieved sequences should be concatenated into a single FASTA file for consistency and ease of analysis. Sequences can be retrieved as full-genome data and subsequently separated using a BLASTp search. Alternatively, if the database supports it (e.g., NCBI Virus), sequences can be directly retrieved in their separated form. Adding metadata to the sequence headers is crucial, as it provides essential context for each sequence, aiding in downstream analysis.

Sequence Cleaning and Deduplication

Irrelevant sequences in your database can hinder analysis, and identifying them can be challenging. These sequences may not meet the specific criteria required for your study, such as host, geography, or other factors. A quick check involves verifying the species taxonomy lineage to ensure all records are correct. Sometimes, irrelevant sequences become apparent during alignment when a sequence doesn’t align well with others, indicating it may be an outlier needing removal. While we assume the initial sequencing was done correctly, BLAST can help catch outliers by flagging hits that are significantly different. However, in cases where you skip BLAST, such as with flu viruses where protein sequences are directly downloaded from databases, you might miss the opportunity to remove these outliers.

To eliminate redundant data, run CD-HIT to remove 100% duplicate sequences, whether they are full-length duplicates or subsets of another sequence. You can also cluster sequences at a lower identity threshold, like 90%, to group similar sequences together. CD-HIT is confirmed to handle subset duplicates effectively, ensuring reliable deduplication.

MSA

After extracting individual protein sequences, you can proceed with the alignment process. MSA is a heuristic method, and while you could rely on a single alignment tool, it’s often beneficial to test multiple tools to determine which provides the most reliable results for your specific dataset. However, consistency is key—if you use multiple tools, apply them uniformly across all proteins and ultimately choose the best alignment for further analysis.

In literature, it’s common to see just one alignment tool being used, especially when dealing with relatively conserved viruses. However, for more diverse viruses, like HIV-1 or different subtypes of influenza, different tools might perform better for different proteins. If multiple tools are used, this choice must be justified clearly in any publication, supported by benchmarks against existing alignments.

Alignment QC

Manual Inspection to Identify Mismatches: Alignment anchors or conserved blocks can guide manual inspection by providing reference points within the alignment. This helps ensure that regions between these anchors are correctly aligned. For misaligned sequences, search for similar sequences within the alignment to determine the correct positioning. Address issues block by block to systematically correct the alignment.
Use of Auxilary Tools: Several auxiliary tools, such as GUIDANCE (Penn et al., 2010), SATe (Liu et al., 2009), HoT (Landan and Graur, 2008), Gblocks (Castresana, 2000), and SuiteMSA (Thompson et al., 2002), have been developed to facilitate this process. These tools assist in improving alignment quality by providing metrics for alignment confidence or by refining the alignment to reduce errors. However, they do not entirely eliminate the need for manual inspection.
Dealing with Mismatches: Mismatches may indicate misalignment. If a mismatched sequence appears only in one sequence, use BLAST to check its validity. If it’s an outlier (e.g., matching bacteria instead of the target organism), consider removing that part or the entire sequence, especially if it’s at the beginning or end. For mismatches in the middle, it’s safer to delete the entire sequence, although careful inspection is needed.

3.2. Parameters

3.2.1. Sample name

Name of sequence to be analysed.

3.2.2. Low support threshold

The support is defined as the number of sequences at a given k-mer position that do not harbor a gap and unknown and/or ambiguous nucleotide base and amino acid residue. Positions below a statistical support of 100 sequences (default) are defined as of low support. The user has the flexibility to set the threshold for low support.

3.2.3. K-mer length

Select a k-mer window size that is appropriate for the analysis.
While the minimum applicable size is 3, the maximum can equal to the alignment length of the uploaded input file. By default, DiMA uses a window size of nine (9; nonamer; 9-mer) to evaluate the viral diversity with respect to cellular immune response.

3.2.4. Header format to include metadata

This optional functionality allows annotation of the distinct sequences at each k-mer position with respective cognate sequence metadata, such as collection date, geographical location, isolation host. Simply, it parses the information on the sequence header (definition/description line).

Note

Example of a definition line: >ATY74257.1 |2017-03-02|China: Kunming|Homo sapiens

Because the format of metadata varies between databases, DiMA has relied on the format of NCBI Virus.

4. How to interpret the results

: Figure 4. Sample output from DiMA analysis.

Note

Sample results are accesible for a self-exploration:

4.1. Summary

Summary information (Figure 4.1) that is general to the input alignment and specific to a given k-mer position.

alignment length
download results
query name
support threshold
position support
distinct variants
position entropy
selected position

4.2. Sequence diversity

Entropy values indicate the level of variability at the corresponding k-mer positions, with zero representing completely conserved positions. Plot (Figure 4.2) provide a holistic view of the diversity and are responsive and interactive (one can easily hover and see the approximate entropy value of the hovered position).

Note

For a benchmark, the peak absolute entropy of 9.2 and total variants of 98% were observed for HIV-1 clade B (Hu et al., 2013).

The methodology for calculation of Shannon’s entropy at each k-mer position is as per Khan et al., (2008).

4.3. Diversity motifs

All sequences at each of the k-mer positions in the protein alignments were quantified for distinct sequences and ranked-classified into diversity motifs (Figure 4.3-4) based on their incidences, as explained above under the About section.

Users can select a position from the “SELECTED POSITION” box (Figure 4.1), in the upper right corner to browse the motif distribution of the position.

4.4. Sequence Metadata

If the header format is provided in the analysis parameters (as described in the above Parameters, DiMA will make a pie chart (Figure 4.5) for each type of the metadata.

The user should select a specific k-mer from the selected position for the metadata to appear. By default, the first peptide will be selected. In the example below, the index sequence is selected and host species distribution is shown in the plot.

4.5. Data Synthesis and Further Analysis

Data from DiMA can be synthesised and analysed further in various ways, such as:

scatter plot of the relationship between entropy and incidence (frequency) of total variants;
scatter plot of motif incidence (for each diversity motif) against total variants;
frequency distribution violin plots of the diversity motifs; and
distribution of conservation level for k-mer positions in the protein, and
a table indicating the minimum and maximum total variants at each entropy boundary values.

Examples of such synthesis and analyses are demonstrated in Hu et al. (2013) and Abd Raman et al. (2020).

4.6. Download

The DiMA output, top panel (Figure 4.1) allows for downloading of the analysis results in JSON and XLSX formats. The JSON file contains the complete analysis results as key-value pairs, which can be viewed using a public JSON viewer tool (such as https://jsonformatter.org/json-viewer). Additionally, the XLSX file provides for easier viewing through an MS Excel application or equivalent that supports the format. The concatenated list of HCS based on a user-defined index incidence threshold can also be downloaded (JSON format) and viewed in a text editor or JSON viewer.

5. DiMA-CLI (Large samples)

5.1. Considerations for CLI

Web service of DiMA is capable of handling files up to 100 MB. Analysis with bigger data is available by using DiMA-CLI locally. Here, we provide some instructions for such purposes. You can download the sample data, MERS-CoV Spike protein, here.

5.2. Installation

Installing the latest version is possible with pip install dima-cli. Python version >=3.7 <3.11 is required.

5.3. Basic Usage

The standard way to run the tool:

dima-cli -i aligned_sequences.afa -o results.json

To print the full list of arguments with short explanations:

dima-cli --help

To recieve the results in tabular format:

dima-cli -i aligned_sequences.afa -t xlsx -o results.xlsx

To run a DNA analysis:

dima-cli -i aligned_sequences_nt.afa -a nucleotide -o results.json

To recieve the HCS along with the results:

dima-cli -i aligned_sequences.afa -o results.json -c hcs.json

5.4. Advanced Usage Examples

Customazing the parameters may be neccesary, especially for the big data analyses. Adjusting the parameters is primarily reasonable for: inclusion of metadata by parsing of the headers, setting a support threshold for positions, prefering different k-mer sizes. Example usage:

dima-cli -i aligned_sequences_nt.afa -l 9 -f "accession|host|geography|year" -a protein -t json -c hcs.json -e 100 -o results.json

6. FAQs and Support

How to cite?

Tharanga, S., Hu, Y., Unlu, E. S., Sjaugi, M. F., Celik, M. A., Hekimoglu, H., Miotto, O., Oncel, M. M., & Khan, A. M. (2022). DiMA: Sequence Diversity Dynamics Analyser for Viruses. https://doi.org/10.48550/arxiv.2205.13915

5.1. Support

Please don’t hesitate to reach out to the developers for your questions, comments, or other feedback through mailing makhan@bezmialem.edu.tr

5.2. Team

Shan Tharanga
Yongli Hu
Eyyüb Selim Ünlü
Muhammad Farhan Sjaugi
Muhammet A. Çelik
Hilal Hekimoğlu
Olivo Miotto
Muhammed Miran Öncel
Mohammad Asif Khan

7. Acknowledgement

We thank all those who used and/or evaluated DiMA and/or its earlier forms, directly or indirectly during its development. Names are listed in alphabetical order. Our apologies if we missed anyone.

Ayesha Fatima
Benjamin, Tan Yong Liang
Chong Li Chuin
Emre Herdan
Esin Özkan
Esra Busra Isik
Faruk Üstünel
Gizem Yılmaz
Gokcen Sahin
Hadia Syahirah Abd Raman
Hasiba Karimi
Heiny Tan
Johann Shane Tian
Lim Wan Ching
Melike Karakaya
Natascha May A/P Thevasagayam
Pendy Tok
Qi Ying Koo
Rashid Mukaila
Rashmi Sukumaran
Robandeep Kaur Saini
Tcharé Adnaane Bawa
Zarife Aslan