Welcome to DiMA User Manual

1. About

1.1. Sequence Diversity Dynamics Analyser for Viruses (DiMA)

Viral infectious diseases are a major public health threat. Sequence diversity is one of the major challenges in the design of diagnostic, prophylactic and therapeutic interventions against viruses. The diversity can be an outcome of a combination of underlying evolutionary processes (mutation, recombination, and assortment). A continuing goal is a greater understanding of viral proteome sequence diversity, the dynamics of substitutions, and effective strategies to overcome the diversity for drug or vaccine design.

Herein, we present Diversity Motif Analyser (DiMA), a tool designed to facilitate the quantification and dissection of viral sequence diversity dynamics. DiMA provides a quantitative measure of sequence diversity by use of Shannon’s entropy (PMID: 18698358), applied via a user-defined k-mer sliding window to a protein alignment. Additionally, DiMA further interrogates the diversity by dissecting the entropy at each aligned k-mer position to various diversity motifs (PMIDs: 32518710, 23593157), based on the incidence of distinct k-mer sequences at the position. At a given position, index is the predominant sequence and all other distinct k-mers are referred to as total variants to the index, sub-classified into major variant (the most common variant), minor variants (comprising of k-mers with incidence lower than major and higher than unique), and unique variants (k-mers seen only once in the alignment). Moreover, the description line of the sequences in the input alignment can be enriched for inclusion of meta-data as part of the analysis, such as spatio-temporal information, among others. DiMA outputs a JSON file that provides multiple facets of sequence diversity: sequence name, k-mer position, entropy, distinct k-mers at the position, and their incidence, motif classification and metadata (if available). DiMA enables comparative sequence diversity dynamics analyses, within and between proteins of a virus species, and proteomes of different species.

DiMA is an outcome of many years of viral studies on several different species since 2007. These studies have been published in peer-reviewed journals (PMIDs: 17434154, 18030326, 18698358, 19401763, 21471731, 22573867, 23593157, 26680743, 29322922, 29363421, 31874646, 34068495 and 32518710 and international conferences. Besides being available as a webserver, it can also be downloaded as a standalone client tool (https://github.com/PU-SDS/DiMA), particularly for big data analyses.

DiMA webserver has been under development since March 2020. It has been extensively tested with 18 datasets (9 structural and 9 non-structural proteins) from six viral species. External validation of our tool has been performed by three individuals, with a total of 36 datasets (18 structural and 18 non-structural proteins), originating from three viral species.

1.2. Accesibility

The webserver is publicly available at: https://dima.bezmialem.edu.tr/

1.3. Browser compatibility

browserc

1.4. Frontend/Backend Frameworks

Python FastAPI utilized for DiMA webserver Backend. ReactJS utilized for DiMA webserver Frontend.

1.5. Novel features

Table 1. Novel features of DiMA in comparison with other web servers for viral sequence variation analysis

Features

DiMA

PVS

LANL

BV-BRC

Analysis of nucleic acid - amino acid sequences

✅ - ✅

❌ - ✅

✅ - ✅

✅ - ✅

Shannon Entropy on user-defined sliding window

Entropy correction for size bias

K-mer/variant frequency calculation

K-mer diversity motifs classification

Metadata inclusion

Idenfication of historically conserved sequences

Multiple interactive visualizations

Web service input size limit

100 MB ⭐

~0.2 MB

?

No limit

⭐ Analysis of larger files possible with CLI version which there is no limit.

1.6. Defining diversity motifs

For a given sequence alignment, all sequences at each of the aligned k-mer positions are quantified for distinct sequences and ranked-classified into diversity motifs based on their incidences, as described in Hu et al. (2013) (Supplementary Figure 1, see extract below) (PMID: 23593157).

diversitymotifs

Figure 1. Definitions of diversity motifs. The ‘‘Index’’ nonamer is the most prevalent sequence, present in 8 of the 20 isolates. The ‘‘Major’’ variant is the most common variant of the index (5/20). ‘‘Minor’’ variants are multiple different repeated sequences, each with incidences less than the major variant. ‘‘Unique’’ variants are those represented by a single aligned sequence. Distinct variant sequences at a given nonamer position are the different sequence at the position; in this example one of major, two of minor, and three of unique.

2. Algorithms

workflow

Figure 2. Workflow schema. Input: Viral protein sequences, typically obtained from publicly available databases (NCBI virus and GISAID, among others), aligned and submitted to DiMA in aligned FASTA (.afa) format. Process: DiMA provides a quantitative measure of sequence diversity by use of Shannon’s entropy, applied via a user-defined k-mer sliding window. Further, the entropy value is corrected for sample size bias by applying a statistical adjustment (Lipinski’s rule). Additionally, DiMA further interrogates the diversity by dissecting the entropy value at each k-mer position to various distinct k-mer sequences that are classified into diversity motifs (index, major, minor and unique; see Section 3 for the definition of the diversity motifs) based on their incidence. Output: The entropy values, diversity motifs, and each of the k-mer corresponding metadata is plotted to provide a panoramic overview of the protein sequence diversity.

2.1. Entropy algorithm

entropy_calculation

Figure 3. Entropy algorithm.

2.2. Performance testing of DiMA

DiMA has been extensively tested with 18 protein datasets from six viral species. External validation of our tool has been performed by three individuals, with a total of 36 protein datasets, originating from three viral species.

2.3. Performance testing of Sample size bias correction

As it was explained in Figure 3, entropy is corrected for sample size bias. Uncorrected (baseline) and corrected entropy values were calculated over a wide range of sample sizes (100 to 100,000)subsetted from SARS-CoV-2 Spike protein alignment. Then, protein-wide average entropy were plotted for each dataset.

mean_deviation

Conclusion:

  • The baseline entropy value appears to be generally an under-estimate relative to the corrected entropy, which can be a reflection of better data distribution achieved through the resampling approach for the corrected entropy.

3. Input file and parameters

3.1. Input file

DiMA only uses multiple sequence alignment (protein sequences; DNA should also work) in (aligned) FASTA (.afa or .fas) format. Any existing, published alignment tool can be used to produce the MSA, such as MAFFT or MUSCLE, as long as the aligned sequences are provided to DiMA as input in (aligned) FASTA format.

3.1.1 MSA preparation workflow

entropy_calculation

Sequence Collection

MSA for DiMA begins with the careful retrieval of sequences from at least one reliable data source. When working with multiple data sources, the retrieved sequences should be concatenated into a single FASTA file for consistency and ease of analysis. Sequences can be retrieved as full-genome data and subsequently separated using a BLASTp search. Alternatively, if the database supports it (e.g., NCBI Virus), sequences can be directly retrieved in their separated form. Adding metadata to the sequence headers is crucial, as it provides essential context for each sequence, aiding in downstream analysis.

Sequence Cleaning and Deduplication

Irrelevant sequences in your database can hinder analysis, and identifying them can be challenging. These sequences may not meet the specific criteria required for your study, such as host, geography, or other factors. A quick check involves verifying the species taxonomy lineage to ensure all records are correct. Sometimes, irrelevant sequences become apparent during alignment when a sequence doesn’t align well with others, indicating it may be an outlier needing removal. While we assume the initial sequencing was done correctly, BLAST can help catch outliers by flagging hits that are significantly different. However, in cases where you skip BLAST, such as with flu viruses where protein sequences are directly downloaded from databases, you might miss the opportunity to remove these outliers.

To eliminate redundant data, run CD-HIT to remove 100% duplicate sequences, whether they are full-length duplicates or subsets of another sequence. You can also cluster sequences at a lower identity threshold, like 90%, to group similar sequences together. CD-HIT is confirmed to handle subset duplicates effectively, ensuring reliable deduplication.

MSA

After extracting individual protein sequences, you can proceed with the alignment process. MSA is a heuristic method, and while you could rely on a single alignment tool, it’s often beneficial to test multiple tools to determine which provides the most reliable results for your specific dataset. However, consistency is key—if you use multiple tools, apply them uniformly across all proteins and ultimately choose the best alignment for further analysis.

In literature, it’s common to see just one alignment tool being used, especially when dealing with relatively conserved viruses. However, for more diverse viruses, like HIV-1 or different subtypes of influenza, different tools might perform better for different proteins. If multiple tools are used, this choice must be justified clearly in any publication, supported by benchmarks against existing alignments.

Alignment QC

  • Manual Inspection to Identify Mismatches: Alignment anchors or conserved blocks can guide manual inspection by providing reference points within the alignment. This helps ensure that regions between these anchors are correctly aligned. For misaligned sequences, search for similar sequences within the alignment to determine the correct positioning. Address issues block by block to systematically correct the alignment.

  • Use of Auxilary Tools: Several auxiliary tools, such as GUIDANCE (Penn et al., 2010), SATe (Liu et al., 2009), HoT (Landan and Graur, 2008), Gblocks (Castresana, 2000), and SuiteMSA (Thompson et al., 2002), have been developed to facilitate this process. These tools assist in improving alignment quality by providing metrics for alignment confidence or by refining the alignment to reduce errors. However, they do not entirely eliminate the need for manual inspection.

  • Dealing with Mismatches: Mismatches may indicate misalignment. If a mismatched sequence appears only in one sequence, use BLAST to check its validity. If it’s an outlier (e.g., matching bacteria instead of the target organism), consider removing that part or the entire sequence, especially if it’s at the beginning or end. For mismatches in the middle, it’s safer to delete the entire sequence, although careful inspection is needed.

3.2. Parameters

3.2.1. Sample name

Name of sequence to be analysed.

3.2.2. Low support threshold

The support is defined as the number of sequences at a given k-mer position that do not harbor a gap and unknown and/or ambiguous nucleotide base and amino acid residue. Positions below a statistical support of 100 sequences (default) are defined as of low support. The user has the flexibility to set the threshold for low support.

3.2.3. K-mer length

Select a k-mer window size that is appropriate for the analysis.
While the minimum applicable size is 3, the maximum can equal to the alignment length of the uploaded input file. By default, DiMA uses a window size of nine (9; nonamer; 9-mer) to evaluate the viral diversity with respect to cellular immune response.

3.2.4. Header format to include metadata

This optional functionality allows annotation of the distinct sequences at each k-mer position with respective cognate sequence metadata, such as collection date, geographical location, isolation host. Simply, it parses the information on the sequence header (definition/description line).

Note

Example of a definition line: >ATY74257.1 |2017-03-02|China: Kunming|Homo sapiens

Because the format of metadata varies between databases, DiMA has relied on the format of NCBI Virus.

4. How to interpret the results

result

Figure 4. Sample output from DiMA analysis.

4.1. Summary

Summary information (Figure 4.1) that is general to the input alignment and specific to a given k-mer position.

  • alignment length

  • download results

  • query name

  • support threshold

  • position support

  • distinct variants

  • position entropy

  • selected position

4.2. Sequence diversity

Entropy values indicate the level of variability at the corresponding k-mer positions, with zero representing completely conserved positions. Plot (Figure 4.2) provide a holistic view of the diversity and are responsive and interactive (one can easily hover and see the approximate entropy value of the hovered position).

Note

For a benchmark, the peak absolute entropy of 9.2 and total variants of 98% were observed for HIV-1 clade B (Hu et al., 2013).

The methodology for calculation of Shannon’s entropy at each k-mer position is as per Khan et al., (2008).

4.3. Diversity motifs

All sequences at each of the k-mer positions in the protein alignments were quantified for distinct sequences and ranked-classified into diversity motifs (Figure 4.3-4) based on their incidences, as explained above under the About section.

Users can select a position from the “SELECTED POSITION” box (Figure 4.1), in the upper right corner to browse the motif distribution of the position.

4.4. Sequence Metadata

If the header format is provided in the analysis parameters (as described in the above Parameters, DiMA will make a pie chart (Figure 4.5) for each type of the metadata.

The user should select a specific k-mer from the selected position for the metadata to appear. By default, the first peptide will be selected. In the example below, the index sequence is selected and host species distribution is shown in the plot.

4.5. Data Synthesis and Further Analysis

Data from DiMA can be synthesised and analysed further in various ways, such as:

  1. scatter plot of the relationship between entropy and incidence (frequency) of total variants;

  2. scatter plot of motif incidence (for each diversity motif) against total variants;

  3. frequency distribution violin plots of the diversity motifs; and

  4. distribution of conservation level for k-mer positions in the protein, and

  5. a table indicating the minimum and maximum total variants at each entropy boundary values.

Examples of such synthesis and analyses are demonstrated in Hu et al. (2013) and Abd Raman et al. (2020).

4.6. Download

The DiMA output, top panel (Figure 4.1) allows for downloading of the analysis results in JSON and XLSX formats. The JSON file contains the complete analysis results as key-value pairs, which can be viewed using a public JSON viewer tool (such as https://jsonformatter.org/json-viewer). Additionally, the XLSX file provides for easier viewing through an MS Excel application or equivalent that supports the format. The concatenated list of HCS based on a user-defined index incidence threshold can also be downloaded (JSON format) and viewed in a text editor or JSON viewer.

5. DiMA-CLI (Large samples)

5.1. Considerations for CLI

Web service of DiMA is capable of handling files up to 100 MB. Analysis with bigger data is available by using DiMA-CLI locally. Here, we provide some instructions for such purposes. You can download the sample data, MERS-CoV Spike protein, here.

5.2. Installation

Installing the latest version is possible with pip install dima-cli. Python version >=3.7 <3.11 is required.

5.3. Basic Usage

  • The standard way to run the tool:

dima-cli -i aligned_sequences.afa -o results.json

  • To print the full list of arguments with short explanations:

dima-cli --help

  • To recieve the results in tabular format:

dima-cli -i aligned_sequences.afa -t xlsx -o results.xlsx

  • To run a DNA analysis:

dima-cli -i aligned_sequences_nt.afa -a nucleotide -o results.json

  • To recieve the HCS along with the results:

dima-cli -i aligned_sequences.afa -o results.json -c hcs.json

5.4. Advanced Usage Examples

Customazing the parameters may be neccesary, especially for the big data analyses. Adjusting the parameters is primarily reasonable for: inclusion of metadata by parsing of the headers, setting a support threshold for positions, prefering different k-mer sizes. Example usage:

dima-cli -i aligned_sequences_nt.afa -l 9 -f "accession|host|geography|year" -a protein -t json -c hcs.json -e 100 -o results.json

6. FAQs and Support

  1. How to cite?

Tharanga, S., Hu, Y., Unlu, E. S., Sjaugi, M. F., Celik, M. A., Hekimoglu, H., Miotto, O., Oncel, M. M., & Khan, A. M. (2022). DiMA: Sequence Diversity Dynamics Analyser for Viruses. https://doi.org/10.48550/arxiv.2205.13915

5.1. Support

Please don’t hesitate to reach out to the developers for your questions, comments, or other feedback through mailing makhan@bezmialem.edu.tr

5.2. Team

  • Shan Tharanga

  • Yongli Hu

  • Eyyüb Selim Ünlü

  • Muhammad Farhan Sjaugi

  • Muhammet A. Çelik

  • Hilal Hekimoğlu

  • Olivo Miotto

  • Muhammed Miran Öncel

  • Mohammad Asif Khan

7. Acknowledgement

We thank all those who used and/or evaluated DiMA and/or its earlier forms, directly or indirectly during its development. Names are listed in alphabetical order. Our apologies if we missed anyone.

  • Ayesha Fatima

  • Benjamin, Tan Yong Liang

  • Chong Li Chuin

  • Emre Herdan

  • Esin Özkan

  • Esra Busra Isik

  • Faruk Üstünel

  • Gizem Yılmaz

  • Gokcen Sahin

  • Hadia Syahirah Abd Raman

  • Hasiba Karimi

  • Heiny Tan

  • Johann Shane Tian

  • Lim Wan Ching

  • Melike Karakaya

  • Natascha May A/P Thevasagayam

  • Pendy Tok

  • Qi Ying Koo

  • Rashid Mukaila

  • Rashmi Sukumaran

  • Robandeep Kaur Saini

  • Tcharé Adnaane Bawa

  • Zarife Aslan