1.1. Sequence Diversity Dynamics Analyser for Viruses (DiMA)

Viral infectious diseases are a major public health threat. Sequence diversity is one of the major challenges in the design of diagnostic, prophylactic and therapeutic interventions against viruses. The diversity can be an outcome of a combination of underlying evolutionary processes (mutation, recombination, and assortment). A continuing goal is a greater understanding of viral proteome sequence diversity, the dynamics of substitutions, and effective strategies to overcome the diversity for drug or vaccine design.

Herein, we present Diversity Motif Analyser (DiMA), a tool designed to facilitate the quantification and dissection of viral sequence diversity dynamics. DiMA provides a quantitative measure of sequence diversity by use of Shannon’s entropy (PMID: 18698358), applied via a user-defined k-mer sliding window to a protein alignment. Additionally, DiMA further interrogates the diversity by dissecting the entropy at each aligned k-mer position to various diversity motifs (PMIDs: 32518710, 23593157), based on the incidence of distinct k-mer sequences at the position. At a given position, index is the predominant sequence and all other distinct k-mers are referred to as total variants to the index, sub-classified into major variant (the most common variant), minor variants (comprising of k-mers with incidence lower than major and higher than unique), and unique variants (k-mers seen only once in the alignment). Moreover, the description line of the sequences in the input alignment can be enriched for inclusion of meta-data as part of the analysis, such as spatio-temporal information, among others. DiMA outputs a JSON file that provides multiple facets of sequence diversity: sequence name, k-mer position, entropy, distinct k-mers at the position, and their incidence, motif classification and metadata (if available). DiMA enables comparative sequence diversity dynamics analyses, within and between proteins of a virus species, and proteomes of different species.

DiMA is an outcome of many years of viral studies on several different species since 2007. These studies have been published in peer-reviewed journals (PMIDs: 17434154, 18030326, 18698358, 19401763, 21471731, 22573867, 23593157, 26680743, 29322922, 29363421, 31874646, 34068495 and 32518710 and international conferences. Besides being available as a webserver, it can also be downloaded as a standalone client tool (https://github.com/PU-SDS/DiMA), particularly for big data analyses.

DiMA webserver has been under development since March 2020. It has been extensively tested with 18 datasets (9 structural and 9 non-structural proteins) from six viral species. External validation of our tool has been performed by three individuals, with a total of 36 datasets (18 structural and 18 non-structural proteins), originating from three viral species.

1.2. Accesibility

The webserver is publicly available at: https://dima.bezmialem.edu.tr/

1.3. Browser compatibility

browserc

1.4. Frontend/Backend Frameworks

Python FastAPI utilized for DiMA webserver Backend. ReactJS utilized for DiMA webserver Frontend.

1.5. Novel features

Table 1. Novel features of DiMA compared to similar web servers
Features	DiMA	PVS	LANL Entropy	BV-BRC Variation Analysis
Analysis of nucleic acid - amino acid sequences	✅ - ✅	❌ - ✅	✅ - ✅	✅ - ✅
Shannon Entropy on user-defined sliding window	✅	❌	✅	❌
Entropy correction for size bias	✅	❌	❌	❌
K-mer/variant frequency calculation	✅	❌	❌	✅
K-mer diversity motifs classification	✅	❌	❌	❌
Metadata inclusion	✅	❌	❌	❌
Idenfication of historically conserved sequences	✅	✅	❌	❌
Output and visualizations	Dashboard with interactive plots	Static page with a line plot and interactive 3D-structure	Static page and a line plot	Static images: line plot, sequence logo
Web service input size limit	100 MB ⭐	~0.2 MB	No limit	No limit

: ⭐ Analysis of larger files possible with CLI version.

1.6. Defining diversity motifs

For a given sequence alignment, all sequences at each of the aligned k-mer positions are quantified for distinct sequences and ranked-classified into diversity motifs based on their incidences, as described in Hu et al. (2013) (Supplementary Figure 1, see extract below) (PMID: 23593157).

: Figure 1. Definitions of diversity motifs. The ‘‘Index’’ nonamer is the most prevalent sequence, present in 8 of the 20 isolates. The ‘‘Major’’ variant is the most common variant of the index (5/20). ‘‘Minor’’ variants are multiple different repeated sequences, each with incidences less than the major variant. ‘‘Unique’’ variants are those represented by a single aligned sequence. Distinct variant sequences at a given nonamer position are the different sequence at the position; in this example one of major, two of minor, and three of unique.