UPGMA

An initial UPGMA phenogram was constructed, based on 32 microsatellite loci, and used as a benchmark for comparisons with SNP-based phenograms, using a number of different criteria for SNP selection.

From: Sketches of Nature , 2016

Volume 1

Michael Weiß Markus Göker , in The Yeasts (Fifth Edition), 2011

4.3 Cluster Analysis: UPGMA and WPGMA

UPGMA (unweighted pair group method with arithmetic mean; Sokal and Michener 1958) is a straightforward approach to constructing a phylogenetic tree from a distance matrix. It is the only method of phylogenetic reconstruction dealt with in this chapter in which the resulting trees are rooted. UPGMA implicitly assumes a constant substitution rate, over time and phylogenetic lineages (known as the molecular clock hypothesis). Since this assumption is often violated, this method is now rarely used.

In a first step, the two terminal taxa with the smallest genetic distance (e.g., taxa A and B) are clustered together to form a new operational taxonomic unit (OTU) AB. Next, a new, smaller distance matrix is computed, which includes OTU AB instead of taxa A and B. In this process, means are used to derive distances between the new operational taxonomic unit AB and the remaining terminal taxa; this distance is

d ( A B ) X = 1 2 ( d A X + d B X )

for any terminal taxon X. In a next iteration, again the two taxa with the smallest distance are clustered, and this process is repeated until only two OTUs are left.

In the clustering process, the formula used to compute mean distances is as follows. If C 1, C 2 are clusters including n 1 and n 2 terminal taxa, respectively, that are to be merged into a new OTU C 1 C 2, then the mean distance to any other cluster D is given by

d ( C 1 C 2 ) D = n 1 n 1 + n 2 d C 1 D + n 2 n 1 + n 2 d C 2 D

(d C1D and d C2D have already been calculated in an earlier clustering step). An alternative method is to use simple means, i.e.,

d ( C 1 C 2 ) D = 1 2 ( d C 1 D + d C 2 D );

this variant is called the weighted pair group method (WPGMA).

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780444521491000124

Genetic Variation and Phylogenetic Relationship Among the Two Different Stocks of Catla (Catla catla) in the Indian State of Orissa Based on RAPD Profiles

Nihar Ranjan Chattopadhyay , in Induced Fish Breeding, 2017

10.3.7 Cluster Analysis

UPGMA clustering method was used to generate a dendrogram for the two stocks of C. catla in Orissa in the present study by computing the GS values with DICE coefficient in NTSYSpc 2.2 program. The dendrogram showed one cluster with the Puri and Ganjam stocks.

Then the GS matrices for intraspecies variation were calculated with DICE coefficient with NTSYS software by pair-wise comparison of 10 individuals in each stock following the method of Nei and Li (1979). The GS values for each stock ranged from 0.67 ± 0.663 in Puri and 0.70 ± 0.599 in the case of Ganjam stock (Table 10.15). The mean intraspecies GS values were 0.7897 ± 0.663 for Puri stock and 0.836 ± 0.599 in the case of Ganjam stock (Table 10.20). The highest GS values within the stock were obtained for Ganjam followed by Puri. These intraspecies GS values estimated for two stocks were checked by one-way ANOVA (SPSS version 16) and found to be significantly different at p < 0.001 (Table 10.2). Similarly the two representative DNA samples from each of the two stocks of catla were selected primer-wise for interspecies genetic variability analysis (Figs. 10.12–10.15). The matching coefficients of GS for the same four primers that amplified 78 bands were calculated using the NTSYS pc 2.2 software with DICE coefficient. The interspecies GS values obtained for Puri/Ganjam through the program were estimated to be (all values per 1.000) 0.943.

Table 10.20. Size of Each Amplified DNA Band in the Individuals of Catla catla With Primer OPA-11 of Puri District

Lanes Rows Lane 1 (mol.wt.) Lane 2 (mol.wt.) Lane 3 (mol.wt.) Lane 4 (mol.wt.) Lane 5 (mol.wt.) Lane 6 (mol.wt.) Lane 7 (mol.wt.) Lane 8 (mol.wt.) Lane 9 (mol.wt.) Lane 10 (mol.wt.) Lane 11 (mol.wt.)
Row 1 1400 1400 1400 1400 1400 1400 1400 1400 1400 1400
Row 2 1200 1200 1200 1200 1200 1200 1200 1200 1200 1200
Row 3 1100 1100 1100 1100 1100 1100 1100 1100 1100 1100
Row 4 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000
Row 5 900 900 900 900 900 900 900 900 900 900 900
Row 6 800 800 800 800 800 800 800 800 800 800 800
Row 7 700
Row 8 620 620 620 620 620 620 620 620 620 620
Row 9 600
Row 10 550 550 550 550 550 550 550 550 550 550
Row 11 500 500 500 500 500 500 500 500 500 500 500
Row 12 400 400 400 400 400 400 400 400 400 400 400
Row 13 320 320 320
Row 14 300 300 300
Row 15 280 280 280 280 280 280 280
Row 16 200
Row 17 100

Molecular weight standard: 100   bp ladder. Molecular weight unit: bp.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780128017746000109

Biometrics Applied to Molecular Analysis in Genetic Diversity

Cosme Damião Cruz , ... Leonardo Lopes Bhering , in Biotechnology and Plant Breeding, 2014

Method of Average Binding among Clusters or UPGMA (Unweighted Pair-Cluster Method using Arithmetic Averages)

The method of unweighted average binding among clusters, better known as UPGMA, has been used most frequently in ecology and systematics ( James and McCulloch, 1990) and in numerical taxonomy (Sneath and Sokal, 1973). UPGMA is treated as a clustering technique that uses the (unweighted) arithmetic averages of the measures of dissimilarity, thus avoiding characterizing the dissimilarity by extreme values (minimum and maximum) between the considered genotypes.

As a general rule, the construction of the dendrogram is established by the genotype of greatest similarity. However, the distance between an individual k and a cluster, formed by individuals i and j, and supplied by:

d ( ij ) k = average ( d jk ; d jk ) = d ik + d jk 2

that is, d ( ij ) k , given by the average of the set of the distances of the pairs of individuals (i and k) and (j and k).

The distance between two clusters is given by:

d ( ij ) ( klm ) = average ( d lk ; d il ; d im d jk ; d jl ; d jm ) = d ik + d il + d im + d jk + d jl + d jm 6

that is, the distance between two clusters, formed respectively by individuals (i and j) and (k, l, and m), and determined by the average between the elements of the set, whose elements are distance between pairs of individuals of clusters (i and k), (i and l), (i and m), (j and k), (j and l), and (j and m).

A general expression for the unweighted average among clusters can be presented in the following manner:

d ( ij ) k = n i n i + n j d ik + n j n i + n j d jk

in which d ( ij ) k is defined as the distance between the cluster (ij), with an internal size ni and nj , respectively, and the cluster k. In this expression, the indexers i, j, and k are characterized as individuals or clusters. This interpretation should be the same for the subsequent methods.

Thus for the calculation of the distance d (12)3, in which one considers the cluster formed by the accesses 1 (ni   =   1) and 2 (nj   =   1), one has:

d ( 12 ) 3 = 1 1 + 1 d 13 + 1 1 + 1 d 23 = d 13 + d 23 2

For the calculation of the distance d (12.3)4, in which one considers the cluster formed by the accesses 1 and 2 (ni   =   2) and 3 (nj   =   1), one has:

d ( 12.3 ) 4 = 2 2 + 1 d ( 12 ) 4 + 1 2 + 1 d 34 = 2 3 1 1 + 1 d 14 + 1 1 + 1 d 24 + d 34 3 = d 14 + d 24 + d 34 3

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780124186729000039

Taxonomy of Prokaryotes

Rainer Borriss , ... Hans-Peter Klenk , in Methods in Microbiology, 2011

4 Phylogenetic trees

Phylogenetic trees are branching diagrams illustrating the evolutionary relationships among species. Usually such trees are constructed based on sequence similarity between the highly conserved 16S rRNA genes or a set of housekeeping genes of several organisms. This limitation to a small set of input sequences can be problematic as the phylogeny of single genes does not necessarily reflect the phylogeny of the complete organisms. It is therefore highly desirable to use all genes of the core genome as input for the tree calculation, which dramatically increases its reliability (Gontcharov et al., 2004). EDGAR creates multiple alignments of all orthologue-sets of the core genome by using MUSCLE (Edgar, 2004), removes unaligned parts with GBLOCKS (Talavera and Castresana, 2007), concatenates the multiple alignments of the single genes to one large alignment and finally creates a phylogenetic tree with the neighbour-joining implementation of the PHYLIP package (Felsenstein, 1995).

PHYLIP is a comprehensive collection of software tools that implement various algorithms for the creation of phylogenetic trees. Four of the most prominent algorithms are:

UPGMA: Unweighted Pair Group Method with Arithmetic Mean: A simple clustering method that assumes a constant rate of evolution (molecular clock hypothesis). It needs a distance matrix of the analysed taxa that can be calculated from a multiple alignment.

Neighbour-joining (NJ): Bottom-up clustering method that also needs a distance matrix. NJ is a heuristic approach that does not guarantee to find the perfect result, but under normal conditions has a very high probability to do so. It has a very good computational efficiency, making it well suited for large datasets.

Maximum parsimony (MP): This method tries to create a phylogeny that requires the least evolutionary change. It may suffer from long branch attraction, a problem that leads to incorrect trees in rapidly evolving lineages (Felsenstein, 1978).

Maximum-likelihood (ML): ML uses a statistical approach to infer a phylogenetic tree. ML is well suited for the analysis of distantly related sequences, but is computationally expensive and thus not that well suited for larger input data.

While phylogenetic trees calculated from large sets of orthologous genes are quite reliable, trees generated from smaller samples may need some further confirmation. In such cases the use of an outgroup and further bootstrapping support can be helpful:

In this context, two terms have to be defined:

Outgroups: When using distance matrix methods it is highly recommended to include at least one distantly related sequence for the analysis. This can be seen as a negative control. The outgroup should appear near the root of the tree and should have a longer branch length than any other sequence.

Bootstrapping: Bootstrapping is a resampling technique that is often used to increase the confidence that the inferred tree is correct. In a defined number of iterations (usually 100–1000) the multiple alignment that serves as input is permutated randomly and a phylogenetic tree is calculated. When the procedure is finished, a majority-rule consensus tree is constructed from the resulting trees of each bootstrap sample. The branches of the final tree are labelled with the number of times they were recovered during the procedure.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780123877307000188

Methods of Molecular Study

Jitra Waikagul , Urusa Thaenkham , in Approaches to Research on the Systematics of Fish-Borne Trematodes, 2014

6.3.4 Tree Building

There are two general categories of methods for constructing phylogenetic trees: clustering methods and tree searching methods. The clustering methods are known as the distance-matrix method, in which the UPGMA and neighbor-joining methods are generally used. The distance-matrix method requires the genetic distance, which is determined for all pairwise combinations of OTUs and then those distances are assembled into a tree. The tree searching methods are known as discrete data methods. Maximum parsimony, maximum likelihood, and Bayesian inference methods are applied directly to nucleotide sequences. Discrete data methods examine the nucleotide variation in each column of the alignment separately and consider only "Phylogenetically informative sites" for searching the best tree that conforms to all of the information. Based on the algorithm differences, distance-matrix methods are much faster than tree searching methods. The clustering methods, however, only presume the most closely related among organisms, whereas discrete data analyses try to find a set of all possible classification schemes and then measure how the characteristics evolve on each of all possible trees. 20,21,23 The user friendly programs for constructing phylogenetic trees are listed in Table 6.3.

The most reliable tree is required for revealing the most probable evolutionary relationships among organisms. Therefore, the trees are always constructed by more than one phylogenetic method and then the congruent phylogenetic relationships found can be supported with high confidence.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780124077201000066

Synthetic Biology and Metabolic Engineering in Plants and Microbes Part B: Metabolism in Plants

E. Foureau , ... V. Courdavault , in Methods in Enzymology, 2016

2.2.3.2 Hierarchical Clustering

Given a dissimilarity matrix (computed by euclidean distances, for example), a hierarchical clustering procedure acts iteratively to cluster similar individuals (genes) by joining most similar individuals. New dissimilarity measures (Ward, UPGMA, Lance-William, etc.) are calculated at each iteration between the new formed cluster and the remaining genes. In the final tree, genes with similar expression patterns are grouped together.

#Rscript

#calculate dissimilarity matrix

d.mat<-dist(fpkm.table, method="euclidean")

#cluster genes

hclust.dmat<-hclust(d.mat, method="ward.D2")

#plot tree

plot(hclust.dmat)

#create cluster by cutting tree

#first, observe how the tree is cut for different thresholds; try different k values

rect.hclust(hclust.dmat, k=5)

#get cluster composition and size

cluster.hclust.dmat<-cutree(hclust.dmat, k=5)

cluster.size<-sapply(levels(as.factor(cluster.hclust.dmat)), function(x)length(which(cluster.hclust.dmat==x)))

names(cluster.size)<-paste("Cluster", levels(as.factor(cluster.hclust.dmat)), sep="")

#plot cluster expression profiles

mean.clust<-lapply(1:nlevels(as.factor(cluster.hclust.dmat)), function(x){

cbind.data.frame("Mean"=apply(fpkm.table[which(cluster.hclust.dmat==x),], 2, mean),"Sample"=colnames(fpkm.table), "Cluster"=rep(paste(x, cluster.size[x], sep="_"), ncol(fpkm.table)))})

mean.clust.table<-do.call("rbind", mean.clust)

p<-ggplot(mean.clust.table, aes(x=Sample, y=Mean))

p+geom_point()+geom_line(aes(group=Cluster))+facet_wrap(~   Cluster, ncol=4)

#retrieve annotation for a given cluster

tmp.list<-names(which(cluster.hclust.dmat==clusternumber))

annot.tmp<-sapply(tmp.list, function(x)which(annot.ok[,1] %in% x)[1])

annot.tmp.2<-annot.ok[annot.tmp,]

#export as a csv file, readable in any Microsoft Office Excel or LibreOffice Calc

write.csv(annot.tmp.2, "annotation_genes_clusternumber.csv")

In addition, agglomerative clustering may also be tested for investigation purposes with the "agnes" function from the Cluster package. Many dissimilarity measures are available with parameters (arguments to the dissimilarity methods) that may be fine-tuned to improve clustering.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/S0076687916000975

Volume 4

M. Forina , ... P. Oliveri , in Comprehensive Chemometrics, 2009

4.04.4.2.1 Clustering

Almost all applications of clustering techniques in food chemistry are based on the Euclidean distance and the natural perception of clusters. This in turn is based on two elements, distance and structure, and generally in the perception that the latter one has greater importance. For this and other reasons, among the hierarchical techniques, we prefer the single linkage method, frequently and unfairly maltreated.

The aspects of CA considered here are

Nomenclature

Representation

Interpretation of dendrograms, and

Statistical tests

4.04.4.2.1(i) Nomenclature

The agglomeration technique should always be clearly reported. The denominations we suggest according to Massart and Kaufman 52 are single linkage or nearest neighbor clustering, complete linkage or furthest neighbor, unweighted average linkage or pair-group average or the Sneath-Sokal abbreviation UPGMA, weighted average linkage or weighted pair-group average or WPGMA, unweighted pair-group centroid or UPGMC, weighted pair-group centroid (median) or WPGMC, and Ward's method.

4.04.4.2.1(ii) Representation

One of the dendrogram coordinates (we use the ordinate) must be the similarity or the distance between clusters. The similarity between two objects or clusters (i and j) is defined 52 as

(1) s ij = 1 d ij d MAX

where d ij is the distance between the two objects or clusters and d MAX is the maximum distance between two objects in the data set before agglomeration.

The second coordinate of the dendrogram is used just to arrange the objects, so it is not informative. However, the user interprets the dendrogram by paying attention to both axes and with the instinctive sensation that close objects are similar and that objects in different clusters are very dissimilar, which could be true but also completely false. The graphical representation of a dendrogram and its underlying similarity matrix can be improved by seriation, that is, the optimal reordering of the objects, to obtain a Robinson Matrix 53 of similarities, a matrix in which the magnitude of the elements monotonically decreases moving away from the main diagonal.

4.04.4.2.1(iii) Interpretation of dendrograms

Users generally obtain the significant clusters by cutting the dendrogram at selected levels, generally in correspondence of the longest branches. In the case of the bidimensional data in Figure 50 , the usual interpretation is that there are three main clusters (A, B, and C) and that cluster A is constituted by two well-separated clusters (A1 and A2). By cutting at the similarity level of 0.75, seven clusters can be identified. The visual clustering by means of the observation of the data, all the information, completely disagrees with the interpretation of the dendrogram. The conclusion is that the dendrograms (those obtained with Complete Linkage, Weighted Average Linkage, and Ward's method) indicate the presence of clusters when there are clusters, but also when there are not.

Figure 50. (a) Two-dimensional data from bivariate uniform distribution. (b) Clusters suggested by dendrogram. (c) Dendrogram (unweighted average linkage).

Moreover, as noticed before, very frequently the natural perception of the similarity is based on the structure. In the example of Figure 51 , there are two linear structures. Ward's method (and also Complete Linkage and Average Linkages) is not able to detect the structures.

Figure 51. (a) Two-dimensional data from two bivariate distributions, both uniform along two different directions and clusters suggested by dendrogram. (b) Dendrogram (Ward's method).

Instead the Single Linkage ( Figure 52 ) method detects the two linear structures very neatly, as when we observe two close grape clusters.

Figure 52. (a) Two-dimensional data from two bivariate distributions, both uniform along two different directions and clusters suggested by dendrogram. (b) Dendrogram (single linkage).

4.04.4.2.1(iv) Statistical tests

We do not remember examples of the use of statistical tests to assess the significance of clusters in the food literature. However, some tests or significance indexes can be found in the software used by food chemists.

The cophenetic distance (e.g., in Mathworks Statistics Toolbox) between two objects is defined by the similarity level of the cluster where they are joined, as

(2) d ij cophenetic = d MAX ( 1 s ij join )

The cophenetic correlation is the correlation coefficient between the original distances and the cophenetic distances. A large value indicates a good representation of the original data in the dendrogram. However, on the basis of our experience, we cannot recommend the use of cophenetic index, at least to compare single linkage with the other agglomerative techniques.

The Sneath test has been suggested 51 in the case of clustering based on Euclidean distances, convex clusters with normal distribution (very rare in practice). The Sneath disjunction index is defined for each pair of two clusters, as

(3) W = d 12 ( n 1 + n 2 ) ( s 1 2 n 1 + s 2 2 n 2 )

where d 12 is the distance between the two clusters, n 1 and n 2 are the number of objects in the two clusters, and s 1 and s 2 are the standard deviations of the projections of the objects on the axis joining the two centroids.

The statistical significance of the Sneath index can be obtained by comparing

(4) t ω = W n 1 + n 2

with the critical value of a Student distribution.

From Equations (1) and (2)

(5) t ω = d 12 ( s 1 2 n 1 + s 2 2 n 2 )

Equation (5) shows clearly that the variable used for the statistical test is the Student variable for the test on the difference of two means. Also, when the hypothesis that the difference between the mean is significant, the separation is rather poor ( Figure 53 ). Therefore, the test seems inadequate for the practical objectives of exploration in food chemistry and possibly suited to evaluate clusters due to different density of points in the space.

Figure 53. Jittered dot diagram for one variable: the significance of the separation is (a) about 5% and (b) about 0.01%.

There are other tests on the significance of clusters. All seem to have major weak points, because of the difference in the shape of clusters and their size; therefore, in the exploration analysis, it is advisable to use pragmatism and interpretation to evaluate the practical significance of clusters.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780444527011001241

Analysis of Genetic Diversity and Population Structure of Buckwheat (Fagopyrum esculentum Moench.) Landraces of Korea Using SSR Markers

Jae Young Song , ... Myung-Chul Lee , in Buckwheat Germplasm in the World, 2018

Genetic Diversity and Phylogenetic Relationships

A neighbor-joining tree of 179 landraces accessions was constructed based on Nei's genetic distance. The genetic distance matrix was generated by PowerMarker software and used to construct an unrooted neighbor-joining tree. The dendrogram revealed a complex accession distribution pattern (Fig. 30.1A). DNA polymorphism detected by 10 SSR markers allowed the genetic distance to be estimated, and the UPGMA tree showed that 179 accessions of Korean buckwheat cultivars were classified in three major groups. The genetic distance among the buckwheat populations from 8 different regions was also used to construct an UPGMA tree ( Fig. 30.1B). The genotypic diversity of buckwheat from 8 geographical regions is compared in Table 30.4. The genetic diversity of buckwheat populations from 8 geographical regions was characterized by an average of 4.08 alleles, ranging from 2 in GG to 6 in GB province.

Figure 30.1. Unrooted neighbor-joining trees of 179 buckwheat accessions collected from different regions in Korea based on Nei's genetic distances among 10 SSR loci (A) and the genetic relationships among different populations in different regions (B).

Table 30.4. Characterization of the 10 Microsatellite Loci According to 8 Geographical Regions in Korea

Regions Sample Size M AF N A H E H O PIC
GW 19 0.586 3.80 0.517 0.408 0.459
GG 3 0.792 2.00 0.310 0.383 0.261
GN 24 0.609 4.50 0.491 0.448 0.438
GB 60 0.588 6.00 0.502 0.401 0.450
JN 14 0.580 3.90 0.515 0.449 0.454
JB 43 0.558 5.70 0.540 0.389 0.483
CN 4 0.650 2.70 0.422 0.325 0.378
CB 12 0.542 4.00 0.549 0.521 0.497
Total 179 4.905 32.60 3.846 3.324 3.420
Average 0.613 4.08 0.481 0.415 0.428

The mean frequency of major alleles (M AF) per locus was 0.613, varying from 0.542 in CB to 0.792 in GG province. The expected heterozygosity (H E ) values ranged from 0.310 (GG) to 0.549 (CB) with an average of 0.481 and the observed heterozygosity (H O ) ranged from 0.325 (CN) to 0.521 (CB) with an average of 0.415. The overall polymorphic information contents (PIC) values ranged from 0.261 (GG) to 0.497 (CB) with an average of 0.428.

The phylogenetic distribution of buckwheat accessions and populations from the 8 geographical regions indicated the complex distribution and did not cluster from the same regions. This result suggests that common buckwheat is widely dispersed with small local differentiation due to strong migration pressure into new geographical regions. Similar results were reported by other studies (Cho et al., 2011).

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780128110065000306

Applied Mycology and Biotechnology

Vladimir Makarenkov , ... Pierre Legendre , in Applied Mycology and Biotechnology, 2006

2.1. Distance-Based Methods

Distance-based methods estimate pairwise distances prior to computing a branch-weighted phylogenetic tree. If the pairwise distances are sufficiently close to the number of evolutionary events between pairs of taxa, these methods reconstruct a correct tree (Kim and Warnow 1999). This assumption is true for many models of biomolecular sequence evolution, in which case distance-based methods give sufficiently accurate results (Li 1997). The main advantage of distance-based methods is their small time complexity that makes them applicable to the analysis of large data sets.

If the rate of evolution is constant over the entire tree and the "molecular clock" hypothesis holds, corrections to the pairwise distances required during inference of the phylogenetic tree may be small. However, the "molecular clock" assumption is usually inappropriate for distantly related sequences and the reconstruction of a correct phylogenetic tree becomes problematic under this hypothesis. If the molecular clock assumption does not hold, the observed differences among sequences do not accurately reflect the evolutionary distances. In that case, multiple substitutions at the same site obscure the true distances and make sequences seem artificially closer to each other then they really are. Correction of the pairwise distances that accounts for multiple substitutions at the same site should be used in such cases. There are many Markov models for modeling sequence evolution; each of them implies a specific way to estimate and correct pairwise distances. Furthermore, these corrections have substantial variance when the distances are large. Among the most popular sequence-distance transformation models we have the Hamming, Jukes Cantor (Jukes and Cantor 1969). Kimura 2-parameter (Kimura 1981). and LogDet (Steel 1994). distances. When the goal is to infer relationships with high divergence between sequences, it can be difficult to obtain reliable values for the distance matrix; as consequence, the distance-based algorithms have little chance of succeeding. More detailed description of some distance-based methods is presented below:

UPGMA : The UPGMA [Unweighted Pair-Group Method using Arithmetic averages (Rohlf 1963).] method was originally proposed for taxonomic purposes. It could be used for phylogeny inferring as well, but one has to assume that the rate of nucleotide or amino acid substitution is the same for all evolutionary lineages. UPGMA always produces an ultrametric tree (i.e. a dendrogram). In practice, this method recovers the correct tree with reasonably high probability when the "molecular clock" hypothesis applies and the evolutionary distance is large for all pairs of sequences. This method can be useful to biologists interested in constructing species trees.

At present, however, many investigators use relatively short DNA sequences for which the "molecular clock" hypothesis is often not valid. Therefore, one should be cautious about UPGMA trees. This method produces a rooted tree because of the assumption of a constant rate of evolution, though it is possible to remove the root if necessary. We illustrate the application of the UPGMA procedure using a set of four species characterized by the sequences TAGG, TACG, AAGC, and AGCC. Using the number of differences as an estimate of the dissimilarity among species, we obtain the distance matrix shown in Table 1.

Table 1. Distance matrix for the four sequences TAGG, TACG, AAGC, and AGCC

TAGG TACG AAGC AGCC
TAGG 0 1 2 4
TACG 0 3 3
AAGC 0 2
AGCC 0

The smallest distance in Table 1 is 1 (between the sequences TAGG and TACG). Consequently, the first cluster to be formed is {TAGG, TACG} and the phylogeny will contain the tree fragment shown in Fig. 1.

Fig. 1. The first cluster (TAGG, TACG} created by the UPGMA algorithm.

The combined node {TAGG, TACG}, formed by the nodes TAGG and TACG, replaces them in the initial distance matrix to obtain the reduced distance matrix shown in Table 2.

Table 2. Reduced distance matrix

{TAGG, TACG} AAGC AGCC
{TAGG, TACG} 0 ½ (2+3) = 2.5 ½ (4+3) = 3.5
AAGC 0 2
AGCC 0

The next cluster with the closest nodes (distance = 2) is {AAGC, AGCC}. These two sequences have two differences in the homologous sites. The final cluster fusion links clusters {TAGG, TACG} and {AAGC, AGCC} (Fig. 2).

Fig. 2. Phylogenetic tree obtained by UPGMA for the set of sequences in Table 1.

Neighbor-joining (NJ): Neighbor-joining (Saitou and Nei 1987; Studier and Keppler 1988). is arguably the most popular among the distance-based methods. For some time, the success of NJ was inexplicable for computational biologists, due to the lack of approximation bounds. One of the first bounds was found by Atteson (1999). who showed that this method would be able to return the true phylogeny given that the observed distance is sufficiently close to the true evolutionary distance. Compared to UPGMA, NJ is designed to correct the unequal rates of evolution in different branches of the tree. NJ has a low O(n 3) time complexity, where n is the number of species, and like other distance methods performs well when the divergence between sequences is low. In its first step, NJ considers a bush tree with n leaves and n branches. The tree is gradually transformed into a binary phylogenetic tree with the same n leaves and 2n-3 branches by merging at each iteration a pair of branches corresponding to the shortest possible tree. Computationally, the tree generation by NJ is similar to UPGMA. When two nodes are linked, their common ancestral node is added to the reduced matrix and the terminal nodes with their respective branches are removed from it. Contrary to UPGMA, neighbor-joining does not produce a dendrogram (ultrametric distance) but an additive tree (additive distance).

Bio Neighbor-joining (BioNJ): The BioNJ (Gascuel 1997a). method is an improved version of the neighbor-joining method of Saitou and Nei (1987). The branch length estimation and distance matrix reduction formulae in NJ provide low variance estimators (Gascuel 1997a). In the paper describing BioNJ, Gascuel (1997a). showed how to improve the accuracy of NJ by incorporating minimum variance optimization in the NJ reduction formula. BioNJ follows an agglomerative scheme similar to that of NJ. It works iteratively, picking a pair of taxa, creating a new node which represents the cluster of these taxa, and reducing the distance matrix by replacing the two taxa by this node. BioNJ uses a simple, first-order model of the variances and covariances of evolutionary distance estimates. This model is well adapted when the estimates are obtained from aligned sequences. At each step it permits the selection, from the class of admissible reductions, of the reduction that minimizes the variance of the new distance matrix. In this way, BioNJ obtains better estimates to choose the pair of taxa to be agglomerated during the next steps. Like NJ, the BioNJ method has a time complexity of O(n 3) for n species. This makes it applicable to the analysis of large data sets. The performances of the two methods are similar when the substitution rates are low, or when they are the same in various lineages. When the substitution rates are high and varying among lineages, BioNJ outperforms NJ in terms of topological accuracy (Gascuel 1997a).

Among other popular distance-based methods, let us mention ADDTREE by Sattath and Tversky (1977). Unweighted Neighbor-Joining (UNJ) by Gascuel (1997b). the Method of Weighted least-squares (MW) by Makarenkov and Leclerc (1999). and FITCH by Felsenstein (1997).

Recommended software: PHYLIP (Felsenstein), PAUP (Swofford), MEGA (Kumar, Tamura and Nei), DAMBE (Xia), T-Rex (Makarenkov), and BIONJ (Gascuel).

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/S1874533406800067

CRISPR-Cas Enzymes

Matthew A. Nethery , Rodolphe Barrangou , in Methods in Enzymology, 2019

3.2 Extract and visualize the repeat spacer array

3.2.1.

Extract repeats and spacers from the genome of interest using the CRISPRviz bioinformatic pipeline (Nethery & Barrangou, 2018). This tool derives repeats and spacers from detected CRISPR loci and places them into separate files, which can subsequently be inspected for trends in length, sequence, and abundance.

# Run CRISPRviz using background threads (-p), and split loci (-x)

crisprviz.sh -pxc

3.2.2.

Once CRISPRviz has completed extraction, visualize repeats and spacers by going to localhost:4444 in your web browser. Repeat and spacer similarity can be quickly approximated by comparing the color and shape combination for each unit (Fig. 3A; Table S1). Use the progressive alignment function to compare spacer acquisitions across strains throughout evolutionary time (Fig. 3B). This function implements progressive sequence alignment using the Needleman–Wunsch algorithm to generate initial similarity scores (Needleman & Wunsch, 1970 ), then generates a guide tree using the UPGMA approach ( Sokal & Michener, 1958), and produces final scores using a sum-of-pairs calculation (Thompson, Plewniak, & Poch, 1999). The unique feature of this alignment algorithm is that the computational logic operates at the whole spacer level, converse to traditional nucleotide-by-nucleotide alignments, requiring less processor time and is better adapted for alignment of entire CRISPR arrays (Nethery & Barrangou, 2018).

Fig. 3

Fig. 3. Visualizing features of the CRISPR array. (A) The nucleotide sequences of loci from two different strains of Yersinia pseudotuberculosis. CRISPRviz converts these spacer sequences to a color/symbol block array for visual comparison. (B) CRISPRviz spacer alignment across nine strains of Y. pseudotuberculosis (type I-F) using the genomes found in Table S1. Each row displays an individual CRISPR locus with the ancestral end positioned to the right. (C) Repeat alignment indicates repeat eight is the terminal repeat of this locus based on the presence of four SNPs.

Visualizing features of the CRISPR array. (A) The nucleotide sequences of loci from two different strains of Yersinia pseudotuberculosis. CRISPRviz converts these spacer sequences to a color/symbol block array for visual comparison. (B) CRISPRviz spacer alignment across nine strains of Y. pseudotuberculosis (type I-F) using the genomes found in Table S1 in the online version at https://doi.org/10.1016/bs.mie.2018.10.016. Each row displays an individual CRISPR locus with the ancestral end positioned to the right. (C) Repeat alignment indicates repeat eight is the terminal repeat of this locus based on the presence of four SNPs.

Once CRISPRviz has completed extraction, visualize repeats and spacers by going to localhost:4444 in your web browser. Repeat and spacer similarity can be quickly approximated by comparing the color and shape combination for each unit (Fig. 3A). Use the progressive alignment function to compare spacer acquisitions across strains throughout evolutionary time (Fig. 3B). This function implements progressive sequence alignment using the Needleman–Wunsch algorithm to generate initial similarity scores (Needleman & Wunsch, 1970), then generates a guide tree using the UPGMA approach (Sokal & Michener, 1958), and produces final scores using a sum-of-pairs calculation (Thompson, Plewniak, & Poch, 1999). The unique feature of this alignment algorithm is that the computational logic operates at the whole spacer level, converse to traditional nucleotide-by-nucleotide alignments, requiring less processor time and is better adapted for alignment of entire CRISPR arrays (Nethery & Barrangou, 2018). Genomes used in Fig. 3 analysis are available in Table S1 in the online version at https://doi.org/10.1016/bs.mie.2018.10.016.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/S0076687918304300