Hierarchical clustering

22.1.1 Hierarchical clustering

The cluster command is used for hierarchical agglomerative clustering of custom data by using a custom distance function.

cluster takes one or more arguments:
- data, a list of data points.
- Optionally, opt, a sequence of options which may contain the following:
  - k or count=k, where k is a positive integer specifying the number of clusters (unset by default).
  - count_inf=k_l, where k_l is a positive integer specifying the minimal number of clusters (unset by default).
  - count_sup=k_u, where k_u is a positive integer specifying the maximal number of clusters (by default, k_u=16).
  - type=linkage, where linkage is a string specifying the linkage method. Avaliable methods are: single (the default), complete, average, weighted, and ward.
  - distance=dist, where dist is a distance function. By default, the squared Euclidean distance function distance2 is used unless data is a list of strings, in which case Levenshtein distance is used. For example, the taxicab distance function could be defined by typing:
    taxicab:=(p1,p2)->l1norm(p1-p2)
  - index=ind, where ind is a string or a list of strings specifying the index function(s) used for selecting the optimal number of clusters (unset by default). Available index functions are: silhouette, mcclain-rao, dunn, and all. ind may also take a boolean value, in which case no index is used (ind=false) or the silhouette index is used (ind=true).
  - output=out, where out is one of the following:
    - part, for outputting the partition of data into clusters.
    - list, for outputting the list of cluster indices for data points (the default).
    - plot, for outputting a colored visualization of data points with additional specifications given in display option (see below). Note that this is possible only with two- and 3D numerical data.
    - count, for outputting only the number of clusters.
    - index, for outputting the list of values of the (first) used index.
    - tree, for outputting a dendrogram drawing which visualizes the linking process.
  - display=disp, where disp is a configuration for plot output (by default, points are drawn as dots of width 2).
  - color=colors, where colors is a list of colors that will be used for cluster coloring in the plot output mode (by default, the standard palette of first 16 colors is used, with white and yellow replaced by more visible colors).
  - labels=lab, where lab is a boolean value which specifies whether to show data points in the dendrogram.
cluster(data ⟨,opt ⟩) returns the result as specified by the output argument (see above).
Hierarchical clustering is slower than the k-means algorithm (see Section 22.1.2) but may produce a better classification of the data. Hierarchical clustering is also a method of choice for custom data types and distance functions.
If more than one index is computed in a clustering process, then the optimal number of clusters is decided by voting: the number which was selected by most indices is used.

Examples

To apply cluster command to a 2D “aggregation” shape dataset which can be downloaded from here, first load the data by entering:

data:=csv2gen("/home/luka/Downloads/Aggregation.txt","\t"):;

The original file contains three data columns: first are the x-coordinates, second the y-coordinates, and third are cluster indices. To cluster and plot 2D points specified by the first two columns, with the average linkage method and silhouette index (which is used by default if index=true), enter:

cluster(delcols(data,2),type="average",index=true,output=plot)

Levenshtein distance (see Section 5.2.16) is used by default for string data. For example:

cluster(["cat","mouse","rat","spouse","house","cut"],output=part)

⎡
⎢
⎣

“cat”	“rat”	“cut”
“mouse”	“spouse”	“house”

⎤
⎥
⎦

In the following example, genomic sequences are split into three clusters by using the average linkage and Hamming distance function.

data:=["GTCTT","AAGCT","GGTAA","AGGCT","GTCAT","CGGCC", "GGGAG","GTTAT","GTCAT","AGGCT","GTCAG","AGGAT"]:; cluster(data,type="average",count=3,distance=hamdist,output=part)

[

⎡
⎣

“GTCTT”,“GTCAT”,“GTTAT”,“GTCAT”,“GTCAG”

⎤
⎦

⎡
⎣

“AAGCT”,“AGGCT”,“CGGCC”,“AGGCT”,“AGGAT”

⎤
⎦

⎡
⎣

“GGTAA”,“GGGAG”

⎤
⎦

]

To display the dendrogram, enter:

cluster(data,type="average",count=3,distance=hamdist,output=tree)