22.1.1 Hierarchical clustering
The cluster
command is used for hierarchical agglomerative clustering
of custom data by using a custom distance function.
-
cluster takes one or more arguments:
-
data, a list of data points.
- Optionally, opt, a sequence of options which may contain the following:
-
k or count=k, where k is a positive integer
specifying the number of clusters (unset by default).
- count_inf=kl, where kl is a positive integer specifying the
minimal number of clusters (unset by default).
- count_sup=ku, where ku is a positive integer specifying the
maximal number of clusters (by default, ku=16).
- type=linkage, where linkage is a string specifying
the linkage method. Avaliable methods are:
single (the default), complete, average, weighted,
and ward.
- distance=dist, where dist is a distance function. By default,
the squared Euclidean distance function distance2 is used unless data
is a list of strings, in which case Levenshtein distance is used.
For example, the taxicab distance function could be defined by typing:
taxicab:=(p1,p2)->l1norm(p1-p2) |
- index=ind, where ind is a string or a list of strings
specifying the index function(s) used for selecting the optimal number of clusters (unset by default).
Available index functions are:
silhouette, mcclain-rao, dunn, and all.
ind may also take a boolean value, in which case no index is used
(ind=false) or the silhouette index is used (ind=true).
- output=out, where out is one of the following:
-
part, for outputting the partition of data into clusters.
- list, for outputting the list of cluster indices for data points (the default).
- plot, for outputting a colored visualization of data points with additional
specifications given in display option (see below). Note that this is possible only with
two- and 3D numerical data.
- count, for outputting only the number of clusters.
- index, for outputting the list of values of the (first) used index.
- tree, for outputting a dendrogram drawing which visualizes the linking process.
- display=disp, where disp is a configuration for plot output
(by default, points are drawn as dots of width 2).
- color=colors, where colors is a list of colors that will be used
for cluster coloring in the plot output mode (by default, the standard palette of first 16 colors
is used, with white and yellow replaced by more visible colors).
- labels=lab, where lab is a boolean value which specifies
whether to show data points in the dendrogram.
- cluster(data ⟨,opt ⟩) returns
the result as specified by the output argument (see above).
- Hierarchical clustering is slower than the k-means algorithm (see Section 22.1.2)
but may produce a better classification of the data. Hierarchical clustering is also a method of
choice for custom data types and distance functions.
- If more than one index is computed in a clustering process, then the optimal number of clusters
is decided by voting: the number which was selected by most indices is used.
Examples
To apply cluster command to a 2D “aggregation”
shape dataset which can be downloaded from here,
first load the data by entering:
data:=csv2gen("/home/luka/Downloads/Aggregation.txt","\t"):; |
The original file contains three data columns: first are the x-coordinates, second the y-coordinates,
and third are cluster indices. To cluster and plot 2D points specified by the first two columns,
with the average linkage method and silhouette index (which is used by default if
index=true), enter:
cluster(delcols(data,2),type="average",index=true,output=plot) |
Levenshtein distance (see Section 5.2.16) is used by default for string
data. For example:
cluster(["cat","mouse","rat","spouse","house","cut"],output=part) |
|
| ⎡
⎢
⎣ | “cat” | “rat” | “cut” |
“mouse” | “spouse” | “house”
|
| ⎤
⎥
⎦ |
|
| | | | | | | | | | |
|
In the following example, genomic sequences are split into three clusters
by using the average linkage and Hamming distance function.
data:=["GTCTT","AAGCT","GGTAA","AGGCT","GTCAT","CGGCC",
"GGGAG","GTTAT","GTCAT","AGGCT","GTCAG","AGGAT"]:;
cluster(data,type="average",count=3,distance=hamdist,output=part) |
|
[ | ⎡
⎣ | “GTCTT”,“GTCAT”,“GTTAT”,“GTCAT”,“GTCAG” | ⎤
⎦ | , |
| | | | | | | | | |
| ⎡
⎣ | “AAGCT”,“AGGCT”,“CGGCC”,“AGGCT”,“AGGAT” | ⎤
⎦ | , |
| | | | | | | | | |
| | | | | | | | | | |
|
To display the dendrogram, enter:
cluster(data,type="average",count=3,distance=hamdist,output=tree) |