Previous Up Next

22.1.1  Hierarchical clustering

The cluster command is used for hierarchical agglomerative clustering of custom data by using a custom distance function.

Examples

To apply cluster command to a 2D “aggregation” shape dataset which can be downloaded from here, first load the data by entering:

data:=csv2gen("/home/luka/Downloads/Aggregation.txt","\t"):;

The original file contains three data columns: first are the x-coordinates, second the y-coordinates, and third are cluster indices. To cluster and plot 2D points specified by the first two columns, with the average linkage method and silhouette index (which is used by default if index=true), enter:

cluster(delcols(data,2),type="average",index=true,output=plot)

Levenshtein distance (see Section 5.2.16) is used by default for string data. For example:

cluster(["cat","mouse","rat","spouse","house","cut"],output=part)
     


    “cat”“rat”“cut”
    “mouse”“spouse”“house”


          

In the following example, genomic sequences are split into three clusters by using the average linkage and Hamming distance function.

data:=["GTCTT","AAGCT","GGTAA","AGGCT","GTCAT","CGGCC", "GGGAG","GTTAT","GTCAT","AGGCT","GTCAG","AGGAT"]:; cluster(data,type="average",count=3,distance=hamdist,output=part)
     
[

“GTCTT”,“GTCAT”,“GTTAT”,“GTCAT”,“GTCAG”
,
         
 

“AAGCT”,“AGGCT”,“CGGCC”,“AGGCT”,“AGGAT”
,
         
 

“GGTAA”,“GGGAG”
]
         

To display the dendrogram, enter:

cluster(data,type="average",count=3,distance=hamdist,output=tree)

Previous Up Next