Clustering by Pattern Similarity:

the pCluster Algorithm

 

What is a pCluster:

 

               Assume we have a dataset:

 

1

1

1

100

2

3

4

2

2

2

80

7

8

9

3

3

3

3

3

3

3

4

4

4

103

9

9

9

 

               We want to find the similarity of the 4 objects in the dataset.

 

               The pCluster algorithm finds 5 clusters from the dataset:

 

               Cluster 1:

                              Objects: 0, 1, 2, 3

                              Columns: 0, 1, 2

              

               Cluster 2:

                              Objects: 0, 1

                              Columns: 4, 5, 6

 

               Cluster 3:

                              Objects: 0, 3

                              Columns: 0, 1, 2, 3

 

               Cluster 4:

                              Objects: 1, 3

                              Columns: 0, 1, 2, 4

 

               Cluster 5:

                              Objects: 2, 3

                              Columns: 4, 5, 6

 

Download:

 

               For MS Windows (Cygwin)

 

Usage:

 

               delta.exe FILE delta nc nr

 

               where:

                              FILE                   a space-delimited text file

                              delta                    for d-pCluster

                              nc                         minimal # of columns of a cluster

                              nr                          minimal # of rows of a cluster

 

               The first line of the FILE contains

                              #ROWS            # of rows

                              #COLS              # of columns

            of  the data that follows

 

Sample Datasets:

 

               Synthetic Dataset :  (each with 10 embedded clusters)

 

3000 x 100                       (suggested parameters:  d=1, nc=8, nr=27)

3000 x 30                         (suggested parameters:  d=1, nc=6, nr=27)

 

Yeast DNA Microarray (after data cleaning)

 

Reference:

 

          Haixun Wang et al. “Clustering by pattern similarity in large datasets”, in SIGMOD p. 394-405, June 2002, Madison, Wisconsin, USA.