Clustering by Pattern Similarity:
the
pCluster Algorithm
What
is a pCluster:
Assume
we have a dataset:
|
1 |
1 |
1 |
100 |
2 |
3 |
4 |
|
2 |
2 |
2 |
80 |
7 |
8 |
9 |
|
3 |
3 |
3 |
3 |
3 |
3 |
3 |
|
4 |
4 |
4 |
103 |
9 |
9 |
9 |
We
want to find the similarity of the 4 objects in the dataset.
The
pCluster algorithm finds 5 clusters from the dataset:
Cluster
1:
Objects:
0, 1, 2, 3
Columns:
0, 1, 2
Cluster
2:
Objects:
0, 1
Columns:
4, 5, 6
Cluster
3:
Objects:
0, 3
Columns:
0, 1, 2, 3
Cluster
4:
Objects:
1, 3
Columns:
0, 1, 2, 4
Cluster
5:
Objects:
2, 3
Columns:
4, 5, 6
Download:
Usage:
delta.exe
FILE delta nc nr
where:
FILE a
space-delimited text file
delta for
d-pCluster
nc minimal
# of columns of a cluster
nr minimal
# of rows of a cluster
The
first line of the FILE contains
#ROWS
#
of rows
#COLS #
of columns
of the data that follows
Sample
Datasets:
Synthetic
Dataset : (each with 10 embedded
clusters)
3000 x 100 (suggested parameters: d=1, nc=8, nr=27)
3000 x 30 (suggested
parameters: d=1,
nc=6, nr=27)
Yeast DNA
Microarray (after data cleaning)
Reference:
Haixun
Wang et al. “Clustering
by pattern similarity in large datasets”, in SIGMOD p. 394-405, June 2002,
Madison, Wisconsin, USA.