Some of you may have followed the recent back-and-forth on the LingPipe newsgroup concerning the non-determinstic behavior of LingPipe’s implementation of complete-link and single-link agglomerative clustering.
It’s not our fault. Really. The basic algorithm is non-determistic.
The Usual Algorithm
As it’s usually presented, we initialize with every element in its own singleton cluster. Then at each step, we choose two clusters to merge based on the distance between them. For single-link clustering, distance is the distance between the two closest points (i.e. min of pairwise distances); for complete-link, the distance between the two furthest points (i.e. max of pairwise distances). Usually we choose the closest pair of clusters to merge. But what if there’s a tie?
Some Examples
Neither the Wikipedia nor the new(ish) Stanford IR book discuss what to do in the case of ties, leaving the algorithm non-determinstic. Here are links:
- Wikipedia: Cluster Analysis
- Stanford IR Book: Hierarchical Agglomerative Clustering
- Stanford IR Book: Single-Link and Complete-Link Clustering
(As of today), the Wikipedia says [my emphasis], “This method builds the hierarchy from the individual elements by progressively merging clusters. … The first step is to determine which elements to merge in a cluster. Usually, we want to take the two closest elements, according to the chosen distance.”
Similarly, the Stanford IR book’s section on hierarchical agglomerative clustering says [my emphasis], “In each iteration, the two most similar clusters are merged.”
Neither source says what to do when there are ties.
Example
Suppose we have three points on a line, {1, 3, 4, 6} with the usual Euclidean distance function (e.g. d(1,3)=2; d(3,4)=1, d(6,1)=5, d(4,4)=0, …). And let’s suppose we go with complete-link clustering. There are two alternative possibilities when we come to the second step:
1 3 4 6 | |
1 {3,4}:1.0 6 | |
{1, {3,4}:1.0}:3.0 6 | 1 {{3,4}:1.0, 6}:3.0 |
{{1, {3,4}:1.0}:3.0}, 6}:5.0 | {1, {{3,4}:1.0, 6}:3.0}:5.0 |
because d(1,{3,4}) = 3.0 and d({3,4},6) = 3.0.
Without the scores, the dendrograms are {{1,{3,4}},6} vs. {1,{{3,4},6}}. Very different structures.
Single-link clustering suffers from the same non-determinsm. Specifically, for the example above, the same first step applies, and then the same ambiguity in the second step where both distances are 2.0 (rather than the 3.0 in the complete-link algorithm).
Breaking the Symmetry
Usually an implementation will break the tie in some programmatic way, such as choosing the first one to appear in some data structure. The problem for LingPipe, which uses hash-based Java collections over objects with reference-based equality is that the ordering’s not fixed. (This is well documented, though a bit subtle.) Specifically, different runs can produce different results. See the last blog post for more details on the non-determinism of Java hash-based collections.
Cophenetic Distance
I suspect it’s not possible to normalize at the dendrogram level for either complete-link or single-link clustering.
On one interpretation of clustering, dendrograms allow us to infer cophenetic distances, which is the distance at which clusters are merged. Unfortunately, the cophenetic distances derived from the two clusterings resulting from the non-deterministic complete-link example above are different, with either 1 or 6 being closer to {3,4} depending on which is merged first. I’m guessing the same thing will happen with single-link, but haven’t worked out the details (in this example, the cophenetic distances are the same).
Thus I believe the real problem’s in the model implicitly assumed in the clustering strategies.
Furthermore, both of these algorithms are notoriously non-robust to small differences in distances or the set of objects being clustered.
Does it Matter?
Are there applications where this will matter? I suppose if you really believe in clustering results and try to analyze them via cophenetic correlation or other measures it will.
May 27, 2010 at 5:41 am |
maybe a way to break ties for single link would be to prefer smaller clusters over big ones. as big clusters have a multitude of chances to win the ‘who’s joined next’ – competition this would be kind of a fair thing to do. For complete link, it’s the other way around. If the clusters are equally sized, the problem remains, of course.
June 6, 2010 at 6:48 pm |
maybe a way to break ties for single link would be to prefer smaller clusters over big ones. as big clusters have a multitude of chances to win the ‘who’s joined next’ – competition this would be kind of a fair thing to do. For complete link, it’s the other way around. If the clusters are equally sized, the problem remains, of course.
+1
October 23, 2011 at 3:23 pm |
You can find C# download DEMO.
http://semanticsearchart.com/researchHAC.html
There is also comparison to other methods of English texts categorization. HAC works better than latent semantic analysis and probabilistic latent semantic analysis.