HOW TO USE HICLUST (1)

S. C. Johnson
Bell Telephone Laboratories
Murray Hill, New Jersey


NON-TECHNICAL INTRODUCTION

Johnson's clustering program performs cluster analysis on a hierarchical basis. The hierarchical clustering refers to the approach where a set of clusters are formed by merging or splitting a cluster into two or more sub-clusters. Because clusters at a given level arise from higher levels in the cluster hierarchy, a tree is produced.

Johnson's approach calls for a simple set of input specifying the number of stimuli to be clustered, the specification of distance or proximity data, a format statement and a data matrix composed of similarity or dissimilarity data, similar to what may be used for KYST or INDSCAL analysis.

Program output consists of a listing of parameter values selected, a listing of original data, and two tree diagrams connecting the stimuli into clusters hierarchically. The first diagram is for the "min" or "Connectedness" method, and the second is for the "max" or "Diameter" method. For each level of clustering the associated value of , the height is printed.

Finally, Johnson's cluster statistic is printed.

Johnson has created an easily computed cluster statistic that is defined and discussed.  It provides some measure of the compactness of a cluster of k objects out of a given n objects.  In the technical introduction, the distribution of this statistic is examined, and it is shown that given n objects together with a distance matrix, the means and variances are computed explicitly for k = 2,3,...,n-1.


TECHNICAL INTRODUCTION

Suppose we are given n objects, numbered 1 through n, together with an nxn symmetric positive valued zero-diagonal matrix representing the distances between objects.  We shall represent the distance from the i-th to the j-th object by d(i,j).  We should like to test the hypothesis that some subset of k points forms a cluster,. that is, that these k points are actually significantly closer together than one would expect by chance.  In order to make precise this intuitive notion, we must agree on acceptable definitions for the terms 'close together' and 'by chance.'

Examine the entire set of n(n-1)/2 interpoint distances.  Of these, a subset of k(k-1)/2 distances represents the interpoint distances within a cluster of k objects (the 'inner' distances) a disjoint subset of k(n-k) distances represents the distances from the objects in the cluster to the objects outside of the cluster (the 'middle' distances), and the set of remaining (n-k)(n-k-1)/2 distances represents the distances which do not involve the objects in the cluster at all (the 'outer' distances).  To decide how close together a cluster is, we must somehow measure the extent to which the distributions of inner, middle, and outer conform to what we expect for a 'tight' cluster.  It seems that the outer distances have very little to do with the measurement of how close together a given cluster is; we might expect the outer distances to be larger than the inner distances, but there are times (for example, examining one out of two tight clusters) when in fact the outer distances might be of about the same distribution as the inner distances, or even smaller.  Thus, we propose leaving the outer distances out of consideration entirely, realizing that they will be important in our discussion of 'by chance,' below.  This leaves us with the inner and middle distances; certainly, the extent to which the middle distances are larger than the inner distances gives one reasonable measure of the tightness of the cluster.  We may measure this difference in several ways:

1. If we have interval scale distances, we may use the mean middle distance minus the mean inner distance as a measure.

2. If we have only rank order distances, we may convert the distances to ranks, and use the mean middle rank minus the mean inner rank as a measure.

We shall from here on treat only case 2, since the other case may be easily transformed into it by replacing the distances by ranks.  Given n objects and a distance matrix, if K is a cluster we shall denote the first statistic by (K).

We now examine the 'by chance' condition.  Typically, a researcher obtains a large matrix empirically, and then looks to a clustering program for help in organizing the data.  If he asks for assurance that a given cluster did not occur by chance, he really wishes to be told 'if you do the experiment again, this cluster has a very high probability of reappearing.' Thus, to give a completely satisfactory answer we must have a complete knowledge of the experimental errors in the data and how they affect the distance matrix.  This is a difficult problem, and seems to make a general solution quite improbable.

As one way out of this difficulty, we may examine the distribution of all the distances in the matrix; taking a cluster K out of n objects, we compute (K) as above, and then ask 'What is the probability that the observed value of the statistic could be obtained by a random sample from the matrix?' That is, our null hypothesis is that the value of (K) is not significantly different from one obtained by picking n(n-k) distances at random from the matrix and taking the mean, picking k(k-1)/2 distances at random from the matrix and taking the mean, and then forming the difference of the means.

This hypothesis, however, tacitly assumes independence of the inner and middle distances in a cluster; in fact, if the triangle inequality were satisfied, some clusters would appear excessively significant, since a few small inner distances and a few large middle distances (together with the triangle inequality) will imply many small inner distances and many large middle distances, with an artificially high confidence level.  In general, however, we have no idea how much this phenomenon is important in the given matrix.

Thus, it seems necessary to take the interdependence of the distances into account.  Accordingly, we shall examine the following null hypothesis:  'the observed value of the statistic is not significantly different from one obtained by picking k objects at random and computing the statistic.' Thus, for a given distance matrix and for each k = 2,3,...,n-1, we wish to examine the distribution of the cluster statistic over all clusters of k objects.  Then, given a cluster K, we compare (K) relative to all other clusters having the same number of _ objects.  We shall derive (in the next section) the mean and standard deviation of the statistic over all clusters of k objects, given n, k and a distance matrix.  If K is a cluster, we then compute the 'z score'

this as our final cluster measure.

Many problems remain; the major one is not knowing the distribution of the statistic in enough detail.  It would be possible, although with much more labor, to compute the higher moments of by a technique similar to that of the next section.  It is not clear what general type of distribution will have.  We can make a good case, however, for this distribution not being normal, or even approaching normality as n increases.  If fact, in the interesting cases where the data clusters tightly, the distribution of distances may be strongly bi-modal, thus affecting the distribution strongly for small or large k.  On the other hand, we believe that when k is close to n/2, the distribution in fact approaches a normal distribution.



The Cluster Statistic.

The program also computes a cluster statistic for each cluster which arises from either method.  This statistic is still experimental, in the sense that it is not clear that it is, in general, useful.  The statistic is not rank order invariant, and is certainly useless unless the data are at least interval scale data.  On the other hand, if the raw data are replaced by their ranks, the resulting statistic does have some meaning.

Basically, for each cluster a number is printed out.  The higher the number, the better the cluster.  The number is a kind of 'z score,' but computed over a statistic that is not, in general, normal.  For fairly good sized clusters, however, when n is over 20 or 30, and the distance matrix has a fairly reasonable distribution, the number may be treated like a z score with reasonable confidence.  In any case, however, if the number is denoted by z, you can say that the proportion of clusters better than the given cluster is less than 1/(z2).


HICLUST Input

Parameter Line:

The first line of the input file is a parameter line, which is followed by a format statement.  The parameter line contains two parameters:



Format Line:

The second line contains the format statement for reading the data matrix.  Standard FORTRAN formats must be used.  Note that the format must be a floating point format, since the data is stored as real data.



Data Input:

Data is input as a lower triangular matrix; That is, the first data line contains d(2,1), the second data line contains d(3,1) and d(3,2), the third data line contains d(4,1), d(4,2), and d(4,3), and so on down to and including the last row of the matrix.


NOTES

1. This paper has been modified for the PC-MDS version of HI-CLUST by Scott M. Smith.


Suppose, for example, we wished to input a distance matrix:

	0	1	2	3	4	5
			1	0	1	2	3	4
			2	1	0	1	2	3
			3	2	1	0	1	2
			4	3	2	1	0	1
			5	4	3	2	1	0

	Then the listing of the input file could appear:
			   6    1
			(5F2.0)
			01
			0201
			030201
			04030201
			0504030201


Sample HICLUST Data File: HICLUST.DAT
   15    1
(15F10.3)
59.130                                                                                                                                                          
62.420    30.790                                                                                                                                                
43.640    83.890    82.370                                                                                                                                                                                                                     
36.010    44.150    57.760    65.460                                                                                                                            
60.210    60.940    23.800    53.330    64.110                                                                                                                  
78.010    27.020    26.850    93.710    72.920    50.490                                                                                                        
34.570    36.360    52.440    55.350    32.870    59.690    64.230                                                                                              
32.100    32.290    50.180    49.110    36.290    55.830    61.550     5.520                                                                                    
61.520     8.490    25.290    85.640    48.960    58.730    31.520    37.940    38.320                                                                          
51.260    78.350    72.210    32.680    21.650    47.420    78.210    64.230    61.300    80.140                                                                
46.650    84.550    76.140    21.400    49.300    46.980    83.700    59.800    57.480    86.420    20.310                                                      
55.170    83.820    78.070     7.830    55.700    55.000    88.850    60.380    65.850    63.860    23.330    20.540                                            
53.670    83.010    70.680    30.560    49.870    47.080    78.130    63.770    65.560    84.240    23.100    10.700    24.760                                  
72.170    48.130    21.870    73.620    62.490    15.250    33.370    63.520    64.490    52.930    58.260    62.460    66.770    54.170   
Sample HICLUST Output
                                 H I C L U S T            
                         HIERARCHICAL  CLUSTER  ANALYSIS 
                            WRITTEN BY S. C. JOHNSON 
                                 PC-MDS  VERSION    
    
 ANALYSIS TITLE: HICLUST TEST DATA
 DATA IS READ FROM FILE:    HICLUST.DAT     
 OUTPUT PRINT FILE IS:      HICLUST.PRN     
  
 NO. OF VARIABLES        15 
 DISTANCE MEASURES        1 
 DATA INPUT FORMAT: 
 (15F10.3)                                                                         
 PRINTBACK OF SYMMETRIC MATRIX FOR INPUT DATA 
     .00  59.13  62.42  43.64  36.01  60.21  78.01  34.57  32.10  61.52  51.26  46.65  55.17  53.67  72.17 
   59.13    .00  30.79  83.89  44.15  60.94  27.02  36.36  32.29   8.49  78.35  84.55  83.82  83.01  48.13 
   62.42  30.79    .00  82.37  57.76  23.80  26.85  52.44  50.18  25.29  72.21  76.14  78.07  70.68  21.87 
   43.64  83.89  82.37    .00  65.46  53.33  93.71  55.35  49.11  85.64  32.68  21.40   7.83  30.56  73.62 
   36.01  44.15  57.76  65.46    .00  64.11  72.92  32.87  36.29  48.96  21.65  49.30  55.70  49.87  62.49 
   60.21  60.94  23.80  53.33  64.11    .00  50.49  59.69  55.83  58.73  47.42  46.98  55.00  47.08  15.25 
   78.01  27.02  26.85  93.71  72.92  50.49    .00  64.23  61.55  31.52  78.21  83.70  88.85  78.13  33.37 
   34.57  36.36  52.44  55.35  32.87  59.69  64.23    .00   5.52  37.94  64.23  59.80  60.38  63.77  63.52 
   32.10  32.29  50.18  49.11  36.29  55.83  61.55   5.52    .00  38.32  61.30  57.48  65.85  65.56  64.49 
   61.52   8.49  25.29  85.64  48.96  58.73  31.52  37.94  38.32    .00  80.14  86.42  63.86  84.24  52.93 
   51.26  78.35  72.21  32.68  21.65  47.42  78.21  64.23  61.30  80.14    .00  20.31  23.33  23.10  58.26 
   46.65  84.55  76.14  21.40  49.30  46.98  83.70  59.80  57.48  86.42  20.31    .00  20.54  10.70  62.46 
   55.17  83.82  78.07   7.83  55.70  55.00  88.85  60.38  65.85  63.86  23.33  20.54    .00  24.76  66.77 
   53.67  83.01  70.68  30.56  49.87  47.08  78.13  63.77  65.56  84.24  23.10  10.70  24.76    .00  54.17 
   72.17  48.13  21.87  73.62  62.49  15.25  33.37  63.52  64.49  52.93  58.26  62.46  66.77  54.17    .00 


 CONNECTEDNESS METHOD: 

    SIMILARITY          CONSTANTS 
    VALUE            
                     0 0 1 1 1 1 0 0 0 0 0 1 0 0 1  
                     5 4 3 1 2 4 1 8 9 7 2 0 3 6 5  
    .55200000E+01    . . . . . . . XXX . . . . . . 
    .78300000E+01    . XXX . . . . XXX . . . . . . 
    .84900000E+01    . XXX . . . . XXX . XXX . . . 
    .10700000E+02    . XXX . XXX . XXX . XXX . . . 
    .15250000E+02    . XXX . XXX . XXX . XXX . XXX 
    .20310000E+02    . XXX XXXXX . XXX . XXX . XXX 
    .20540000E+02    . XXXXXXXXX . XXX . XXX . XXX 
    .21650000E+02    XXXXXXXXXXX . XXX . XXX . XXX 
    .21870000E+02    XXXXXXXXXXX . XXX . XXX XXXXX 
    .25290000E+02    XXXXXXXXXXX . XXX . XXXXXXXXX 
    .26850000E+02    XXXXXXXXXXX . XXX XXXXXXXXXXX 
    .32100000E+02    XXXXXXXXXXX XXXXX XXXXXXXXXXX 
    .32290000E+02    XXXXXXXXXXX XXXXXXXXXXXXXXXXX 
    .32870000E+02    XXXXXXXXXXXXXXXXXXXXXXXXXXXXX 

 END OF METHOD
  ---------------------------------------------------------------- 
 **CLUSTER**************STATISTIC******************************** 
 ---------------------------------------------------------------- 
     2 PTS.             2.1036510000 
   8  9 
 ---------------------------------------------------------------- 
     2 PTS.             2.2725360000 
   4 13 
 ---------------------------------------------------------------- 
     2 PTS.             2.2342120000 
   2 10 
 ---------------------------------------------------------------- 
     2 PTS.             2.0240680000 
  12 14 
 ---------------------------------------------------------------- 
     2 PTS.             1.7727890000 
   6 15 
 ---------------------------------------------------------------- 
     3 PTS.             3.0681480000 
  11 12 14 
 ---------------------------------------------------------------- 
     5 PTS.             6.1619810000 
   4 13 11 12 14 
 ---------------------------------------------------------------- 
     6 PTS.             5.8046500000 
   5  4 13 11 12 14 
 ---------------------------------------------------------------- 
     3 PTS.             2.8645100000 
   3  6 15 
 ---------------------------------------------------------------- 
     5 PTS.             3.5864870000 
   2 10  3  6 15 
 ---------------------------------------------------------------- 
     6 PTS.             5.3076040000 
   7  2 10  3  6 15 
 ---------------------------------------------------------------- 
     3 PTS.             2.3808690000 
   1  8  9 
 ---------------------------------------------------------------- 
     9 PTS.             5.1098720000 
   1  8  9  7  2 10  3  6 15 
 ---------------------------------------------------------------- 
  
 DIAMETER METHOD: 
  
    SIMILARITY          CONSTANTS 
    VALUE 
                     0 0 0 0 1 0 1 1 1 0 0 1 0 0 1  
                     1 8 9 5 1 4 3 2 4 7 2 0 3 6 5  
  
    .55200000E+01    . XXX . . . . . . . . . . . . 
    .78300000E+01    . XXX . . XXX . . . . . . . . 
    .84900000E+01    . XXX . . XXX . . . XXX . . . 
    .10700000E+02    . XXX . . XXX XXX . XXX . . . 
    .15250000E+02    . XXX . . XXX XXX . XXX . XXX 
    .21650000E+02    . XXX XXX XXX XXX . XXX . XXX 
    .23800000E+02    . XXX XXX XXX XXX . XXX XXXXX 
    .30560000E+02    . XXX XXX XXXXXXX . XXX XXXXX 
    .31520000E+02    . XXX XXX XXXXXXX XXXXX XXXXX 
    .34570000E+02    XXXXX XXX XXXXXXX XXXXX XXXXX 
    .60940000E+02    XXXXX XXX XXXXXXX XXXXXXXXXXX 
    .64230000E+02    XXXXXXXXX XXXXXXX XXXXXXXXXXX 
    .65850000E+02    XXXXXXXXXXXXXXXXX XXXXXXXXXXX 
    .93710000E+02    XXXXXXXXXXXXXXXXXXXXXXXXXXXXX 
 END OF METHOD
  ---------------------------------------------------------------- 
 **CLUSTER**************STATISTIC******************************** 
 ---------------------------------------------------------------- 
     2 PTS.             2.1036510000 
   8  9 
 ---------------------------------------------------------------- 
     2 PTS.             2.2725360000 
   4 13 
 ---------------------------------------------------------------- 
     2 PTS.             2.2342120000 
   2 10 
 ---------------------------------------------------------------- 
     2 PTS.             2.0240680000 
  12 14 
 ---------------------------------------------------------------- 
     2 PTS.             1.7727890000 
   6 15 
 ---------------------------------------------------------------- 
     2 PTS.             1.3951500000 
   5 11 
 ---------------------------------------------------------------- 
     3 PTS.             2.8645100000 
   3  6 15 
 ---------------------------------------------------------------- 
     4 PTS.             4.7521320000 
   4 13 12 14 
 ---------------------------------------------------------------- 
     3 PTS.             3.1424840000 
   7  2 10 
 ---------------------------------------------------------------- 
     3 PTS.             2.3808690000 
   1  8  9 
 ---------------------------------------------------------------- 
     6 PTS.             5.3076040000 
   7  2 10  3  6 15 
 ---------------------------------------------------------------- 
     5 PTS.             2.4994310000 
   1  8  9  5 11 
 ---------------------------------------------------------------- 
     9 PTS.             6.1253430000 
   1  8  9  5 11  4 13 12 14 
 ---------------------------------------------------------------- 
  
 N, THE NUMBER OF VARIABLES, EQUALS ZERO. HICLUST STOP