Johnson's clustering program performs cluster analysis on a hierarchical basis. The hierarchical clustering refers to the approach where a set of clusters are formed by merging or splitting a cluster into two or more sub-clusters. Because clusters at a given level arise from higher levels in the cluster hierarchy, a tree is produced.
Johnson's approach calls for a simple set of input specifying the number of stimuli to be clustered, the specification of distance or proximity data, a format statement and a data matrix composed of similarity or dissimilarity data, similar to what may be used for KYST or INDSCAL analysis.
Program output consists of a listing of parameter values selected, a listing of original data, and two tree diagrams connecting the stimuli into clusters hierarchically. The first diagram is for the "min" or "Connectedness" method, and the second is for the "max" or "Diameter" method. For each level of clustering the associated value of , the height is printed.
Finally, Johnson's cluster statistic is printed.
Johnson has created an easily computed cluster statistic that is defined and discussed. It provides some measure of the compactness of a cluster of k objects out of a given n objects. In the technical introduction, the distribution of this statistic is examined, and it is shown that given n objects together with a distance matrix, the means and variances are computed explicitly for k = 2,3,...,n-1.
Suppose we are given n objects, numbered 1 through n, together with an nxn symmetric positive valued zero-diagonal matrix representing the distances between objects. We shall represent the distance from the i-th to the j-th object by d(i,j). We should like to test the hypothesis that some subset of k points forms a cluster,. that is, that these k points are actually significantly closer together than one would expect by chance. In order to make precise this intuitive notion, we must agree on acceptable definitions for the terms 'close together' and 'by chance.'
Examine the entire set of n(n-1)/2 interpoint distances. Of these, a subset of k(k-1)/2 distances represents the interpoint distances within a cluster of k objects (the 'inner' distances) a disjoint subset of k(n-k) distances represents the distances from the objects in the cluster to the objects outside of the cluster (the 'middle' distances), and the set of remaining (n-k)(n-k-1)/2 distances represents the distances which do not involve the objects in the cluster at all (the 'outer' distances). To decide how close together a cluster is, we must somehow measure the extent to which the distributions of inner, middle, and outer conform to what we expect for a 'tight' cluster. It seems that the outer distances have very little to do with the measurement of how close together a given cluster is; we might expect the outer distances to be larger than the inner distances, but there are times (for example, examining one out of two tight clusters) when in fact the outer distances might be of about the same distribution as the inner distances, or even smaller. Thus, we propose leaving the outer distances out of consideration entirely, realizing that they will be important in our discussion of 'by chance,' below. This leaves us with the inner and middle distances; certainly, the extent to which the middle distances are larger than the inner distances gives one reasonable measure of the tightness of the cluster. We may measure this difference in several ways:
1. If we have interval scale distances, we may use the mean middle distance minus the mean inner distance as a measure.
2. If we have only rank order distances, we may convert the distances to ranks, and use the mean middle rank minus the mean inner rank as a measure.
We shall from here on treat only case 2, since the other case may be easily transformed into it by replacing the distances by ranks. Given n objects and a distance matrix, if K is a cluster we shall denote the first statistic by (K).
We now examine the 'by chance' condition. Typically, a researcher obtains a large matrix empirically, and then looks to a clustering program for help in organizing the data. If he asks for assurance that a given cluster did not occur by chance, he really wishes to be told 'if you do the experiment again, this cluster has a very high probability of reappearing.' Thus, to give a completely satisfactory answer we must have a complete knowledge of the experimental errors in the data and how they affect the distance matrix. This is a difficult problem, and seems to make a general solution quite improbable.
As one way out of this difficulty, we may examine the distribution of all the distances in the matrix; taking a cluster K out of n objects, we compute (K) as above, and then ask 'What is the probability that the observed value of the statistic could be obtained by a random sample from the matrix?' That is, our null hypothesis is that the value of (K) is not significantly different from one obtained by picking n(n-k) distances at random from the matrix and taking the mean, picking k(k-1)/2 distances at random from the matrix and taking the mean, and then forming the difference of the means.
This hypothesis, however, tacitly assumes independence of the inner and middle distances in a cluster; in fact, if the triangle inequality were satisfied, some clusters would appear excessively significant, since a few small inner distances and a few large middle distances (together with the triangle inequality) will imply many small inner distances and many large middle distances, with an artificially high confidence level. In general, however, we have no idea how much this phenomenon is important in the given matrix.
Thus, it seems necessary to take the interdependence of the distances into account. Accordingly, we shall examine the following null hypothesis: 'the observed value of the statistic is not significantly different from one obtained by picking k objects at random and computing the statistic.' Thus, for a given distance matrix and for each k = 2,3,...,n-1, we wish to examine the distribution of the cluster statistic over all clusters of k objects. Then, given a cluster K, we compare (K) relative to all other clusters having the same number of _ objects. We shall derive (in the next section) the mean and standard deviation of the statistic over all clusters of k objects, given n, k and a distance matrix. If K is a cluster, we then compute the 'z score'

this as our final cluster measure.
Many problems remain; the major one is not knowing the distribution of the statistic in enough detail. It would be possible, although with much more labor, to compute the higher moments of by a technique similar to that of the next section. It is not clear what general type of distribution will have. We can make a good case, however, for this distribution not being normal, or even approaching normality as n increases. If fact, in the interesting cases where the data clusters tightly, the distribution of distances may be strongly bi-modal, thus affecting the distribution strongly for small or large k. On the other hand, we believe that when k is close to n/2, the distribution in fact approaches a normal distribution.
The Cluster Statistic.
The program also computes a cluster statistic for each cluster which arises from either method. This statistic is still experimental, in the sense that it is not clear that it is, in general, useful. The statistic is not rank order invariant, and is certainly useless unless the data are at least interval scale data. On the other hand, if the raw data are replaced by their ranks, the resulting statistic does have some meaning.
Basically, for each cluster a number is printed out. The higher the number, the better the cluster. The number is a kind of 'z score,' but computed over a statistic that is not, in general, normal. For fairly good sized clusters, however, when n is over 20 or 30, and the distance matrix has a fairly reasonable distribution, the number may be treated like a z score with reasonable confidence. In any case, however, if the number is denoted by z, you can say that the proportion of clusters better than the given cluster is less than 1/(z2).
Parameter Line:
The first line of the input file is a parameter line, which is followed by a format statement. The parameter line contains two parameters:
(1) N, the size of the data matrix parameter, (Max = 200 x 200)
(2) IS, the type of data parameter, where
1 = matrix of distances
2 = matrix of proximities
Format Line:
The second line contains the format statement for reading the data matrix. Standard FORTRAN formats must be used. Note that the format must be a floating point format, since the data is stored as real data.
Data Input:
Data is input as a lower triangular matrix; That is, the first data line contains d(2,1), the second data line contains d(3,1) and d(3,2), the third data line contains d(4,1), d(4,2), and d(4,3), and so on down to and including the last row of the matrix.
1. This paper has been modified for the PC-MDS version of HI-CLUST by Scott M. Smith.
Suppose, for example, we wished to input a distance matrix:
0 1 2 3 4 5 1 0 1 2 3 4 2 1 0 1 2 3 3 2 1 0 1 2 4 3 2 1 0 1 5 4 3 2 1 0 Then the listing of the input file could appear: 6 1 (5F2.0) 01 0201 030201 04030201 0504030201
15 1 (15F10.3) 59.130 62.420 30.790 43.640 83.890 82.370 36.010 44.150 57.760 65.460 60.210 60.940 23.800 53.330 64.110 78.010 27.020 26.850 93.710 72.920 50.490 34.570 36.360 52.440 55.350 32.870 59.690 64.230 32.100 32.290 50.180 49.110 36.290 55.830 61.550 5.520 61.520 8.490 25.290 85.640 48.960 58.730 31.520 37.940 38.320 51.260 78.350 72.210 32.680 21.650 47.420 78.210 64.230 61.300 80.140 46.650 84.550 76.140 21.400 49.300 46.980 83.700 59.800 57.480 86.420 20.310 55.170 83.820 78.070 7.830 55.700 55.000 88.850 60.380 65.850 63.860 23.330 20.540 53.670 83.010 70.680 30.560 49.870 47.080 78.130 63.770 65.560 84.240 23.100 10.700 24.760 72.170 48.130 21.870 73.620 62.490 15.250 33.370 63.520 64.490 52.930 58.260 62.460 66.770 54.170Sample HICLUST Output
H I C L U S T
HIERARCHICAL CLUSTER ANALYSIS
WRITTEN BY S. C. JOHNSON
PC-MDS VERSION
ANALYSIS TITLE: HICLUST TEST DATA
DATA IS READ FROM FILE: HICLUST.DAT
OUTPUT PRINT FILE IS: HICLUST.PRN
NO. OF VARIABLES 15
DISTANCE MEASURES 1
DATA INPUT FORMAT:
(15F10.3)
PRINTBACK OF SYMMETRIC MATRIX FOR INPUT DATA
.00 59.13 62.42 43.64 36.01 60.21 78.01 34.57 32.10 61.52 51.26 46.65 55.17 53.67 72.17
59.13 .00 30.79 83.89 44.15 60.94 27.02 36.36 32.29 8.49 78.35 84.55 83.82 83.01 48.13
62.42 30.79 .00 82.37 57.76 23.80 26.85 52.44 50.18 25.29 72.21 76.14 78.07 70.68 21.87
43.64 83.89 82.37 .00 65.46 53.33 93.71 55.35 49.11 85.64 32.68 21.40 7.83 30.56 73.62
36.01 44.15 57.76 65.46 .00 64.11 72.92 32.87 36.29 48.96 21.65 49.30 55.70 49.87 62.49
60.21 60.94 23.80 53.33 64.11 .00 50.49 59.69 55.83 58.73 47.42 46.98 55.00 47.08 15.25
78.01 27.02 26.85 93.71 72.92 50.49 .00 64.23 61.55 31.52 78.21 83.70 88.85 78.13 33.37
34.57 36.36 52.44 55.35 32.87 59.69 64.23 .00 5.52 37.94 64.23 59.80 60.38 63.77 63.52
32.10 32.29 50.18 49.11 36.29 55.83 61.55 5.52 .00 38.32 61.30 57.48 65.85 65.56 64.49
61.52 8.49 25.29 85.64 48.96 58.73 31.52 37.94 38.32 .00 80.14 86.42 63.86 84.24 52.93
51.26 78.35 72.21 32.68 21.65 47.42 78.21 64.23 61.30 80.14 .00 20.31 23.33 23.10 58.26
46.65 84.55 76.14 21.40 49.30 46.98 83.70 59.80 57.48 86.42 20.31 .00 20.54 10.70 62.46
55.17 83.82 78.07 7.83 55.70 55.00 88.85 60.38 65.85 63.86 23.33 20.54 .00 24.76 66.77
53.67 83.01 70.68 30.56 49.87 47.08 78.13 63.77 65.56 84.24 23.10 10.70 24.76 .00 54.17
72.17 48.13 21.87 73.62 62.49 15.25 33.37 63.52 64.49 52.93 58.26 62.46 66.77 54.17 .00
CONNECTEDNESS METHOD:
SIMILARITY CONSTANTS
VALUE
0 0 1 1 1 1 0 0 0 0 0 1 0 0 1
5 4 3 1 2 4 1 8 9 7 2 0 3 6 5
.55200000E+01 . . . . . . . XXX . . . . . .
.78300000E+01 . XXX . . . . XXX . . . . . .
.84900000E+01 . XXX . . . . XXX . XXX . . .
.10700000E+02 . XXX . XXX . XXX . XXX . . .
.15250000E+02 . XXX . XXX . XXX . XXX . XXX
.20310000E+02 . XXX XXXXX . XXX . XXX . XXX
.20540000E+02 . XXXXXXXXX . XXX . XXX . XXX
.21650000E+02 XXXXXXXXXXX . XXX . XXX . XXX
.21870000E+02 XXXXXXXXXXX . XXX . XXX XXXXX
.25290000E+02 XXXXXXXXXXX . XXX . XXXXXXXXX
.26850000E+02 XXXXXXXXXXX . XXX XXXXXXXXXXX
.32100000E+02 XXXXXXXXXXX XXXXX XXXXXXXXXXX
.32290000E+02 XXXXXXXXXXX XXXXXXXXXXXXXXXXX
.32870000E+02 XXXXXXXXXXXXXXXXXXXXXXXXXXXXX
END OF METHOD
----------------------------------------------------------------
**CLUSTER**************STATISTIC********************************
----------------------------------------------------------------
2 PTS. 2.1036510000
8 9
----------------------------------------------------------------
2 PTS. 2.2725360000
4 13
----------------------------------------------------------------
2 PTS. 2.2342120000
2 10
----------------------------------------------------------------
2 PTS. 2.0240680000
12 14
----------------------------------------------------------------
2 PTS. 1.7727890000
6 15
----------------------------------------------------------------
3 PTS. 3.0681480000
11 12 14
----------------------------------------------------------------
5 PTS. 6.1619810000
4 13 11 12 14
----------------------------------------------------------------
6 PTS. 5.8046500000
5 4 13 11 12 14
----------------------------------------------------------------
3 PTS. 2.8645100000
3 6 15
----------------------------------------------------------------
5 PTS. 3.5864870000
2 10 3 6 15
----------------------------------------------------------------
6 PTS. 5.3076040000
7 2 10 3 6 15
----------------------------------------------------------------
3 PTS. 2.3808690000
1 8 9
----------------------------------------------------------------
9 PTS. 5.1098720000
1 8 9 7 2 10 3 6 15
----------------------------------------------------------------
DIAMETER METHOD:
SIMILARITY CONSTANTS
VALUE
0 0 0 0 1 0 1 1 1 0 0 1 0 0 1
1 8 9 5 1 4 3 2 4 7 2 0 3 6 5
.55200000E+01 . XXX . . . . . . . . . . . .
.78300000E+01 . XXX . . XXX . . . . . . . .
.84900000E+01 . XXX . . XXX . . . XXX . . .
.10700000E+02 . XXX . . XXX XXX . XXX . . .
.15250000E+02 . XXX . . XXX XXX . XXX . XXX
.21650000E+02 . XXX XXX XXX XXX . XXX . XXX
.23800000E+02 . XXX XXX XXX XXX . XXX XXXXX
.30560000E+02 . XXX XXX XXXXXXX . XXX XXXXX
.31520000E+02 . XXX XXX XXXXXXX XXXXX XXXXX
.34570000E+02 XXXXX XXX XXXXXXX XXXXX XXXXX
.60940000E+02 XXXXX XXX XXXXXXX XXXXXXXXXXX
.64230000E+02 XXXXXXXXX XXXXXXX XXXXXXXXXXX
.65850000E+02 XXXXXXXXXXXXXXXXX XXXXXXXXXXX
.93710000E+02 XXXXXXXXXXXXXXXXXXXXXXXXXXXXX
END OF METHOD
----------------------------------------------------------------
**CLUSTER**************STATISTIC********************************
----------------------------------------------------------------
2 PTS. 2.1036510000
8 9
----------------------------------------------------------------
2 PTS. 2.2725360000
4 13
----------------------------------------------------------------
2 PTS. 2.2342120000
2 10
----------------------------------------------------------------
2 PTS. 2.0240680000
12 14
----------------------------------------------------------------
2 PTS. 1.7727890000
6 15
----------------------------------------------------------------
2 PTS. 1.3951500000
5 11
----------------------------------------------------------------
3 PTS. 2.8645100000
3 6 15
----------------------------------------------------------------
4 PTS. 4.7521320000
4 13 12 14
----------------------------------------------------------------
3 PTS. 3.1424840000
7 2 10
----------------------------------------------------------------
3 PTS. 2.3808690000
1 8 9
----------------------------------------------------------------
6 PTS. 5.3076040000
7 2 10 3 6 15
----------------------------------------------------------------
5 PTS. 2.4994310000
1 8 9 5 11
----------------------------------------------------------------
9 PTS. 6.1253430000
1 8 9 5 11 4 13 12 14
----------------------------------------------------------------
N, THE NUMBER OF VARIABLES, EQUALS ZERO. HICLUST STOP