Let the number of objects being clustered be n, each of which is measured on N variables. Let the objects be denoted by x1, x2, ...,xn, each xi being a vector (xi1,..., xin) in N-dimensional space. Let P(S,p) represent a p-fold partitioning of the set S into disjoint subsets, A, B, and so forth.
With the above notation, the problem of clustering may be stated as follows: Given a set of objects (x1, x2,...,xn|xiS) each characterized by a number of variables, to partition S into subsets such that the subsets are as internally homogeneous and at the same time as mutually dissimilar as possible. The dissimilarity between any two objects, xi and xj in S can be defined as the squared Euclidean distance:
![]()
For a group A with na members it can be shown that the sum of all squared interpoint distances is directly proportional to the sum of all interpoint deviations from the mean of the group.
Mathematically stated,

where xA is the centroid of A, nA is the number of points in A, and the squared distance between any two points is defined by the previous equation.
The total variance VT of all n points in S can be written as:
![]()
For any p-fold partition into p groups, this variance can be divided into two components - a within group variance, Vw, and a between-group variance, VB. The total variance within any group A of na points has been shown earlier. Thus, the total within-group variance, Vw can be obtained by:
![]()
The criterion of optimally partitioning S into p groups is simply: Find P(S,p) such that VW is a
minimum. Unfortunately no algorithm presently exists for finding such a split in N dimensions.
However a locally optimal solution can be found by choosing a good split and then shifting points
until a minimum Vw is found by a procedure to be described subsequently. The result will be at least
locally optimal, but may be to some degree dependent on the initial split. In the Howard-Harris
program an initial split is made on the basis of the maximum variance component of the vectors in the
group being split.
The steps involved in finding a (p+1) fold partitioning of S such that Vw is minimum given a p-fold partition of S, used in the program are as follows:
1. A group A is chosen for splitting on the basis of maximum within group variance.
2. A component of the vectors xi in A is chosen to serve as the splitting variable on the basis of maximum variance.
3. New group membership for the vectors xi (formerly A) is found on the basis of splitting at the mean value of the maximum variance component of xi in A.
4. Finally, a local optimization is carried out by comparing the distance from each xiS to the centroid of each of the p + 1 existing groups. Points are shifted and centroids recomputed within group assignments are represented by minimum squared distance to the centroid of each cluster.
It can be shown, using proof by counter-example, Howard and Harris, that the solution contained in step 4 is locally optimal. The property of the locally optimal solution is: if any single point be moved from its assigned group to any other group, total within-group variance would be increased.
The actual test for final group membership is made by testing for changes in within-group variance that would accompany a hypothetical shift of xiA to any other group B. If the shift would result in decreased variance, it is made.
The PC-MDS version of the Howard-Harris program will handle 2000 objects and 40 variables. Input data can be standardized by column to a mean of zero and standard deviation of 1.0 at the user's option. Furthermore, different scaling factors for each variable can be read in. The user can also specify the maximum number of groups to be formed.
The program does not have a menu or input parameters that must be specified with the data set. Instead, the program interactively requests the following information:
| ENTER THE MAXIMUM NUMBER OF CLUSTERS TO BE ALLOWED IN THE SOLUTION (MAXIMUM = 40) |
| ENTER THE MAXIMUM NUMBER OF VARIABLES TO BE INCLUDED IN THE ANALYSIS. (MAXIMUM = 40) |
| YOUR DATA WILL BE STANDARDIZED UNLESS INDICATED.
PRESS ENTER TO STANDARDIZE, OR 1 IF NOT DESIRED. |
| ARE SCALING FACTORS INCLUDED FOR EACH VARIABLE?
ENTER 1 IF SCALING FACTORS ARE INCLUDED WITH DATA, (NOTE: SCALING FACTORS MUST APPEAR IMMEDIATELY BEFORE THE DATA AND ARE READ AS 16F5.0 ) PRESS ENTER IF SCALING FACTORS ARE NOT DESIRED. |
(F3.0,6F2.0,F3.0,4F2.0,F4.0,F2.0,2F3.0,F4.0,2F3.0,F4.0)
125 0 1 2 0 0 4 08 1 0 0 1 300 5 38 46 184 72 53 350 ASTON
110 0 1 2 0 0 2 12 1 1 0 1 380 5 39 37 177 78 45 354 FERRA
118 0 1 2 0 0 2 12 1 1 0 1 420 5 39 33 165 79 42 341 LAMBO
057 0 1 2 0 0 2 04 1 1 0 1 215 4 36 27 169 73 45 194 LOTUS
.....
.....
.....
011 1 1 5 1 0 5 04 1 0 1 1 096 3 35 32 175 65 70 125 ISUZU
010 1 1 2 0 0 4 04 1 0 1 1 109 3 34 33 157 66 71 142 MITSU
017 1 1 5 1 0 5 04 1 0 1 1 125 3 41 42 184 71 70 200 TOYOT
HOWARD HARRIS CLUSTERING PROGRAM |INPUT DATA: 90 CARS BY 20 ATTRIBUTES
PROGRAM WRITTEN BY DRS. N. HOWARD AND B. HARRIS |
PC - MDS VERSION |(F3.0,6F2.0,F3.0,4F2.0,F4.0,F2.0,2F3.0,F4.0,2F3.0,F4.0)
|125 0 1 2 0 0 4 08 1 0 0 1 300 5 38 46 184 72 53 350 ASTON
|110 0 1 2 0 0 2 12 1 1 0 1 380 5 39 37 177 78 45 354 FERRA
ANALYSIS TITLE: 1987 AUTOMOBILE DATA 90 CARS BY 20 VARIABLES |118 0 1 2 0 0 2 12 1 1 0 1 420 5 39 33 165 79 42 341 LAMBO
DATA IS READ FROM FILE: CLUSTER2.DAT |057 0 1 2 0 0 2 04 1 1 0 1 215 4 36 27 169 73 45 194 LOTUS
OUTPUT PRINT FILE IS: CLUSTER.PRN | .....
| .....
TITLE: 1987 AUTOMOBILE DATA 90 CARS BY 20 VARIABLES | .....
NO OF OBSERVATIONS OR CASES 90 | .....
NO OF VARIABLES PER CASE 20 |011 1 1 5 1 0 5 04 1 0 1 1 096 3 35 32 175 65 70 125 ISUZU
MAX NO OF CLUSTERS 12 |010 1 1 2 0 0 4 04 1 0 1 1 109 3 34 33 157 66 71 142 MITSU
FORMAT: (F3.0,6F2.0,F3.0,4F2.0,F4.0,F2.0,2F3.0,F4.0,2F3.0,F4.0) |017 1 1 5 1 0 5 04 1 0 1 1 125 3 41 42 184 71 70 200 TOYOT
|
VARIABLE STANDARD |
NUMBER MEAN DEVIATION | RAW DATA PRINTOUT OPTION MAY BE SELECTED
---------------------------------------- |
1 22.3889 31.3772 | STANDARDIZED DATA PRINTOUT OPTION MAY BE SELECTED
2 .7444 .4362 |
3 .6556 .4752 |
4 2.8444 1.0318 |
5 .2556 .4362 |
6 .2000 .4000 |
7 4.3556 1.6351 |
8 5.1556 1.9259 |
9 .9889 .1048 |
10 .0778 .2678 |
11 .4444 .4969 |
12 .6333 .4819 |
13 134.4666 71.9849 |
14 2.6333 1.2776 |
15 35.8000 3.7274 |
16 29.5778 7.2938 |
17 175.4889 16.4528 |
18 68.4222 4.5092 |
19 57.4000 8.5554 |
20 162.2000 77.7629 |
|
***********************************************************************|
GROUP NUMBER OF SPLIT NUMBER 1 |
NUMBER SUBJ. IN GRP. LIST OF SUBJECTS |
-----------------------------------------------------------------------|---| IDENTIFICATION OF CASES FOR EACH CLUSTER
1 31 1 2 3 4 5 8 9 10 11 12 13 14| BEGINS WITH TWO GROUPS AND CONTINUES UNTIL
15 16 17 18 19 20 21 22 30 33 42 62| THE MAXIMUM NUMBER OF CLUSTERS SPECIFIED
63 64 65 71 72 73 89 | IS REACHED.
2 59 6 7 23 24 25 26 27 28 29 31 32 34|
35 36 37 38 39 40 41 43 44 45 46 47|
48 49 50 51 52 53 54 55 56 57 58 59|
60 61 66 67 68 69 70 74 75 76 77 78|
79 80 81 82 83 84 85 86 87 88 90 |
***************************************************************************|
GROUP NUMBER OF SPLIT NUMBER 11 |
NUMBER SUBJ. IN GRP. LIST OF SUBJECTS |
---------------------------------------------------------------------------| PRINTOUT OF GROUP MEMBERSHIP FOR THE
1 1 71 | LAST CLUSTER.
2 8 14 24 25 26 31 50 52 54 |
3 10 5 10 11 13 16 17 18 20 30 62 |
4 8 68 69 70 83 84 85 87 89 |
5 12 6 7 8 19 21 22 23 29 33 41 42 45|
6 6 63 64 65 66 72 73 |
7 3 86 88 90 |
8 4 2 3 4 9 |
9 9 74 75 76 77 78 79 80 81 82 |
10 3 1 12 15 |
11 18 27 28 32 34 35 36 37 38 39 40 43 44|
46 47 48 49 53 61 |
12 8 51 55 56 57 58 59 60 67 |
------------------------------------------------------------------------------
WITHIN GROUP SUM OF SQUARES FOR EACH FINAL GROUP |
|
1800.000 1348.924 1220.728 992.773 841.051 773.725 | WITHIN GROUPS SUM OF SQUARES FOR EACH GROUP.
704.403 628.197 564.702 517.529 483.684 452.608 | THIS IS A MEASURE OF COMPACTNESS OF EACH GROUP.
|
|
|
TITLE:1987 AUTOMOBILE DATA 90 CARS BY 20 VARIABLES |
|
***************************************************************************|
|
GROUP SUMMARY TABLE | IDENTIFICATION OF GROUP MEMBERSHIP:
SUBJECT BY SPLIT NUMBER | 2 TO 12 CLUSTER SOLUTIONS.
---------------------------------------------------------------------------| THE FINAL ASSIGNMENT OF OBSERVATIONS TO
| CLUSTERS.
0 1 2 3 4 5 6 7 8 9 10 11 | ASSIGNMENTS ARE SHOWN FOR VARIOUS LEVELS OF
1 1 1 3 3 3 3 3 3 3 10 10 10 | GROUPINGS UP TO THE MAXIMUM LEVEL SPECIFIED.
2 1 1 3 3 3 3 3 8 8 8 8 8 |
3 1 1 3 3 3 3 3 8 8 8 8 8 |
4 1 1 3 3 3 3 3 8 8 8 8 8 |
. . . . . |
. . . . . |
. . . . . |
87 1 2 2 4 4 4 4 4 4 4 4 4 |
88 1 2 2 2 5 5 5 5 5 5 7 7 |
89 1 1 3 4 4 4 4 4 4 4 4 4 |
90 1 2 2 2 2 2 7 7 7 7 7 7 |
|
|
FINISHED PROCESSING |