HOW TO USE HOWARD-HARRIS CLUSTER ANALYSIS(1)


The Howard-Harris program forms groups of objects sequentially using an objects-by-variables (Respondent-by-variables) data matrix. The program uses the criterion of minimum within-group variance at each level of clustering.


THEORETICAL DISCUSSION

Let the number of objects being clustered be n, each of which is measured on N variables. Let the objects be denoted by x1, x2, ...,xn, each xi being a vector (xi1,..., xin) in N-dimensional space. Let P(S,p) represent a p-fold partitioning of the set S into disjoint subsets, A, B, and so forth.

With the above notation, the problem of clustering may be stated as follows: Given a set of objects (x1, x2,...,xn|xiS) each characterized by a number of variables, to partition S into subsets such that the subsets are as internally homogeneous and at the same time as mutually dissimilar as possible. The dissimilarity between any two objects, xi and xj in S can be defined as the squared Euclidean distance:

For a group A with na members it can be shown that the sum of all squared interpoint distances is directly proportional to the sum of all interpoint deviations from the mean of the group.

Mathematically stated,

where xA is the centroid of A, nA is the number of points in A, and the squared distance between any two points is defined by the previous equation.

The total variance VT of all n points in S can be written as:

For any p-fold partition into p groups, this variance can be divided into two components - a within group variance, Vw, and a between-group variance, VB. The total variance within any group A of na points has been shown earlier. Thus, the total within-group variance, Vw can be obtained by:

The criterion of optimally partitioning S into p groups is simply: Find P(S,p) such that VW is a minimum. Unfortunately no algorithm presently exists for finding such a split in N dimensions. However a locally optimal solution can be found by choosing a good split and then shifting points until a minimum Vw is found by a procedure to be described subsequently. The result will be at least locally optimal, but may be to some degree dependent on the initial split. In the Howard-Harris program an initial split is made on the basis of the maximum variance component of the vectors in the group being split.


COMPUTATIONAL PROCEDURE

The steps involved in finding a (p+1) fold partitioning of S such that Vw is minimum given a p-fold partition of S, used in the program are as follows:

It can be shown, using proof by counter-example, Howard and Harris, that the solution contained in step 4 is locally optimal. The property of the locally optimal solution is: if any single point be moved from its assigned group to any other group, total within-group variance would be increased.

The actual test for final group membership is made by testing for changes in within-group variance that would accompany a hypothetical shift of xiA to any other group B. If the shift would result in decreased variance, it is made.


INPUT PARAMETERS, SAMPLE DATA FILE, AND SAMPLE OUTPUT

The PC-MDS version of the Howard-Harris program will handle 2000 objects and 40 variables. Input data can be standardized by column to a mean of zero and standard deviation of 1.0 at the user's option. Furthermore, different scaling factors for each variable can be read in. The user can also specify the maximum number of groups to be formed.

The program does not have a menu or input parameters that must be specified with the data set. Instead, the program interactively requests the following information:


ENTER THE MAXIMUM NUMBER OF CLUSTERS TO BE
ALLOWED IN THE SOLUTION (MAXIMUM = 40)

ENTER THE MAXIMUM NUMBER OF VARIABLES TO BE INCLUDED IN
THE ANALYSIS. (MAXIMUM = 40)

YOUR DATA WILL BE STANDARDIZED UNLESS INDICATED.
PRESS ENTER TO STANDARDIZE, OR 1 IF NOT DESIRED.

ARE SCALING FACTORS INCLUDED FOR EACH VARIABLE?
ENTER 1 IF SCALING FACTORS ARE INCLUDED WITH DATA,
(NOTE: SCALING FACTORS MUST APPEAR IMMEDIATELY
BEFORE THE DATA AND ARE READ AS 16F5.0 )
PRESS ENTER IF SCALING FACTORS ARE NOT DESIRED.
Following this input, execution of the program begins.

NOTES

1. This explanation of the Howard-Harris Cluster analysis program was written by Paul E. Green and Yoram Wind and is found in Multiattribute Decisions in Marketing, The Dryden Press, 1973.

PROGRAM WRITTEN BY DRS. N. HOWARD AND B. HARRIS
SAMPLE DATA FILE
(F3.0,6F2.0,F3.0,4F2.0,F4.0,F2.0,2F3.0,F4.0,2F3.0,F4.0) 
125 0 1 2 0 0 4 08 1 0 0 1 300 5 38 46 184 72 53 350 ASTON 
110 0 1 2 0 0 2 12 1 1 0 1 380 5 39 37 177 78 45 354 FERRA 
118 0 1 2 0 0 2 12 1 1 0 1 420 5 39 33 165 79 42 341 LAMBO 
057 0 1 2 0 0 2 04 1 1 0 1 215 4 36 27 169 73 45 194 LOTUS 
      ..... 
      ..... 
      ..... 
011 1 1 5 1 0 5 04 1 0 1 1 096 3 35 32 175 65 70 125 ISUZU 
010 1 1 2 0 0 4 04 1 0 1 1 109 3 34 33 157 66 71 142 MITSU 
017 1 1 5 1 0 5 04 1 0 1 1 125 3 41 42 184 71 70 200 TOYOT 


SAMPLE HOWARD-HARRIS OUTPUT
                           HOWARD HARRIS CLUSTERING PROGRAM             |INPUT DATA: 90 CARS BY 20 ATTRIBUTES 
                    PROGRAM WRITTEN BY DRS. N. HOWARD AND B. HARRIS     |
                                  PC - MDS VERSION                      |(F3.0,6F2.0,F3.0,4F2.0,F4.0,F2.0,2F3.0,F4.0,2F3.0,F4.0) 
                                                                        |125 0 1 2 0 0 4 08 1 0 0 1 300 5 38 46 184 72 53 350 ASTON
                                                                        |110 0 1 2 0 0 2 12 1 1 0 1 380 5 39 37 177 78 45 354 FERRA
 ANALYSIS TITLE: 1987 AUTOMOBILE DATA 90 CARS BY 20 VARIABLES           |118 0 1 2 0 0 2 12 1 1 0 1 420 5 39 33 165 79 42 341 LAMBO
 DATA IS READ FROM FILE:    CLUSTER2.DAT                                |057 0 1 2 0 0 2 04 1 1 0 1 215 4 36 27 169 73 45 194 LOTUS 
 OUTPUT PRINT FILE IS:      CLUSTER.PRN                                 |      ..... 
                                                                        |      ..... 
 TITLE: 1987 AUTOMOBILE DATA 90 CARS BY 20 VARIABLES                    |      ..... 
 NO OF OBSERVATIONS OR CASES        90                                  |      ..... 
 NO OF VARIABLES PER CASE           20                                  |011 1 1 5 1 0 5 04 1 0 1 1 096 3 35 32 175 65 70 125 ISUZU 
 MAX NO OF CLUSTERS                 12                                  |010 1 1 2 0 0 4 04 1 0 1 1 109 3 34 33 157 66 71 142 MITSU 
 FORMAT: (F3.0,6F2.0,F3.0,4F2.0,F4.0,F2.0,2F3.0,F4.0,2F3.0,F4.0)        |017 1 1 5 1 0 5 04 1 0 1 1 125 3 41 42 184 71 70 200 TOYOT 
                                                                        |
                   VARIABLE              STANDARD                       |
                    NUMBER    MEAN       DEVIATION                      | RAW DATA PRINTOUT OPTION MAY BE SELECTED
                ----------------------------------------                | 
                        1   22.3889        31.3772                      | STANDARDIZED DATA PRINTOUT OPTION MAY BE SELECTED
                        2     .7444          .4362                      | 
                        3     .6556          .4752                      | 
                        4    2.8444         1.0318                      | 
                        5     .2556          .4362                      | 
                        6     .2000          .4000                      | 
                        7    4.3556         1.6351                      | 
                        8    5.1556         1.9259                      | 
                        9     .9889          .1048                      | 
                       10     .0778          .2678                      | 
                       11     .4444          .4969                      | 
                       12     .6333          .4819                      | 
                       13  134.4666        71.9849                      | 
                       14    2.6333         1.2776                      | 
                       15   35.8000         3.7274                      | 
                       16   29.5778         7.2938                      | 
                       17  175.4889        16.4528                      | 
                       18   68.4222         4.5092                      | 
                       19   57.4000         8.5554                      | 
                       20  162.2000        77.7629                      | 
                                                                        | 
 ***********************************************************************|
     GROUP    NUMBER OF            SPLIT NUMBER   1                     | 
     NUMBER  SUBJ. IN GRP.         LIST OF SUBJECTS                     | 
 -----------------------------------------------------------------------|---| IDENTIFICATION OF CASES FOR EACH CLUSTER
         1         31          1   2   3   4   5   8   9  10  11  12  13  14| BEGINS WITH TWO GROUPS AND CONTINUES UNTIL
                              15  16  17  18  19  20  21  22  30  33  42  62| THE MAXIMUM NUMBER OF CLUSTERS SPECIFIED
                              63  64  65  71  72  73  89                    | IS REACHED.
         2         59          6   7  23  24  25  26  27  28  29  31  32  34|
                              35  36  37  38  39  40  41  43  44  45  46  47|
                              48  49  50  51  52  53  54  55  56  57  58  59|
                              60  61  66  67  68  69  70  74  75  76  77  78|
                              79  80  81  82  83  84  85  86  87  88  90    |
 
 ***************************************************************************| 
     GROUP    NUMBER OF            SPLIT NUMBER  11                         |
     NUMBER  SUBJ. IN GRP.         LIST OF SUBJECTS                         |
 ---------------------------------------------------------------------------| PRINTOUT OF GROUP MEMBERSHIP FOR THE 
         1          1         71                                            | LAST CLUSTER.
         2          8         14  24  25  26  31  50  52  54                |
         3         10          5  10  11  13  16  17  18  20  30  62        |
         4          8         68  69  70  83  84  85  87  89                |
         5         12          6   7   8  19  21  22  23  29  33  41  42  45|
         6          6         63  64  65  66  72  73                        |
         7          3         86  88  90                                    |
         8          4          2   3   4   9                                |
         9          9         74  75  76  77  78  79  80  81  82            |
        10          3          1  12  15                                    |
        11         18         27  28  32  34  35  36  37  38  39  40  43  44| 
                              46  47  48  49  53  61                        |
        12          8         51  55  56  57  58  59  60  67                |
 ------------------------------------------------------------------------------
              WITHIN GROUP SUM OF SQUARES FOR EACH FINAL GROUP              |
                                                                            | 
         1800.000  1348.924  1220.728   992.773   841.051   773.725         |  WITHIN GROUPS SUM OF SQUARES FOR EACH GROUP.
          704.403   628.197   564.702   517.529   483.684   452.608         |  THIS IS A MEASURE OF COMPACTNESS OF EACH GROUP.
                                                                            | 
                                                                            | 
                                                                            | 
 TITLE:1987 AUTOMOBILE DATA 90 CARS BY 20 VARIABLES                         | 
                                                                            | 
 ***************************************************************************|
                                                                            | 
                           GROUP SUMMARY TABLE                              |  IDENTIFICATION OF GROUP MEMBERSHIP:
                        SUBJECT  BY  SPLIT  NUMBER                          |  2 TO 12 CLUSTER SOLUTIONS.
 ---------------------------------------------------------------------------|  THE FINAL ASSIGNMENT OF OBSERVATIONS TO 
                                                                            |  CLUSTERS.
       0   1   2   3   4   5   6   7   8   9  10  11                        |  ASSIGNMENTS ARE SHOWN FOR VARIOUS LEVELS OF 
   1   1   1   3   3   3   3   3   3   3  10  10  10                        |  GROUPINGS UP TO THE MAXIMUM LEVEL SPECIFIED.
   2   1   1   3   3   3   3   3   8   8   8   8   8                        | 
   3   1   1   3   3   3   3   3   8   8   8   8   8                        | 
   4   1   1   3   3   3   3   3   8   8   8   8   8                        | 
       .   .   .   .   .                                                    |
       .   .   .   .   .                                                    |
       .   .   .   .   .                                                    |
  87   1   2   2   4   4   4   4   4   4   4   4   4                        | 
  88   1   2   2   2   5   5   5   5   5   5   7   7                        | 
  89   1   1   3   4   4   4   4   4   4   4   4   4                        | 
  90   1   2   2   2   2   2   7   7   7   7   7   7                        | 
                                                                            |
                                                                            |
  FINISHED PROCESSING                                                       |