FREQUENCY ANALYSIS


Requirements: The FREQUENCY distribution of the variables is computed. The number of observations, cumulative frequency and statistics are reported.


The analysis of data often begins with what is called a "frequency analysis." Once the data collection process is completed, the analyst begins to explore the data, by measuring the central tendency of the data, and more importantly, the dispersion of the data around this central tendency.

Frequency analysis is particularly useful for describing discrete categories of data having multiple choice or yes-no response formats. This analysis involves constructing a frequency distribution. The frequency distribution is a record of the number of scores that fall within each response category. The frequency distribution, then, has two elements: (1) the categories of response, and (2) the frequency with which respondents are identified with each category.

The only technical requirement of the frequency analysis is that the categories of response be mutually exclusive and exhaustive. This means that the same observation cannot be counted as belonging to more than one response category. The frequency analysis must be exhaustive in the sense that all respondents must fit into a category.


Measuring Central Tendency

Every set of data has a tendency to cluster around a central value. There are various measures of central tendency, including the Mean, Median , Mode, Quartiles and various percentiles.



(1) The Arithmetic Mean:

For most statistical analyses, the mean is the most often used measure of central tendency. The mean is used most often, because of its relationship to the variance statistic. The mean is also important in the sampling distribution, which is formed from the distribution of all possible individual sample means, and has as its center, the mean of the population. The mean is affected by the presence of extreme scores (outliers) which may not be typical of the sample (or population) as a whole. The mean is preferred when a distribution is symmetric and interest is centered on a score that represents all scores.

The mean value, X, is obtained by summing the score values for a given variable and dividing by the number of scores.

For any variable X, having n observations,

X1,X2,X3, ...........Xn, the mean of variable X is defined as

Sometimes, however the data is not a simple set of integers that are summed, but instead, a set of categories. Consider an example where respondents are grouped into age categories. In this example, the midpoint of the category is identified as the mean of the category and is substituted for individual responses in the computation of the mean



              Group
                X    Frequency    Mean      Freq * Mean
    ---------------------------------------------------
        Age   41-43      1         42            42
        Group 38-40      3         39           117
              35-37      4         36           144
              32-34      4         33           132
              29-31      5         30           150
              26-28      8         27           216
              23-25     10         24           240
              20-22     18         21           378
              17-19     23         18           414
              14-16     17         15           255
              11-13     10         12           120  
                   S = 103               S =  2,208

                           _     2208
                    Mean = X  =  ----  =  21.44
                                  103



Thus we see that for ungrouped data, the identified categories of X1, X2, X3, ..... Xn have corresponding frequencies f1, f2, f3, ........ fn, and for grouped data, the values X1, X2, X3,....... Xn are the class-midpoints of the n class intervals with class frequencies f1, f2, f3, ...... fn respectively.

              _ 
Then the mean X is defined as 
                                           n
                                           S (fiXi)
       X1f1 + X2f2 + X3f3+ .....+ Xnfn     i = 1
 X =  -------------------------------- = --------------
         f1f1 + X2f2 + X3f3+ .....+ Xnfn     S  fi 
                                         i = 1,n                                                  


(2) The Median:

The median is the value that divides a frequency distribution in two equal parts (the distribution is arranged in ascending or descending order of magnitude).

The median value is an appropriate indicator of central tendency when the distribution of points is skewed and when the most typical value is desired (typical is described as the middle point between the extremes).



If X1, X2, X3, ............ Xn is a set of data arranged in ascending order of magnitude, then the median of the set of data is given by :

This result is true also for an ungrouped frequency distribution. If the data is a grouped frequency distribution, then the median,

where N = (f1 + f2 + f3 + .......... + fn)

It should be noted that cumulative frequencies may be defined in two ways. (1) the cumulative frequency less than the upper boundary of a class, and (2) the cumulative frequency greater than the lower boundary of class. Unless otherwise stated, the term cumulative frequency refers to the first definition.

In this example, (the same example as for the mean calculation, except that the classes are arranged in an ascending order) the cumulative frequencies of the categories are computed and the median class is identified as the mean of the category and is used for input to the computation of the median.

                              Cumulative
            X          Freq   Frequency (Less Than)
          ---------------------------------------------
    Age   11-13          10     10          
    Group 14-16          17     27           
          17-19          23     50 <--- Modal Class
          20-22          18     68 <--- Median Class 
          23-25          10     78 
          26-28           8     86 Median = 103/2=51.5
          29-31           5     91              
          32-34           4     95 
          35-37           4     99                   
          38-40           3    102  
          41-43           1    103        
                             S=103           
Median Class Lower Bound = l1 = 20
Median Class Upper Bound = l2 = 22
N = 103, l2 - l1 = 2, f = 68, c = 50

Median = 20 + ((103/2 - 50)/68)*2 =20.0441

(3) The Mode:

The mode is defined as the most frequently observed value. For grouped data, the mode is the most commonly observed category, and for ungrouped data, the mode is the value which occurs most frequently.

For a grouped frequency distribution, the mode is given by

Mo = l1 + ((f1 - f0)/(2f1 - f0 - f2))*(l2 - l1)

where:

l1 - l2= the modal class

f1= frequency of the modal class

f2= Frequency of the class following the modal class

f0= Frequency of the class following the modal class

Using the same example for computation of the mode, the modal class is identified as the mean of the category and is used for the responses in the computation of the mode.

l2 - l1 = (17-19) = 2, f1 = 23, f0 = 17, f2 = 18

Mo = 17 + ((23-17)/(2*23-17-18))*(2) = 17 + (6/11)*2 =18.091


Measuring Dispersion

Dispersion is the spread of the data about the measure of central tendency. There are various measures of dispersion that may be applied to a data set. The most commonly used measures include the range, variance, standard deviation.

(1) Maximum & Minimum

The largest and smallest values of the variable.

(2) Range

The Range is the difference between the highest and the lowest values of the variable.

(3) The Variance

The variance is defined as the "Mean of the Square of deviations around the mean". The calculation of the variance occurs as a three step process.

Step 1: Calculate the mean of the set of data (Let mean = X)

Step 2: Calculate the Deviation of each score from the mean score for the variable. [ Deviation = (Xi - X) ]

Step 3: The mean of the square of the deviations is calculated by dividing by n - 1.

                   n       _
                   S (Xi - X)2
                 i = 1
Variance (s2) = ----------------
                    ( n - 1 )  

If the set of data is very large, the denominator ( n-1 ) can be approximated by n.

(4) The Standard Deviation

The standard deviation is defined as the root-mean-square-of-deviation-around-the-mean. The standard deviation for a sample can be expressed as the square root of the variance and is represented by s. The advantage of the standard deviation is that it is expressed in the same unit as the original variable.

(5) The Standard Error

If from the population of data is drawn an infinite number (or all possible) samples of equal size, then the mean of each sample would be a true estimate of the population mean, but not all of them would be identical. These means would be normally distributed into what is called a `sampling distribution'. It is the standard deviation of this sampling distribution that is called the standard error. The standard error is an estimate of the potential for discrepancy between the sample mean and the population mean Because the population mean is usually unknown, the standard error cannot be calculated directly, but is estimated by dividing the standard deviation by the square root of the number of cases in the sample.


THE FREQ PROGRAM

The Frequency Analysis program produces frequency distributions and descriptive statistics for variables identified in the data set. As previously indicated three distinct files are required to run frequency analysis.

The PC-MDS Command File

The command file defines the various variables, their format and locations, defines missing values for variables and recodes the values of the variables, if desired. For purposes of clarification, the command files are designated as files with an "SPS" extension (i.e., *.SPS). As an example command file, we shall refer to the command file for the hospital data. The file is designated HOSP.SPS and is found on the data disk. The name of the PC-MDS command file is specified interactively by the user when each program is run. (Note that the .SPS designation is used for instructional clarity only. The command file may have any name and does not, in reality, require the .SPS extension).

The Data File

The data file contains the data in the format described in the Command File. The data files are usually named with a "DAT" extension (i.e., *.DAT). The example data file for the hospital data is called HOSP.DAT and is found on the data disk. The data file is specified in line 2 of the command file (the FILENAME command).

(Note that the .DAT designation is used for instructional clarity only. The data file may have any name and does not, in reality, require the .DAT extension).

The Output File

An output file must be interactively specified by the user while running each of the PC-MDS programs. The output file is the file to which the frequency analysis is printed. A common convention is to name the file with a "PRN" extension to signify a print file (i.e., *.PRN). For the frequency analysis, the output file contains the following output.

1) The variable number              2) The mean
3) The standard deviation           4) The standard error
5) The sample size                  6) The maximum and minimum values
7) The range                        8) The frequency distribution with percentages

HOW TO RUN THE FREQ PROGRAM

STEP 1: Enter the EDITOR (a word processor or program editor that produces ASCII files will suffice), and prepare the command file and the data file. STEP 2: Load the FREQ program. The program is loaded by simply typing FREQ and then pressing the [ENTER] key. C> FREQ [ENTER] STEP 3: After the initial logo identifying the program, a message will appear on the screen requesting the location and name of the command file.
  ENTER THE NAME OF THE PC-MDS COMMAND FILE

USE THE FORM: DRV:FILENAME.EXT (e.g. B:STAT.SPS)
C:HOSP.SPS
RESPOND with the location and name of the command file: C:HOSP.SPS [ENTER] (Assumes the HOSP.SPS file is in the main directory of the C: drive). If the specification of the command file name was not acceptable, then a message will ask you to re-enter the command file name. STEP 4: If the name of the command file was specified correctly then the next menu item will pop up asking you to specify the location and name of the output file.
  ENTER THE NAME OF THE PC-MDS COMMAND FILE         

USE THE FORM: DRV:FILENAME.EXT (e.g. B:STAT.SPS)
C:HOSP.SPS
      ENTER THE NAME OF THE FILE TO SAVE OUTPUT           
                                                          
      USE THE FORM:  DRV:FILENAME.EXT(e.g. B:STAT.PRN)
                                                     
                                                     
      C:HOSP.PRN                                      
Enter the name of the output file: C:HOSP.PRN [ENTER] (Assumes you want to output the file HOSP.PRN to the A: drive). If a file already exists with the same name, then the message will appear on screen:
  THIS OUTPUT FILE NAME ALREADY EXISTS!         
  DO YOU WANT TO OVERWRITE IT? (Y/N) Y          
STEP 5: Once the output file name is correctly entered, the initial computations required for reading the command file take place. Initial error messages associated with the command file, if any, will be displayed on screen as follows: ERROR MESSAGES ERROR: LINE # : MESSAGE If errors are found, the program aborts. It is recommended that the user makes a note of the errors. The user must edit the Command file to correct the errors. The Frequency program may then be rerun. If there were no errors then the message on screen will be:
   FREQUENCY PROGRAM OPTIONS:                        
                                                     
   64 VARIABLES HAVE BEEN DECLARED.                  
   SELECT THE APPROPRIATE OPTION:                    
                                                     
   (1) SPECIFY THE VARIABLES FOR ANALYSIS            
       (VARIABLES ARE SPECIFIED BY SEQUENCE NUMBER)  
   (2) VIEW A LIST OF VARIABLE NUMBERS               
   (3) QUIT PROGRAM                                  
                                                     
   YOUR CHOICE : 2                                   
SEQ# NAME    VARIABLE LABEL   
  1  V1      PERSON FILLING OUT                      
  2  V2      TYPE OF SURGERY                         
  3  V3      OUTPATIENT INSURANCE                    
  4  V4      PERCENT DR COVERAGE                     
             
Option 2 was selected to VIEW THE VARIABLE LIST. Option 1 is then selected to specify the variables that are to be included in the analysis. STEP 6: The option to SPECIFY THE VARIABLES will give the following message. Enter the variables you want to study and press enter. The variables selected are listed.
   VARIABLES SPECIFICATION:                          
                                                     
 ENTER VARIABLES ONE AT A TIME.                      
 A blank space must follow each variable number.     
 The dash (-) may be used to simplify statements.    
 PRESS ENTER to quit this menu                       
 For example,                                        
 1 2 3 4 5 and 1 - 5 are equivalent statements.      
                                                     
 1 -3                                                

               SELECTED    VARIABLES      
 
  1  V1          2  V2          3  V3                 
                                                      
               VARIABLES CORRECT? Y 
The program next reads the first line of data, displays the input format for reading the data, and lists the values for the first data case. If the data is read incorrectly, you may re-specify the format statement. After you indicate that the data was read correctly, the program proceeds with the frequency analysis of the data.
 STMT#  #VARIABLES    FORMAT STATEMENT AND DATA  
   1        64                                                            
   (4X,F1.0,F2.0,53F1.0,2X,9F1.0)  
    1.00000e+000  6.00000e+000  1.00000e+000  1.00000e+000  1.00000e+000  
    5.00000e+000  4.00000e+000  4.00000e+000  4.00000e+000  4.00000e+000  
    4.00000e+000  4.00000e+000  4.00000e+000  4.00000e+000  4.00000e+000  
    2.00000e+000  3.00000e+000  1.00000e+000  2.00000e+000  1.00000e+000  
    2.00000e+000  1.00000e+000  2.00000e+000  2.00000e+000  2.00000e+000  
    1.00000e+000  3.00000e+000  2.00000e+000  2.00000e+000  3.00000e+000  
    
 WAS THE DATA READ CORRECTLY? Y
PLEASE ENTER THE NEW FORMAT STATEMENT                                   
                                                                          
  (4X,F1.0,F2.0,53F1.0,2X,9F1.0)
The FREQuency program next computes and lists the initial statistics. Once the initial statistics are prepared, the program prompts for the print option. You may request frequency distribution tables, or Quit the program.
PRINT OPTION:
Press ENTER to
QUIT ANALYSIS
OR Type 1 for
FREQUENCY TABLES
STEP 7: The output is written to the HOSP.PRN file when the analysis is complete. Enter the EDITOR or a word processing program to read the output file. The output file may be printed from the editor. A printed copy of the output from the sample Hospital Data example follows.



SAMPLE FREQUENCY ANALYSIS DATA FILE
TITLE  PATIENT SURVEY, LOCAL HOSPITAL
FILE NAME 'C:HOSP.DAT'
DATA LIST V1 TO V64
64 (4X,F1.0,F2.0,53F1.0,2X,9F1.0)
VARIABLE LABELS 
   V1     'PERSON FILLING OUT'
   V2     'TYPE OF SURGERY'
   V3 	'OUTPATIENT INSURANCE'
   V4 	'PERCENT DR COVERAGE'
   V5 	'PERCENT HOSPITAL COVERAGE'
   V6 	'DECISION OF WHERE TO HAVE SURGERY'
----------------------------------------------
¦¦¦Continues for V7 to  V64 ¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦
----------------------------------------------
RECODE	      V7 TO V10(1,2=1)(3,4=2)(5=0)
MISSING VALUES  V1 TO V10(0,9)

THE HOSP.DAT FILE
00101061115444444444231212122213223111326355533326424666250200213141735
00201071412222299223331122111211221111222222222222233202331100115141222
00301061334233333333131111111111111111113442442325344333441200124102343
-----------------------------------------------------------------------
¦¦ Continues for Observations 4 - 181 (NUMBERING IS NOT CONSECUTIVE) ¦¦
------------------------------------------------------------------------
19602011115444333324441111111111111111112334324224025432222100122137125
19702031113111449994941111111111111911311111111111111911311100124137124
19802011115224444444441111111211222111211112121111122221212100124131122
19902021435333233222231111111111111111111313111213425429232200113131743



SAMPLE FREQUENCY ANALYSIS OUTPUT FILE
	             PC-MDS
	       FREQUENCY ANALYSIS 
 
 
 ANALYSIS TITLE      PATIENT SURVEY, LOCAL HOSPITAL
 INPUT DATA  FILE    C:HOSP.DAT                                         
 OUTPUT PRINT FILE   C:HOSP.PRN                                       
 NO. OF VARIABLES      64 
 
 DATA FOR RECORD:     1 

.10E+01 .60E+01 .10E+01 .10E+01 .10E+01 .50E+01 .20E+01 .20E+01 .20E+01 .20E+01 .20E+01 .20E+01 .20E+01 .20E+01 .20E+01 .10E+01 .20E+01 .10E+01 .10E+01 .10E+01 .10E+01 .10E+01 .10E+01 .10E+01 .10E+01 .10E+01 .20E+01 .10E+01 .10E+01 .20E+01 .10E+01 .10E+01 .10E+01 .20E+01 .10E+01 .00E+00 .10E+01 .20E+01 .20E+01 .20E+01 .10E+01 .10E+01 .10E+01 .10E+01 .00E+00 .20E+01 .10E+01 .20E+01 .00E+00 .00E+00 .00E+00 .10E+01 .20E+01 .00E+00 .10E+01 .20E+01 .10E+01 .30E+01 .10E+01 .40E+01 .10E+01 .70E+01 .30E+01 .50E+01 
 
 
 DATA FOR RECORD:   185 
.20E+01 .20E+01 .10E+01 .40E+01 .30E+01 .50E+01 .20E+01 .20E+01 .20E+01 .10E+01 .20E+01 .20E+01 .10E+01 .10E+01 .10E+01 .10E+01 .20E+01 .10E+01 .10E+01 .10E+01 .10E+01 .10E+01 .10E+01 .10E+01 .10E+01 .10E+01 .10E+01 .10E+01 .10E+01 .10E+01 .10E+01 .10E+01 .10E+01 .10E+01 .10E+01 .10E+01 .10E+01 .10E+01 .10E+01 .10E+01 .10E+01 .10E+01 .10E+01 .10E+01 .10E+01 .20E+01 .10E+01 .20E+01 .20E+01 .10E+01 .90E+01 .10E+01 .10E+01 .10E+01 .10E+01 .10E+01 .10E+01 .30E+01 .10E+01 .30E+01 .10E+01 .70E+01 .40E+01 .30E+01 
 
 

 DATA MODIFICATION COMPLETE   185 OBSERVATIONS READ. 
 VAR #         MEAN  STD.DEV.  STD.ERR. SAMPLE   MAXIMUM   MINIMUM     RANGE 
 --------- --------- --------- --------- ----- ---------- --------- --------- 
 V1           1.413      .536      .040   184       3.00      1.00      2.00 
 V2           4.286     2.476      .182   185      16.00      1.00     15.00 
 V3           1.059      .237      .017   185       2.00      1.00      1.00 
 

           VARIABLE:  V1       PERSON FILLING OUT                       
           ___________________________________________________________ 
               VALUE     COUNT       VALID %     PERCENT CUMULATIVE%
           ___________________________________________________________ 
                 .000         1     MISSING        .005     MISSING 
                1.000       112        .609        .605        .609 
                2.000        68        .370        .368        .978 
                3.000         4        .022        .022       1.000 
           =========================================================== 
                            185       1.000       1.000       1.000 

1. Count = Total number of responses for each value of the variable in the frequency distribution.

2. Valid % = The percentage after adjustment for missing values (To account for non-respondents or 
   undesired codes)

3. Percent = (Count of Each Value)/(Total Count)

4. Cumulative % = The sum total of percentages of all values less than or equal to the current variable 
   value.  The cumulative % of the last variable must be 100%. 

            VARIABLE:  V2       TYPE OF SURGERY                          
           ___________________________________________________________ 
               VALUE     COUNT       VALID %     PERCENT CUMULATIVE%   
           ___________________________________________________________ 
                1.000        42        .227        .227        .227 
                2.000         6        .032        .032        .259 
                3.000        28        .151        .151        .411 
                4.000        12        .065        .065        .476 
                5.000        35        .189        .189        .665 
                6.000        31        .168        .168        .832 
                7.000        10        .054        .054        .886 
                8.000        20        .108        .108        .995 
               16.000         1        .005        .005       1.000 
           =========================================================== 
                            185       1.000       1.000       1.000 


           VARIABLE:  V3       OUTPATIENT INSURANCE                     
           ___________________________________________________________ 
               VALUE     COUNT       VALID %     PERCENT CUMULATIVE%   
           ___________________________________________________________ 
                1.000       174        .941        .941        .941 
                2.000        11        .059        .059       1.000 
           =========================================================== 
                            185       1.000       1.000       1.000