THE UNIVERSITY OF TEXAS AT AUSTIN
SCHOOL OF INFORMATION


LIS 397.1
INTRODUCTION TO RESEARCH IN LIBRARY AND INFORMATION SCIENCE
R. E. Wyllys

Notes on Using Microsoft Excel for Statistical Analysis


Introduction

Microsoft Excel XP (2000, 97, and 95) provides good capabilities for doing statistical analyses. These notes are intended to help you get started using Excel for this purpose. The notes are intentionally brief; you are expected to supplement them by using the online help that is available in Excel.

Getting Data into Excel

In general, you enter observations from your sample into a row or a column in Excel. I have a mild personal preference for entering data in columns, since if I am keying them in by hand, all I have to do to finish the entry in one cell and move to the next cell is to hit the Enter key; but otherwise, rows and columns work equally well as the source of data for most of the statistical functions.

In any case, if you enter some data originally into, say, a column and later decide that you would prefer that the data be in a row (or vice versa), it is easy to use the Cut to clipboard and Paste Special--Transpose capabilities to accomplish the desired change. ("Transposing" moves data from a column to a row, or vice versa.)

If you are using the clipboard to enter a set of numbers obtained from some other program (or even from a text file), you will probably find that when you insert data from the clipboard, they all go into a single cell. You should then select (i.e., click on) that cell, and choose the Text to Columns option from the Data menu (despite its name, this option can place data into cells of a row [or rows] as well as cells of a column [or columns]). Use Excel's Help to learn how to use the Text to Column options, which are quite flexible.

Activating the Data-Analysis Tools

To activate the Data-Analysis tools, choose Tools--Add-Ins and click on the box next to Analysis ToolPak. If this option is not available, you will need to add it to your Excel installation. Use Excel Help to learn how to get the necessary additional files from your Excel or Microsoft Office CD-ROM. The basic Analysis ToolPak (not the Analysis ToolPak - VBA) is all you need for the exercises in LIS 397.1.

Using Excel to Determine a Confidence Interval

To determine a confidence interval for a set of observations, enter the observations into a row (or column). Choose Tools--Data Analysis--Descriptive Statistics. In the Descriptive Statistics popup, you will need to specify:

When you are done with the specifications, click on OK in the Descriptive Statistics popup. Your results will appear in the location you chose for output. Excel provides you with basic sample statistics, including the sample mean. It also provides, labeled simply as "Confidence", the half-width of the 95% confidence interval: viz., the number that is to be subtracted from the sample mean to yield the left end of the confidence interval, and is to be added to the sample mean to yield the right end of the confidence interval.

Note: (You may ignore this paragraph unless you wish to use the data from Excel's Descriptive Statistics procedure to develop a confidence interval, for the population mean, having a confidence level different from 95%.) The value that Excel reports for the half-width of the 95% confidence interval is calculated by multiplying the standard error of the mean by the exact value of Student's t corresponding to the number of observations, i.e., the size of the sample. Excel does this regardless of whether the sample is (a) of size 30 or less, in which case the proper procedure is always to use the Student's t value rather than the Gaussian value (e.g., 1.96 or 2.58), or (b) of size 31 or more, in which case the rule-of-thumb is that the Gaussian value is acceptable (though the more accurate t value is preferred if it is readily available). Therefore, the reported value of the half-width of the confidence interval will always be slightly larger (but more accurate) than the value you could obtain by multiplying the standard error of the mean by 1.96, or by 2.58, etc. To get the exact value of Student's t, you can use the Excel function, TINV. As the Excel Help function explains in more detail, the syntax of TINV is that it takes two arguments: probability, and degrees_freedom. Probability is the probability associated with the two-tailed Student's t-distribution. Degrees_freedom is the number of degrees of freedom in which you are interested. For example, TINV(0.05,10) = 2.228139238; TINV(0.05,30) = 2.042270353; TINV(0.01,60) = 2.660272003. (You should compare these values with those shown in a table of the Student's t distribution.

Using Excel for t-Tests of Hypotheses

The t-Test for Independent Samples

To use Excel to carry out a t-test for independent samples, enter the sample observations into a worksheet. One way to do this is to use two rows for the data, i.e., one row for each set of sample observations. It is handy to have each row begin with a label in the left-most cell. Choose Tools--Data Analysis--t-Test: Two-Sample Assuming Equal Variances. In the resulting popup, you will need to specify:

The results will include the observed value of the t-statistic, which--for the purposes of LIS 397.1--is the one labeled "t Critical two-tail". The results will also include the probability (Mendenhall's "p-value") of your observing the t value that you did observe, when the null hypothesis is true; this is labeled "P(T<=t) two-tail".

The t-Test for Dependent (and Matched-Pair) Samples

To use Excel to carry out a t-test for dependent samples, carry out essentially the same steps as in the independent-sample procedure, except that you will choose Tools--Data Analysis--t-Test: Paired Two Sample for Means. Excel assumes that cells in corresponding positions in the two rows (or columns) contain the paired observations.

Using Excel for ANOVA

Enter the data as outlined above for the t-test procedures, using one row (or column) for each set of sample observations (e.g., for a 3-population ANOVA test, you will take a sample of observations from each of the 3 populations, and you will enter each different sample in a different row [or column]). It is easiest to use adjacent rows (or columns) for the data and to begin every row in the same column (or to begin every column in the same row). Choose Tools--Data Analysis--Anova: Single Factor. In the Anova: Single Factor popup, you will need to specify:

Excel provides a version of the standard ANOVA table, including the value of the observed F-ratio and the corresponding P-value (i.e., the probability of your observing the value of the F-ratio that you did observe, when the null hypothesis is true). The output also includes the threshold value, labeled "F crit", against which you can compare the observed value of the F-ratio; this is the value that you would find in a table of the F-ratio, such as the table on pp. 488-497 of the Mendenhall text.

Using Excel for Correlation

Excel does Pearson and Spearman correlation, as well as linear regression. For regression Excel insists that the observations be placed in columns rather than rows, i.e., it insists that the independent and dependent variables have their respective observed values entered in columns. Since often you will want to do both regression and correlation, you may as well develop the habit of entering the pairs of values in successive rows in columns (typically, in adjacent columns, although adjacency is not required).

To do correlation, choose Tools--Data Analysis--Correlation. In the Correlation popup, you will have to specify:

Excel displays the results in a 2x2 table (for the 2-variable case), showing the correlation of each variable with itself (viz., 1) and with the other variable. Only the cells along the diagonal and in the lower half of the table are filled in, since the table is symmetric with respect to the diagonal. The sample Pearson correlation coefficient thus appears in the cell in the lower left corner.

(Note: The design of this table is such as to make it easily expandable to handle situations involving correlations among more than just 2 variables. For example, for 3 variables, a 3x3 table will be displayed, and for larger numbers of variables, appropriately larger tables are used. In these larger tables, the entries in the various cells are the pairwise Pearson partial correlation coefficients [partial correlation is not treated in LIS 397.1].)

Since the arithmetic of calculating the Spearman rank-order correlation coefficient is (by design) identical to the arithmetic of calculating the Pearson coefficient, the above procedures work for the Spearman coefficient. But be sure to remember that observed Spearman coefficients must be compared with threshold values from a Spearman table rather than a Pearson table.

Using Excel for Linear Regression

As noted earlier, in doing regression Excel insists that the observations be placed in columns rather than rows, i.e., it insists that the independent and dependent variables have their respective observed values entered in columns.

To do regression, choose Tools--Data Analysis--Regression. In the Regression popup, you will have to specify:

Excel displays the results in several tables. The values we are primarily interested in for LIS 397.1 are shown in the column headed coefficients. In its first row, this column contains the value of the intercept coefficient, the one we call B0. The second row of this column contains the slope coefficient, the one we call B1. If you chose the Labels option in the Regression popup, Excel uses for this coefficient the label you provided for the independent variable; if you did not choose the Labels option, Excel calls this "X Variable 1." Excel reports the value of the Pearson correlation coefficient as "Multiple R" in the table called "Regression Statistics."

Using Excel for Chi-Square Tests

The Chi-Square Goodness-of-Fit Test and the Chi-Square Test of Association

Excel does not provide a particularly convenient means for doing the chi-square goodness-of-fit and association tests. You will have to provide the observed values in one range, and the corresponding expected values in another range, in an Excel spreadsheet. You can, of course, use ordinary spreadsheet functions to calculate the expected values, but it may be quicker and easier to use a calculator to produce the expected-value numbers and then copy the numbers into Excel.

When you have provided a range of observed values and another range of corresponding expected values, you choose another cell and place in it the formula

=CHIINV(CHITEST(range1, range2), df)

where range1 contains the observed values, range2 contains the expected values, and df is the pertinent number of degrees of freedom. The chosen cell will display the observed value of chi-square. If you are working the chi-square goodness-of-fit test, it will be convenient to have each range be a span of cells within a column. If you are working the chi-square association test, it will be convenient to have the ranges take the form of rectangles of cells.

Note that by itself the formula

=CHITEST(range1, range2)

yields the probability, in the circumstance that the null hypothesis is true, of observing the value of chi-square that was observed.

In Conclusion

These notes are intended to help you get started using Microsoft Excel to perform statistical analyses. You should experiment for yourself with other statistical tools provided by the Analysis ToolPak in Excel.


Go to Guide to Course Materials for LIS 397.1
Go to Wyllys Webpage


Last revised 2002 Nov 8