Excel For Statistical Data Analysis
This is a webtext companion site of
Business Statistics USA Site
Para mis visitantes del mundo de habla hispana, este sitio se encuentra disponible en espaГ±ol en:
Sitio Espejo para AmГ©rica Latina Sitio de los E.E.U.U.
To search the site. try E dit | F ind in page [Ctrl + f]. Enter a word or phrase in the dialogue box, e.g. " variance" or " mean" If the first appearance of the word/phrase is not what you are looking for, try F ind Next.
This site provides illustrative experience in the use of Excel for data summary, presentation, and for other basic statistical analysis. I believe the popular use of Excel is on the areas where Excel really can excel. This includes organizing data, i.e. basic data management, tabulation and graphics. For real statistical analysis on must learn using the professional commercial statistical packages such as SAS, and SPSS.
Microsoft Excel 2000 (version 9) provides a set of data analysis tools called the Analysis ToolPak which you can use to save steps when you develop complex statistical analyses. You provide the data and parameters for each analysis; the tool uses the appropriate statistical macro functions and then displays the results in an output table. Some tools generate charts in addition to output tables.
If the Data Analysis command is selectable on the Tools menu, then the Analysis ToolPak is installed on your system. However, if the Data Analysis command is not on the Tools menu, you need to install the Analysis ToolPak by doing the following:
Step 1: On the Tools menu, click Add-Ins. If Analysis ToolPak is not listed in the Add-Ins dialog box, click Browse and locate the drive, folder name, and file name for the Analysis ToolPak Add-in В— Analys32.xll В— usually located in the Program Files\Microsoft Office\Office\Library\Analysis folder. Once you find the file, select it and click OK.Step 2: If you don't find the Analys32.xll file, then you must install it.
- Insert your Microsoft Office 2000 Disk 1 into the CD ROM drive.
- Select Run from the Windows Start menu.
- Browse and select the drive for your CD. Select Setup.exe, click Open, and click OK.
- Click the Add or Remove Features button.
- Click the + next to Microsoft Excel for Windows.
- Click the + next to Add-ins.
- Click the down arrow next to Analysis ToolPak.
- Select Run from My Computer.
- Select the Update Now button.
- Excel will now update your system to include Analysis ToolPak.
- Launch Excel.
- On the Tools menu, click Add-Ins. - and select the Analysis ToolPak check box.
Microsoft Excel is a powerful spreadsheet package available for Microsoft Windows and the Apple Macintosh. Spreadsheet software is used to store information in columns and rows which can then be organized and/or processed. Spreadsheets are designed to work well with numbers but often include text. Excel organizes your work into workbooks; each workbook can contain many worksheets; worksheets are used to list and analyze data.
Excel is available on all public-access PCs (i.e. those, e.g. in the Library and PC Labs). It can be opened either by selecting Start - Programs - Microsoft Excel or by clicking on the Excel Short Cut which is either on your desktop, or on any PC, or on the Office Tool bar.Opening a Document:
- Click on File-Open (Ctrl+O) to open/retrieve an existing workbook; change the directory area or drive to look for files in other locations
Saving and Closing a Document:
To save your document with its current filename, location and file format either click on File - Save. If you are saving for the first time, click File-Save; choose/type a name for your document; then click OK. Also use File-Save if you want to save to a different filename/location.
When you have finished working on a document you should close it. Go to the File menu and click on Close. If you have made any changes since the file was last saved, you will be asked if you wish to save them.
The Excel screen
Workbooks and worksheets:
When you start Excel, a blank worksheet is displayed which consists of a multiple grid of cells with numbered rows down the page and alphabetically-titled columns across the page. Each cell is referenced by its coordinates (e.g. A3 is used to refer to the cell in column A and row 3; B10:B20 is used to refer to the range of cells in column B and rows 10 through 20).
Your work is stored in an Excel file called a workbook. Each workbook may contain several worksheets and/or charts - the current worksheet is called the active sheet. To view a different worksheet in a workbook click the appropriate Sheet Tab.
You can access and execute commands directly from the main menu or you can point to one of the toolbar buttons (the display box that appears below the button, when you place the cursor over it, indicates the name/action of the button) and click once.
Moving Around the Worksheet:
It is important to be able to move around the worksheet effectively because you can only enter or change data at the position of the cursor. You can move the cursor by using the arrow keys or by moving the mouse to the required cell and clicking. Once selected the cell becomes the active cell and is identified by a thick border; only one cell can be active at a time.
To move from one worksheet to another click the sheet tabs. (If your workbook contains many sheets, right-click the tab scrolling buttons then click the sheet you want.) The name of the active sheet is shown in bold.
Moving Between Cells:
Here is a keyboard shortcuts to move the active cell:
- Home - moves to the first column in the current row
- Ctrl+Home - moves to the top left corner of the document
- End then Home - moves to the last cell in the document
To move between cells on a worksheet, click any cell or use the arrow keys. To see a different area of the sheet, use the scroll bars and click on the arrows or the area above/below the scroll box in either the vertical or horizontal scroll bars.
Note that the size of a scroll box indicates the proportional amount of the used area of the sheet that is visible in the window. The position of a scroll box indicates the relative location of the visible area within the worksheet.
A new worksheet is a grid of rows and columns. The rows are labeled with numbers, and the columns are labeled with letters. Each intersection of a row and a column is a cell. Each cell has an address. which is the column letter and the row number. The arrow on the worksheet to the right points to cell A1, which is currently highlighted. indicating that it is an active cell. A cell must be active to enter information into it. To highlight (select) a cell, click on it.
To select more than one cell:
- Click on a cell (e.g. A1), then hold the shift key while you click on another (e.g. D4) to select all cells between and including A1 and D4.
One worksheet can have up to 256 columns and 65,536 rows, so it'll be a while before you run out of space.
Each cell can contain a label. value. logical value. or formula .
- Labels can contain any combination of letters, numbers, or symbols.
To enter information into a cell, select the cell and begin typing.
Note that as you type information into the cell, the information you enter also displays in the formula bar. You can also enter information into the formula bar, and the information will appear in the selected cell.
When you have finished entering the label or value:
- Press "Enter" to move to the next cell below (in this case, A2)
Unless the information you enter is formatted as a value or a formula, Excel will interpret it as a label, and defaults to align the text on the left side of the cell.
If you are creating a long worksheet and you will be repeating the same label information in many different cells, you can use the AutoComplete function. This function will look at other entries in the same column and attempt to match a previous entry with your current entry. For example, if you have already typed "Wesleyan" in another cell and you type "W" in a new cell, Excel will automatically enter "Wesleyan." If you intended to type "Wesleyan" into the cell, your task is done, and you can move on to the next cell. If you intended to type something else, e.g. "Williams," into the cell, just continue typing to enter the term.
To turn on the AutoComplete funtion, click on "Tools" in the menu bar, then select "Options," then select "Edit," and click to put a check in the box beside "Enable AutoComplete for cell values."
Another way to quickly enter repeated labels is to use the Pick List feature. Right click on a cell, then select "Pick From List." This will give you a menu of all other entries in cells in that column. Click on an item in the menu to enter it into the currently selected cell.
A value is a number, date, or time, plus a few symbols if necessary to further define the numbers [such as. + - ( ) % $ / ].
Numbers are assumed to be positive; to enter a negative number, use a minus sign "-" or enclose the number in parentheses "()".
Dates are stored as MM/DD/YYYY, but you do not have to enter it precisely in that format. If you enter "jan 9" or "jan-9", Excel will recognize it at January 9 of the current year, and store it as 1/9/2002. Enter the four-digit year for a year other than the current year (e.g. "jan 9, 1999"). To enter the current day's date, press "control" and ";" at the same time.
Times default to a 24 hour clock. Use "a" or "p" to indicate "am" or "pm" if you use a 12 hour clock (e.g. "8:30 p" is interpreted as 8:30 PM). To enter the current time, press "control" and ":" (shift-semicolon) at the same time.
An entry interpreted as a value (number, date, or time) is aligned to the right side of the cell, to reformat a value.Rounding Numbers that Meet Specified Criteria: To apply colors to maximum and/or minimum values:
- Select a cell in the region, and press Ctrl+Shift+* (in Excel 2003, press this or Ctrl+A) to select the Current Region.
- From the Format menu, select Conditional Formatting.
- In Condition 1, select Formula Is, and type =MAX($F:$F) =$F1.
- Click Format, select the Font tab, select a color, and then click OK.
- In Condition 2, select Formula Is, and type =MIN($F:$F) =$F1.
- Repeat step 4, select a different color than you selected for Condition 1, and then click OK.
Rounding Numbers that Meet Specified Criteria
Problem: Rounding all the numbers in column A to zero decimal places, except for those that have "5" in the first decimal place.
Solution: Use the IF, MOD, and ROUND functions in the following formula: =IF(MOD(A2,1)=0.5,A2,ROUND(A2,0))
To Copy and Paste All Cells in a Sheet
- Select the cells in the sheet by pressing Ctrl+A (in Excel 2003, select a cell in a blank area before pressing Ctrl+A, or from a selected cell in a Current Region/List range, press Ctrl+A+A).
To Copy the Entire Sheet
Copying the entire sheet means copying the cells, the page setup parameters, and the defined range Names.Option 1:
- Move the mouse pointer to a sheet tab.
- Press Ctrl, and hold the mouse to drag the sheet to a different location.
- Release the mouse button and the Ctrl key.
- Right-click the appropriate sheet tab.
- From the shortcut menu, select Move or Copy. The Move or Copy dialog box enables one to copy the sheet either to a different location in the current workbook or to a different workbook. Be sure to mark the Create a copy checkbox.
- From the Window menu, select Arrange.
- Select Tiled to tile all open workbooks in the window.
- Use Option 1 (dragging the sheet while pressing Ctrl) to copy or move a sheet.
Sorting by Columns
The default setting for sorting in Ascending or Descending order is by row. To sort by columns:
- From the Data menu, select Sort, and then Options.
- Select the Sort left to right option button and click OK.
- In the Sort by option of the Sort dialog box, select the row number by which the columns will be sorted and click OK.
The Data Analysis ToolPak has a Descriptive Statistics tool that provides you with an easy way to calculate summary statistics for a set of sample data. Summary statistics includes Mean, Standard Error, Median, Mode, Standard Deviation, Variance, Kurtosis, Skewness, Range, Minimum, Maximum, Sum, and Count. This tool eliminates the need to type indivividual functions to find each of these results. Excel includes elaborate and customisable toolbars, for example the "standard" toolbar shown here:
Some of the icons are useful mathematical computation:
is the "Autosum" icon, which enters the formula "=sum()" to add up a range of cells.
is the "FunctionWizard" icon, which gives you access to all the functions available.
is the "GraphWizard" icon, giving access to all graph types available, as shown in this display:
Excel can be used to generate measures of location and variability for a variable. Suppose we wish to find descriptive statistics for a sample data: 2, 4, 6, and 8.
Step 1. Select the Tools *pull-down menu, if you see data analysis, click on this option, otherwise,
click on add-in. option to install analysis tool pak.
Step 2. Click on the data analysis option.
Step 3. Choose Descriptive Statistics from Analysis Tools list.
Step 4. When the dialog box appears:
Enter A1:A4 in the input range box, A1 is a value in column A and row 1. in this case this value is 2. Using the same technique enter other VALUES until you reach the last one. If a sample consists of 20 numbers, you can select for example A1, A2, A3, etc. as the input range.
Step 5. Select an output range. in this case B1. Click on summary statistics to see the results.
Select OK .
When you click OK. you will see the result in the selected range.
As you will see, the mean of the sample is 5, the median is 5, the standard deviation is 2.581989, the sample variance is 6.666667,the range is 6 and so on. Each of these factors might be important in your calculation
of different statistical procedures.
Consider the problem of finding the probability of getting less than a certain value under any normal probability distribution. As an illustrative example, let us suppose the SAT scores nationwide are normally distributed with a mean and standard deviation of 500 and 100, respectively. Answer the following questions based on the given information:
A: What is the probability that a randomly selected student score will be less than 600 points?
B: What is the probability that a randomly selected student score will exceed 600 points?
C: What is the probability that a randomly selected student score will be between 400 and 600?
Hint: Using Excel you can find the probability of getting a value approximately less than or equal to a given value. In a problem, when the mean and the standard deviation of the population are given, you have to use common sense to find different probabilities based on the question since you know the area under a normal curve is 1.
In the work sheet, select the cell where you want the answer to appear. Suppose, you chose cell number one, A1. From the menus, select "insert pull-down".
Steps 2-3 From the menus, select insert, then click on the Function option.
Step 4. After clicking on the Function option, the Paste Function dialog appears from Function Category. Choose Statistical then NORMDIST from the Function Name box; Click OK
Step 5. After clicking on OK, the NORMDIST distribution box appears:
i. Enter 600 in X (the value box);
ii. Enter 500 in the Mean box;
iii. Enter 100 in the Standard deviation box;
iv. Type "true" in the cumulative box, then click OK.
As you see the value 0.84134474 appears in A1, indicating the probability that a randomly selected student's score is below 600 points. Using common sense we can answer part "b" by subtracting 0.84134474 from 1. So the part "b" answer is 1- 0.8413474 or 0.158653. This is the probability that a randomly selected student's score is greater than 600 points. To answer part "c", use the same techniques to find the probabilities or area in the left sides of values 600 and 400. Since these areas or probabilities overlap each other to answer the question you should subtract the smaller probability from the larger probability. The answer equals 0.84134474 - 0.15865526 that is, 0.68269. The screen shot should look like following:
Calculating the value of a random variable often called the "x" value
You can use NORMINV from the function box to calculate a value for the random variable - if the probability to the left side of this variable is given. Actually, you should use this function to calculate different percentiles. In this problem one could ask what is the score of a student whose percentile is 90? This means approximately 90% of students scores are less than this number. On the other hand if we were asked to do this problem by hand, we would have had to calculate the x value using the normal distribution formula x = m + zd. Now let's use Excel to calculate P90. In the Paste function, dialog click on statistical, then click on NORMINV. The screen shot would look like the following:
When you see NORMINV the dialog box appears.
i. Enter 0.90 for the probability (this means that approximately 90% of students' score is less than the value we are looking for)
ii. Enter 500 for the mean (this is the mean of the normal distribution in our case)
iii. Enter 100 for the standard deviation (this is the standard deviation of the normal distribution in our case)
At the end of this screen you will see the formula result which is approximately 628 points. This means the top 10% of the students scored better than 628.
Confidence Interval for the Mean
Suppose we wish for estimating a confidence interval for the mean of a population. Depending on the size of your sample size you may use one of the following cases:
Large Sample Size (n is larger than, say 30):
The general formula for developing a confidence interval for a population means is:
In this formula
is the mean of the sample; Z is the interval coefficient, which can be found from the normal distribution table (for example the interval coefficient for a 95% confidence level is 1.96). S is the standard deviation of the sample and n is the sample size.
Now we would like to show how Excel is used to develop a certain confidence interval of a population mean based on a sample information. As you see in order to evaluate this formula you need
"the mean of the sample" and the margin of error
Excel will automatically calculate these quantities for you.
The only things you have to do are:
add the margin of error
to the mean of the sample,
; Find the upper limit of the interval and subtract the margin of error from the mean to the lower limit of the interval. To demonstrate how Excel finds these quantities we will use the data set, which contains the hourly income of 36 work-study students here, at the University of Baltimore. These numbers appear in cells A1 to A36 on an Excel work sheet.
After entering the data, we followed the descriptive statistic procedure to calculate the unknown quantities. The only additional step is to click on the confidence interval in the descriptive statistics dialog box and enter the given confidence level, in this case 95%.
Here is, the above procedures in step-by-step:
Step 1. Enter data in cells A1 to A36 (on the spreadsheet)
Step 2. From the menus select Tools
Step 3. Click on Data Analysis then choose the Descriptive Statistics option then click OK .
On the descriptive statistics dialog, click on Summary Statistic. After you have done that, click on the confidence interval level and type 95% - or in other problems whatever confidence interval you desire. In the Output Range box enter B1 or what ever location you desire.
Now click on OK. The screen shot would look like the following:
As you see, the spreadsheet shows that the mean of the sample is
= 6.902777778 and the absolute value of the margin of error
= 0.231678109. This mean is based on this sample information. A 95% confidence interval for the hourly income of the UB work-study students has an upper limit of 6.902777778 + 0.231678109 and a lower limit of 6.902777778 - 0.231678109.
On the other hand, we can say that of all the intervals formed this way 95% contains the mean of the population. Or, for practical purposes, we can be 95% confident that the mean of the population is between 6.902777778 - 0.231678109 and 6.902777778 + 0.231678109. We can be at least 95% confident that interval [$6.68 and $7.13] contains the average hourly income of a work-study student.
Smal Sample Size (say less than 30) If the sample n is less than 30 or we must use the small sample procedure to develop a confidence interval for the mean of a population. The general formula for developing confidence intervals for the population mean based on small a sample is:
In this formula
is the mean of the sample.
is the interval coefficient providing an area of
in the upper tail of a t distribution with n-1 degrees of freedom which can be found from a t distribution table (for example the interval coefficient for a 90% confidence level is 1.833 if the sample is 10). S is the standard deviation of the sample and n is the sample size.
Now you would like to see how Excel is used to develop a certain confidence interval of a population mean based on this small sample information.
As you see, to evaluate this formula you need
"the mean of the sample" and the margin of error
Excel will automatically calculate these quantities the way it did for large samples.
Again, the only things you have to do are: add the margin of error
to the mean of the sample,
, find the upper limit of the interval and to subtract the margin of error from the mean to find the lower limit of the interval.
To demonstrate how Excel finds these quantities we will use the data set, which contains the hourly incomes of 10 work-study students here, at the University of Baltimore. These numbers appear in cells A1 to A10 on an Excel work sheet.
After entering the data we follow the descriptive statistic procedure to calculate the unknown quantities (exactly the way we found quantities for large sample). Here you are with the procedures in step-by-step form:
Step 1. Enter data in cells A1 to A10 on the spreadsheet
Step 2. From the menus select Tools
Step 3. Click on Data Analysis then choose the Descriptive Statistics option. Click OK on the descriptive statistics dialog, click on Summary Statistic, click on the confidence interval level and type in 90% or in other problems whichever confidence interval you desire. In the Output Range box, enter B1 or whatever location you desire. Now click on OK. The screen shot will look like the following:
Now, like the calculation of the confidence interval for the large sample, calculate the confidence interval of the population based on this small sample information. The confidence interval is:
6.8 В± 0.414426102
We can be at least 90% confidant that the interval [$6.39 and $7.21] contains the true mean of the population.
Test of Hypothesis Concerning the Population Mean
Again, we must distinguish two cases with respect to the size of your sample
Large Sample Size (say, over 30): In this section you wish to know how Excel can be used to conduct a hypothesis test about a population mean. We will use the hourly incomes of different work-study students than those introduced earlier in the confidence interval section. Data are entered in cells A1 to A36. The objective is to test the following Null and Alternative hypothesis:
The null hypothesis indicates that the average hourly income of a work-study student is equal to $7 per hour; however, the alternative hypothesis indicates that the average hourly income is not equal to $7 per hour.
I will repeat the steps taken in descriptive statistics and at the very end will show how to find the value of the test statistics in this case, z, using a cell formula.
Step 1. Enter data in cells A1 to A36 (on the spreadsheet)
Step 2. From the menus select Tools
Step 3. Click on Data Analysis then choose the Descriptive Statistics option, click OK .
On the descriptive statistics dialog, click on Summary Statistic. Select the Output Range box, enter B1 or whichever location you desire. Now click OK.
(To calculate the value of the test statistics search for the mean of the sample then the standard error. In this output, these values are in cells C3 and C4.)
Step 4. Select cell D1 and enter the cell formula = (C3 - 7)/C4. The screen shot should look like the following:
The value in cell D1 is the value of the test statistics. Since this value falls in acceptance range of -1.96 to 1.96 (from the normal distribution table), we fail to reject the null hypothesis.
Small Sample Size (say, less than 30):
Using steps taken the large sample size case, Excel can be used to conduct a hypothesis for small-sample case. Let's use the hourly income of 10 work-study students at UB to conduct the following hypothesis.
The null hypothesis indicates that average hourly income of a work-study student is equal to $7 per hour .The alternative hypothesis indicates that average hourly income is not equal to $7 per hour.
I will repeat the steps taken in descriptive statistics and at the very end will show how to find the value of the test statistics in this case "t" using a cell formula.
Step 1. Enter data in cells A1 to A10 (on the spreadsheet)
Step 2. From the menus select Tools
Step 3. Click on Data Analysis then choose the Descriptive Statistics option. Click OK .
On the descriptive statistics dialog, click on Summary Statistic. Select the Output Range boxes, enter B1 or whatever location you chose. Again, click on OK.
(To calculate the value of the test statistics search for the mean of the sample then the standard
error, in this output these values are in cells C3 and C4.)
Step 4. Select cell D1 and enter the cell formula = (C3 - 7)/C4. The screen shot would look like the following:
Since the value of test statistic t = -0.66896 falls in acceptance range -2.262 to +2.262 (from t table, where
= 0.025 and the degrees of freedom is 9), we fail to reject the null hypothesis.
Difference Between Mean of Two Populations
In this section we will show how Excel is used to conduct a hypothesis test about the difference between two population means assuming that populations have equal variances. The data in this case are taken from various offices here at the University of Baltimore. I collected the hourly income data of 36 randomly selected work-study students and 36 student assistants. The hourly income range for work-study students was $6 - $8 while the hourly income range for student assistants was $6-$9. The main objective in this hypothesis testing is to see whether there is a significant difference between the means of the two populations. The NULL and the ALTERNATIVE hypothesis is that the means are equal and the means are not equal, respectively.
Referring to the spreadsheet, I chose A1 and A2 as label centers. The work-study students' hourly income for a sample size 36 are shown in cells A2:A37. and the student assistants' hourly income for a sample size 36 is shown in cells B2:B37
Data for Student Assistant: 6, 6, 6, 6, 6, 6.5, 6.5, 6.5, 6.5, 6.5, 7, 7, 7, 7, 7, 7.5, 7.5, 7.5, 7.5, 7.5, 7.5, 8, 8, 8, 8, 8, 8, 8, 8.5, 8.5, 8.5, 8.5, 8.5, 9, 9, 9, 9.
Use the Descriptive Statistics procedure to calculate the variances of the two samples. The Excel procedure for testing the difference between the two population means will require information on the variances of the two populations. Since the variances of the two populations are unknowns they should be replaced with sample variances. The descriptive for both samples show that the variance of first sample is s1 2 = 0.55546218. while the variance of the second sample s2 2 =0.969748 .