Using Excel to Do Test Item AnalysesPosted: April 2, 2009 | Author: Jamie Madigan | Filed under: Uncategorized | 7 Comments »
Occasionally I get some free time at work and I get to dip into my “wouldn’t it be fun to…” file. Recently I did this and pulled out the idea of something I’ve not had a real chance to do since that Tests and Measurements class I took in grad school: Item Analysis.
Specifically, I had data from a few hundred test-takers on a written, multiple-choice job knowledge test. No follow-up research had been done since the initial test development, and I thought it would be fun (nerd!) to see how the individual items were performing according to good old classical test theory.
Unfortunately my current employer hasn’t equipped me with any statistical analysis software like SAS or SPSS –boo!– so I set to trying to figure out how to do this with plain old Microsoft Excel. Fortunately after some time with Google I found this page, which gives you a pretty good road map as to how to do it. I made a few tweaks to the process to meet my own needs, and thought I’d share with you all here.
Let me say something first in the interest of covering myself professionally and ethically: All data I present here are fictional; no actual test data from my employer or anywhere else are reproduced here. In fact, for the sake of illustration I’m simplifying the data set to 9 test takers and 10 test items. It’s easy to extrapolate to whatever number of test-takers and items you need.
First, a Brief Review
To start, in addition to basic test-level statistics (mean, standard deviation, reliability), I wanted to know item difficulty and item discrimination. Item difficulty is often measured using the p statistic and basically describes the percentage of test-takers who get that item right. Here’s the formula for p:
pi = (number of people getting an item correct) / (number of people taking the test)
Item discrimination, according to this old textbook I have here, is measured with the D statistic. It reveals how good an item is at discriminating between “good” candidates and “bad” candidates. Here, good/bad is conceptualized as whether the test-taker scored in the upper or lower third of the score distribution (some other sources I’ve seen use top/bottom quartile –you choose). The formula for D is:
D = (U – L) / n
Where U = number of people in the top third (according to overall test score) who got the item right, L = number of people in the lower group who passed the item, and n = number of people in each group (assuming they’re equal).
There are other item-level stats I could look at, like item-total correlations (a.k.a., point-biserial correlations) or distractor analyses, but I thought difficulty and discrimination were a good start. That goal identified, the challenge was now to do this in Excel.
Setting up the Spreadsheet
Following the model here I created this spreadsheet:
(Click image above to embiggen or click here to download a copy of the spreadsheet.)
The first 11 rows of the spreadsheet contain the test data for our 9 hypothetical candidates. Row 13 shows the answer key.
Rows 14 through 18 give some information about what percentage of candidates chose each response option for each question. You can start to see trouble here, as with Item 1 –everybody answerd B, which is the correct answer. But more on that shortly. Row 18 shows the totals, and if it’s below 100% (like with Item 5) that means one or more candidates left that one blank. If a LOT of people leave an item blank, that’s generally a big red flag.
Calculating Scores and Results
The “Results” half of the spreadsheet starts to get into the meat of things. Rows 22 through 31 show each candidate’s score on each item –1 point for getting a question right, and 0 points for getting it wrong. (I guess now is a good time to point out that this model doesn’t work when you give partial credit.) It’s worth taking a second to discuss the formulas in this section. The formula “=IF($E$5=$E$13,1,0)” in cell E23, for example, basically says that if that Manny Calavera’s answer on Item 1 (cell E5) is equal to what’s in the answer key for Item 1 (cell E13), then a “1” goes into that cell. If not, a “0.”
It’s also noteworthy that I used the $ symbol in those cell references (e.g., $E$5 instead of just E5), which locks those references in and prevents Excel from outsmarting itself if I do something like sort the data in the “Results” section. Which I did, for a reason I’ll get to in a second.
Using the Spreadsheet to Calculate Item Difficulty
So, let’s get to the punchlines: the item analyses. Row 33 contains the Item Difficulty (p) for each of the 10 test items. This is just the percentage of test-takers who got each item correct, and is calculated for each item by summing up the number of points across all test-takers and then dividing by the number of candidates (9 in this example). So for Item 1 the formula is “=SUM(E23:E31)/9”
Item difficulties that approach 1 are probably too easy; ones that get close to 0 are probably too hard. Either that or there’s some kind of error in your scoring key so that you’re incorrectly scoring the tests.
Using the Spreadsheet to Calculate Item Discrimination
Now for Item Discrimination, which is an index of how well an item differentiates between high and low performers. To calculate this, you just subtract the p value for the bottom third of the test takers from the p value for the top third. To get these two p values, I first sorted the test takers by Score. This is why using the $ symbol to make those cell references absolute earlier was important. Without that, Excel would diligently assume you wanted it to change the formulas that score each item (e.g., in cell E23) relative to the new orders. Then you’d get garbage.
The next step in calculating D values is shown in rows 34 and 35, which recalculates those p values, but only using the top and bottom thirds of the test-takers (the first 3 and the last 3 in this case, since we have 9 people in the mix).
Row 36 then calculates the D value for each item by subtracting the upper and lower p values.
Using the Spreadsheet to Test-Level Statistics
Finally, I tossed in a section that calculated those test-level statistics: mean socre, standard deviation (SD), reliability, and Standard Error of the Mean. To calculate reliability I used the Kuder-Richardson 21 formula, which again according to this old textbook I have is:
This is where k is the number of items, X is the mean test score, and SD is the test’s standard deviation. And when you see something like X^2 means X to the second power, or X squared. The formula’s in there in the Excel spreadsheet.
Standard Error of the Mean is easier to calculate. It’s just the square root of (1-X)*SD. Again, the formula’s in the Excel spreadsheet.
Making Use of the Results
So, putting this all together, what can we tell? There are a number of red flags you can hunt for in the item analyses. First is seeing if items are too easy. Right off the bat item 1 looks bad. It has a p value of 1.00, which means that every single person got it right. Mabye it’s too easy or maybe there’s something about the way it’s written that’s clueing everyone in to the correct answer. Either way, it’s not contributing in any meaningful way and could probably be dropped, fixed, or replaced by a better item. In general, p values above .90 should be examined closely.
Conversely, you probably want to look for items that have very low p values. Look at Item 5. It has a p value of .11, which means that almost nobody got it right. Assuming you’ve got the right people taking the test, this probably means that the question is either too hard or you’ve got some error in the body of the question or the distractors that’s throwing people off. Or it’s even possible that your answer key is in error. It happens.
What about item Discrimination? Generally you want values in the middle of the road here, which means that your better test-takers are getting it right while your dunces and chowderheads are getting it wrong. Very low values –especially those approaching 0– mean that the item just isn’t telling the two groups apart. What’s worse are items with negative Discrimination values. That happens when for some reason the item is missed by your top test takers and gotten right by the bottom ones. Take a close look at those and either fix them, drop them, or replace them.
So that’s it. It’s not everything you need to know about item analysis, but it was a fun little project for me and maybe you can benefit from using my spreadsheet as a starting point if you ever do anything similar. Again, thanks to Chris Elvin of this website, from whom I borrowed heavily for everything above.