In their book, Investing in People: Financial Impact of Human Resource Initiatives authors Wayne Cascio and John Boudreau hit on something I’ve written about elsewhere: making research understandable and meaningful to a wider audience, especially in the context of business. In other words, putting dollar signs in there.
After some introductions and defining of terms, the authors propose what the call a “LAMP” framework for approaching the measurement of Human Resources initiatives. LAMP is an acronym for a paradigm relating to planning and couching research projects in terms those folks in Operations or Accounting will actually care about. You must have a coherent Logic for the initiative and how it connects to the larger business, the right Analytics to make sense of the data, the right Measures to gather the data in the first place, and the right Processes to make use of what you discover.
This framework established, the next chunk of the book dealt with very specific questions that I-O psychologists working in the area of Human Resources are likely to be called upon to answer. How much does employee absenteeism really hurt the company? How worried should we be about our turnover? Is it going to benefit the company to put in a new fitness center for employee use or to pay for a smoking cessation program? Is it worth it to offer on-site day care for employees to use in emergencies? How concerned should I be about these employee satisfaction survey results? Just how bad ARE the hyena attacks on the third floor?
The authors provide logic, measures, analytics, and processes for each issue to educate the reader on how to approach each question as both a scientist and a business person. Good research methods, theory building, and scientific interpretation of results are stressed, but so is communicating the outcomes in terms of dollars (or whatever your local currency may be). If you need a formula for calculating the hourly cost of turnover or absenteeism, for example, you’ll find it here.
The next major part of the book dives head first into the complicated (and often controversial) concept of staffing utility. The authors provide information on measuring and using staffing utility, then its use in decision making processes for things like enhanced selection systems and HR development programs. This section of the book is not for the faint of heart as it contains some pretty complicated algebra and calls to do some pretty challenging measurement. Here, let’s look at Equation 11-10 here:
I mean, sweet jeebus, what is THAT? Well, they tell you, or at least try to. And utility (pun intended) of this kind of effort can’t be understated when you are trying to sell a program to key decision makers or to communicate the impact of a new program.
So in general I liked Investing in People, even if it bogs down from time to time and once or twice the reader is presented with instructions that basically amount to “just make a best guess and plug the number in to your model.” But the message of how to communicate and debate with stakeholders in their own language and on their home turf is an invaluable one if Human Resources in general and I-O psychology in specific is going to move forward and become a real driving force in business.
If you work for a large company that uses a lot of pre-employment tests across a large number of jobs it can sometimes be hard to keep track of all the various information associated with a test. This is especially true when new people rotate in and out of roles connected with testing (e.g., recruiters).
There are plenty of collaboration tools out there, but one solution to this kind of issue that I tried experimenting with was to create a departmental wiki that I could keep updated and point people to in order to get basic information instead of hunting around for files scattered across drives or cabinets. For those of you who don’t know, a wiki (derived from a Hawaiian word for “fast”) is a collection of web pages designed so that anyone with a given set of permissions can edit them. These pages are organized and linked to each other so that you can search for information or follow links from one related topic to the next.
Wikepedia.org is probably the most well-known example of a wiki, but you can take that basic concept and create one of any size and purpose. Doing a Google search for “free wiki software” or something similar will net you a ton of links, but if you want to experiment with the idea and do a quick proof of concept like I did, I’d recommend TiddlyWiki to begin with. I liked it because it’s entirely client side without anything to install on web servers and works in just about any web browser and OS. You can just plunk it down on your local hard drive to get it going, then move it to a shared network drive when you’re ready to share it with your team.
Using a wiki to track test information need not be complicated. To start, you could create a wiki page for each test, each with the following sections:
- Name of test, along with common acronyms, abbreviations, or other names
- A brief description of the test
- A list of jobs that it is used for
- Cut scores and other guides to interpreting results
- Titles of technical reports and their locations
- Contact information for vendors, if applicable
- Retest policy
- Related links
REMEMBER, the concept is that anyone in your group would have access to this info, which is handy but it also means you should be careful about what you put on there. You wouldn’t, for example, want to use the wiki to store test results or other sensitive information.
From there, I’d encourage you to build out the wiki to include other sets of pages related to testing. Wikis can also include links to files on your intranet or shared drive, which can be an extremely useful way to organize things. There are lots of possibilities, but here are a few:
- Procedures for scheduling candidates for testing
- Links to practice tests, study guides, or brochures
- Procedures for scheduling testing rooms and other facilities
- Templates for test result letters, e-mails, or phone calls
- Procedures and tools for test scoring
- Procedures for looking up test results for previous candidates
- Testing schedules
- Lists of certified test administrators
- Troubleshooting tips for test scoring/scanning
And the great thing about the above is that in line with the collaborative spirit of the whole wiki concept, anyone with access to the wiki can edit an existing page or create a new one. So the people who schedule candidates for testing can be the ones to create and maintain the wiki pages dealing with those procedures. If you get everyone on board, pretty soon you’ll have a thriving, perpetually updated tool that will make things easier for everyone.
Have you had any success with a departmental wiki for testing or other issues? If so, I’d love to hear about it. Leave a message by clicking on the “Comments” link below.
Sometimes when I’ve implemented a pre-employment testing program I’ve just been there to insert Tab A into Slot B in a way that keeps Lawyer C from showing up sending us all to Cell Block D (and also, hire good people in a cost efficient way). Other times, though, I’ve had to build a business case for implementing testing, and part of that business case is estimating how much this whole thing is going to cost.
Of course, there are one-time startup fees that vary widely depending on what you’re doing –you may do it in-house, hire an expensive consulting firm, or become a member of a consortium that offers testing services. Those are the costs that usually get the most attention, but it’s also important to include in any business case the cost of keeping things going once you’ve released it into the wild. Somebody is going to pay for testing materials, test administration, scoring, scheduling of candidates, and all that.
Given all this, I thought it would be interesting to write a little about the process I recently went through to do just this as part of developing a larger business case for a new testing program.
But first, a disclaimer: All data presented here are fictional. None of the numbers here are indicative of anything about my employer or anyone else. What I’m presenting is just a basic method and formulas populated by some fake data so that you can plug in the real thing on your own.
To tackle this, it’s best to build a model based on assumptions that build on each other to calculate bottom-line costs. The starting blanks that I try to fill in are these:
- # of candidates tested per opening
- # of openings per year
- Length of test session in hours
- Max number of candidates per session
- Time in hours to prep for testing, score tests, clean up, etc.
- Time in hours to schedule each candidate
- Cost of test materials per test
- Cost (other than labor) to score test
- Test administrator’s annual salary
It may take some research to come up with educated guesses for these, but once you have them you label them “ASSUMPTIONS” and use them to calculate the following:
- # of candidates tested per year
- # of test sessions per year (round up if a fraction)
- Hours per year administering tests
- Hours per year on prep, scoring, etc.
- Hours per year on scheduling
- Total hours spent on testing per year
- Cost per hour of administrator’s time (including benefits)
- Cost per year for test materials
- Cost per year for test administrator labor
Say, you know what would be handy for all this? A spreadsheet! Download it, or here’s a picture:
It might take some research to get good estimates, but the idea here is that you can modify the numbers in the “Assumptions” area and see what the resulting costs are in the “Calculated Costs” area. You might also build multiple models to compare various options, like one vendor versus another or outsourcing test administration versus keeping it in-house.
Or if you’re REALLY on the ball (and you are, aren’t you?), you can include these costs in a formal utility analysis or break-even analysis. But even if you don’t go that far any of this info would be great to include in your business case.
If you look at the spreadsheet, you’ll see that most of the formulas in the “Calculated Costs” area are self-explanatory. The only thing that may need explanation are the labor cost estimates (i.e., the dollar cost for test administration, scheduling, etc.). This is something that a lot of people might overlook, but which I think is critical to include so that you get the complete picture and don’t get called out it by one of your stakeholders.
First, we make some assumptions about the test administrator/scheduler for the model:
- Number of paid hours per year including paid vacations/holidays (e.g., 40 hours a week x 52 weeks = 2,080)
- His/her annual salary (e.g., $45,000; see cell C13 in the spreadsheet and adjust your model accordingly)
- The cost of his/her benefits as a % of salary (e.g., 40%, see cell C14 and adjust accordingly if you have a better estimate for your situation)
Those assumptions made, we can make some quick calculations:
Annual salary / 2,080 = Hourly labor cost (e.g., $45,000 / 2,080 = $21.64)
Annual salary x .40 = estimated annual cost of benefits (e.g., $45,000 x .40 = $18,000)
Estimated annual cost of benefits / 2,080 = Estimated hourly cost of benefits (e.g., $18,000 / 2,080 = $8.65)
Hourly labor cost + hourly cost of benefits = Total hourly compensation (e.g., $21.64 + $8.65 = $30.29).
You can then use that to calculate labor costs for your administrator’s time, plus I’ve created a formula in that spreadsheet to calculate it automatically.
So there you have it. This is only a starting point and you may need to tweak to make it fit yoru circumstances. For example, if you have two different people doing scheduling and administration, you might break those costs out (or plug an average salary into cell C13 if they’re close). Or if you don’t pay your test administrators for vacations and holidays you would want to subtract those numbers from the 2,080 hours per year that they work. Or maybe you’re going to outsource test administration and you need to add a line in there for that.
But this template has been a good starting point for me. A lot of it is based on assumptions and educated guesses, but even that makes a big difference when you’re building your business case. Hope you enjoy, and if you have any comments, tips, or suggestions make sure to click on the “Comments” link below and let me know.
I may be looking for a new co-author for the column I write for SIOP’s TIP magazine starting next year. The column aims to examine the bridge between research and practice, so we want one practitioner (that’s me) and one academic. So if you’re an academic type in the area of Industrial-Organizational Psychology and might be interested in helping, drop me a line. The magazine comes out quarterly, so that’s only four columns a year.
Occasionally I get some free time at work and I get to dip into my “wouldn’t it be fun to…” file. Recently I did this and pulled out the idea of something I’ve not had a real chance to do since that Tests and Measurements class I took in grad school: Item Analysis.
Specifically, I had data from a few hundred test-takers on a written, multiple-choice job knowledge test. No follow-up research had been done since the initial test development, and I thought it would be fun (nerd!) to see how the individual items were performing according to good old classical test theory.
Unfortunately my current employer hasn’t equipped me with any statistical analysis software like SAS or SPSS –boo!– so I set to trying to figure out how to do this with plain old Microsoft Excel. Fortunately after some time with Google I found this page, which gives you a pretty good road map as to how to do it. I made a few tweaks to the process to meet my own needs, and thought I’d share with you all here.
Let me say something first in the interest of covering myself professionally and ethically: All data I present here are fictional; no actual test data from my employer or anywhere else are reproduced here. In fact, for the sake of illustration I’m simplifying the data set to 9 test takers and 10 test items. It’s easy to extrapolate to whatever number of test-takers and items you need.
First, a Brief Review
To start, in addition to basic test-level statistics (mean, standard deviation, reliability), I wanted to know item difficulty and item discrimination. Item difficulty is often measured using the p statistic and basically describes the percentage of test-takers who get that item right. Here’s the formula for p:
pi = (number of people getting an item correct) / (number of people taking the test)
Item discrimination, according to this old textbook I have here, is measured with the D statistic. It reveals how good an item is at discriminating between “good” candidates and “bad” candidates. Here, good/bad is conceptualized as whether the test-taker scored in the upper or lower third of the score distribution (some other sources I’ve seen use top/bottom quartile –you choose). The formula for D is:
D = (U – L) / n
Where U = number of people in the top third (according to overall test score) who got the item right, L = number of people in the lower group who passed the item, and n = number of people in each group (assuming they’re equal).
There are other item-level stats I could look at, like item-total correlations (a.k.a., point-biserial correlations) or distractor analyses, but I thought difficulty and discrimination were a good start. That goal identified, the challenge was now to do this in Excel.
Setting up the Spreadsheet
Following the model here I created this spreadsheet:
(Click image above to embiggen or click here to download a copy of the spreadsheet.)
The first 11 rows of the spreadsheet contain the test data for our 9 hypothetical candidates. Row 13 shows the answer key.
Rows 14 through 18 give some information about what percentage of candidates chose each response option for each question. You can start to see trouble here, as with Item 1 –everybody answerd B, which is the correct answer. But more on that shortly. Row 18 shows the totals, and if it’s below 100% (like with Item 5) that means one or more candidates left that one blank. If a LOT of people leave an item blank, that’s generally a big red flag.
Calculating Scores and Results
The “Results” half of the spreadsheet starts to get into the meat of things. Rows 22 through 31 show each candidate’s score on each item –1 point for getting a question right, and 0 points for getting it wrong. (I guess now is a good time to point out that this model doesn’t work when you give partial credit.) It’s worth taking a second to discuss the formulas in this section. The formula “=IF($E$5=$E$13,1,0)” in cell E23, for example, basically says that if that Manny Calavera’s answer on Item 1 (cell E5) is equal to what’s in the answer key for Item 1 (cell E13), then a “1” goes into that cell. If not, a “0.”
It’s also noteworthy that I used the $ symbol in those cell references (e.g., $E$5 instead of just E5), which locks those references in and prevents Excel from outsmarting itself if I do something like sort the data in the “Results” section. Which I did, for a reason I’ll get to in a second.
Using the Spreadsheet to Calculate Item Difficulty
So, let’s get to the punchlines: the item analyses. Row 33 contains the Item Difficulty (p) for each of the 10 test items. This is just the percentage of test-takers who got each item correct, and is calculated for each item by summing up the number of points across all test-takers and then dividing by the number of candidates (9 in this example). So for Item 1 the formula is “=SUM(E23:E31)/9”
Item difficulties that approach 1 are probably too easy; ones that get close to 0 are probably too hard. Either that or there’s some kind of error in your scoring key so that you’re incorrectly scoring the tests.
Using the Spreadsheet to Calculate Item Discrimination
Now for Item Discrimination, which is an index of how well an item differentiates between high and low performers. To calculate this, you just subtract the p value for the bottom third of the test takers from the p value for the top third. To get these two p values, I first sorted the test takers by Score. This is why using the $ symbol to make those cell references absolute earlier was important. Without that, Excel would diligently assume you wanted it to change the formulas that score each item (e.g., in cell E23) relative to the new orders. Then you’d get garbage.
The next step in calculating D values is shown in rows 34 and 35, which recalculates those p values, but only using the top and bottom thirds of the test-takers (the first 3 and the last 3 in this case, since we have 9 people in the mix).
Row 36 then calculates the D value for each item by subtracting the upper and lower p values.
Using the Spreadsheet to Test-Level Statistics
Finally, I tossed in a section that calculated those test-level statistics: mean socre, standard deviation (SD), reliability, and Standard Error of the Mean. To calculate reliability I used the Kuder-Richardson 21 formula, which again according to this old textbook I have is:
This is where k is the number of items, X is the mean test score, and SD is the test’s standard deviation. And when you see something like X^2 means X to the second power, or X squared. The formula’s in there in the Excel spreadsheet.
Standard Error of the Mean is easier to calculate. It’s just the square root of (1-X)*SD. Again, the formula’s in the Excel spreadsheet.
Making Use of the Results
So, putting this all together, what can we tell? There are a number of red flags you can hunt for in the item analyses. First is seeing if items are too easy. Right off the bat item 1 looks bad. It has a p value of 1.00, which means that every single person got it right. Mabye it’s too easy or maybe there’s something about the way it’s written that’s clueing everyone in to the correct answer. Either way, it’s not contributing in any meaningful way and could probably be dropped, fixed, or replaced by a better item. In general, p values above .90 should be examined closely.
Conversely, you probably want to look for items that have very low p values. Look at Item 5. It has a p value of .11, which means that almost nobody got it right. Assuming you’ve got the right people taking the test, this probably means that the question is either too hard or you’ve got some error in the body of the question or the distractors that’s throwing people off. Or it’s even possible that your answer key is in error. It happens.
What about item Discrimination? Generally you want values in the middle of the road here, which means that your better test-takers are getting it right while your dunces and chowderheads are getting it wrong. Very low values –especially those approaching 0– mean that the item just isn’t telling the two groups apart. What’s worse are items with negative Discrimination values. That happens when for some reason the item is missed by your top test takers and gotten right by the bottom ones. Take a close look at those and either fix them, drop them, or replace them.
So that’s it. It’s not everything you need to know about item analysis, but it was a fun little project for me and maybe you can benefit from using my spreadsheet as a starting point if you ever do anything similar. Again, thanks to Chris Elvin of this website, from whom I borrowed heavily for everything above.