The ROI of the Gallup Q12: Assessing the true value of high-cost HR interventions
- An intervention is defined as any HR initiative which seeks to intervene in the day-to-day functioning of employees and their work.
- If money, time, or effort is expended on any activity within a business, then, unless that business is a charity, there should be a firm expectation that the expenditure can be assessed for its eventual return (profit) – expressed monetarily as a Return on Investment (ROI).
- ROI modelling is defined to be the entire and complete costing of an HR intervention, taken together with an explicit detailing of the process of "how and when" the benefits of such an intervention are to be realised by a company.
This paper investigates how the cost of implementing the Gallup Workplace Audit
(now known as the Q12 Employee Engagement assessment) may be critically evaluated in terms of calculating the likelihood of making or losing money as a corporate-wide Gallup Audit score is increased.
i.e. The question posed and answered via computational simulation is
"what are the odds of a company making or losing money as a result of an increase in score from 36 (average) to something higher".
What prompted this paper was my experience at a large publicly-listed NZ corporate, who effectively paid out millions of dollars deploying the Gallup and associated training and support processes, on the basis that increasing employee engagement (indexed by score change on the Gallup Q12) would lead to increases in company profitability and a host of other positive features associated with corporate performance. Indeed, use of the Q12 indicated an increase in engagement in many sectors over the few years it was in use. All the while financial performance was becoming terminal.
The corporate was sold, de-listed, and broken up a few years later; testament to the fact that increasing employee engagement does not necessarily lead to improved results at all, but rather, negative ones.
The reason why? Because the effect sizes advertized by Gallup and a particulary influential research paper by some of their researchers are so low that the odds of improving vs decreasing company performance over typically average score-ranges are very near 50:50.
No HR director would ordinarily realize this. Hence the work-up of the potential realizable "consequences" in this whitepaper. Better to make a multi-million dollar decision armed with complete knowledge of the risks involved, than rely upon some abstract "validity" figures which do not convey ALL the potential consequences of implementation.
Normative test scores in a performance-oriented personnel selection strategy
Two questions are asked in this paper:
Q1. When a test publisher/employer/recruiter begins using a psychometric test scale as part of a selection process, where a particular score on a scale is required to be used as a threshold for a “minimum likely performance/literacy standard” or “filter” for applicants, the first question they face is “which score should be used as the “threshold”?
Short answer: The response requires a clear choice to be made, between setting a threshold subjectively or using an empirical evidence-based approach. Arguments are made for each approach, concluding that an empirically informed decision is all but mandatory except in exceptional circumstances.
Q2. Should raw or normatively-interpreted scores be used in selection settings? That is, should an employer use the raw scale score to represent a magnitude of some attribute for a candidate, or instead, re-express the score relative to a normative set of scores provided by a homogenous group of individuals (whether “general population” or some specific subgroup)?
Short answer: From some detailed "closely-matching-reality" data simulation work, there is clearly no justification whatsoever for using transformed scaled scores such as stens, T-Scores, stanines etc. in a performance-oriented selection process, except where the norms are properly representative, substantive in constituent number, and remain static (i.e. are not cumulatively updated or “bootstrapped”).
The problem is that psychologists and practitioners have become fixated on the interpretation of test scores expressed relative to some group. That's useful when trying to interpret the score in terms of how others score. But, a score can also be considered "absolute". That is, it represents a particular magnitude of a psychological attribute.
When benchmarking
(as in engagement, work-stress, or performance-related applications), the use of normative scores can be a big mistake. Because if the norm group characteristics change in any way, the entire benchmarking process is rendered problematic. Further, if score magnitudes are considered related to performance, then it is the actual score which carries that relationship, not its "normed-percentile", sten, or T-score version.
This is a fairly hefty 23-page whitepaper - with some substantive supporting analyses and argument, explaining why the use of transformed scores is not recommended for work in selection settings where a cut-score or threshold is being used as a pre-screen.
The Meta-Analytic Correlation between the Big Five Personality Constructs of Emotional Stability and Conscientiousness: Something is not quite right in the woodshed
Co-athored with Jean-Pierre Rolland, Université de Paris Ouest - Nanterre La Défense, France.
Aunt Ada Doom is the infamous "mad woman in the attic" of Stella Gibbons’ comedy novel Cold Comfort Farm (1932); her mind became unhinged when as a child she saw "something nasty in the woodshed". The literary phrase may not totally capture the effect of the observations we make below, but something is "not quite right" about the following meta-analytic results reported in a series of studies since 1993.
We do not wish to dwell on the pros and cons of meta-analysis, but rather we find ourselves questioning the implicit understanding that meta-analysis is always capable of revealing the expected population correlation between attributes. The paper by LeLorier, J., Gregoire, G., Benhaddad, A., Lapierre, J., & Derderian, F. (1997).
Discrepancies between meta-analyses and subsequent large randomized, controlled trials. The New England Journal of Medicine, 337, 8, 536-542 is perhaps the most famous study showing that meta-analysis does not always produce accurate estimates of population parameters, and the recent study by Schonemann, P.H., & Scargle, J.D. (2008).
A Generalized Publication Bias Model. Chinese Journal of Psychology, 50, 1, 21-29, helps to explain why.
Of specific interest here though are the various meta-analytic estimates of population correlations between two specific Big Five personality test scales,
Emotional Stability and
Conscientiousness. These are the two most important broad personality factors associated meta-analytically with job performance. Nine sources of published evidence were examined in some detail ..
Using psychometric test scores: Some warnings, explanations, and solutions for HR professionals.
How might you respond when a candidate, employee, or union notifies you of a claim of unfair or negligent practice against you, asserting that the selection procedures utilized resulted in their failure to be employed, retained post-downsizing, or promoted? When examined very closely, the usual best practice' responses from HR and I/O psychologists might not work.
I do not discuss adverse-impact, as this is the simplest of all grievance scenarios to defend/prosecute.
Instead, I wanted to tackle a much more broad platform of potential grievances against HR selection practices - where adverse/unfair 'outcome' is claimed to be the result of negligence or incompetence in the way test scores and assessments were incorporated into the decision-process, resulting in an unfair/unjust outcome for the palintiff/s.
This 19-page whitepaper introduces a potential employee grievance scenario surrounding a selection exercise from incumbent employees into a fast-track leadership development program within an organization. This could just as easily be a re-allocation of a subset of employees from one job-role to another, or a downsizing scenario.
I then take the reader through the typical justifications deployed by HR
(and their test-publisher/consultant I/O psychology advisers) for how test and assesment scores were used. Each is responded to in turn with a mixture of logic and empirical evidence, showing that under aggressive expert-advised cross-examination , these 'usual suspect' justifcations can be undermined, sometimes fatally.
Three lines of argument likely to be employed by HR in its defence are examined in detail:
- The disaffected group assume that we made our decision using the test scores alone, or were disproportionately biased in the importance we gave to them. As our test publishers advise, as well as the relevant international professional societies, we used the psychometric scores only as a component (among many) to help us come to a decision.
- Modern organizational psychological science has shown that validity generalization via meta-analysis studies and/or synthetic validity coefficients generated by a test publisher render local validity (determining whether the assessment works in your organization) as obsolete.
- Psychometric assessments were made of candidates, but were not used in the decision process.
Ultimately the issue comes down to the sufficiency of an evidence base for any procedure which is used to select a subset of individuals from a larger group. I am of the opinion
(shared with the US Supreme Court) that local validity is what ultimately stands between HR being legally exposed and legally protected. And, unlike the orthodox negative whinge from I/O psychologists, variations of local validity can be attained in many innovative ways, once the HR strategist concentrates on the issue at hand - developing an evidence base for a procedure which will withstand adversarial expert and legal scrutiny.
Using psychometric test scores: When outcomes are critical
Using norm-based test scores to convey information about critical potential outcomes is wrong, and legally indefensible unless the norm group is conditioned upon the criterion of interest. Even then, there is little or no utility in this approach. The application domain of employee safety tests is used as an exemplar.
In some organizational applications of psychological assessment, the test scores for an individual might be considered critical. That is, employment/promotion decisions about a candidate or incumbent may be made directly on the basis of that test score. A couple of areas where this is clearly the case is with integrity/security personnel testing and the assessment of employee safety using psychological attributes as ‘indicators of potential risk’.
The outcomes of employee dishonesty, theft, shrinkage, or causing accidents/incidents in the workplace are usually considered critical because the consequences of each can be traumatic, impact on other employees’ situations/health, and be financially costly to the organization.
The use of assessments in these domains is generally predicated upon the accuracy of a test to predict an adverse outcome, so that threshold or cut-scores/regions of interest might be used to screen individuals prior to employment, training, or deployment.
Unlike those application areas where psychological assessment information is subjectively combined with other sources and interpreted by one or more members of a selection panel, critical outcome tests produce scores which need no interpretation as their magnitudes are related directly to the probability of occurrence of the adverse outcome. And these ‘tests’ may actually be composite assessments of many attributes configured into an optimal profile classifier, where the ‘score’ is not a simple sum of unit-weighted items, but a weighted composite.
What sets this kind of test design, construction, and calibration process apart from standard psychometric tests is that they are computationally optimised for predictive accuracy of the criterion of interest.
Using normative scores with critical outcome tests makes no sense at all, except where a normative group is defined explicitly in terms of the outcome; not in terms of employee group etc. But, even here it makes no real sense to use ‘normative’ scores. Critical test scores must convey likelihood of outcome in order to justify their use.
Employee Safety Assessment
For example, in a safety assessment, should one wish to express a classifier or test score relative to a safe (no recorded incident) or unsafe (at least one recorded self-cause incident) group, the resulting normative score at least preserves the relativity with respect to the criterion. But, expressing a score relative to an employee group, without first conditioning upon incident-causation confounds the criterion with the group membership e.g. in a supervisor group, some of them may have actually caused incidents and will be included in your norm group...
The public perception of institutional leadership as a function of $$ spend on executive training and development.
When plotting $$ spend (in billions) against the public confidence in the leadership of major US corporations and Wall Street, from 1996 through to 2011, it is noted that there is at no actual relationship. From more detailed analyses there is what looks to be an overall negative relationship.
Confidence in leadership is falling almost as fast as corporate leadership development and training budgets are increased.
A review of commercial products and academic articles associated with the psychological assessment of Safety Attributes within prospective and incumbent employees
This comprehensive review (as at 2010) was originally carried out by myself for the Insight Partnership (London, UK Ltd) who were contracted to provide a new safety assessment for a commercial client.
15 commercial test products are evaluated using five headings per test:
- Publisher
- Source documents
- Test details
- Results format
- Predictive validity evidence
35 academic, peer-reviewed articles are also summarised using four headings:
- Abstract
- Predictors
- Criteria predicted
- Predictive accuracy