Paul Barrett - Whitepapers

The Strategic (S1, S2 .. S7) whitepapers are largely devoted to issues of wider interest in Human Resouces and organizational employee strategies/spends.

Conventional interrater reliability: definitions, formulae, and worked examples in SPSS and STATISTICA

A document which presents both conceptual and computational detail for conventional interrater reliability analysis. It covers the "usual suspect" methods spaning kappa, alpha, concordance, and intraclass models. Detailed SPSS v10 and STATISTICA v6 analysis examples are included for Intraclass correlations as per Shrout and Fleiss Models 1, 2, and 3.

I refer to the techniques as "conventional", because technical paper #9 provides new classes of methods which directly index reliability as magnitude agreement between observations. Technical paper #10 introduces the new data-model-free algorithmic methodology for assessing interrater reliability under a variety of typical estimation conditions.

Percentiles and percentile ranks: confused or what?

This document tries to explain the basis for two almost equally occurring definitions of percentiles as either:

A percentile is the point in a distribution at or below which a given percentage of scores is found
-or-
The value below which P% of the values fall is called the Pth percentile.

22 annotated definitions of percentiles are scoured from books and the internet to list those which define them using either of the definitional statements above. Then some logic and worked examples are used to show how both definitions are correct, given a perspective of a test score as an integer, or as a point-estimate of a continuous real-valued number.

The 2011 revision content begins at page 12, with 10 new pages of analysis and information. Really, the new investigations are all about what to do when raw score frequencies are truncated-range, non-normally distributed, integers. Peculiarities occur with these kinds of norm-data when treating them as "continuous, normal, real-valued scores" rather than as discrete, truncated, awkwardly distributed scores. The whitepaper makes use of the new Stanscore 4 program avalable from my website (the Software page) with lots of analysed data examples to highlight the issues.

Mine is an extremely simple-minded view of norms and percentiles. I do not assume norm groups are some random sample from a hypothetical population; I just take them “as they are” – a sample of a group of people who will be used as a comparison group for another individual’s scores. Whether or not that group can be considered a representative sampling of some “target” group population is a matter for sometimes deep consideration. I don’t try and estimate the hypothetical error around each percentile given my norm-group sample size, or estimate what a real-valued score might be from the integer representation.

The approach to estimating percentiles which incorporate statistical sampling error as part of their calculation can be found in John Crawford and colleagues’ work – an excellent paper outlining the logic, algorithms, and results can be downloaded from John’s website:

#160 Crawford, J. R., Cayley, C., Wilson, P. H., Lovibond, P. F., & Hartley, C. (2011). Percentile norms and accompanying interval estimates from an Australian general adult population sample for self-report mood scales (BAI, BDI, CRSD, CES-D, DASS, DASS-21, STAI-X, STAI-Y, SRDS, and SRAS). The Australian Psychologist, 46, 3-14.

See also #151: Crawford, J. R., Garthwaite, P. H., & Slick, D. J. (2009). On percentile norms in neuropsychology: Proposed reporting standards and methods for quantifying the uncertainty over the percentile ranks of test scores. The Clinical Neuropsychologist, 23, 1173-1195.

Ordinary least squares regression: assumptions and some clarity

This was in response to a question asked of me ... "I am starting to use the some regression analysis and I am a bit confused about the Assumptions. About normality and homoscedasticity, what exactly I need to test? The real variables or just the residuals? or both?"

My response began ... There is one important assumption for the use of least-squares, linear regression that is generally phrased as:
"The population means of the values of the dependent variable Y at each value of the independent variable X are assumed to be on a straight line".
This statement implies that at each value of X, there is a distribution of Y values for which the mean is used as the value that characterises the average value of each Y at X. This immediately implies that Y itself is a random variable, possessing equal-interval, additive concatenation units (the use of the mean implies additivity of units).

A further set of assumptions that are also made when using linear regression are (taken from Pedhazur, 1997, pp. 33-34):

The mean of the errors (residuals (Yik-Yik')) for each observation of the Yi on Xi, over many replications, is zero.
Errors associated with one observation of Yi on Xi are independent of errors associated with any other observation Yj on Xi (serial autocorrelation)
The variance of the errors of Y, at all values of X, is constant (homoscedasticity)
The values of the errors of Y are independent of the values of X.
The distribution of errors (residuals) over all values of Y are normally distributed

From the above, there seems to be no a priori requirement for Y itself to be normally distributed. It seems that the assumption:
"The population means of the values of the dependent variable Y at each value of the independent variable X are assumed to be on a straight line".
could be met by a variable whose values are, for example, uniformly distributed rather than normally distributed. The normality assumption seems to be confined explicitly to the errors of prediction of Y, not Y itself. In fact, many textbooks only mention the assumptions within this framework. ... I then generated some appropriate data and set about testing each of the assumptions with uniformly and normally generated/distributed data. Plenty of graphics and discussion!

Test reliability and validity: The inappropriate use of the Pearson and other variance ratio coefficients for indexing reliability and validity

For test-retest reliability and validity estimation, psychologists generally use Pearson correlations to express the magnitude of relationships between attributes. For rater reliability where ratings are usually acquired using Likert ordered-class-as-numbered- magnitudes scales, they generally use intraclass (ICC) coefficients and rwg statistics.
I initially explore the three main questions asked by anyone working with tests, whether researchers, I/O test publisher psychologists, or clients/consumers of test products who are relying upon statements made by the sellers of tests.
I then show why the use of the Pearson or ICC coefficients alone are inappropriate in the context of the three common questions, using logical argument, graphics, and data analysis.
A solution to the dilemma posed by #3 is then constructed, introducing the Gower and Double-Scaled Euclidean indices of agreement as obvious choices for use in assessing test and rating reliability, test validity, and predictive accuracy. The final recommendation made is for the Gower coefficient, because of its more direct and obvious interpretation relative to the observation metrics. The Gower is interpreted as the average % of maximum agreement (identity) between the two sets of observations.
I compute both agreement and monotonicity, using the Gower and Pearson correlation coefficient respectively; the Pearson being the optimal measure of symmetric scale-free monotonicity.
I also develop the bootstrap procedure for assessing the statistical significance of the agreement index (and, if necessary, the monotonicity coefficient).
Example dataset analyses are provided to show how the agreement coefficient compares to conventional indices; these include the use of random samples of observations taken from bivariate-normal and uniform distributions.
Finally, the results from analyzing three real-world datasets are presented (two validity estimation applications and the examination of test sub-scale score relationship). In the case of the validity estimation applications, conventional validity r-squares of 19% (r = 0.44) and 5% (r = 0.23) can be compared to 90% and 87% agreement respectively using the Gower index. The reason for the somewhat spectacular increase in validity is provided in detailed sub-analyses associated with each example.
Three important theoretical developments and thinking have driven this work: Joel Michell's (1997, 2008) explanations of psychometrics as a pathology of science, Leo Breiman's (2001) arguments and results in favor of algorithmic statistics, and most recently, James Grice's development of Observation Oriented Modeling.
For test publishers, the opportunity now exists to cease producing the usual tables of mostly indifferent and "not quite certain what they really mean" validity indices, and instead take another look at their existing datasets which might harbor the kind of validities which need no creative spin nor ad-hoc "in a perfect world" statistical corrections.

The ROI of the Gallup Q12: Assessing the true value of high-cost HR interventions

An intervention is defined as any HR initiative which seeks to intervene in the day-to-day functioning of employees and their work.

If money, time, or effort is expended on any activity within a business, then, unless that business is a charity, there should be a firm expectation that the expenditure can be assessed for its eventual return (profit) – expressed monetarily as a Return on Investment (ROI).

ROI modelling is defined to be the entire and complete costing of an HR intervention, taken together with an explicit detailing of the process of "how and when" the benefits of such an intervention are to be realised by a company.

This paper investigates how the cost of implementing the Gallup Workplace Audit (now known as the Q12 Employee Engagement assessment) may be critically evaluated in terms of calculating the likelihood of making or losing money as a corporate-wide Gallup Audit score is increased.

i.e. The question posed and answered via computational simulation is "what are the odds of a company making or losing money as a result of an increase in score from 36 (average) to something higher".

What prompted this paper was my experience at a large publicly-listed NZ corporate, who effectively paid out millions of dollars deploying the Gallup and associated training and support processes, on the basis that increasing employee engagement (indexed by score change on the Gallup Q12) would lead to increases in company profitability and a host of other positive features associated with corporate performance. Indeed, use of the Q12 indicated an increase in engagement in many sectors over the few years it was in use. All the while financial performance was becoming terminal.

The corporate was sold, de-listed, and broken up a few years later; testament to the fact that increasing employee engagement does not necessarily lead to improved results at all, but rather, negative ones.

The reason why? Because the effect sizes advertized by Gallup and a particulary influential research paper by some of their researchers are so low that the odds of improving vs decreasing company performance over typically average score-ranges are very near 50:50.

No HR director would ordinarily realize this. Hence the work-up of the potential realizable "consequences" in this whitepaper. Better to make a multi-million dollar decision armed with complete knowledge of the risks involved, than rely upon some abstract "validity" figures which do not convey ALL the potential consequences of implementation.

Normative test scores in a performance-oriented personnel selection strategy

Two questions are asked in this paper:
Q1. When a test publisher/employer/recruiter begins using a psychometric test scale as part of a selection process, where a particular score on a scale is required to be used as a threshold for a “minimum likely performance/literacy standard” or “filter” for applicants, the first question they face is “which score should be used as the “threshold”?
Short answer: The response requires a clear choice to be made, between setting a threshold subjectively or using an empirical evidence-based approach. Arguments are made for each approach, concluding that an empirically informed decision is all but mandatory except in exceptional circumstances.

Q2. Should raw or normatively-interpreted scores be used in selection settings? That is, should an employer use the raw scale score to represent a magnitude of some attribute for a candidate, or instead, re-express the score relative to a normative set of scores provided by a homogenous group of individuals (whether “general population” or some specific subgroup)?
Short answer: From some detailed "closely-matching-reality" data simulation work, there is clearly no justification whatsoever for using transformed scaled scores such as stens, T-Scores, stanines etc. in a performance-oriented selection process, except where the norms are properly representative, substantive in constituent number, and remain static (i.e. are not cumulatively updated or “bootstrapped”).

The problem is that psychologists and practitioners have become fixated on the interpretation of test scores expressed relative to some group. That's useful when trying to interpret the score in terms of how others score. But, a score can also be considered "absolute". That is, it represents a particular magnitude of a psychological attribute.

When benchmarking (as in engagement, work-stress, or performance-related applications), the use of normative scores can be a big mistake. Because if the norm group characteristics change in any way, the entire benchmarking process is rendered problematic. Further, if score magnitudes are considered related to performance, then it is the actual score which carries that relationship, not its "normed-percentile", sten, or T-score version.

This is a fairly hefty 23-page whitepaper - with some substantive supporting analyses and argument, explaining why the use of transformed scores is not recommended for work in selection settings where a cut-score or threshold is being used as a pre-screen.

Using psychometric test scores: Some warnings, explanations, and solutions for HR professionals.

How might you respond when a candidate, employee, or union notifies you of a claim of unfair or negligent practice against you, asserting that the selection procedures utilized resulted in their failure to be employed, retained post-downsizing, or promoted? When examined very closely, the usual best practice' responses from HR and I/O psychologists might not work.

I do not discuss adverse-impact, as this is the simplest of all grievance scenarios to defend/prosecute.

Instead, I wanted to tackle a much more broad platform of potential grievances against HR selection practices - where adverse/unfair 'outcome' is claimed to be the result of negligence or incompetence in the way test scores and assessments were incorporated into the decision-process, resulting in an unfair/unjust outcome for the palintiff/s.

This 19-page whitepaper introduces a potential employee grievance scenario surrounding a selection exercise from incumbent employees into a fast-track leadership development program within an organization. This could just as easily be a re-allocation of a subset of employees from one job-role to another, or a downsizing scenario.

I then take the reader through the typical justifications deployed by HR (and their test-publisher/consultant I/O psychology advisers) for how test and assesment scores were used. Each is responded to in turn with a mixture of logic and empirical evidence, showing that under aggressive expert-advised cross-examination , these 'usual suspect' justifcations can be undermined, sometimes fatally.

Three lines of argument likely to be employed by HR in its defence are examined in detail:

The disaffected group assume that we made our decision using the test scores alone, or were disproportionately biased in the importance we gave to them. As our test publishers advise, as well as the relevant international professional societies, we used the psychometric scores only as a component (among many) to help us come to a decision.

Modern organizational psychological science has shown that validity generalization via meta-analysis studies and/or synthetic validity coefficients generated by a test publisher render local validity (determining whether the assessment works in your organization) as obsolete.

Psychometric assessments were made of candidates, but were not used in the decision process.

Ultimately the issue comes down to the sufficiency of an evidence base for any procedure which is used to select a subset of individuals from a larger group. I am of the opinion (shared with the US Supreme Court) that local validity is what ultimately stands between HR being legally exposed and legally protected. And, unlike the orthodox negative whinge from I/O psychologists, variations of local validity can be attained in many innovative ways, once the HR strategist concentrates on the issue at hand - developing an evidence base for a procedure which will withstand adversarial expert and legal scrutiny.

Using psychometric test scores: When outcomes are critical

Using norm-based test scores to convey information about critical potential outcomes is wrong, and legally indefensible unless the norm group is conditioned upon the criterion of interest. Even then, there is little or no utility in this approach. The application domain of employee safety tests is used as an exemplar.

In some organizational applications of psychological assessment, the test scores for an individual might be considered critical. That is, employment/promotion decisions about a candidate or incumbent may be made directly on the basis of that test score. A couple of areas where this is clearly the case is with integrity/security personnel testing and the assessment of employee safety using psychological attributes as ‘indicators of potential risk’.

The outcomes of employee dishonesty, theft, shrinkage, or causing accidents/incidents in the workplace are usually considered critical because the consequences of each can be traumatic, impact on other employees’ situations/health, and be financially costly to the organization.

The use of assessments in these domains is generally predicated upon the accuracy of a test to predict an adverse outcome, so that threshold or cut-scores/regions of interest might be used to screen individuals prior to employment, training, or deployment.

Unlike those application areas where psychological assessment information is subjectively combined with other sources and interpreted by one or more members of a selection panel, critical outcome tests produce scores which need no interpretation as their magnitudes are related directly to the probability of occurrence of the adverse outcome. And these ‘tests’ may actually be composite assessments of many attributes configured into an optimal profile classifier, where the ‘score’ is not a simple sum of unit-weighted items, but a weighted composite.

What sets this kind of test design, construction, and calibration process apart from standard psychometric tests is that they are computationally optimised for predictive accuracy of the criterion of interest.

Using normative scores with critical outcome tests makes no sense at all, except where a normative group is defined explicitly in terms of the outcome; not in terms of employee group etc. But, even here it makes no real sense to use ‘normative’ scores. Critical test scores must convey likelihood of outcome in order to justify their use.

Employee Safety Assessment For example, in a safety assessment, should one wish to express a classifier or test score relative to a safe (no recorded incident) or unsafe (at least one recorded self-cause incident) group, the resulting normative score at least preserves the relativity with respect to the criterion. But, expressing a score relative to an employee group, without first conditioning upon incident-causation confounds the criterion with the group membership e.g. in a supervisor group, some of them may have actually caused incidents and will be included in your norm group...