Psychology26 slides0 views

Psychometrics

The discipline that asks: when we say someone "has high anxiety" or "an IQ of 120," what does that statement mean, and how would we know if we were wrong?

Standalone Download

Shared with ShipslidesCreate your own deck →

About this HTML presentation

This Shipslides page presents Psychometrics as an interactive HTML presentation deck in the Psychology catalog with 26 slides. The share page keeps the uploaded deck sandboxed while exposing readable context, topics, and a slide outline for viewers and search engines.

The discipline that asks: when we say someone "has high anxiety" or "an IQ of 120," what does that statement mean, and how would we know if we were wrong? Key sections include: Psycho metrics.; Opening What psychometrics is.; Chapter I The dark origins.; Chapter II The first practical test.; Chapter III Stanford-Binet and Wechsler.; Chapter IV The general factor.; Chapter V IQ has been rising.; Chapter VI The bell-curve controversy.; Chapter VII Stereotype threat.; Chapter VIII Reliability..

Key sections

01Psycho metrics.
02Opening What psychometrics is.
03Chapter I The dark origins.
04Chapter II The first practical test.
05Chapter III Stanford-Binet and Wechsler.
06Chapter IV The general factor.
07Chapter V IQ has been rising.
08Chapter VI The bell-curve controversy.
09Chapter VII Stereotype threat.
10Chapter VIII Reliability.
11Chapter IX Validity.
12Chapter X Factor analysis.
13Chapter XI Item Response Theory.
14Chapter XII The MMPI.
15Chapter XIII NEO and Big Five inventories.
16Chapter XIV Why the MBTI fails as measurement.
17Chapter XV The Implicit Association Test.
18Chapter XVI The polygraph problem.
19Chapter XVII Test ethics.
20Chapter XVIII Test fairness across groups.
21Chapter XIX Measurement from digital traces.
22Chapter XX What has held up.
23Chapter XXI Twenty-five works.
24Chapter XXII Watch & read.

Topics covered

psychology psychometrics

Related decks

Psychology31 slides

Cognitive Psychology

Slide outline

01Psycho metrics.
02Opening What psychometrics is.
03Chapter I The dark origins.
04Chapter II The first practical test.
05Chapter III Stanford-Binet and Wechsler.
06Chapter IV The general factor.
07Chapter V IQ has been rising.
08Chapter VI The bell-curve controversy.
09Chapter VII Stereotype threat.
10Chapter VIII Reliability.
11Chapter IX Validity.
12Chapter X Factor analysis.
13Chapter XI Item Response Theory.
14Chapter XII The MMPI.
15Chapter XIII NEO and Big Five inventories.
16Chapter XIV Why the MBTI fails as measurement.
17Chapter XV The Implicit Association Test.
18Chapter XVI The polygraph problem.
19Chapter XVII Test ethics.
20Chapter XVIII Test fairness across groups.
21Chapter XIX Measurement from digital traces.
22Chapter XX What has held up.
23Chapter XXI Twenty-five works.
24Chapter XXII Watch & read.
25Chapter XXIII What's next.
26The end of the deck.

Page data

Canonical: https://shipslides.com/d/psychology-psychometrics
Category: Psychology
Size: 58.1 KB
Updated: 2026-05-17
LLM text: https://shipslides.com/d/psychology-psychometrics/llms.txt

Presentation Transcript

Detailed slide-by-slide text content extracted from this presentation.

Slide 01

Psychometrics.

Vol. XII · Deck 10 · The Deck Catalog
The measurement of psychological constructs. IQ, the Big Five, the Flynn effect, the bell-curve controversy, and why the MBTI is not a serious measurement instrument.
FoundedGalton, c. 1880
α target≥ 0.80
Pages26

Slide 02

OpeningWhat psychometrics is.

LedeII
A working def.Psychometrics is the branch of psychology concerned with the design, administration, scoring, and interpretation of psychological tests — the theory of measurement applied to mental constructs.
The discipline that asks: when we say someone "has high anxiety" or "an IQ of 120," what does that statement mean, and how would we know if we were wrong?
Psychological constructs — intelligence, personality, anxiety, depression, self-esteem — are not directly observable. We measure them through proxies: questionnaire responses, behavioural tasks, ratings by others. Psychometrics is the theory and practice of doing this rigorously.
The deck covers the founding (Galton through Binet), the development of intelligence testing, the Big Five revolution in personality assessment, the technical core (reliability, validity, factor analysis, IRT), the major controversies (the bell-curve debate, the MBTI critique, the Flynn effect), and the contemporary frontier of computational and digital-footprint measurement.
The Deck Catalog · Vol. XII— ii —

Slide 03

Chapter IThe dark origins.

GaltonIII
Francis Galton1822–1911. Darwin's half-cousin. Founded biometry, statistical regression, and (less proudly) eugenics. Hereditary Genius (1869). Anthropometric Laboratory at South Kensington, London, 1884–1891.
The discipline began in Victorian England with Francis Galton, who attempted to measure individual differences in mental ability through proxy measures: reaction time, head circumference, sensory acuity. Galton's anthropometric laboratory measured roughly 9,000 people for a small fee. The premise was that mental ability could be inferred from simple physical and sensory measures.
The premise was wrong. The simple sensory and reaction-time measures correlated weakly at best with what we would now call intelligence. But Galton invented the statistical tools — correlation, regression, the use of percentiles — that subsequent psychometrics depended on.
Galton was also the founder of eugenics, a project that sought to improve the heritable qualities of the human race through selective reproduction. The eugenics movement, including its later coercive expressions in the US sterilisation laws of the 1907–1939 period and Nazi German racial hygiene, was directly influenced by Galton's writings. The discipline of psychometrics has, since its founding, carried this legacy. Contemporary practitioners are aware of it.
Psychometrics · Galton— iii —

Slide 04

Chapter IIThe first practical test.

BinetIV
Alfred Binet1857–1911. French. Commissioned by the Paris school authorities in 1904 to design a test that could identify children needing special instruction. Working with Théodore Simon, produced the Binet-Simon scale (1905, 1908, 1911 revisions).
Alfred Binet's 1905 test was the first that worked. Its premise differed from Galton's: rather than measuring elementary processes, Binet asked children directly to perform tasks of increasing complexity — naming objects, repeating digit sequences, completing sentences, defining words. The tasks were ranked by difficulty across age groups; a child's performance was scored against age-typical performance.
Mental age and IQ
Binet's mental age concept — a 7-year-old performing at the level typical of 9-year-olds had a mental age of 9. William Stern in 1912 proposed the intelligence quotient: IQ = (mental age / chronological age) × 100. The ratio IQ formula has since been replaced by the deviation IQ (current IQ scores represent standard deviations from the population mean), but the basic measurement concept persists.
Binet's caution
Binet himself warned against treating intelligence test scores as fixed measures of innate ability. He saw the scores as identifying children who needed help, not as ranking children by inherent worth. Subsequent users of his test — particularly the American hereditarians of the 1910s–20s — ignored the caveat.
Psychometrics · Binet— iv —

Slide 05

Chapter IIIStanford-Binet and Wechsler.

Stanford-Binet & WechslerV
Two test familiesStanford-Binet (Lewis Terman, 1916; current edition SB-5, 2003). Wechsler: WAIS-IV (adults), WISC-V (children), WPPSI-IV (preschool). The Wechsler is now more widely used clinically.
Lewis Terman at Stanford translated and adapted Binet's test for American use, producing the Stanford-Binet in 1916. The test became the dominant US intelligence measure for decades. Terman also launched the Genetic Studies of Genius longitudinal study (1921–) — 1,528 high-IQ children followed across their lives. Subjects were called "Termites." The study is still running with descendants.
Wechsler
David Wechsler (1896–1981), at Bellevue Hospital in New York, developed the Wechsler-Bellevue Intelligence Scale in 1939. He argued that intelligence testing should yield not a single score but profiles across multiple cognitive domains. The Wechsler scales are now the dominant clinical instruments. The current adult version, the WAIS-IV (2008), produces four index scores — Verbal Comprehension, Perceptual Reasoning, Working Memory, Processing Speed — plus a Full Scale IQ.
The CHC model
The Cattell-Horn-Carroll model is the current dominant theoretical framework for intelligence testing. It posits a hierarchy: a general factor (g), broad abilities (fluid reasoning, crystallised intelligence, processing speed, working memory, visual processing, etc.), and narrow abilities. Modern test batteries are designed to assess the broad abilities.
Psychometrics · SB & Wechsler— v —

Slide 06

Chapter IVThe general factor.

Spearman's gVI
Charles Spearman1863–1945. London. "General Intelligence" Objectively Determined and Measured (1904). Introduced factor analysis and identified the g factor.
Charles Spearman observed that performance on different cognitive tests was always positively correlated — people who do well on vocabulary tests also do well on spatial-rotation tests, on arithmetic, on memory tasks. The correlations are not perfect (each test has unique variance) but the positive manifold is robust. Spearman called the common factor g (general intelligence).
Spearman's two-factor theory: any cognitive test reflects g plus an s (specific) component for that particular task. The model has been refined many times since but g has held up as a reliable empirical regularity.
Thurstone's challenge
L. L. Thurstone (1938) argued for primary mental abilities rather than a single g — verbal comprehension, word fluency, number, space, memory, perceptual speed, reasoning. Thurstone's analysis suggested intelligence was multidimensional and that g was an artefact of how the abilities were measured.
The contemporary CHC synthesis incorporates both: g exists as a higher-order factor; the broad abilities Thurstone identified exist as lower-order factors. Both are real; both matter.
Psychometrics · g— vi —

Slide 07

Chapter VIQ has been rising.

Flynn effectVII
James Flynn1934–2020. New Zealand. 1984 and 1987 papers documented systematic IQ gains across 14 nations. What Is Intelligence? (2007).
James Flynn's 1984 paper documented a massive secular gain in IQ scores across 20th-century cohorts. Average IQ scores in the developed world rose by approximately 3 points per decade — about 30 points over the 20th century. The gains were largest on tests of fluid reasoning (Raven's Progressive Matrices); smaller on tests of crystallised intelligence and vocabulary.
The implications were striking. Either people were genuinely getting smarter at an unprecedented rate, or IQ tests were measuring something other than fixed innate ability, or both.
Causes
Improved nutrition. Reduced disease burden. Compulsory schooling. Increased educational complexity. The cognitive demands of an information-rich environment. Test sophistication and a "scientific spectacles" effect (Flynn's term — modern people are more comfortable with abstract classification, which test items reward).
Reverse Flynn?
Some recent samples (Norway, the Netherlands, Finland, France) have shown stagnation or modest reversal of the Flynn gains since the 1990s. Whether this reflects real cognitive change, sampling shifts, or test-content issues is debated. Bratsberg & Rogeberg's 2018 paper (Norwegian male conscripts) was the most-cited reverse-Flynn evidence.
Psychometrics · Flynn— vii —

Slide 08

Chapter VIThe bell-curve controversy.

Bell CurveVIII
Herrnstein & Murray, 1994The Bell Curve: Intelligence and Class Structure in American Life. Argued that cognitive ability had become more important to life outcomes than family background, and that mean IQ differences between racial groups had partly genetic origins.
Richard Herrnstein and Charles Murray's The Bell Curve (1994) made several empirical claims, some uncontroversial in psychometrics and some highly contested.
What was uncontroversial
That IQ tests are reliable. That they have predictive validity for educational and occupational outcomes. That cognitive demands have become more important to economic mobility in the second half of the 20th century. That measured average IQ varies between population groups.
What was contested
The strong claim that mean IQ differences between racial groups (specifically the historically observed roughly 1-SD gap between US Black and White samples on most IQ tests) have partly genetic causes. The empirical and methodological pushback was substantial. The 1995 American Psychological Association report ("Intelligence: Knowns and Unknowns") concluded that the available evidence did not support a genetic explanation for the Black-White gap.
What has happened since
The gap has narrowed (Dickens & Flynn 2006). Behaviour-genetic studies suggest within-group heritability of intelligence is high (~50–80% in adulthood), but within-group heritability does not license claims about the causes of between-group differences. The current consensus is more cautious than either side of the 1990s debate.
Psychometrics · Bell Curve— viii —

Slide 09

Chapter VIIStereotype threat.

Stereotype threatIX
Steele & Aronson, 1995"Stereotype threat and the intellectual test performance of African Americans." The original demonstration: framing a test as diagnostic of intellectual ability reduced Black students' performance relative to White students; framing it as a non-diagnostic exercise eliminated the gap.
Claude Steele and Joshua Aronson's 1995 paper showed that the experience of being evaluated through the lens of a negative stereotype could itself depress performance. Black participants performed worse on a verbal test framed as a measure of intellectual ability than on the same test framed as a problem-solving exercise. White participants showed no such effect.
The phenomenon — stereotype threat — has been shown for women in math, white students in athletic tasks, older adults in memory tests, and many other groups. It is a real psychological phenomenon that affects measured performance.
Effect size
The effect has been replicated many times but with smaller effect sizes than the original studies suggested. Recent meta-analyses (Picho, Rodriguez, Finnie 2013; Flore & Wicherts 2015) find effect sizes ranging from small to moderate, with substantial publication bias in the literature.
The implication for psychometrics: test performance is not a pure measure of underlying ability; it is also affected by the social and motivational context of the testing. Real-world decisions based on test scores carry the consequences of these context effects.
Psychometrics · Stereotype Threat— ix —

Slide 10

Chapter VIIIReliability.

ReliabilityX
Cronbach's αα = (k/(k−1)) · (1 − Σσ²ᵢ/σ²ₜ)1951. The most-used reliability coefficient. Estimates internal consistency. Conventional thresholds: α > 0.7 acceptable; > 0.8 good; > 0.9 excellent (with caveats about overinflation).
A test is reliable if it produces consistent results. Three forms:
Test-retest reliability. Correlation between two administrations of the same test to the same people across time. For stable traits (intelligence, Big Five), test-retest r should be ≥ 0.7 across reasonable time periods. The MBTI's r ≈ 0.5 across five weeks is, for a categorical instrument, essentially random.
Internal consistency. The degree to which the items of a test all measure the same construct. Cronbach's alpha is the standard index. McDonald's omega is technically superior in many situations and is increasingly preferred in modern psychometrics.
Inter-rater reliability. The degree to which different scorers produce consistent ratings. Measured by Cohen's kappa (categorical) or intraclass correlation (continuous). Critical for clinical interview-based instruments and behavioural observation.
The reliability ceiling
A test cannot have validity greater than its reliability — you cannot measure something more accurately than your instrument is consistent. The classical formula: r_xy ≤ √(r_xx · r_yy).
Psychometrics · Reliability— x —

Slide 11

Chapter IXValidity.

ValidityXI
Three (or five) typesCronbach & Meehl's 1955 paper "Construct Validity in Psychological Tests" was the foundational treatment. Subsequent treatments (Messick 1989) integrated multiple types into a unified construct-validity framework.
A test is valid to the extent that it measures what it purports to measure. The classical taxonomy:
Content validity. Does the test cover the relevant content domain? An achievement test should sample the curriculum it claims to measure.
Criterion validity. Does the test predict the relevant outcome? Two subtypes: concurrent (relationship to a current outcome — does the depression scale correlate with clinician ratings now?) and predictive (relationship to a future outcome — do SAT scores predict college GPA?).
Construct validity. Does the test measure the underlying psychological construct it is named for? Demonstrated through multiple converging lines of evidence: convergent validity (correlation with other measures of the same construct), discriminant validity (lack of correlation with measures of different constructs), and the broader nomological network in which the construct sits.
Campbell & Fiske, 1959
The multitrait-multimethod matrix formalised the convergent/discriminant distinction. Multiple traits should be measured by multiple methods; the construct-validity case is supported when same-trait correlations across methods exceed different-trait correlations within methods.
Psychometrics · Validity— xi —

Slide 12

Chapter XFactor analysis.

Factor analysisXII
EFA vs CFAExploratory factor analysis (EFA): the data tells you how many factors and what loads on what. Confirmatory factor analysis (CFA): you specify a hypothesised structure and test whether the data fit it.
The statistical workhorse of psychometrics. Factor analysis identifies latent dimensions that explain patterns of correlation among observed variables. If many test items correlate strongly with each other but weakly with another set of items, factor analysis can identify the underlying factors.
Spearman's 1904 g paper was the first factor-analytic study. Cattell's 16PF, Eysenck's PEN model, the Big Five, the structure of psychopathology (Krueger's HiTOP), the structure of cognitive abilities (CHC) — all are factor-analytic results.
The fit problem
Factor analysis does not yield a unique solution. Multiple rotations of the same data produce different factor structures. The choice between solutions involves both statistical fit (eigenvalues, scree plots, parallel analysis, fit indices like CFI, RMSEA, SRMR) and substantive interpretability. Different fields have settled on different conventional structures, and reasonable people can disagree about whether 3, 5, or 6 factors best capture personality variation.
The 5-factor model has substantial support across languages and methods, but the case is empirical and partial, not deductive.
Psychometrics · Factor Analysis— xii —

Slide 13

Chapter XIItem Response Theory.

IRTXIII
Item Response TheoryFrederic Lord and Georg Rasch developed IRT in the 1960s–70s. Models how item-level responses depend on examinee ability and item parameters. The basis of computer-adaptive testing.
Classical Test Theory (the framework underlying Cronbach's alpha and traditional test scoring) treats a test score as the sum of true score plus error. Item Response Theory (IRT) models the probability that a person of a given ability will answer a particular item correctly, as a function of item-level parameters (difficulty, discrimination, guessing).
The most common IRT models: 1PL (Rasch) — items vary only in difficulty; 2PL — items vary in difficulty and discrimination; 3PL — adds a guessing parameter. For polytomous items (Likert ratings), graded response and partial credit models extend the framework.
Computer-adaptive testing
IRT enables tests that adapt to the test-taker. The GRE, GMAT, and many state-level standardised tests are now computer-adaptive: each item is selected based on the test-taker's performance on previous items. The result: more measurement information per minute of testing time, particularly at the extremes of the ability distribution where fixed-form tests have less precision.
Differential Item Functioning
IRT also provides the framework for testing whether items function differently across demographic groups. DIF analysis identifies items that produce different probability of correct response for different groups at the same underlying ability level — a major fairness check for high-stakes tests.
Psychometrics · IRT— xiii —

Slide 14

Chapter XIIThe MMPI.

MMPIXIV
MMPI historyMMPI-1 (1943) — Hathaway & McKinley. MMPI-2 (1989) — restandardised on a more representative sample. MMPI-2-RF (2008) — restructured. MMPI-3 (2020) — current.
The Minnesota Multiphasic Personality Inventory is the most-used clinical personality assessment. Developed in 1943 at the University of Minnesota by Starke Hathaway (psychologist) and J. Charnley McKinley (psychiatrist), originally to support psychiatric diagnosis.
The original MMPI used empirical keying: items were not selected for face validity but for their ability to differentiate clinical groups (depressed patients vs controls, schizophrenia patients vs controls, etc.) regardless of whether the items obviously related to the disorder. The approach minimised respondent strategising — you couldn't easily fake a depression score if the depression-keyed items included "I sometimes enjoy reading."
The original 567-item MMPI-2 produces 10 clinical scales (Depression, Hysteria, Hypochondriasis, Psychopathic Deviate, Masculinity-Femininity, Paranoia, Psychasthenia, Schizophrenia, Hypomania, Social Introversion) plus validity scales designed to detect inconsistent or strategic responding (L, F, K, VRIN, TRIN).
The MMPI-3 (2020) modernises the norms (a 1,620-person normative sample reflecting current US demographics) and integrates the dimensional restructured form approach. It remains the dominant US clinical personality assessment.
Psychometrics · MMPI— xiv —

Slide 15

Chapter XIIINEO and Big Five inventories.

NEO & the Big FiveXV
NEO-PI-R1992. Costa & McCrae. 240 items, 6 facets per Big Five trait. The gold-standard Big Five inventory.
The major Big Five instruments:
NEO-PI-R (240 items, Costa & McCrae 1992; revised as NEO-PI-3 in 2010). 6 facets per trait, gold-standard for research use.
BFI-2 (60 items, Soto & John 2017). Public-domain. Three facets per trait. Briefer than NEO; broader use in survey research.
IPIP-NEO (Goldberg). Public-domain alternatives to the proprietary NEO. Versions of various lengths (50, 100, 120, 300 items). Available free online.
HEXACO-PI-R (Lee & Ashton). Six factors including Honesty-Humility.
Mini-IPIP, TIPI (Gosling, Rentfrow, Swann 2003) — very short scales (10-item TIPI, 20-item Mini-IPIP) for use when time is limited. Lower reliability than longer scales, but acceptable for some research purposes.
Differences in coverage
The Big Five inventories differ in their facet structures. Different inventories operationalise the Openness facet differently — some include intellectual openness, others aesthetic sensitivity, others both. Researchers should be cautious about comparing scores across instruments.
Psychometrics · NEO— xv —

Slide 16

Chapter XIVWhy the MBTI fails as measurement.

The MBTI problemXVI
A psychometric verdictUsed by ~88% of Fortune 500 companies. Has poor reliability, weak predictive validity, and bimodality assumptions that the data do not support. Academic personality psychology does not use it.
The Myers-Briggs Type Indicator's psychometric problems are by now well-documented and widely (academically) acknowledged. Four major issues:
1. Bimodality assumption is wrong. The MBTI classifies people as I or E, S or N, T or F, J or P. Empirical distributions on the underlying dimensions are normal, not bimodal. Most people score near the middle and are classified arbitrarily.
2. Test-retest reliability is poor. About 50% of test-takers receive a different four-letter type on retest within five weeks. For a categorical measure that is essentially noise.
3. Predictive validity is weak. MBTI types do not predict job performance, leadership, relationship satisfaction, or other outcomes better than chance. Big Five traits do.
4. Construct validity is poor. Three of the four MBTI dimensions either overlap heavily with Big Five traits (E with extraversion, T/F with agreeableness, J/P with conscientiousness, N/S with openness) or are not coherent psychological constructs (the J/P dimension in particular).
The MBTI persists for non-scientific reasons: it feels meaningful; it produces flattering descriptions; it has a robust commercial ecosystem (CPP/The Myers-Briggs Company sells the official version; many derivatives are free). The empirical case against using it for any high-stakes decision is overwhelming.
Psychometrics · MBTI— xvi —

Slide 17

Chapter XVThe Implicit Association Test.

IATXVII
Greenwald, Banaji, Nosek 1998Implicit Association Test. Measures the strength of automatic associations between concepts (e.g., race) and evaluative attributes (e.g., good/bad) using reaction-time differentials.
The IAT was introduced by Anthony Greenwald, Mahzarin Banaji, and Brian Nosek in 1998. It measures the strength of automatic associations through differences in reaction time on classification tasks. Subjects are asked to classify words and images using two response keys; the keys are paired with different concepts. If "Black" and "bad" share a key, and "White" and "good" share a key, subjects with stronger automatic White-good and Black-bad associations respond faster than when the pairings are reversed.
The Project Implicit website has administered the IAT to millions of people since 2002. The aggregate finding: most people, regardless of self-reported attitudes, show evidence of automatic biases favouring socially dominant groups.
The measurement debate
The IAT has substantial measurement-property concerns. Test-retest reliability is poor (r ≈ 0.4–0.5 for the race IAT). The relationship between IAT scores and actual discriminatory behaviour is weak (meta-analytic r ≈ 0.1, Oswald et al. 2013; Forscher et al. 2019). The construct of "implicit attitude" is contested.
The IAT's policy use — particularly in implicit-bias training programmes — has been criticised for outrunning the empirical foundation. The phenomenon of automatic associations is real; the IAT's specific psychometric properties are weaker than its widespread use suggests.
Psychometrics · IAT— xvii —

Slide 18

Chapter XVIThe polygraph problem.

PolygraphXVIII
A bad measureThe polygraph measures physiological arousal (heart rate, blood pressure, skin conductance, breathing). It does not, as a measurement instrument, reliably distinguish lying from truth-telling.
The polygraph is a useful case study in failed psychometrics. The instrument was developed by William Marston in 1921 (Marston later co-created the Wonder Woman comic, in part using the polygraph as inspiration for the Lasso of Truth). It became widespread in US law enforcement, intelligence, and employment screening through the 20th century.
The fundamental measurement problem: physiological arousal is not specific to deception. Anxiety, anger, fear of the test itself, illness, medication effects, and many other conditions produce the same patterns the polygraph detects. The validity of the polygraph as a lie-detector has never been demonstrated to a scientific standard.
The 2003 National Research Council report The Polygraph and Lie Detection reviewed the available evidence and concluded that the polygraph's accuracy was poor, particularly for the high-stakes screening applications (employment, security clearances) for which it was widely used. The Employee Polygraph Protection Act (1988) had already banned most private-sector employment use; the federal-government use continues despite the negative scientific verdict.
The contemporary alternatives — fMRI-based deception detection (Langleben), guilty knowledge tests — have their own substantial measurement problems.
Psychometrics · Polygraph— xviii —

Slide 19

Chapter XVIITest ethics.

StandardsXIX
The StandardsStandards for Educational and Psychological Testing — joint AERA/APA/NCME publication. Currently the 2014 edition. The reference document for psychometric ethics and practice.
The use of psychological tests carries ethical and legal obligations that the discipline has formalised over decades. The Standards for Educational and Psychological Testing (most recent edition 2014) is the joint statement of the American Educational Research Association, American Psychological Association, and National Council on Measurement in Education.
Core principles: tests should be used only for purposes for which their validity has been demonstrated; users should have appropriate training; test-takers should be informed of the test's purpose; results should be communicated with appropriate caveats; the use of tests in high-stakes decisions (admissions, employment, custody) carries elevated obligations for fairness and due-process review.
Legal framework
In US employment contexts, the Uniform Guidelines on Employee Selection Procedures (1978) regulate test use. Tests with disparate impact on protected groups must be shown to be job-related and consistent with business necessity. The legal-psychometric interface has produced substantial literature on test fairness, group-level equating, and the validity of selection procedures.
Most contemporary high-stakes psychological testing operates under elaborate quality-control regimes. The history of test misuse — particularly in the early 20th-century US immigration and IQ-testing era, where tests in English were administered to non-English speakers and the results used to argue against immigration — is a permanent cautionary tale.
Psychometrics · Standards— xix —

Slide 20

Chapter XVIIITest fairness across groups.

Cross-cultural fairnessXX
Measurement invarianceThe technical question: does a test measure the same construct in the same way across groups? Tested via multi-group CFA — configural, metric, scalar invariance.
A test that is reliable and valid in one population may not be in another. The technical concept is measurement invariance: the property that a test measures the same underlying construct in the same way across groups (cultures, languages, age cohorts, genders).
Three levels of invariance: configural (same factor structure across groups); metric (same factor loadings); scalar (same intercepts). Scalar invariance is required for meaningful cross-group comparison of mean scores.
Many widely-used psychological tests do not achieve scalar invariance across cultures, which means cross-cultural mean comparisons are technically problematic even when the tests are reliable within each culture. The Big Five structure has held up reasonably well across many languages; cross-cultural mean comparisons of Big Five traits are nonetheless contested.
Differential Item Functioning
DIF analysis (above) identifies specific items that function differently across groups at the same underlying ability level. Tests used in high-stakes contexts (SAT, GRE, employment screening) routinely undergo DIF screening with items that show meaningful DIF being removed or revised.
Psychometrics · Cross-cultural— xx —

Slide 21

Chapter XIXMeasurement from digital traces.

Digital footprintXXI
Kosinski et al., 2013"Private traits and attributes are predictable from digital records of human behavior." 58,000 Facebook users; their Likes predicted Big Five traits, sexual orientation, religion, political views with accuracy approaching that of close friends.
Michal Kosinski and David Stillwell's myPersonality project (2007–2018) collected Facebook data and personality measures from millions of consenting users. The studies that followed established that digital footprints could predict personality traits at substantial accuracy — Facebook Likes (Kosinski, Stillwell, Graepel 2013), Twitter language (Park et al. 2015), smartphone sensor data (Stachl et al. 2020).
Stachl et al. (2020) used six months of smartphone-sensor data (call logs, app usage, music, location movement patterns) and predicted Big Five traits at correlations approaching r = 0.4 with self-report — comparable to informant report accuracy.
The implication: stable individual differences leave detectable traces in digital behaviour. The privacy and autonomy implications are substantial. The 2018 Cambridge Analytica scandal — using Kosinski-style methods, allegedly to target political advertising — accelerated public scrutiny of psychometric inference from digital data.
What this means for measurement
The traditional self-report inventory may be supplemented or partially replaced by passive measurement from digital traces. The validity may be comparable; the implications for consent, manipulation, and surveillance are more troubling than self-report's. The field's ethical apparatus has not yet caught up with the technical capabilities.
Psychometrics · Digital— xxi —

Slide 22

Chapter XXWhat has held up.

Replication pictureXXII
Findings strongly supported by replication and continuous accumulation: the existence and predictive validity of g; the Big Five factor structure across languages; the validity of the Wechsler intelligence batteries; the heritability of intelligence and Big Five traits; the basic Flynn effect; the reliability characteristics of well-designed inventories; the value of the multitrait-multimethod approach.
Findings substantially weakened: the strong-form Bell Curve claims about between-group genetic causes; the stereotype-threat literature in its strongest form (effects are real but smaller than originally reported); the IAT's predictive validity for actual discrimination; the polygraph as a lie-detection device; the MBTI as a serious measurement instrument.
The discipline's empirical core — that psychological constructs can be measured with care, that some constructs are well-supported and some are not, that the validity of any test depends on the use to which it is put — is robust. The applied claims have been more variable.
Psychometrics · Replication— xxii —

Slide 23

Chapter XXITwenty-five works.

Reading ListXXIII
1869Hereditary GeniusGalton
1904"'General Intelligence' Objectively Determined and Measured"Spearman
1905Binet-Simon Intelligence ScaleBinet & Simon
1916Stanford-BinetTerman
1938Primary Mental AbilitiesThurstone
1939Wechsler-Bellevue Intelligence ScaleWechsler
1943MMPI-1Hathaway & McKinley
1951"Coefficient alpha and the internal structure of tests"Cronbach
1955"Construct Validity in Psychological Tests"Cronbach & Meehl
1959"Convergent and Discriminant Validation by the Multitrait-Multimethod Matrix"Campbell & Fiske
1981The Mismeasure of ManGould (critical)
1984"The Mean IQ of Americans: Massive gains 1932 to 1978"Flynn
1987"Massive IQ gains in 14 nations"Flynn
1992NEO-PI-R ManualCosta & McCrae
1994The Bell CurveHerrnstein & Murray
1995"Stereotype threat and intellectual test performance"Steele & Aronson
1995"Intelligence: Knowns and Unknowns" APA reportNeisser et al.
1998"Measuring Individual Differences in Implicit Cognition"Greenwald, Banaji, Nosek
2001The Psychometrics of IntelligenceJensen
2007What Is Intelligence?Flynn
2013"Private traits and attributes from digital records"Kosinski et al.
2014Standards for Educational and Psychological TestingAERA/APA/NCME
2018"Flynn effect and its reversal" (Norway)Bratsberg & Rogeberg
2020"Predicting Personality from Smartphone Behavior"Stachl et al.
2023"Income and Emotional Well-Being: A Conflict Resolved"Killingsworth, Kahneman, Mellers
Psychometrics · Reading List— xxiii —

Slide 24

Chapter XXIIWatch & read.

Watch & ReadXXIV
↑ Russell T. Warne · What is the Flynn Effect?
More on YouTube
Watch · Richard Haier · The Bell Curve controversy
Watch · 16 Personalities, the Big 5, and MBTI
Read
For an entry-level treatment: Earl Hunt's Human Intelligence (2011). For depth: Lord & Novick's Statistical Theories of Mental Test Scores (1968) — the classical-test-theory bible. For the political controversies: Stephen Jay Gould's The Mismeasure of Man (1981, revised 1996) — the major popular critique of IQ-testing practice; not without its own technical errors but the historical-critical case is still important. For practical test design: Test Theory by Suen, or de Ayala's Theory and Practice of Item Response Theory (2009). The Standards document (2014) is the working reference.
Psychometrics · Watch & Read— xxiv —

Slide 25

Chapter XXIIIWhat's next.

FrontiersXXV
Three frontiers shape the discipline in the late 2020s.
Machine-learning measurement
Predictive models trained on large naturalistic data (smartphone sensors, social-media language, voice recordings, video of facial expression) are increasingly competing with self-report inventories. The technical performance is reaching parity for some constructs. The privacy and consent implications are unsolved.
Network and dimensional psychopathology
The HiTOP framework (Robert Krueger and colleagues, 2017) reorganises psychopathology dimensionally rather than categorically, with empirical factor structure replacing DSM categorical boundaries. Network-analysis approaches (Borsboom and colleagues) treat disorders not as latent constructs but as networks of mutually-reinforcing symptoms. Both frameworks may eventually replace classical psychometric measurement of psychopathology.
Cross-cultural standardisation
The major personality and intelligence inventories were standardised on Western samples and then exported. The reverse process — building inventories from non-Western lexical and behavioural data — is slowly underway. The result will be a more globally calibrated measurement infrastructure and, possibly, a richer typology of constructs that the Western tradition has missed.
Psychometrics · Frontiers— xxv —

Slide 26

The end of the deck.

ColophonXXVI
Psychometrics — Volume XII, Deck 10 of The Deck Catalog. Set in Inter and Tiempos Text. Off-white #f6f6f4; navy ink with scientific orange and steel-blue accents. Mathematical notation in monospace.
Twenty-four leaves on the science of measuring psychological constructs. The framework is rigorous; the misuse is constant; the discipline carries its founder's eugenic legacy and works around it. Numbers are not neutral. Measurement is a moral act.
FINIS
↑ Vol. XII · Psy. · Deck 10 / 10

Remove this deck