Matthew Scillitani on Measuring the Extreme Right Tail with a Supervised Timed High-Range Ability Test
Matthew Scillitani: How does a supervised, timed high-range ability test like The Mental Inventor aim to differentiate performance at the extreme right tail while addressing validity, reliability, AI-assisted cheating, and proctoring integrity?
Matthew Scillitani is a psychometrics practitioner at Neurolus Psychometrics focused on developing supervised, time-limited high-range ability examinations. He co-launched The Mental Inventor with Paul Cooijmans as an empirical testbed for a central measurement question: whether performances can be validly differentiated in the extreme right tail under proctored conditions. His approach emphasizes procedural integrity—identity verification, approved proctoring, and rule enforcement—alongside cautious claims about interpretation until reliability and validity evidence is established. He highlights emerging threats to unsupervised testing, including AI-assisted responding and large-scale collaboration, and advocates peer review before formal reclassification.
In this exchange, Scott Douglas Jacobsen interviews Matthew Scillitani about The Mental Inventor, a supervised, timed high-range ability exam designed to explore whether performance can be meaningfully differentiated at the extreme right tail under proctored conditions. Scillitani avoids claiming it measures “intelligence” until validity evidence exists, citing regulatory constraints and the need for peer review. He frames low reliability as a decisive falsifier, emphasizes supervision to deter AI and collaboration, and explains proctor approval and identity verification procedures. Scores are reported cautiously via a preliminary conversion table, with analyses planned as data accumulates.
Scott Douglas Jacobsen: What construct is the intended measure with The Mental Inventor?
Matthew Scillitani: Our long-term goal is to investigate whether performances can be validly differentiated in the extreme right tail under supervised, time-limited conditions. To avoid speculation, we should avoid saying that The Mental Inventor is intended to measure a psychological construct, though.
For context, in the United States, most states have laws prohibiting the unlicensed practice of psychology. Even non-clinical tests are within the scope of this very complex regulatory environment. So, we can't prematurely make any claims about what a test may measure until there’s strong empirical evidence for that.
Jacobsen: What would empirically support this construct as measured by The Mental Inventor?
Scillitani: After enough exam sittings occur, we can start to perform meaningful data analyses, including calculating the correlations, like g loading. If analysis suggests that it measures intelligence with validity, findings will be peer reviewed before conclusions are published.
If Neurolus and its third-party peer reviewer agree that an exam functions well as an I.Q. test, it'll be reclassified accordingly, and our testing procedures will change. For example, we’d no longer report scores directly to candidates. Where legally appropriate, scores would be released through a licensed psychologist or accepting high-I.Q. society.
Jacobsen: What would falsify it?
Scillitani: Low reliability would immediately falsify it. This is because a test’s reliability sets an upper bound on its correlations with anything else. For clarity, insufficient internal consistency suppresses the potential relationship a test could have with general intelligence.
An exam can also be highly reliable and still have a low g loading. In that case, it may just be measuring a task-specific, probably learned skill. We’ll better understand the validity and reliability after the validation study.
Jacobsen: Why specify the format as supervised and timed?
Scillitani: The primary reason is artificial intelligence. Some large language models are getting very smart, very fast. I’m expecting that unsupervised tests will be unusable in a decade because of this.
There's also the growing problem of cheating by collaboration. Recently, I’ve learned of groups comprising tens to hundreds of members sharing or selling answers. While still possible, collaboration is more difficult on a supervised exam.
The exam's duration was set at three hours for practical reasons. Not many proctors are willing to go longer than that, and probably many candidates wouldn’t either. In any case, an even longer time limit can be prohibitively expensive for many.
Jacobsen: What does time pressure add at the extreme right tail?
Scillitani: Time pressure introduces some strategy that’s mostly absent in untimed tests. For example, candidates have to determine whether they can solve an item, how long it’ll take, and whether it’s worth using their time on.
This requires the ability to anticipate subjective item difficulty, estimate time to solve, and manage limited time resources. Generally speaking, those types of decisions are probably g-loaded as well, but we shouldn’t speculate too much.
Jacobsen: What is the proctoring protocol?
Scillitani: This was the second-most challenging obstacle for this project. The exam was initially going to be released only in the United States using a network of community colleges, but we later decided to release it globally, requiring a different model.
Our solution was to have candidates find their own proctor, subject to approval. Acceptable proctors currently include libraries, universities, notaries, and private invigilators. Candidates submit their proposed proctors’ info before scheduling the exam, which I manually review and either approve or reject.
On exam day, the proctor verifies a candidate’s identity using their government-issued photo I.D. If identity can’t be verified, the sitting is cancelled.
Attempts will also be considered invalid if a candidate breaks any rules, like trying to bring unauthorized materials into the testing area.
Jacobsen: How are scores scaled and interpreted?
Scillitani: To avoid overstatement, we've published a preliminary score conversion table rather than a norms table. This is because we currently have very limited data and want to be cautious. The table maps raw scores from 0 to 40 into scaled scores ranging from 120 to 199. A formal norms table will be published later as more data comes in.
Regardless of validation status, we’ll never interpret scores, though. Candidates won't be told that they’re “above average” or “gifted” or anything similar. Scores are only reported, not explained.
Jacobsen: What reliability evidence has been collected?
Scillitani: None as of yet; the exam launched only a few weeks ago at the time of writing. If the participation rate remains steady, the first statistical report is expected in late 2026. Then we’ll have our first idea about reliability.
Jacobsen: What is the likely eventual reliability and sensitivity to conditions?
Scillitani: At present, that’s unknown. Retesting isn’t permitted, so test-retest reliability can’t be calculated. We’ll instead have to rely on split-half reliability, which provides an indirect estimate by comparing performance across two halves of the test.
Proctor variability is a serious concern, and something I often think about. If certain types of proctors consistently fail to follow instructions, that group will be removed as an option, and those sittings may be invalidated to protect data integrity.
Jacobsen: How will you test whether this is g-loaded versus a specialist puzzle skill?
Scillitani: Through the validation process. If analysis suggests meaningful g-related variance, a peer reviewer will independently review the data to verify. If results aren't significant yet, analyses will be repeated at predetermined data collection intervals.
In the end, if the exam only measures task-specific non-g puzzle skills, we’ll still publish that finding.
Jacobsen: How do you prevent prize-competition dynamics from contaminating results?
Scillitani: In practice, most candidates who attempt these types of exams are already intrinsically motivated. This is because there no serious external stakes in the conventional sense. For example, performance doesn’t affect whether you’ll get accepted by your dream university or employer.
Many candidates simply enjoy solving very challenging puzzle-like problems. Others may be motivated by the competition, but motivation is generally high in either case. And I suspect that the source of motivation is less meaningful than its presence.
Jacobsen: Thank you very much for the opportunity and your time, Matthew.
Scott Douglas Jacobsen is a blogger on Vocal with over 120 posts on the platform. He is the publisher of In-Sight Publishing (ISBN: 978–1–0692343) and the Editor-in-Chief of In-Sight: Interviews (ISSN: 2369–6885). He writes for The Good Men Project, International Policy Digest (ISSN: 2332–9416), The Humanist (Print: ISSN 0018–7399; Online: ISSN 2163–3576), Basic Income Earth Network (UK Registered Charity 1177066), A Further Inquiry, The Washington Outsider, The Rabble, and The Washington Outsider, and other media. He is a member in good standing of numerous media associations/organizations.
About the Creator
Scott Douglas Jacobsen
Scott Douglas Jacobsen is the publisher of In-Sight Publishing (ISBN: 978-1-0692343) and Editor-in-Chief of In-Sight: Interviews (ISSN: 2369-6885). He is a member in good standing of numerous media organizations.


Comments
There are no comments for this story
Be the first to respond and start the conversation.