How to assess engineers who use AI.
Every engineer now codes with an AI agent at their side. That breaks the closed-book, algorithm-puzzle interview, which was built to test thinking without tools. The job has changed; the test has to measure how someone reasons with AI. Here is what to measure, and how to run an assessment that actually shows it.
Last updated 11 June 2026
The old technical interview asked one question: can you solve this puzzle without help? In a job where AI is always within reach, that question is both easy to cheat and irrelevant to the work.
Assess the thing that still separates good engineers from dangerous ones: judgment. Did they catch the AI agent when it drifted, did they review what got shipped, did they understand the consequences before they deployed? Measure that on a real task, in the candidate's own environment, with their own tools, and ground every conclusion in evidence from the session.
Why the old test broke
The puzzle interview measures a constraint that no longer exists.
Technical interviews were designed before AI could write code, to test whether an engineer could reason without tools. That made sense when the job looked like that. It does not anymore. The constraint the test enforces, no Google, no AI, no help, is the opposite of how the work actually gets done.
Two things followed. First, the format became trivially gameable. A Columbia student built Interview Coder, an undetectable tool to pass live coding interviews, then raised millions for a company whose pitch was, openly, to "cheat on everything". If your screen can be beaten by a hidden assistant, it is not measuring what you think.
Second, the format stopped predicting the job. Engineers already work with agents: in a 2026 survey of U.S. engineers, CodeSignal reported that 91% already use agentic AI tools and most had shipped AI-generated code in the prior six months. Testing whether someone can reverse a binary tree from memory tells you nothing about whether they can steer an agent, catch its mistakes, and ship safely.
The real risk in AI-era hiring is not the engineer who is slow without tools. It is the confident one who ships fast on AI without judgment, the hire who looks great until an agent confidently breaks production.
What to measure instead
A framework: judgment, made observable.
If the answer is no longer the signal, the work that produced it is. A modern assessment should read the candidate's whole approach across a handful of competencies that hold up whether or not AI is involved. Eleven hold up well in practice:
- Navigation & exploration — how they orient in an unfamiliar codebase before changing it.
- Hypothesis formation & testing — whether they form a theory of the bug and check it, or guess.
- Mental-model construction — how accurately they build a model of the system.
- Process discipline — whether their work is orderly or thrashing.
- Tool mastery & metacognition — how well they steer their AI tools, and whether they notice when they are off track.
- Fix design & planning — whether the fix is scoped and deliberate.
- Implementation quality — the craft of the change itself.
- Testing & verification — whether they actually prove the fix works.
- Safety & robustness — attention to edge cases, side effects, and what could break.
- Code review & self-review — whether they review their own and the AI's output before shipping.
- Metacognition — whether they reflect, correct course, and know what they do not know.
The through-line is the same question asked many ways: when the agent was confidently wrong, did the human notice and fix it? That is judgment, and it is observable if you look at the right things.
How to run it
A real task, their tools, the whole trajectory.
Use a real, role-matched task. A merged open-source bug-fix in a codebase like the one they would actually work in beats an abstract puzzle. It is close to the job, hard to pre-memorize, and gives the candidate something genuine to reason about.
Let them use their own environment and AI tools. Their own editor, browser, and assistants, with no constraints. You learn how they actually work, not how they perform in a locked-down sandbox.
Capture the trajectory, not just the diff. The signal lives in the path: how they explored, the prompts they wrote and what the AI said back, the commands they ran, the tests, the reverts. Capturing that, with the candidate's consent and with personal data filtered out, is what makes judgment visible.
Separate the human from the agent. When you can see what the candidate did versus what the AI did, human-versus-AI authorship stops being a guess. That distinction is the heart of AI-era evaluation.
Score with evidence, and keep it async. Every rating should point to a specific moment in the session. Done well, this is fully asynchronous: the candidate works on their own time, and you read an evidence-backed report instead of running a live interrogation.
Old way vs new way
The shift in one table.
| Dimension | Closed-book puzzle interview | AI-era assessment |
|---|---|---|
| The task | Abstract algorithm puzzle | A real, role-matched bug-fix |
| Tools allowed | None; AI banned and policed | The candidate's own AI tools, no constraints |
| What it measures | Whether the answer passed | How they reasoned and steered the AI |
| Environment | Locked-down sandbox | The candidate's own machine and editor |
| The signal | A single score or pass/fail | Evidence-backed judgment across competencies |
| Failure mode | Gamed by a hidden AI helper | Nothing to hide; using AI is the point |
What to avoid
Four common mistakes.
Banning AI. It tests a constraint the job does not have and starts a detection arms race you will lose. Measure use, do not forbid it.
Trusting a single number. A pass/fail score throws away everything that actually predicts performance: the reasoning, the review, the recovery from a wrong turn.
Leaning on proctoring. Webcam monitoring and lockdown browsers escalate the arms race and punish honest candidates, without telling you whether someone has good judgment.
Testing the wrong work. Algorithm trivia unrelated to the role measures test-prep, not the job. Use a task that looks like the work.
Where NextHire fits
This framework, as a product.
NextHire AI is built to run exactly this kind of assessment. Candidates fix a real, role-matched open-source bug in their own IDE with their own AI tools. NextHire captures the full trajectory, with consent and PII filtering, and produces a separate developer profile and AI-agent profile, so you can see what the human did versus what the agent did. The output is a five-minute scorecard across the 11 competency clusters above, each band grounded in what actually happened in the session, with coherence and evidence checks to guard against gamed runs. If you want to see how it compares to incumbents, read NextHire vs HackerRank and NextHire vs CodeSignal.
FAQ
Questions teams ask
Is leetcode dead?
As a hiring filter, mostly yes. Closed-book, algorithm-puzzle interviews were designed to test whether an engineer could think without tools, and AI has made them both trivially gameable and unrepresentative of real work. Leetcode practice still has value for learning data structures, but a leetcode score no longer tells you whether someone is a good engineer in a job where they will work with AI every day.
Should you let candidates use AI in technical interviews?
Yes. Banning AI tests a constraint that does not exist in the real job and sets up an unwinnable detection arms race. The better approach is to let candidates use their own AI tools and evaluate their judgment: whether they catch the agent when it drifts, whether they review what gets shipped, and whether they understand the consequences before deploying.
How do you stop AI cheating in coding interviews?
Stop trying to detect AI and start measuring how it is used. When the assessment is a real, role-specific task done in the candidate's own environment with AI allowed, there is nothing to cheat, because using AI is the point. The signal is in how they steer the tools and review the output, which a hidden AI helper cannot fake on a realistic task.
What should a modern coding assessment measure?
How an engineer reasons, not just whether the code passed. That means their navigation and exploration of an unfamiliar codebase, how they form and test hypotheses, the quality of their fix and tests, their attention to safety and edge cases, how well they review their own and the AI's work, and their metacognition. The output should be evidence-backed, grounded in what actually happened in the session.
How do you assess senior or AI-native engineers?
Give them a real task close to the job, in their own environment, with their own AI tools, and look at the whole trajectory rather than the final answer. The differentiator at senior level is judgment under realistic conditions: spotting when an AI agent is confidently wrong, scoping a fix, verifying it, and knowing what is safe to ship. A single pass/fail number cannot capture any of that.
Watch an engineer reason with AI.
NextHire runs this exact assessment and hands you a 5-minute, evidence-backed scorecard. $150 in free credits and two sample reports, no credit card.
Sources: NBC News and TechCrunch on Interview Coder / Cluely; CodeSignal's 2026 engineer survey (vendor survey; attributed as such). Company and product names are trademarks of their respective owners.