In June 2025, I was prompting AI with phrases like "game-changing legal assistant." By April 2026, I had a scored audit across seven dimensions of AI effectiveness and was challenging outputs as a default habit. The honest account of what changed, what the scores revealed, and what the remaining 27 points say about where good AI use is still developing.

I typed: "Can you help me craft a concept for this game-changing legal assistant?"
That was June 2025. No context, no role, no constraints. The phrase "game-changing" was doing the work that specificity should have done. I was treating AI like a search engine with better sentence structure. The output I got was polished and useless in equal measure. I didn't challenge it. I copied pieces of it and moved on.
By April 2026, the same prompt read differently: "We're designing an AI assistant for UK legal professionals. The product must not give legal advice. Here are the three core use cases – review each one against this constraint and flag where it breaks. The gap between those two prompts is not about vocabulary. It's about forming a working theory of what AI is for and what it isn't.
The project got complex. As the product grew – AI features, compliance constraints, integration work across multiple systems – vague prompts stopped producing usable output. "Can you help me design this?" started returning responses that were technically coherent but structurally wrong. I'd iterate, accept a slightly better version, and move on. Weeks later, something downstream would break.
The shift wasn't a single moment. It was a gradual realisation that AI outputs are confident by default. They don't flag their own weaknesses, they don't flag yours, and they cannot tell when you've asked the wrong question. The better the output looks, the more dangerous it is to accept without challenge.
That's when the prompts changed. Less "Help me make this." More "audit what we have – what's in here that we don't need?"
In April 2026, I asked Claude to assess my effectiveness as an AI user across seven dimensions. The first score came back as 67 out of 100, against an estimated average user score of 38. I challenged one dimension – loop closure, scored 40 – explaining that it was handled in a separate project Claude couldn't access. The score was revised to 73. The breakdown told a clearer story than the total.
Challenging outputs: 85 (average user: 25). The biggest gap. Assuming the AI might be wrong and pushing back. More rare than it sounds.
Working with evidence: 80 (average: 30). Iterating with real artefacts and research data rather than abstract questions.
Persistent context: 78 (average: 20). Structured project knowledge means Claude isn't starting from zero each session.
Loop closure: 72 (average: 35). Tracking decisions and outcomes separately from active design work.
Proactive exploration: 62 (average: 40). Using AI before the work exists is identified as a growth area.
Role-setting: 60 (average: 38). Inconsistent. Sometimes omitted entirely.
Prompt specificity: 58 (average: 42). The dimension closest to the average user and the one most designers focus on improving first.
The result that surprised me: prompt specificity – the dimension I'd spent the most effort on – came back closest to average. Challenging outputs – the habit I'd developed almost unconsciously – came back with the biggest gap.
The most valuable AI skill isn't writing better prompts. It's reading the outputs more sceptically.
A specific prompt doesn't help if you accept the answer without pressure-testing it. The habit of assuming the AI might be wrong – asking "What have you missed?" rather than "How can I make this better?" – changes the quality of the output more than the precision in the original question. Most AI users are optimising for the input. The leverage is in the response to the output.
The other significant gap is persistent context. AI has no memory between sessions. You have to build the context it needs – structured project knowledge, decision logs, open questions – and maintain it deliberately. The designers doing this are in a small minority. Most people have a chaotic conversation history. Treating information architecture as a habit that applies to your own AI usage, not just to your product, is the thing that compounds.
73 out of 100 means 27 points unaccounted for. Proactive exploration is the hardest gap to close. It means prompting AI before the work exists – using it to stress-test early assumptions, generate alternatives, and challenge direction before committing to it. That's a different mode from using AI to make existing work better. It requires treating the tool as a thinking partner rather than a production assistant.
Role-setting is simply inconsistent. Sometimes deliberate, sometimes omitted. No good reason for the variation. Just habit.
The note from the audit that stayed with me: 'The remaining gaps are structural, not behavioural.' They represent the frontier of current AI interaction patterns – not individual failure." Reaching 73 required behavioural change. The next 27 require building interaction models that most people aren't using yet. That framing is either reassuring or a convenient excuse. Probably both.
I'm not a more productive designer because of AI. I'm a differently productive designer. Some work is faster. Some is more thorough. Some I now question in ways I didn't before – because knowing that AI will confidently produce something wrong makes you a more sceptical reader of confident-looking output, wherever it comes from.
The score isn't the point. The audit was a forcing function to make the practice visible – to describe what was happening rather than just doing it. Most improvements come from the moment you have to articulate what you're actually doing. 67. Then 73. Neither number matters much. What matters is the gap between the prompts.