Grading AI Predictions 2 Years & 3.2 Months In
I originally expected about 40%-70% of my predictions to come true. Let’s see how I did 27 months on.
In October 2023, I wrote up some predictions for the next few years of AI progress. 27 months have now passed, so we get to see how well I did in 2025. The original prediction post is here, and the 1 year review is here. At the time, I originally expected about 40%-70% of my predictions to come true. Let’s see how I did!
2023 Predictions for 2024, reassessed:
First, some predictions I had made for 2024 were wrong because they hadn’t happened yet. Of those, let’s see how wrong I was -- did they happen in 2025 instead?
“Agents can do basic tasks on computers -- like filling in forms, working in excel, pulling up information on the web, and basic robotics control. This reaches the point where it is actually useful for some of these things”
<1 year off | I rated this as ‘Debatable’ last year, based on Claude with Computer Use and Figure’s robot control. Today, Claude for Chrome and Claude Code, Codex, etc can clearly do these tasks sans robotics control. The robotics control piece has remained elusive and harder to claim, however Tesla and Figure robots were deployed in some factory use, and use some general purpose transformers for part of their stack, so I think it can be claimed as “actually useful for some of these things” now.
“Context windows are no longer an issue for text generation tasks. Algorithmic improvements, or summarisation and workarounds, better attention on infinite context windows, or something like that solves the problem pretty much completely from a user’s perspective for the best models.”
<1 year off | I rated this as ‘Mostly False’ for 2024, because although context windows had increased from ~8k to ~128k, I felt there were still limitations. They are now at ~1m, so if you are still having trouble with your text generation tasks, I would say it’s not because of the context window. Gemini 1.5 Pro also had a 1m context window in 2024, though I didn’t think it was effectively usable context at the time.
“GPT-5 has the context of all previous chats, Copilot has the entire codebase as context, etc.”
>1 year off, debatable framing | I still rate this as mostly false, although searching chats and the codebase works perfectly, and in effect is the same thing. The capability is there, just different to how I framed it in 2023.
“Online learning begins -- GPT-5 or equivalent improves itself slowly, autonomously, but not noticeably faster than current models are improved with human effort and a training step. It does something like select its own data to train on from all of the inputs and outputs it has received, and is trained on this data autonomously and regularly (daily or more often).”
>1 year off | I still rate this as false, although current frontier models do help train their successors in several ways, we definitely do not run daily updates to the model.
“AI selection of what data to train on is used to improve datasets in general - training for one epoch on all data becomes less common, as some high quality or relevant parts of giant sets are repeated more often or allowed larger step size.”
<1 year off | This is true today! AIs curate and filter their own data, and curriculum design is a larger part of efficient training. It is all done with AI assistance.
Overall Score for 2023 predictions about 2024:
Leans True — 7
Debatable — 1
Leans False but <1 Year Late — 3
Leans False and >1 Year Late — 2
2023 Predictions for 2025:
“AI agents are used in basic robotics -- like LLM driven delivery robots and (in demos of) household and factory robots, like the Tesla Bot. Multimodal models basically let them work out of the box, although not 100% reliably yet.”
Leans False | Demos of household and factory robots using LLMs have certainly happened, though the vibes of this prediction are overestimating progress in robotics. Tesla Optimus and Figure AI had small deployments in car factories, but not commercially, matching the prediction. Delivery robots did not use LLMs in any way, as far as I am aware, even for conversing with the sender or receiver, or for reasoning through high level path planning. Real-time multimodal AIs like Gemini Flash can live stream and respond to video and audio out of the box, but not reliably enough to be used as part of a robotics stack directly.
“Trends continue from the previous year:”
“The time horizons agents can work on increase”
True | 100% true, and now well tracked by the famous METR chart.
“LLMs improve on traditional LLM tasks”
True | Trivially true, and an extremely obvious prediction. Comparing o1 to Opus 4.5, SWE-bench went from ~49% to ~81%, AIME maths went from ~74% to ~100%, and so on.
“Smaller models get more capable”
True | Claude Haiku 4.5 matches or exceeds Claude Opus 3 on many benchmarks, despite (apparently) being far smaller.
“The best models get bigger.”
Debatable | Estimates of the most capable models’ parameter count is higher than previous years, but not by much, and the scaling up of parameters has not been a major source of performance improvements, against my broader expectations. It’s a technically correct prediction that missed the actual trend.
“AI curated and generated data becomes far more common than previously, especially for aligning models.”
True | Synthetic data became the default approach for LLMs, and is used deeply for alignment training through Constitutional AI. Deepseek’s R1 famously used pure RL for reasoning, and distilled to smaller models using the generated reasoning traces.
“Virtual environments become more common for training general purpose models, combined with traditional LLM training.”
Debatable | Robotics training in simulation is booming, as are world models as a research area, like Genie 3, however they aren’t commonly used for training frontier LLMs. Those are trained in and for environments like browsers, terminals, with other tools which take actions for them, but that stretches the definition of ‘virtual environment’ a little.
“Code writing AI (just LLMs with context and finetuning) are capable of completely producing basic apps, solving most basic bugs, and working with human programmers very well -- it’s pair programming with an AI, with the AI knowing all of the low level details (a savant who has memorised the docs and can use them perfectly, and can see the entire codebase at once), and the human keeping track of the higher level plan and goals. The AI can also be used to recommend architectures and approaches, of course, and gradually does more and more between human inputs.”
True | I barely directly write code myself anymore, as Opus 4.5 in Claude Code did nearly all of it for me by December 2025. I do still need to track the higher level plan, architecture, and goals, as predicted.
“If there ever feels like a lull in progress, it will be in this period leading up to models capable enough for robotics control, long time frame agents, and full form video generation, which I don’t expect to happen in an large scale way in 2025.”
Leans True | There was talk of feeling like there was a lull in early-mid 2025, around the time leading up to and including GPT-5’s launch, and it is correct that robotics control, long time horizon agents (more than a few hours), and full form video generation haven’t taken off in a big way in 2025.
“Possibly GPT-6 or equivalent is released, but more likely continuous improvements to GPT-5 carry forward. There’s not a super meaningful difference at this point, with online learning continually improving existing models.”
Leans False | No online learning in the sense I meant it, though we do have a focus on better post-training leading to many model releases sharing the same pre-trained base. I also think it is surprisingly debatable whether the jumps in capability in the chain GPT-3 --> GPT-4 --> o1 --> Opus 4.5 were roughly equally sized jumps in capability (and hence whether Opus 4.5 is GPT-6 equivalent from 2023s perspective), though I would still assess that as ‘probably not / leans false’.
Overall Score for 2023 predictions about 2025:
Leans True || True — 6 (3 non-trivial)
Debatable — 2
Leans False || False — 2
Overall, I think my predictions matched my calibration about them, and potentially did even slightly better than my 40% - 70% claim. The biggest mistake was predicting some form of continual learning, and by far the biggest omission was to say nothing about reasoning models, which would become the dominant paradigm a year after my original write up. I did talk about runtime search in July 2024, several months before o1-preview was announced, but completely missed it in 2023.
I was pretty accurately calibrated on AI software capabilities, but too bullish on robotics. In hindsight, the ‘lull’ prediction for 2025 was probably the most unlikely one to get mostly correct, and I think compared to (my memory of) other predictions at the time I correctly downweighted video generation and upweighted coding automation.
See you all next year!

