Be a part of our each day and weekly newsletters for the most recent updates and unique content material on industry-leading AI protection. Be taught Extra
The discharge of OpenAI GPT-4.5 has been considerably disappointing, with many stating its insane worth level (about 10 to 20X costlier than Claude 3.7 Sonnet and 15 to 30X extra expensive than GPT-4o).
Nevertheless, provided that that is OpenAI’s largest and strongest non-reasoning mannequin, it’s value contemplating its strengths and the areas the place it shines.
Higher information and alignment
There may be little element concerning the mannequin’s structure or coaching corpus, however we now have a tough estimate that it has been skilled with 10X extra compute. And, the mannequin was so giant that OpenAI wanted to unfold coaching throughout a number of knowledge facilities to complete in an affordable time.
Greater fashions have a bigger capability for studying world information and the nuances of human language (provided that they’ve entry to high-quality coaching knowledge). That is evident in a few of the metrics introduced by the OpenAI crew. For instance, GPT-4.5 has a record-high rating on PersonQA, a benchmark that evaluates hallucinations in AI fashions.
Sensible experiments additionally present that GPT-4.5 is healthier than different general-purpose fashions at remaining true to information and following consumer directions.
Customers have identified that GPT-4.5’s responses really feel extra pure and context-aware than earlier fashions. Its potential to observe tone and magnificence pointers has additionally improved.
After the discharge of GPT-4.5, AI scientist and OpenAI co-founder Andrej Karpathy, who had early entry to the mannequin, stated he “expect[ed] to see an improvement in tasks that are not reasoning-heavy, and I would say those are tasks that are more EQ (as opposed to IQ) related and bottlenecked by e.g. world knowledge, creativity, analogy making, general understanding, humor, etc.”
Nevertheless, evaluating writing high quality can also be very subjective. In a survey that Karpathy ran on completely different prompts, most individuals most well-liked the responses of GPT-4o over GPT-4.5. He wrote on X: “Either the high-taste testers are noticing the new and unique structure but the low-taste ones are overwhelming the poll. Or we’re just hallucinating things. Or these examples are just not that great. Or it’s actually pretty close and this is way too small sample size. Or all of the above.”
Higher doc processing
In its experiments, Field, which has built-in GPT-4.5 into its Field AI Studio product, wrote that GPT-4.5 is “particularly potent for enterprise use-cases, where accuracy and integrity are mission critical… our testing shows that GPT-4.5 is one of the best models available both in terms of our eval scores and also its ability to handle many of the hardest AI questions that we have come across.”
In its inner evaluations, Field discovered GPT-4.5 to be extra correct on enterprise doc question-answering duties — outperforming the unique GPT-4 by about 4 share factors on their take a look at set.

Field’s assessments additionally indicated that GPT-4.5 excelled at math questions embedded in enterprise paperwork, which older GPT fashions typically struggled with. For instance, it was higher at answering questions on monetary paperwork that required reasoning over knowledge and performing calculations.
GPT-4.5 additionally confirmed improved efficiency at extracting info from unstructured knowledge. In a take a look at that concerned extracting fields from lots of of authorized paperwork, GPT-4.5 was 19% extra correct than GPT-4o.
Planning, coding, evaluating outcomes
Given its improved world information, GPT-4.5 may also be an acceptable mannequin for creating high-level plans for advanced duties. Damaged-down steps can then be handed over to smaller however extra environment friendly fashions to elaborate and execute.
In line with Constellation Analysis, “In initial testing, GPT-4.5 seems to show strong capabilities in agentic planning and execution, including multi-step coding workflows and complex task automation.”
GPT-4.5 may also be helpful in coding duties that require inner and contextual information. GitHub now offers restricted entry to the mannequin in its Copilot coding assistant and notes that GPT-4.5 “performs effectively with creative prompts and provides reliable responses to obscure knowledge queries.”
Given its deeper world information, GPT-4.5 can also be appropriate for “LLM-as-a-Judge” duties, the place a powerful mannequin evaluates the output of smaller fashions. For instance, a mannequin akin to GPT-4o or o3 can generate one or a number of responses, cause over the answer and go the ultimate reply to GPT-4.5 for revision and refinement.
Is it well worth the worth?
Given the large prices of GPT-4.5, although, it is rather arduous to justify most of the use instances. However that doesn’t imply it would stay that approach. One of many fixed developments we now have seen lately is the plummeting prices of inference, and if this pattern applies to GPT-4.5, it’s value experimenting with it and discovering methods to place its energy to make use of in enterprise purposes.
Additionally it is value noting that this new mannequin can grow to be the premise for future reasoning fashions. Per Karpathy: “Keep in mind that that GPT4.5 was only trained with pretraining, supervised finetuning and RLHF [reinforcement learning from human feedback], so this is not yet a reasoning model. Therefore, this model release does not push forward model capability in cases where reasoning is critical (math, code, etc.)… Presumably, OpenAI will now be looking to further train with reinforcement learning on top of GPT-4.5 model to allow it to think, and push model capability in these domains.”