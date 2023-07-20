Recent research has shown that GPT-4’s performance has declined over time.

This strategy’s cost-effectiveness raises questions about its impact on output quality.

Developers using GPT-4 should exercise caution due to the model’s inconsistent behavior.

Recently, there has been an unexpected observation regarding GPT-4’s performance, which appears to be degrading over time instead of improving. The consensus about the decline in the AI model’s response quality is now supported by empirical evidence, not just individual experiences.

New research has now confirmed this observation.

Recent studies indicate that the June version of GPT-4 performs notably worse than the March version on specific tasks. For example, when tested with a set of 500 problems that required identifying prime integers, the model’s performance declined.

The results were alarming, as the March model solved 488 problems correctly, while the June model managed only 12 accurate responses. This represents a significant decline in accuracy, dropping from an impressive 97.6% to a concerning 2.4%!

In an effort to enhance the model’s analytical capability, scientists employed the Chain-of-Thought method. However, despite breaking down the task into simpler steps, the updated GPT-4 version failed to generate the intermediate calculations, resulting in an incorrect response of “No” when asked if ‘17077’ is a prime number.

Additionally, the model’s ability to generate code has also experienced a notable decline.

The exact cause of this issue can only be speculated upon.

OpenAI’s update process is not fully transparent, leading to speculation about how they assess the model’s progress or regression. There are suggestions that OpenAI might be using smaller, specialized GPT-4 models to replicate the functions of a large model, potentially reducing operational costs. When a user submits a query, the system selects the most suitable model to handle the request.

Indeed, this cost-effective and efficient strategy raises the question of whether it could be a contributing factor to the decline in output quality.

This serves as a warning to developers integrating GPT-4 into their applications. Inconsistent variations in the behavior of a Language Learning Model over time are not viable.

