Technology

OpenAI's Language Models: Inconsistent Performance Over Time

What is new about the reliability and consistency of ChatGPT? Did you know that GPT-4's accuracy in identifying prime numbers fell from 97% to 2%? Read more!

By Cursor Contributor | Jul 28 2023, 5:49pm in Technology

Openai language models

OpenAI, a renowned artificial intelligence laboratory, has been making waves in the tech industry with ChatGPT. However, recent studies have shown that AI models have demonstrated uneven performance over time, raising questions about their reliability and consistency in various tasks.

This article delves into the intricacies of these findings and their implications for the use of AI tools in various industries. Read more!

OpenAI's Language Models

Studies by computer scientists from prestigious institutions such as Stanford University and the University of California show inconsistent performance levels in OpenAI's language models, especially GPT-3.5 and GPT-4, which are key models in Microsoft's Cloud.

ChatGPT, powered by GPT-3.5, offers a paid Plus subscription that opens access to GPT-4, and Microsoft, a tech giant, is integrating these neural networks into its software and services, emphasizing their significance in the technology sector, but there are some aspects requiring more attention…

The Performance Levels

James Zou, an assistant professor at Stanford University, observed considerable variations in the responses of the models to identical questions over time.

He stated: “The newer versions got worse on some tasks”, highlighting the inconsistency. Studies also show other problems.

Inconsistent Results In Various Tasks

Performance tests included solving mathematical problems, generating code, answering inappropriate questions, and visual reasoning. These tests revealed dramatic fluctuations in the performance levels of both GPT-3.5 and GPT-4 within three months.

The accuracy of GPT-4's accuracy in identifying prime numbers fell from 97.6% to a mere 2.4% between March and June 2023.

Conversely, The accuracy rates of GPT-3.5 rose from 7.4% to 86.8% during the same period.

The GPT-4 generation of bug-free, directly executable code, fell from 52% to 10%, while GPT-3.5 saw a drop from 22% to just 2%.

Shorter Responses, But Improved Discretion

In some tasks, the models produced shorter responses, but GPT-4 showed improved discretion in refraining from answering inappropriate questions. Instead of lengthy explanations, it simply responded, “Sorry, but I can't assist with that”.

Enhanced Visual Reasoning Skills

In a final task, an increment in performance was observed in both GPT-3.5 and GPT-4. The AI accurately generated a color grid from a given image, demonstrating improved visual reasoning skills.

Anyway, the OpenAI models' workings are proprietary, contributing to their evolving nature as OpenAI frequently updates its code and neural networks. This opacity hinders a full understanding of what causes changes in the models' responses.

What's Next?

Due to the possible ripple effects of updates on applications and services, developers are advised to regularly monitor the behavior of these models. AI tools, increasingly used as components of large systems, need regular evaluation to spot drifts and unexpected behaviors.

Changes in model responses can disrupt downstream processes and decisions. Therefore, continual evaluation of these models is essential to maintain the quality and reliability of systems that depend on them.

AI tools had become integral components of large systems, and their consistency and reliability are of paramount importance, and understanding these tools' behaviors over time can provide insights into system behaviors, ultimately improving the efficiency and effectiveness of these systems.