Artificial intelligence tools like ChatGPT are often praised for their ability to generate fluent responses and explain complex topics, but are they really accurate?
The study examined how well ChatGPT could evaluate whether scientific hypotheses were supported by existing research. While the AI often sounded confident in its answers, the findings suggest its reasoning ability may not yet match its language skills.
AI against scientific research
Researchers analysed 719 hypotheses drawn from business research papers published since 2021. The goal was to determine whether ChatGPT could correctly assess whether the hypotheses had been supported by research findings.
Each hypothesis was presented to the AI ten separate times using identical prompts. The repeated questioning allowed the researchers to measure not only accuracy but also whether the system would give consistent answers to the same question.
The experiment was conducted twice using different versions of the AI tool. In 2024, the team used the free version of ChatGPT-3.5. The experiment was repeated in 2025 using an updated free model.
The results showed some improvement, but still raised concerns. In 2024, the AI answered correctly about 76.5 percent of the time. When the test was repeated a year later, accuracy increased slightly to around 80 percent.
However, the researchers adjusted the results to account for random guessing. Since the questions required true-or-false answers, a random guess would already have a 50 percent chance of being correct. After this adjustment, the AI’s performance was only about 60 percent better than chance.
The system also struggled to recognise when a hypothesis was false. In those cases, the AI correctly identified false statements only about 16.4 percent of the time.
Inconsistent answers
The study found a lack of consistency in the AI’s responses. When the same question was asked repeatedly with the exact same wording, ChatGPT did not always produce the same answer.
Across ten identical prompts, the system provided consistent results only about 73 percent of the time. In several cases, the AI split its answers evenly between true and false for the same hypothesis.
This variability suggests that even when an AI system appears confident in an explanation, the underlying reasoning process may not be stable.
Language fluency vs. reasoning ability
The findings highlight an important gap in current generative AI systems. Large language models are designed primarily to generate text that sounds natural and convincing. They are trained on enormous datasets and learn patterns in language rather than developing a human-like understanding of concepts.
As a result, they can often produce persuasive explanations even when the underlying answer is incorrect. The research suggests that linguistic fluency should not be mistaken for deeper reasoning ability.
The researchers say the results show the importance of treating AI-generated answers with caution, especially in areas that require careful analysis or interpretation of evidence.
For business leaders and professionals using AI tools, the findings suggest that verification remains essential. AI systems can help with information gathering and idea generation, but their outputs should still be checked against reliable sources.
As generative AI continues to evolve, researchers say understanding both its strengths and limitations will be key to using the technology effectively.

