The latest advancements in artificial intelligence (AI) have introduced multimodal models like GPT-4o and Gemini 1.5 Pro, claiming to understand images and audio alongside text. However, a recent study reveals that these models may not truly “see” in the way humans expect. Despite marketing claims about their “vision capabilities” and “visual understanding,” these models often struggle with basic visual tasks.
AI developers and marketers often use language that implies their models have some form of visual perception. They speak of the models’ ability to analyze images and video, suggesting they can handle tasks from solving homework problems to watching sports. While these models can match patterns in input data to their training data, their performance on simple visual tasks highlights their limitations.
Researchers from Auburn University and the University of Alberta conducted an informal but systematic study on the visual reasoning capabilities of these AI models. They tested the models on simple tasks such as determining whether two shapes overlap, counting specific shapes in an image, and identifying circled letters in a word. These tasks, which are trivial for a first-grader, posed significant challenges for the AI models.
Overlapping Shapes and Basic Visual Tasks
One of the simplest tasks involved determining whether two circles overlap. While GPT-4o performed well when the circles were far apart, its accuracy dropped drastically to 18% when the circles were close or touching. Gemini Pro 1.5 fared slightly better but still only managed a 70% accuracy rate at close distances.
Counting interlocking circles also proved problematic. While the models could accurately count five rings, adding one more ring significantly decreased their accuracy. For instance, Gemini struggled to get it right even once, while Sonnet-3.5 and GPT-4o achieved less than 50% accuracy. This inconsistency suggests that the models’ visual processing is not robust.
The study’s findings indicate that these AI models do not possess true visual understanding. Instead, they rely on pattern matching based on their training data. For example, they might accurately identify the Olympic Rings due to extensive exposure during training, but they falter with slightly altered configurations not present in the data.
Anh Nguyen, a co-author of the study, explained that these models might extract abstract information from images, such as “there’s a circle on the left side,” but they lack the ability to make precise visual judgments. This form of “blindness” means the models can be informed about an image without genuinely perceiving it.
Practical Implications and Future Directions
Despite these limitations, visual AI models are not without value. They excel in specific areas like recognizing human actions, facial expressions, and common objects. However, the study underscores the need for cautious interpretation of their capabilities. While marketing may suggest near-human visual perception, these models still require significant advancements to achieve true visual understanding.
The recent study highlights the gap between the marketing claims and the actual performance of multimodal AI models in visual tasks. While these models have impressive capabilities, their inability to perform basic visual reasoning tasks suggests they do not “see” in the conventional sense. As AI technology continues to evolve, ongoing research is crucial to uncovering the true extent of these models’ abilities and limitations.
See also: Helsing Raises $487M Series C To Bolster Baltic Defense Against Russian Threat