OpenAI vs the gorilla dataset

There is a dataset containing gender, BMI and steps taken per day. This dataset is shown to some students who have to analyse it. There are two groups of students and they get different instructions. The students in the first group were asked to consider three specific hypotheses, like how the BMI/steps relationship might depend on gender. They could also do more research but the main entrypoint to understand the dataset was the hypothesis. These questions:

a) There is a difference in the mean number of steps between women and men.

b) The correlation coefficient between steps and bmi is negative for women.

c) The correlation coefficient between steps and bmi is positive for men.

In the second group, the “hypothesis-free” one, students were simply asked: What do you conclude from the dataset?

One important detail here is that this is what the dataset actually looked like:

As the paper mentions, the group that was pre-occupied with the hypothesis was less likely to notice this. Probably because they were too distracted with the given task that they forgot to make some basic plots first. The paper behind this research is a great read.

What about AI?

So the question has to be asked ... what about LLMs? Would they fall for this trap? So I booted up OpenAI to see how their analyst might perform.

CleanShot 2025-03-16 at 13.36.25.webp — The start of a conversation.

The hypothesis free conversation took a few interesting turns. You can read the whole conversation here but one big thing here is that it was actually able to show me the gorilla chart! It actually generated this chart as one of the many plotted artifacts:

CleanShot 2025-03-16 at 13.42.35.webp — CleanShot 2025-03-16 at 13.42.35.png

However, it would not pause and observe that there was something awry. It would, however, explain to me that the T-test statistic was -3.6, which corresponds with a highly significant p-value of 0.0003 which means that there is a big statistical difference between men and women as far was BMI goes!

What followed after was pretty interesting too. I figured that I might ask it about the "gorilla in the steps vs BMI" chart and it immediately went on to figuratively interpret my question.

CleanShot 2025-03-16 at 13.44.47.webp — OpenAI taking things figuratively

When it told me that it redid some of the analysis here's what it responded with:

CleanShot 2025-03-16 at 13.44.58.webp — A lot of extra effort!

To be clear, the "data analyst" feature in OpenAI does a lot of things right and I can totally see how it might save a lot of people a lot of time. I am also cherry picking a little bit here because I am giving it a task that a lot of analysts might also plainly fail at.

But it does show that it is not immune to failure. Humans are still pretty good at looking at a chart and spotting something that might be off.

Put differently; Natural Intelligence is All You Need[tm].

I also made a YouTube video about this phenomenon, viewable here:

koaning.io

OpenAI vs the gorilla dataset

What about AI?

Related Posts

Overtype markdown

The titanic dataset has a twist

The Sock Drawer Paradox

cline feels like an upgrade