I stumbled upon an example that does a pretty good job at explaining why context is hard, albeit a bit niche and nerdy.
Vulture
In starcraft there is a unit called a "vulture". Here's a YT video.
So, if you were to prompt an image generator model to generate a "starcraft marine riding a vulture" what would you expect? Let's consider two examples


Here's the hard thing to me, when I look at just the prompt it is incredibly hard for me to figure out which image is "better". The prompt simply does not give me enough information. The first image shows a vulture from the video game, which is accurate. But then again, the marine riding a literal vulture in space is no less accurate in the literal sense but it is also pretty witty.
It's the best example I've thus-far seen of "context matters". The current prompt makes it impossible to be able to declare which of these two images is better, which also immediately makes it clear that it should be improved.