The old adage is that you learn something new everyday. Well I bumped into an old Conchango colleague the other day and learned about 500 new things in one hour. The one which has been rattling round my head ever since is the Winograd Schema Challenge.
The challenge is to see if a computer can correctly identify what ‘it’ refers to in the following sentence:
“The trophy would not fit in the brown suitcase because it was too big“
Well that doesn’t sound so hard, right?
Easy to assume most people would intuitively know that the trophy is too big. Notice if you change the last word to ‘small’, then the ‘it’ begins to apply to the suitcase.
This is a kind of fancy Turing Test. And up until 2019, computers were unable to answer it correctly. That changed with the advent of tools like Chat-GPT, who’s somewhat bizarre answer is below. Correct but also weirdly wrong.

However, to say that OpenAI has solved for this challenge is to miss the point. They have included large number of Winograd examples in their training data. Per this article, the LLMs can be seen not to have evolved to answer the challenge itself but to have defeated the challenge through training and tricks. It’s a bit like the VW emissions cheating scandal, the system knows you’re checking up on it!
The creators of these sorts of tests have adapted too, creating a battery of 1,000s of similar tests to run against the machines. Why? Because deep down, people who know how computers and AI work also know that computers aren’t ‘thinking’ in the way that we do. And those computer scientists would like to find a way to keep showing that this is true, although it is getting harder.
Another old adage: if it walks like a duck, quacks like a duck… it is a duck.
That’s what’s going through our minds when we happily let Chat GPT loose on our birthday card messages or annual report. Or when we’re chatting to it about our failed relationship. A recent report showed that 1000s of people in mental health crisis are turning to Chat GPT for help. Jesus!
It certainly acts like it’s thinking. It answers just as quickly as your stoner friend and often a lot more accurately. So we start to believe it must be doing what we’re doing when we answer things.
But that is not what it is doing at all.
Let’s go back to calling LLMs what they actually are: statistical models that have been trained to guess what the next word might be given any set of inputs. Yes, it can spit out the somewhat wonky answer above, but when it reads the word trophy, it doesn’t get a mental picture of one, like you just did. And it never imagined someone trying to close a suitcase with trophy inside like you did when you started reading this article.
Spotting and repeating patterns is a great party trick. And it’s definitely a productivity aid in certain scenarios, but it is not what you think it is. Da Vinci didn’t create the Mona Lisa by statistically averaging every other painting in the world. Nor did Brian Eno and David Byrne compose Life In the Bush of Ghosts by working out what the average was of all music from before. When we have breakthroughs or try new things we cannot just synthesize the past.
Let’s look at the places where it works best. The technique appears to be very effective at writing and analysing code. And that can hardly come as a surprise when you understand what it is doing under the covers, and how coding works. It’s all about patterns. Creating pictures that look like other pictures: check. Creating a version of that essay about Descartes that has already been written by 100,000 A-level students: check. Giving me 100 ideas based on some words: check.
And that is all well and good. But scroll, instead, through LinkedIn and hear the fever dreams of how AI (and LLM in particular) will change the world, and I think you’d be forgiven for thinking that it was actually thinking. We’ve started to personify it and we’ve started to make excuses for it. When it produces answers we don’t like we don’t say it’s broken, that the technique doesn’t work well, we say it is ‘hallucinating’.
What seems to have happened is people have seen the party trick, the word guessing, the pattern matching, and the AI coding and they’ve thought ‘DUCK’. And there is an escalating pattern of one-up-man-ship online to try and come up with bolder and more exciting or scary versions of what this might mean.
And if you convince yourself that they’ve crammed an intellect into a machine with an almost impossible appetite for data, those conclusions might make sense. But if instead you think of this as a guessing tool with a lot of training, the conclusions are a bit different.
‘Would you like me to average your essay out with a million others?’, ‘Would you like me to summarise the meeting in a way which shows absolutely no understanding of the concepts that were discussed or the social hierachy which formed the context?’, ‘Would you like me to reply to your customers with pastiche answers lacking in detail?’
When future generations look back on this spiky, speculation-fuelled period in human history will they think we were on the brink of a great twist in society’s progression, or of a massive collective delusion?