Archive for May 2024
Seeing through AI hype – some thoughts from a journeyman
It seems that every new release of a Large Language Model (LLM) is accompanied by a firehose of vendor hype and uncritically positive analyst and media comments.
The seductions of a new technology make it all too easy to overlook its shortcomings and side-effects, and rush in where “angels may fear to tread.” Although this is particularly evident in today’s age of LLMs, it is not a new phenomenon. As the anthropologist, Gregory Bateson, noted over forty years ago:
“It seems that every important scientific advance provides tools which look to be just what the applied scientists and engineers had hoped for, and usually these [folks] jump in without more ado. Their well-intentioned (but slightly greedy and slightly anxious) efforts usually do as much harm as good, serving at best to make conspicuous the next layer of problems, which must be understood before the applied scientists can be trusted not to do gross damage. Behind every scientific advance there is always a matrix, a mother lode of unknowns out of which the new partial answers have been chiseled. But the hungry, overpopulated, sick, ambitious, and competitive world will not wait, we are told, till more is known, but must rush in where angels fear to tread.
I have very little sympathy for these arguments from the world’s “need.” I notice that those who pander to its needs are often well paid. I distrust the applied scientists’ claim that what they do is useful and necessary. I suspect that their impatient enthusiasm for action, their rarin’-to-go, is not just a symptom of impatience, nor is it pure buccaneering ambition. I suspect that it covers deep epistemological panic.”
The hype and uncritical use of LLM technology are symptoms of this panic. This article is largely about how you and I – as members of the public – can take a more considered view of these technologies and thereby avoid epistemological panic, at least partially. Specifically, I cover two areas: a) the claims that LLMs can reason (see tweet above) and b) the broader question of the impact of the impact of these technologies on our information ecosystem.
–x–
One expects hype in marketing material from technology vendors. However, these days it seems that some researchers, who really ought to know better, are not immune. As an example, in this paper a bunch of computer scientists from Microsoft Research suggest that LLMs show “sparks of AGI” (Artificial General Intelligence), by which they imply that LLMs can match or surpass human cognitive capabilities such as reasoning. I’ll have more to say about the claim shortly. However, before I go on, a few words about how LLMs work are in order.
The principle behind all LLMs tools, such as GPT, is next token prediction – i.e., the text they generate is drawn from a list of most likely next words, based on the prompt (i.e. the input you provide and the text generated thus far). The text LLMs generate is usually coherent and grammatical, but not always factually correct (as a lawyer found out the hard way) or logically sound (I discuss examples of this below).
The coherence and grammatical correctness are expected because their responses are derived from a massive multidimensional probability distribution based on the data they are trained on, which is a representative chunk of the internet. This is augmented by human feedback via a process that is called reinforcement learning from human feedback (RLHF).
For those interested in finding out more about how LLMs work, I highly recommend Stephen Wolfram’s long but excellent non-technical essay which is also available in paperback.
Given the above explanation about how LLMs work, it should be clear any claim suggesting LLMs can reason like humans should be viewed with scepticism.
Why?
Because a next-token-predictor cannot reason; it can at best match patterns. As Subbarao Kambhampati puts it, they are approximate retrieval engines. That said, LLMs’ ability to do pattern matching at scale enables them to do some pretty mind-blowing things that look like reasoning. See my post, More Than Stochastic Parrots, for some examples of this and keep in mind they are from a much older version of ChatGPT.
So, the question is: what exactly are LLMs doing, if not reasoning?
In the next section, I draw on recent research to provide a partial answer to this question. I’ll begin with a brief discussion of some of the popular prompting techniques that seem to demonstrate that LLMs can reason and then highlight some recent critiques of these approaches.
–x–
In a highly cited 2022 paper entitled, Chain-of-Thought (CoT) Prompting Elicits Reasoning in Large Language Models, a team from Google Brain claimed that providing an LLM a “series of intermediate reasoning steps significantly improves [its] ability to perform complex reasoning.” Figure 2 below shows an example from their paper (see this blog post from the research team for a digest version of the paper)
The original CoT paper was closely followed by this paper (also by a team from Google Brain) claiming that one does not even have to provide intermediate steps. Simply adding “Let’s think step by step” to a prompt will do the trick. The authors called this “zero shot prompting.” Figure 2 from the paper compares few shot and CoT prompting.
The above approach works in many common reasoning problems. But does it imply that LLMs can reason? Here’s how Melanie Mitchell puts it in a substack article:
“While the above examples of CoT and zero-shot CoT prompting show the language model generating text that looks like correct step-by-step reasoning about the given problem, one can ask if the text the model generates is “faithful”—that is, does it describe the actual process of reasoning that the LLM uses to solve the problem? LLMs are not trained to generate text that accurately reflects their own internal “reasoning” processes; they are trained to generate only plausible-sounding text in response to a prompt. What, then, is the connection between the generated text and the LLM’s actual processes of coming to an answer?“
Incidentally, Mitchell’s substack is well worth subscribing to for a clear-eyed, hype-busting view of AI, and this book by Arvind Narayanan due to be released in Sept 2024 is also worth keeping an eye out for.
Here are a few interesting research threads that probe LLMs reasoning capabilities:
- Subbarao Kambhampati’s research group has been investigating LLMs planning abilities. The conclusion they reach is that LLMs cannot plan, but can help in planning. In addition, you may want to view this tutorial by Kambhampati in which he walks viewers through details of tests described in the papers.
- This paper from Thomas Griffiths’ research group critiques the Microsoft paper on “sparks of AGI”. As the authors note: “Based on an analysis of the problem that LLMs are trained to solve (statistical next-word prediction), we make three predictions about how LLMs will be influenced by their origin in this task—the embers of autoregression that appear in these systems even as they might show sparks of artificial general intelligence“. In particular, they demonstrate that LLM outputs have a greater probability of being incorrect when one or more of the following three conditions are satisfied a) the probability of the task to be performed is low, b) the probability of the output is low and, c) the probability of the input string is low. The probabilities in these three cases refer to the chances of examples of a) the task, b) the output or c) the input being found on the internet.
- Somewhat along the same lines as the above, this paper by Zhaofeng Wu and colleagues investigates LLM reasoning capabilities through counterfactual tasks – i.e., variations of tasks commonly found on the internet. An example of a counterfactual task would be adding two numbers in base 8 as opposed to the default base 10. As expected, the authors find the performance of LLMs on counterfactual tasks to be substantially worse than on default tasks.
- In this paper, Miles Turpin and colleagues show that when LLMs appear to reason, they can systematically misrepresent the reasons for their predictions. In other words, the explanations they provide for how they reached their conclusions can, in some cases, be demonstrated to be incorrect.
- Finally, in this interesting paper (summarised here), Ben Prystawski and colleagues attempt to understand the reason why CoT prompting works (when it does, that is!). They conclude that “we can expect CoT reasoning to help when a model is tasked with making inferences that span different topics or concepts that do not co-occur often in its training data, but can be connected through topics or concepts that do.” This is very different from human reasoning which is a) embodied, and thus uses data that is tightly coupled – i.e., relevant to the problem at hand and b) uses the power of abstraction (e.g. theoretical models). Research of this kind, aimed at understanding the differences between LLM and human reasoning, can suggest ways to improve the former. But we are a long way from that yet.
To summarise, then: the “reasoning” capabilities of LLMs are very different from those of humans and can be incorrect in surprising ways. I should also note that although the research described above predates the release of GPT-4o, the newer version does not address any of the shortcomings as there is no fundamental change in the way it is built. It is way too early for published research on this, but see this tweet from a researcher in Kambhampati’s group.
With that said for reasoning, I now move on to the question of the pernicious effects of these technologies on information access and reliability. Although this issue has come to the fore only recently, it is far from new: search technology has been silently mediating our interactions with information for many years.
–x–
In 2008, I came across an interesting critique of Google Search by Tom Slee. The key point he makes is that Google influences what we know simply by the fact that the vast majority of people choose to click on one of the top two or three links presented to them by the search engine. This changes the dynamics of human knowledge. Here’s how Slee puts it, using an evocative analogy of Google as a (biased!) guide through a vastly complicated geography of information:
“Google’s success has changed the way people find their routes. Here is the way it happens. When a new cluster of destinations is built there may be a flurry of interest, with new signposts being erected pointing towards one or another of those competing locations. And those signposts have their own dynamics…But that’s not the end of the story. After some initial burst, no one makes new signposts to this cluster of destinations any more. And no one uses the old signposts to select which particular destination to visit. Instead everyone uses [Google]. It becomes the major determinant of the way people travel; no longer a guide to an existing geography it now shapes the geography itself, becoming the most powerful force of all in many parts of the land.”
To make matters worse, in recent years even the top results by Google are increasingly tainted. As this paper notes, “we can conclude that higher-ranked pages are on average more optimized, more monetized with affiliate marketing, and they show signs of lower text quality.” In a very recent development Google has added Generative AI capabilities to its search engine to enhance the quality of search results (Editors Note: LLMs are a kind of Generative AI technology). However, as suggested by this tweet from Melanie Mitchell and this article, the road to accurate and trustworthy AI-powered search is likely be a tortuous one…and to a destination that probably does not exist.
–x–
As we have seen above, by design, search engines and LLMs “decide” what information should be presented to us, and they do so in an opaque manner. Although the algorithms are opaque, we do know for certain that they use data available on the internet. This brings up another issue: LLM-generated data is being added to (flooding?) the internet at an unknown rate. In a recent paper Chirag Shah and Emily Bender consider the effect of synthetically generated data on the quality of data on the internet. In particular, they highlight the following issues with LLM-generated data:
- LLMs are known to propagate biases present in their training data.
- They lack transparency – the responses generated by LLMs are presented as being authoritative, but with no reference to the original sources.
- The users have little control over how LLMs generate responses. Often there can be an “illusion of control” as we saw with CoT prompting.
Then there is the issue of how an information access system should work: should it just present the “right” result and be done with it, or should it encourage users to think for themselves and develop their information literacy skills. The short yet fraught history of search and AI technologies suggests that vendors are likely to prioritise the former over the latter.
–x–
Apart from the above issues of bias, transparency and control, there is the question of whether there are qualitative differences between synthetically generated and human generated data. This question was addressed by Andrew Peterson in a recent paper entitled, AI and the Problem of Knowledge Collapse. His argument is based on the empirical observation (in line with theoretical expectations) that any Generative AI trained on a large publicly available corpus will tend to be biased toward returning results that conform to popular opinion – i.e., given a prompt, it is most likely to return a response that reflects the “wisdom of the crowd.” Consequently, opinions and viewpoints that are smaller in number compared to the mainstream will be underrepresented.
As LLM use becomes more widespread, AI-generated content will flood the internet and will inevitably become a significant chunk of the training data for LLMs. This will further amplify LLMs predilection for popular viewpoints in preference to those in the tail of the probability distribution (because the latter become increasingly underrepresented). Peterson terms this process knowledge collapse – a sort of regression to the average, leading to a homogenisation of the internet.
How do deal with this?
The obvious answer is to put in place measures that encourage knowledge diversity. As Peterson puts it:
“…measures should be put in place to ensure safeguards against widespread or complete reliance on AI models. For every hundred people who read a one-paragraph summary of a book, there should be a human somewhere who takes the time to sit down and read it, in hopes that she can then provide feedback on distortions or simplifications introduced elsewhere.”
As an aside, an interesting phenomenon related to the LLM-mediated homogenisation of the information ecosystem was studied in this paper by Shumailov et. al. who found that the quality of LLM responses degrade as they are iteratively trained on their outputs. In their experiments, they showed that if LLMs are trained solely on LLM generated data, the responses degrade to pure nonsense within a few generations. They call this phenomenon model collapse. Recent research shows that model collapse can be avoided if training data includes a mix of human and AI generated text. The human element is essential to avoid the pathologies of LLM-generated data.
–x–
In writing this piece, I found myself going down several research rabbit holes – one paper would lead to another, and then another and so on. Clearly, it would be impossible for me to do justice to all the great work that investigates vendor claims and early adopter hype, but I realised that there is no need for me to do so. My objective in writing this piece is to discuss how we can immunise ourselves against AI-induced epistemological panic and – more importantly – that it is easy to do so. The simple solution is not to take what vendors say at face value and instead turn to the (mostly unbiased) research literature to better understand how these technologies work. Although the details in academic research reports can be quite technical, the practical elements of most papers are generally easy enough to follow, even for non-specialists.
So, I’ll sign off here with a final word from Bateson who, in the mid-1960s, had this to say about the uncritical purpose and panic driven use of powerful technologies:
“Today [human purposes] are implemented by more and more effective [technologies]. [We] are now empowered to upset the balances of the body, of society, and of the biological world around us. A pathology—a loss of balance—is threatened…Emergency is present or only just around the corner; and long-term wisdom must therefore be sacrificed to expediency, even though there is a dim awareness that expediency will never give a long-term solution……The problem is systemic and the solution must surely depend upon realizing this fact.”
Half a century later, the uncritical use of Generative AI technology threatens to dilute our cognitive capabilities and the systemic balance of the information ecosystem we rely on. It is up to us to understand and use these technologies in ways that do not outsource our thinking to mindless machines.
–x–x–




