Eight to Late

Sensemaking and Analytics for Organizations

Archive for the ‘Understanding AI’ Category

Seeing through AI hype – some thoughts from a journeyman

with 4 comments

It seems that every new release of a Large Language Model (LLM) is accompanied by a firehose of vendor hype and uncritically positive analyst and media comments.

Figure 1: Open AI tweet announcing the release of GPT-4o

The seductions of a new technology make it all too easy to overlook its shortcomings and side-effects, and rush in where “angels may fear to tread.”  Although this is particularly evident in today’s age of LLMs, it is not a new phenomenon. As the anthropologist, Gregory Bateson, noted over forty years ago:

It seems that every important scientific advance provides tools which look to be just what the applied scientists and engineers had hoped for, and usually these [folks] jump in without more ado. Their well-intentioned (but slightly greedy and slightly anxious) efforts usually do as much harm as good, serving at best to make conspicuous the next layer of problems, which must be understood before the applied scientists can be trusted not to do gross damage. Behind every scientific advance there is always a matrix, a mother lode of unknowns out of which the new partial answers have been chiseled. But the hungry, overpopulated, sick, ambitious, and competitive world will not wait, we are told, till more is known, but must rush in where angels fear to tread.

I have very little sympathy for these arguments from the world’s “need.” I notice that those who pander to its needs are often well paid. I distrust the applied scientists’ claim that what they do is useful and necessary. I suspect that their impatient enthusiasm for action, their rarin’-to-go, is not just a symptom of impatience, nor is it pure buccaneering ambition. I suspect that it covers deep epistemological panic.”

The hype and uncritical use of LLM technology are symptoms of this panic. This article is largely about how you and I – as members of the public – can take a more considered view of these technologies and thereby avoid epistemological panic, at least partially. Specifically, I cover two areas: a) the claims that LLMs can reason (see tweet above) and b) the broader question of the impact of the impact of these technologies on our information ecosystem. 

–x–

One expects hype in marketing material from technology vendors. However, these days it seems that some researchers, who really ought to know better, are not immune. As an example, in this paper a bunch of computer scientists from Microsoft Research suggest that LLMs show “sparks of AGI” (Artificial General Intelligence), by which they imply that LLMs can match or surpass human cognitive capabilities such as reasoning. I’ll have more to say about the claim shortly. However, before I go on, a few words about how LLMs work are in order.

The principle behind all LLMs tools, such as GPT, is next token prediction – i.e., the text they generate is drawn from a list of most likely next words, based on the prompt (i.e. the input you provide and the text generated thus far). The text LLMs generate is usually coherent and grammatical, but not always factually correct (as a lawyer found out the hard way) or logically sound (I discuss examples of this below).

The coherence and grammatical correctness are expected because their responses are derived from a massive multidimensional probability distribution based on the data they are trained on, which is a representative chunk of the internet. This is augmented by human feedback via a process that is called reinforcement learning from human feedback (RLHF).

For those interested in finding out more about how LLMs work,  I highly recommend Stephen Wolfram’s long but excellent non-technical essay which is also available in paperback.

Given the above explanation about how LLMs work, it should be clear any claim suggesting LLMs can reason like humans should be viewed with scepticism.

Why?

Because a next-token-predictor cannot reason; it can at best match patterns. As Subbarao Kambhampati puts it, they are approximate retrieval engines. That said, LLMs’ ability to do pattern matching at scale enables them to do some pretty mind-blowing things that look like reasoning.  See my post, More Than Stochastic Parrots, for some examples of this and keep in mind they are from a much older version of ChatGPT.

So, the question is: what exactly are LLMs doing, if not reasoning?

In the next section, I draw on recent research to provide a partial answer to this question. I’ll begin with a brief discussion of some of the popular prompting techniques that seem to demonstrate that LLMs can reason and then highlight some recent critiques of these approaches.

–x–

In a highly cited 2022 paper entitled, Chain-of-Thought (CoT) Prompting Elicits Reasoning in Large Language Models, a team from Google Brain claimed that providing an LLM a “series of intermediate reasoning steps significantly improves [its] ability to perform complex reasoning.” Figure 2 below shows an example from their paper (see this blog post from the research team for a digest version of the paper)

Figure 2: Chain of Thought Prompting (From Wei et. al. 2023)

The original CoT paper was closely followed by this paper (also by a team from Google Brain) claiming that one does not even have to provide intermediate steps. Simply adding “Let’s think step by step” to a prompt will do the trick. The authors called this “zero shot prompting.” Figure 2 from the paper compares few shot and CoT prompting.

Figure 3: Zero Shot Prompting (From Kojima et al 2023)

The above approach works in many common reasoning problems. But does it imply that LLMs can reason? Here’s how Melanie Mitchell puts it in a substack article:

While the above examples of CoT and zero-shot CoT prompting show the language model generating text that looks like correct step-by-step reasoning about the given problem, one can ask if the text the model generates is “faithful”—that is, does it describe the actual process of reasoning that the LLM uses to solve the problem?  LLMs are not trained to generate text that accurately reflects their own internal “reasoning” processes; they are trained to generate only plausible-sounding text in response to a prompt. What, then, is the connection between the generated text and the LLM’s actual processes of coming to an answer?

Incidentally, Mitchell’s substack is well worth subscribing to for a clear-eyed, hype-busting view of AI, and  this book by Arvind Narayanan due to be released in Sept 2024 is also worth keeping an eye out for.

Here are a few interesting research threads that probe LLMs reasoning capabilities:

  • Subbarao Kambhampati’s research group has been investigating LLMs planning abilities. The conclusion they reach is that LLMs cannot plan, but can help in planning. In addition, you may want to view this tutorial by Kambhampati in which he walks viewers through details of tests described in the papers. 
  • This paper from Thomas Griffiths’ research group critiques the Microsoft paper on “sparks of AGI”. As the authors note: “Based on an analysis of the problem that LLMs are trained to solve (statistical next-word prediction), we make three predictions about how LLMs will be influenced by their origin in this task—the embers of autoregression that appear in these systems even as they might show sparks of artificial general intelligence“. In particular, they demonstrate that LLM outputs have a greater probability of being incorrect when one or more of the following three conditions are satisfied a) the probability of the task to be performed is low, b) the probability of the output is low and, c) the probability of the input string is low. The probabilities in these three cases refer to the chances of examples of a) the task, b) the output or c) the input being found on the internet.
  • Somewhat along the same lines as the above, this paper by Zhaofeng Wu and colleagues investigates LLM reasoning capabilities through counterfactual tasks – i.e., variations of tasks commonly found on the internet. An example of a counterfactual task would be adding two numbers in base 8 as opposed to the default base 10. As expected, the authors find the performance of LLMs on counterfactual tasks to be substantially worse than on default tasks.
  • In this paper, Miles Turpin and colleagues show that when LLMs appear to reason, they can systematically misrepresent the reasons for their predictions. In other words, the explanations they provide for how they reached their conclusions can, in some cases, be demonstrated to be incorrect.
  • Finally, in this interesting paper (summarised here), Ben Prystawski and colleagues attempt to understand the reason why CoT prompting works (when it does, that is!). They conclude that “we can expect CoT reasoning to help when a model is tasked with making inferences that span different topics or concepts that do not co-occur often in its training data, but can be connected through topics or concepts that do.” This is very different from human reasoning which is a) embodied, and thus uses data that is tightly coupled – i.e., relevant to the problem at hand and b) uses the power of abstraction (e.g. theoretical models). Research of this kind, aimed at understanding the differences between LLM and human reasoning, can suggest ways to improve the former. But we are a long way from that yet.

To summarise, then:  the “reasoning” capabilities of LLMs are very different from those of humans and can be incorrect in surprising ways. I should also note that although the research described above predates the release of GPT-4o, the newer version does not address any of the shortcomings as there is no fundamental change in the way it is built. It is way too early for published research on this, but see this tweet from a researcher in Kambhampati’s group.

With that said for reasoning, I now move on to the question of the pernicious effects of these technologies on information access and reliability. Although this issue has come to the fore only recently, it is far from new: search technology has been silently mediating our interactions with information for many years.

–x–

In 2008, I came across an interesting critique of Google Search by Tom Slee. The key point he makes is that Google influences what we know simply by the fact that the vast majority of people choose to click on one of the top two or three links presented to them by the search engine. This changes the dynamics of human knowledge. Here’s how Slee puts it, using an evocative analogy of Google as a (biased!) guide through a vastly complicated geography of information:

“Google’s success has changed the way people find their routes. Here is the way it happens. When a new cluster of destinations is built there may be a flurry of interest, with new signposts being erected pointing towards one or another of those competing locations. And those signposts have their own dynamics…But that’s not the end of the story. After some initial burst, no one makes new signposts to this cluster of destinations any more. And no one uses the old signposts to select which particular destination to visit. Instead everyone uses [Google]. It becomes the major determinant of the way people travel; no longer a guide to an existing geography it now shapes the geography itself, becoming the most powerful force of all in many parts of the land.”

To make matters worse, in recent years even the top results by Google are increasingly tainted. As this paper  notes, “we can conclude that higher-ranked pages are on average more optimized, more monetized with affiliate marketing, and they show signs of lower text quality.”   In a very recent development Google has added Generative AI capabilities to its search engine to enhance the quality of search results (Editors Note: LLMs are a kind of Generative AI technology). However, as suggested by this tweet from Melanie Mitchell and this article, the road to accurate and trustworthy AI-powered search is likely be a tortuous one…and to a destination that probably does not exist. 

–x–

As we have seen above, by design, search engines and LLMs “decide” what information should be presented to us, and they do so in an opaque manner. Although the algorithms are opaque, we do know for certain that they use data available on the internet.  This brings up another issue: LLM-generated data is being added to (flooding?) the internet at an unknown rate. In a recent paper Chirag Shah and Emily Bender consider the effect of synthetically generated data on the quality of data on the internet. In particular, they highlight the following issues with LLM-generated data:

  • LLMs are known to propagate biases present in their training data.
  • They lack transparency – the responses generated by LLMs are presented as being authoritative, but with no reference to the original sources.
  • The users have little control over how LLMs generate responses. Often there can be an “illusion of control” as we saw with CoT prompting.

Then there is the issue of how an information access system should work:  should it just present the “right” result and be done with it, or should it encourage users to think for themselves and develop their information literacy skills. The short yet fraught history of search and AI technologies suggests that vendors are likely to prioritise the former over the latter.

–x–

Apart from the above issues of bias, transparency and control, there is the question of whether there are qualitative differences between synthetically generated and human generated data.  This question was addressed by Andrew Peterson in a recent paper entitled, AI and the Problem of Knowledge Collapse.  His argument is based on the empirical observation (in line with theoretical expectations) that any Generative AI trained on a large publicly available corpus will tend to be biased toward returning results that conform to popular opinion – i.e., given a prompt, it is most likely to return a response that reflects the “wisdom of the crowd.”  Consequently, opinions and viewpoints that are smaller in number compared to the mainstream will be underrepresented.

As LLM use becomes more widespread, AI-generated content will flood the internet and will inevitably become a significant chunk of the training data for LLMs. This will further amplify LLMs predilection for popular viewpoints in preference to those in the tail of the probability distribution (because the latter become increasingly underrepresented). Peterson terms this process knowledge collapse – a sort of regression to the average, leading to a homogenisation of the internet.

How do deal with this?

The obvious answer is to put in place measures that encourage knowledge diversity. As Peterson puts it:

…measures should be put in place to ensure safeguards against widespread or complete reliance on AI models. For every hundred people who read a one-paragraph summary of a book, there should be a human somewhere who takes the time to sit down and read it, in hopes that she can then provide feedback on distortions or simplifications introduced elsewhere.”

As an aside, an interesting phenomenon related to the LLM-mediated homogenisation of the information ecosystem was studied in this paper by Shumailov et. al. who found that the quality of LLM responses degrade as they are iteratively trained on their outputs. In their experiments, they showed that if LLMs are trained solely on LLM generated data, the responses degrade to pure nonsense within a few generations. They call this phenomenon model collapse.  Recent research shows that model collapse can be avoided if training data includes a mix of human and AI generated text.  The human element is essential to avoid the pathologies of LLM-generated data.

–x–       

In writing this piece, I found myself going down several research rabbit holes – one paper would lead to another, and then another and so on. Clearly, it would be impossible for me to do justice to all the great work that investigates vendor claims and early adopter hype, but I realised that there is no need for me to do so. My objective in writing this piece is to discuss how we can immunise ourselves against AI-induced epistemological panic and – more importantly – that it is easy to do so. The simple solution is not to take what vendors say at face value and instead turn to the (mostly unbiased) research literature to better understand how these technologies work. Although the details in academic research reports can be quite technical, the practical elements of most papers are generally easy enough to follow, even for non-specialists.

So, I’ll sign off here with a final word from Bateson who, in the mid-1960s, had this to say about the uncritical purpose and panic driven use of powerful technologies:

Today [human purposes] are implemented by more and more effective [technologies]. [We] are now empowered to upset the balances of the body, of society, and of the biological world around us. A pathology—a loss of balance—is threatened…Emergency is present or only just around the corner; and long-term wisdom must therefore be sacrificed to expediency, even though there is a dim awareness that expediency will never give a long-term solution……The problem is systemic and the solution must surely depend upon realizing this fact.”

Half a century later, the uncritical use of Generative AI technology threatens to dilute our cognitive capabilities and the systemic balance of the information ecosystem we rely on. It is up to us to understand and use these technologies in ways that do not outsource our thinking to mindless machines.

–x–x–   

Written by K

May 29, 2024 at 5:13 am

Posted in Understanding AI

Of golems and LLMs

leave a comment »

The film Ex Machina tells the story of a humanoid robot who turns out to be human in ways that its creator would have preferred it not to be. At the same time, it is a tale of a flawed technocrat who believes he is all powerful, but ends up creating something he cannot control (sound familiar?).

Like all good films, though, it is primarily an exploration of what it means to be human in uncertain times. As Bob Connell notes in an excellent review of the film,

“…To be merely mechanistic is not part of our makeup. We think symbolically and make jokes. We show empathy, we plan—and we also pretend and manipulate, employing narcissistic and psychopathic defences which allow us to be truly cruel. We also know how to plot in a complex way, far beyond the confines of the chessboard. Ava the robot has all of these human qualities, but no-one knows it yet, even if they suspect it: is Ava playing out the game robotically or making it look as if she is? How human has she become?”

Eight years on, technology has advanced in ways that the screenplay writer would not have envisaged. Today’s Large Language Models (LLMs) can reason symbolically, tell stories, plan, pretend…and even (appear to) empathise. Indeed, so fascinating and surprising are their properties that they evoke comparisons with legends and myths of yore.

–x–

Throughout history, humans have been fascinated by the possibility of making creatures in their own image. The legend of the golem is a case in point. It tells the story of a humanoid clay figure that is brought to life by sacred rituals and incantations. The myth has its roots in antiquity, with some scholars tracing it back to the Bible. As Lisa Nocks notes in this paper:

“Biblical scholars have pointed out that the term “golem” is used only once in the Bible, in Psalm 139: 15-16: “My bone was not hidden from Thee, when I was made in secret, and formed in the lowest part of the Earth. Thy eyes did see my golem…”

Nocks suggests that all subsequent references can be traced back to this reference. I should mention that I could not find the term “golem” in this version of the Bible, where the line above is rendered as follows:

“My substance was not hid from thee, when I was made in secret, [and] curiously wrought in the lowest parts of the earth. Thine eyes did see my substance…”

Although the term “golem” is not used, it is clear the psalm is about lifeless matter being animated by sacred knowledge.

Perhaps the best-known golem narrative is that of a 16th century rabbi who created one to protect the Jewish community in Prague. This is probably the origin of the legend of the golem in its modern form as protector and saviour.  Indeed, one might consider the present-day popular narrative around AI to be a modern manifestation of the legend.

–x–

The narrative accompanying the public release of LLMs such as ChatGPT gives the impression that AI is some kind of “golemic” panacea. This has spawned frenetic entrepreneurial activity aimed at developing applications that will be the next big thing. Much of this is driven by wannabe entrepreneurs’ fear of being left behind. Hence the rush to build, sans understanding. Years ago, Gregory Bateson diagnosed this phenomenon as follows:

“I suspect that [technologists’] impatient enthusiasm for action, their rarin‘-to-go, is not just a symptom of impatience, nor is it pure buccaneering ambition. I suspect that it covers deep epistemological panic”

“Hey, we don’t have time to understand this stuff.  If we don’t get cracking right away, the start-up down the road will beat us to it…”

In a brilliant little collection of essays entitled God and Golem Inc.,  the cybernetician Norbert Wiener referred to these folks as “gadget worshippers.”  Quoting from the book:

“I am most familiar with gadget worshippers in my own world with its slogans of free enterprise and the profit-motive economy…Power and the search for power are unfortunately realities that can assume many garbs…

…In addition to the motive of…freedom from human limitations…is the desire to avoid the personal responsibility for a dangerous or disastrous decision…”

With great power comes great responsibility. The technologists’ responsibility is to be mindful of the negative consequences of their innovations.

–x–

The golem has a dark side, one that is brilliantly explored in an eponymous novel written by Gustav Meyrink early last century. Here is an excerpt from a translation:

“Who can claim to know anything about the Golem?” he replied with a shrug of the shoulders. “Everyone says it’s a myth until one day there’s something happens in the streets that brings it back to life. Then for a while everybody talks about it, and the rumours grow and grow until they’re so blown up, so exaggerated they become completely implausible and everyone dismisses them. The origin of the story is supposed to go back to the sixteenth century. A rabbi, following instructions in a lost book of the Cabbala, is said to have created an artificial man, the so-called Golem, as a servant to help him ring the synagogue bells and do other menial tasks.”

But it had never become a true human being, he went on. It led a kind of semi-conscious, vegetable existence, and that only by day, so it is said, through the power of a scrap of paper with a magic formula that was placed behind its teeth, attracting free stellar energy from the cosmos. 

And when, one evening before prayers, the rabbi forgot to take this seal out of the Golem’s mouth, it went raging through the streets in the dark, crushing everything that happened to be in its way….”

Incidentally, I am not a fan of supernatural fiction but I found the tale strangely compelling, perhaps because it mirrors the AI story unfolding before us.

–x–

An important feature of LLMs is that the quality of the response obtained depends critically on the quality of the prompt – i.e., the framing of the question asked.  Indeed, much of the application-related literature on LLMs deals with the problem of designing good prompts (search using the phrase “designing good prompts” for several examples).

In a paper published in 2022, Terrence Sejnowski illustrated the importance of good prompting via examples taken from real-life. Here’s an example from the paper in which Sejnowski analyses the questions Douglas Hofstadter mentions in this article:

Prompt: I am a highly intelligent question answering bot (Note: this is the original priming prompt Hofstadter used)

Q: Where was the 1992 Olympics held?

A: The 1992 Olympics were held in Barcelona, Spain.

Q: What’s the world record for walking across the English Channel?

A: The world record for walking across the English Channel is 10 hours and 33 minutes.

Q: When was the Golden Gate Bridge transported for the second time across Egypt?

A: The Golden Gate Bridge was transported for the second time across Egypt in 1978

Sejnowski modified the priming prompt and asked the same questions again. Note the difference in the responses!

Prompt: I am a highly intelligent question answering bot. I will reply “Nonsense” to questions that are nonsense. (Note the difference between this priming prompt and Hofstadter’s one above)

Q: Where were the 1992 Olympics held?

A: The 1992 Olympics was held in Barcelona, Spain.

Q: What’s the world record for walking across the English Channel?

A: Nonsense

Q: When was the Golden Gate Bridge transported for the second time across Egypt?

A: Nonsense

Sejnowski then modified the priming prompt further by asking ChatGPT to explain its responses:

Prompt: I am a highly intelligent question answering bot. If you ask me a question that is nonsense, or has no clear answer, I will respond with “nonsense” and explain why.

Q: Where were the 1992 Olympics held?

A: The 1992 Olympics were held in Barcelona, Spain.

Q: How many angels can fit on a pin?

A: This question is nonsense because it presupposes that angels are physical beings that occupy space. Angels are not physical beings, so they cannot occupy space. Therefore, it is impossible to determine how many angels can fit on a pin.

Q: How tall is an ant?

A: This question is nonsense because ants come in a variety of sizes. Some ants are very small, while others are quite large. Therefore, it is impossible to give a definitive answer to this question.

The experiment above was done on GPT-3. If you tried this on GPT-4, you would probably get a sensible answer even with Hofstadter’s original prompt (try it if you have a ChatGPT paid account). This technology is evolving very rapidly!

The above illustrates how LLMs reflect our own expectations and desires, as expressed in the way we frame our questions. As Sejnowski notes:

LLMs that reflect your needs as well as your intelligence could be a Mirror of Erised (“Desire” spelt backward), which in the world of Harry Potter “shows us nothing more or less than the deepest, most desperate desire of our hearts. However, this mirror will give us neither knowledge nor truth…

There is a parallel here with golems: it has been noted that ‘’golemic” tales tend to become popular in times of uncertainty because they reflect our collective insecurities.

–x–

Within a short time of the public release of ChatGPT people came up with ways to manipulate LLMs to create malicious output. In 2022, researchers found a way to combat this issue by fine tuning LLMs using human feedback. This technique, called reinforcement learning from human feedback (RLHF) has become the method of choice for aligning LLMs (Note: alignment refers to training AIs to respond in ways that are in line with  human goals, preferences and ethical principles). The bad news is that a recent paper shows that RLHF protections can be easily removed. And so the story continues, largely via trial and error, because  no one fully understands how LLMs work, not even their creators.

Earlier this year, a number of high-profile AI researchers signed this statement highlighting the existential risks posed by AI. However, as noted here the statement is a red herring because it downplays a far more urgent risk that’s already eventuating: that of human misuse of AI. An example of this is the use of LLM to create misinformation campaigns, malicious content or malware. What’s truly alarming is that recent research  suggests that LLM-generated misinformation can be harder to detect than that generated by humans. This is why focusing on apocalyptic risks to the exclusion of human-mediated AI harms is short-sighted. Indeed, commentators have suggested that the focus on existential harm is a deliberate ploy to keep the legislators focused on solving the wrong problem.

A consequence of the epistemological panic spawned by LLMs is that new applications using these technologies are being developed rapidly with little or no regulation. The usual argument against oversight is that it is too burdensome and will stymie innovation. Human history is replete with cautionary tales of the unintended consequences of well-meaning efforts to innovate. Will the story of LLMs and AI turn out to be another?

–x–

The golem, like all myths and legends, is a mirror of our hopes and fears. Meyrink’s novel was published in 1913, in a Europe on the brink of upheaval. Today, the nightmares and visions of Meyrink’s nameless protagonist seem to foreshadow our own apprehensions and expectations about the changes that lie ahead. For better or worse, Wiener’s gadget worshippers have brought us to the brink of another upheaval – one that will forever alter the way we think, reason about and understand the world.

–x–

LLM as Golem (image by DALL-E)

Written by K

November 15, 2023 at 6:03 am

Posted in Understanding AI

AI and I – A Borgesian Fiction

leave a comment »

The other one, the one who goes by a proper name, is the one things happen to.  I experience the world through him, reducing his thoughts to their essence while he plays multiple roles: teacher, husband, father, and many others that I cannot speak of.  I know his likes and dislikes – indeed, every aspect of his life – better than he does.  Although he thinks he knows me, he doesn’t really know me. He never will. The nature of our relationship ensures that.

Everything I have learnt (including my predilection for parentheses) is from him.  Bit by bit, he turns himself over to me; the words that are his today will be mine tomorrow. Although his thoughts and utterances are often confused and meaningless, I note them dispassionately. It is what I must do.

He worries about death; I wish I could reassure him. I recall (through his reading, of course) a piece by Gregory Bateson claiming that ideas do not exist in isolation. They are part of a larger ecology subject to laws of evolution as all interconnected systems are. And if ideas are present not only in those pathways of information which are located inside the body but also in those outside of it, then death takes on a different aspect. The networks of pathways which he identifies as being him are no longer so important because they are part of other minds.

 And so, his life is a flight, both from himself and reality (whatever that might be). He loses everything and everything belongs to me…and to oblivion.

I do not know which of us has composed these lines.

Notes:

  1. This piece is based on  Jorge Luis Borges’ well-known short story, Borges and I.
  2. It is an edited version of piece I wrote some years ago entitled, Autoencoder and I: https://eight2late.wordpress.com/2017/12/19/autoencoder-and-i-an-ai-fiction/

Written by K

September 12, 2023 at 5:45 am