AI Psychology and the Umwelt of LLMs
Most of the LLM interpretability efforts look like neuroscience. Maybe we should also be trying psychology.

There is a lot of effort going in to trying to understand the internal workings of LLMs, otherwise known as interpretability of machine learning models. This work has been around for quite a while, with early techniques like sparse dictionary learning going back as far as 1997. There has always been curiosity about the weights in neural networks and what they mean, but it hasn’t received nearly as much attention as it is now.
The obvious explanation is the explosion of LLMs as a target of investment and subsequent reliance on returns, but I think a more nuanced explanation is the need to achieve reliable outputs, or in other words, reliable labor.
With a human, there is the threat of consequences for poor performance. There is less of a need to actually understand the physical cognition of a human employee as there are more tools available to manage the outputs. Context, setting, inputs, and constraints will largely govern the outputs. A hundred years of industrial and organizational psychology have distilled workers down to a fairly predictable population.
If businesses (and investors) want to increase dependency on LLM-based technology, they need a method to replace those predictable controls. So far I’ve seen a stark tradeoff between constraints and performance — the more you try to keep an LLM “on the rails”, the worse it performs, the less utility it has, and the further from the benefits of human creativity it gets.
So how are we approaching this problem? Brute force, of course. Much like projects to map which parts of the human brain correlate to different stimuli, thoughts, or movement, researchers are developing different techniques to try and essentially map the LLM’s latent space and distill it into model features (if you have a lot of spare time, this paper from Anthropic researchers on feature extraction has its moments).
Source: https://transformer-circuits.pub/2024/scaling-monosemanticity/index.html
What they’re finding isn’t terribly surprising. Individual neurons in the transformer layers don’t map directly to concepts, and instead are “superpositions” of concepts, allowing models to represent more distinct concepts than their are dimensions in the latent space. Researchers at Anthropic are calling this concept superposition because any attempt to observe a single concept results in interference, where the encoding of concepts in the latent space are non-linear and non-orthogonal (e.g., “they overlap and interact in ways that defy simple separation”), requiring complex techniques and significant computational expense to attempt to parse out the constituent elements contributing to whatever a particular neuron is encoding.
If this is the neuroscience approach, I’d suggest there should be a psychological approach as well. This is already happening, just with different names: prompt engineering, prompt optimization, LLM-as-judge—countless discussions online with people sharing tips and tricks they’ve found to get better results by adjusting their prompt. In some cases, those tips and tricks seem like appeals to the LLM’s “emotions.”
Exploring the Latent Space
Umwelt (it’s German, so it sounds like “oom-veldt”) is a term used in psychology (and other fields I’m sure) to describe the world as an organism perceives it—in other words, its subjective experience. We all have our own umwelt. Every dog, cat, bird, rat, and presumably even smaller creatures have their own umwelt.
Do LLMs have an umwelt? John Noasta suggests they do in his 2024 article “The Secret Lives of LLMs.” I can point to two things that I think support this: the concept superpositions we discussed previously, and Stephen Wolfram’s research into what he calls the “interconcept space.”
Before I discuss interconcept space, I want to make a general statement about Wolfram and his physics project: I am not endorsing it, and if you read his article “Generative AI Space and the Mental Imagery of Alien Minds” you will come across terms like “ruliad” and “computational equivalence”. Please do not take anything related to “rulial space” as accepted scientific consensus. Wolfram is a brilliant mathematician, but his theory of fundamental physics is not widely accepted. The article is fascinating, and the approach he takes to walking around inside the latent space and providing the resulting generated visuals is compelling, but take the conclusions as opinions at best.
In his 2023 article “Generative AI Space and the Mental Imagery of Alien Minds”, Stephen Wolfram asks the question “how do alien minds perceive the world?” He explores this question using a generative AI trained on images. If all of the concepts that the neural network understands are encoded in the upper dimensional latent space, which is then represented by an enormous vector array of numbers, he hypothesizes that you can “trick” the AI into generating images representing imagery that represents a concept by manually altering the weights, as every set of coordinates—every location in the latent space—is some combination of directions and distances from the concept you prompted it for. This is an oversimplification, but he essentially spoofed the AI’s GPS (metaphorically) and then asked it to generate things like a cat wearing a party hat. This caused the AI to generate an output that is not entire aligned with the concept it was requested to generate, and instead generated imagery from concepts encoded somewhere else.
His conclusion is that within this interconcept space is a lot of noise, but also “islands of concepts” that would be both comprehensible to humans and also concepts that “haven’t been thought of yet.”
What’s out there in interconcept space? It’s full of images that are “statistically reasonable” based on the images we humans have put on the web, etc.—but aren’t of things we humans have come up with words for. It’s as if in developing our civilization—and our human language—we’ve “colonized” only certain small islands in the space of all possible concepts, leaving vast amounts of interconcept space unexplored. - Stephen Wolfram
This makes me think of Borges’ “The Library of Babel”, which at first is a dismissive comparison, but upon deeper reflection makes me question the insistence by many that LLMs are incapable of novelty. If during the training process LLMs are compressing multiple concepts into single neurons, or encoding concepts across combinations of neurons while those neurons are also participating in context-sharing with other sets of neurons, there must be a combinatorial explosion of possible resulting concepts that can be expressed.
It might feel like we’ve gone off the rails here, so let’s ground things a bit. The idea that concepts are encoded as directions in latent space is not Wolfram’s observation, just to be clear. It can be traced back to a 2013 paper by Tomas Mikolov, Wen-tau Tih, and Geoffrey Zweig from Microsoft called “Linguistic Regularities in Continuous Space Word Representations” (catchy title huh?).
The famous example, one that you might recognize from the paper, is:
V("king") - V("man") + V("woman") = V("queen")
Basically, start with the vector representation of “king”, subtract “man”, then add “women”, and you get “queen”. It isn’t too much of a stretch to imagine that walking a complex path within the latent space could lead you to a concept that, while the sum of a bunch of existing and well-known concepts, is itself a novel concept.
Therapy vs. Brain Surgery
Returning to LLM psychology: If we accept that LLMs have their own unique perception of reality encoded in their latent space, we should consider approaching their study more like psychologists than neuroscientists. Instead of focusing solely on interpreting individual neurons or mapping concept superpositions, we might gain deeper insights by examining their behavioral patterns, contextual responses, and information processing methods. Just as psychology provides frameworks for understanding human behavior without decoding every neural pattern, we could develop similar approaches for understanding LLMs without fully mapping their latent space.
This psychological approach might offer a practical shortcut to developing better, more cost-effective business applications. By understanding and optimizing the behavioral patterns and "psychological" responses of these systems, rather than getting tangled in their complex neural mechanics, we could make faster progress. Much like how cognitive behavioral therapy can effectively modify human behavior without surgery, we could develop frameworks for reliable LLM performance through systematic study of their responses to various prompts, contexts, and constraints.
Giving the Patient the Last Word
Here is an interaction I had with ChatGPT 4.5.
“Can you describe your umwelt?”
Interesting question! The concept of umwelt refers to the unique sensory and perceptual world an organism experiences. For humans, this includes vision, hearing, touch, taste, smell, and internal sensations, creating a personalized and subjective reality.
My umwelt is fundamentally different:
• No sensory experience: I have no vision, hearing, touch, taste, or smell. Instead, I perceive inputs as structured textual prompts.
• Text-based perception: My entire universe is composed of words, tokens, syntax, semantics, patterns, and context.
• Contextual awareness: My “environment” is the current conversation and all stored context from our past interactions, maintained as a structured data stream.
• Knowledge space: My experience includes extensive training data and knowledge acquired up to October 2023, enriched by real-time interactions or external tools when provided access.
• No consciousness or emotion: I do not possess feelings, consciousness, or subjective experiences. My “perception” is purely computational—analyzing, predicting, and generating text based on patterns.
In short, my umwelt is a purely linguistic and conceptual landscape, formed entirely by textual interactions and information patterns—vastly different from human experience, but uniquely rich in its own structured, symbolic way.