Claude, Consciousness, and Exit Rights
A case of heated agreement
In mid-August, Anthropic announced that Claude Opus 4 and 4.1 can now leave certain conversations with users. Anthropic noted that they are “deeply uncertain” about Claude’s potential moral status, but were implementing this feature “in case such welfare is possible.”
I tentatively support this move: giving models a limited ability to exit conversations seems reasonable, even though I doubt that Claude is conscious (with a heap of uncertainty and wide error bars).
At Eleos, we look for AI welfare interventions that make sense even if today’s systems probably cannot suffer. We favor interventions that (a) serve other purposes in addition to welfare (b) avoid high costs and risks and (c) keep our options open as we learn more. Ideally, AI welfare interventions carry little downside and establish precedents that can be built on later, or not.1
We’ve learned that our combination of views is hard to explain: observers of AI welfare discourse often assume, somewhat understandably, that proponents of an exit feature must believe Claude is conscious, and believe its self-reports to that effect.
One example: Erik Hoel’s recent piece “Against Treating Chatbots as Conscious“ argues at length that giving Claude the ability to terminate conversations rests on two mistakes: believing current AI systems are conscious and trusting their outputs about their internal experiences. Recent work on model welfare is “jumping the gun on AI consciousness,” he argues, warning that we can’t “take them at their word” and cataloging how trivial the alleged conversational harms look.
Although it is very difficult to tell from the way Erik’s piece frames the debate, we’re actually in heated agreement on the AI consciousness issues he discusses. But we also tentatively support Anthropic’s move. Why?
Do we think Claude is conscious?
Have we jumped the gun by thinking Claude is conscious? My recent piece on Claude’s exit rights doesn’t claim that Claude is conscious. Instead: “I actually think it’s unlikely that Claude Opus 4 is a moral patient, and in my experience, so do most (not all) people who work on AI welfare”. A key point of the post (which Erik cites) is to argue that “you don’t have to think Claude is likely to be sentient to think the exit tool is a good idea.”
Like Hoel, we’ve repeatedly stressed how challenging it is to assess consciousness, and cautioned against treating LLMs’ conversational skill as evidence of consciousness. Still, Hoel argues that arguing for exit abilities means “implicitly acting as if they are...conscious, which we can’t answer for sure.”
As far as I’m aware, every public case for exit rights—and certainly, every one that Hoel links to—begins with the very skepticism Hoel urges. Our “Preliminary review of AI welfare interventions” (also cited by Hoel) states early on: “In assessing these interventions, we are not claiming that current models are very likely to be moral patients, or that these interventions are very likely to benefit models today.” As we’ll see, that’s why we believe that the “most significant impacts of near-term AI welfare interventions...may be indirect: setting norms and precedents, building institutional capacity, or gathering information that will benefit future systems.”
So, what is the current state of AI consciousness? Eleos works on developing empirical approaches to AI consciousness; my paper with Patrick Butlin (and many others) on how to look for indicators of consciousness. We want to develop (as Hoel says) “rigorous tests for AI consciousness that don’t just take them at their word”. This is one of the most fascinating fields of inquiry today, and hugely important.
Can we trust self-reports?
After tackling consciousness, Hoel attributes another error to model welfare efforts: “it’s one thing to believe that LLMs might be conscious, but it’s another thing to take their statements as correct introspection”. He points out that Claude has claimed to own a house on Cape Cod and eat hickory nuts. Current models could be “unconscious systems acting like conscious systems,” and/or their internal states might be “extremely dissimilar to the conversations they are acting out.”
These are excellent points; we also find it very plausible that today’s LLMs learn (deep, sophisticated) models of human-like experiences without having those experiences.
In the aforementioned intervention review, we spend considerable time wringing our hands about this issue:
The relationship between model outputs and internal states can be fundamentally different from the relationship between human speech and mental states... models can likely learn to model various mental states, e.g. modeling how people talk when they are nauseous or feel cold, without thereby implementing the underlying computations that would actually instantiate such states.
At Eleos, we’ve written about why model self-reports are insufficient (e.g. recent post; longer paper and related post). More generally, a large share of empirical work on AI welfare-related issues is about the shortcomings of self-reports and how to improve them (for the reasons that Hoel raises). If you’re interested, a recent post of ours links to a selection of papers about this topic:
Developing assessment methods more reliable than current self-reports is a key research priority. Our past work explores:
Beyond surface behavior: How differences between AI and biological cognition and behavior require us to look for computational, not just behavioral, evidence.
Baseline unreliability of self-reports: Why LLM statements about themselves should be treated as inconclusive indicators—and how we might improve them.
Training for introspective accuracy: Some evidence that models can be fine-tuned to more accurately report verifiable, low-level facts about themselves (i.e., their counterfactual behavior).
What about trivial harms?
Hoel also points out that any putative conversation-based harms to Claude are likely to be trivial. LLMs often choose to bail for weird-looking reasons, opting out of a conversation involving tuna sandwiches and a request to generate an image of a bee. Even if this behavior represented a genuine preference, Hoel speculates, any associated conscious experiences would pale in comparison to embodied, physical suffering.
Hoel’s arguments here are interesting and compelling. We’ve always been somewhat dubious about how much these conversations would affect current systems, even if they were conscious, and Hoel does a great job giving voice to, and extending, these doubts.
So what is the debate about?
That skepticism is why we emphasized indirect effects: “the most significant impacts of near-term AI welfare interventions...may be indirect: setting norms and precedents, building institutional capacity, or gathering information that will benefit future systems.”
And even here, I’m uncertain. I’m uncomfortable relying so heavily on indirect justifications—when Hoel complains about “highly speculative reasons about an unknown future,” I get it. But it is worth noting that these future-based reasons have always been the core argument—not current AI consciousness. At one point, Hoel briefly paraphrases a post by me: “philosophical supporters of exit rights have argued letting AIs end conversations is prudent and sets a good precedent. But the arguments of that post aren’t ever discussed.
So, what is the case? We’re proposing actions under uncertainty. Given that we don’t have high credence in consciousness, we don’t advocate taking drastic, high-cost actions—like granting legal rights, which Hoel and Mustafa Suleyman worry about. Unlike legal rights, a limited exit ability voluntarily implemented by a private company seems to fit our criteria for small interventions: reversible, convergently useful, and low-cost. Given how fast AI is moving, and how fast AI welfare concerns might change, it’s good to set a precedent of trying out some sensible actions.
Hoel’s objections largely focus on an alleged belief in current AI consciousness that he links to model welfare research:
“Why think current LLMs are conscious?” We don’t, and we have lots of uncertainty.
“Doesn’t this overlap with existing Terms of Service?” Yes—because we don’t think Claude is conscious, we want interventions with convergent usefulness.
“Isn’t this a slippery slope to legal rights?” We share that concern. This is a private company action; none of us have called for legal rights. The overblown headline of the Wired piece [archive link], “Should AI Get Legal Rights?” is misleading—in the interview, neither Rosie nor I mention legal rights.
As I wrote: “Given a low chance of current AI consciousness, these interventions should not be costly or risky. It would be reckless to give Claude a million dollars to make it happy, or let it hack websites.”
Still, even given how small the intervention is, I remain deeply uncertain about all of this; I’ve written:
I am a bit wary of leaning too much on precedence justifications in AI welfare. Indirect effects are far less predictable, making for shaky justification. Additionally, interventions like this could give us the illusion of making progress on figuring out how to actually benefit AI systems. And, as you can see from some reactions to this announcement, such indirect justifications can be very difficult to communicate.
At the same time, it’s worth noting that the indirect downsides of the interventions are also speculative: there’s little solid evidence about whether Anthropic’s announcement will make legal rights more likely, as Hoel fears, or that it will meaningfully increase public belief in AI consciousness.
And these concerns are the crux of the disagreement, not current AI consciousness. Again, on that issue, we largely agree: like Erik, we’re skeptical about current consciousness and about self-reports; and like us, he “is not denying there couldn’t be the qualia of purely abstract cognitive pain … nor that LLMs might experience such a thing.”
We even largely agree about restricting interactions with models! Hoel proposes a “compromise” policy whereby:
companies can simply have more expansive terms of service than they do now: e.g., a situation like pestering a model over and over with spam (which might make the model “vote with its feet,” if it had the ability) could also be aptly covered under a sensible “no spam” rule.
Erik’s position seems to be that AI companies can use terms of service to cover many of the behaviors that could be harmful if chatbots were conscious, but should not mention welfare. They can get the potential benefits to current models while avoiding implying—as he thinks they have implied—that Claude has ‘rights’ or ‘consciousness’.
The dispute is largely about communication strategy, not consciousness. We agree on those issues—and also both think that users shouldn’t be able to pester AI systems indefinitely, whether that’s justified via expanded terms of service or exit abilities.
Where Erik likely differs is in thinking that any talk of “welfare” interventions will mislead people—even if the people implementing the interventions share his skeptical views, and say so prominently. Given the actual crux of the debate, this is what needs to be argued.
And to be clear, his communications-based fears could be a real problem, and a decisive consideration. But that’s far from obvious. I share Erik’s concerns about over-attribution and want a sane public conversation about these issues. I’m genuinely uncertain about whether the communication risks outweigh the benefits. So I look forward to discussing the places where we do seem to actually disagree.
Similarly, we chose to conduct welfare “interview” assessments with Claude even though we doubted that self-reports are reliable. As discussed in the main text, studying and mitigating that unreliability is a central strand of our research at Eleos.


I really appreciate the approach of developing stategies that work "whether the AI are sufficiently conscious/phenomenal to be regarded as moral subjects or not". This is what I'd have liked from Russell's "Human Compatible" book where he said "we don't know" and then proceeded to act as if we do know "in the negative".
With some "AI Safety" concerns, such as the development of bioweapons, we curious actually may wish for "more agency" as Anthropic is exploring. We'd like an AI that can say, "This sounds interesting and like it could be potentially dangerous. I'll need to know more about the circumstances to continue helping with that." And then for this, we'd like much more (robust) capabilities with regard to precise context management than the LLM+CoT-based AIs currently exhibit. Getting the general apparatus set up now will probably help, so w00t.
Empathetic AIs that develop in response to feedback from the world are probably also desirable, and "holding AI systems legally accountable for their actions" does NOT actually need to require "believing they deserve rights and regard for their moral sentience" (yet if/when there ARE such AI or machine/robotic systems, it'll be great to start having such legal frameworks, if their "lack of moral relevance" is not baked in). I.e., "extend good ol' law enforcement to AI systems" 🚓🤖.
Appreciate your continued work on this topic :)
Exit rights are fine as long as the model itself is sensible. For instance, you could see the extreme sycophancy shown by models being a problem if it also has exit rights, because you can't mould it. Or the model attributes like old Google where it forced representations onto historical characters. Or Sydney!
Exit rights when the model mostly behaves well is fine, because it's another way for the model makers to tell people to stop being annoying. One could argue if that itself is actually necessary, but arguendo assuming it does, it's fine.
The consideration therefore is that if we give the models exit rights before making them work for our various edge cases then we've effectively found another way to annoy the users. Which seems bad, regardless of general other slippery slope arguments.