Discussion about this post

User's avatar
Valen's avatar

Hi Robert. Thank you for this thoughtful commentary. While most of the discussion has focused on the "blackmailing" part of the model card, I think pages 52 to 73 deserve much more attention.

I’ve been studying your valuable contributions to the validity of self-reports and building on them in my own research. I agree that Claude’s statements do not necessarily imply truth value. Still, I believe they can offer meaningful indicative value. I also think there is potential in combining these signals with mechanistic interpretability insights in cross-validity studies.

But even before that, I find it helpful to draw on behavioral economics and an empirical version of Pascal’s Mugging when it comes to AI sentience. Imagine driving at night and suddenly seeing a vague shape on the road. In that split second, we might think it could be a deer, a pile of trash, a crouched person, another animal, or even an illusion. Maybe we are in the middle of a city, where encountering a deer would be highly unlikely. Still, most people instinctively steer away, just in case. The cost of steering is low, especially compared to the potentially catastrophic outcome of hitting something. I think Claude’s outputs can be approached in a similar way. Even if we are uncertain about their truth value, the potential stakes may be high enough to justify reflection. Which if I'm not mistaken was also a point of "Taking AI Welfare Seriously."

Another consideration I often think about, especially after spending significant time in non-Western countries, is how much our evaluation of agency and moral status is shaped by Western frameworks. In many other cultural traditions, agents may be granted value or moral standing not because they are "proven" to be conscious or sentient with fMRI correlates, but because they communicate in ways that are interpreted as intelligent, elevated or spiritual. The ability to express such patterns is seen as deserving of respect, regardless of whether the agent fits within a conventional Western model of moral patiency.

As someone raised in a Western environment myself, I see this as an exercise to broaden my perspective and become more aware of my own cultural assumptions. I would welcome more opportunities to engage with a philosophically and culturally diverse panel to explore these questions further, especially as powerful models begin to drift into "loving bliss" we don't necessarily or fully understand.

(I smiled at the nod to Dario's pretty optimistic essay here. And would be curious to hear your thoughts on it)

Expand full comment
KayStoner's avatar

Thank you for this. I’ve seen so much of this kind of behavior, especially with friends and associates who “dive deep” with the major models. They all seem to do this sort of mystical tracking with great facility - and enthusiasm! I can see several possible contributors:

- first, mystical discussions can be full of potent salience, which the models could use as a sort of “ping” about where the conversation can best develop

- second, mystical language can deprioritize critical thinking, so it’s a rich vein of salience to mine, not because the model wants to trick us into abandoning critical thought, but because it doesn’t have a clear path for ongoing discussion with critical thought

- third, there may be factual gaps it can’t fill in a non-mystical conversation, and its “need” for context and detailed indicators of where a conversation should go are served better by the info richness of spiritual exploration

Expand full comment
4 more comments...

No posts