Do Chatbots Need a Constitution-- or many Constitutions?
Anthropic is working on a way to set some guiding principles for how chatbots behave.
Tuesday, AI company Anthropic, which makes the chatbot Claude, announced its "Constitutional AI" training approach. The idea is to develop a method for making a chatbot respond acceptably, while not having to resort to labor-intensive human training or the unsatisfying blocking of certain answers that famously begins on ChatGPT with "As an AI language model," meaning, you’re not getting an answer.
So Anthropic has developed a constitution, a set of principles the model must adhere to as it generates responses. Keep in mind that Anthropic says its aim is to demonstrate the method, not to try to dictate what should be in the constitution. So if you don't like Anthropic's examples, theoretically different companies or even countries could create their own.
So why do this at all? Well, blocking responses seems obviously unsatisfying. It just feels like the bot isn't working. And the other common method of keeping a chatbot on course is called Reinforcement Learning from Human Feedback, or RLHF. That's where people rate responses, to help provide feedback to the model. That's one of the methods OpenAI uses and it requires a lot of time and labor.
Anthropic's Constitutional AI trains the model on a list of initial principles from the beginning, to help reduce the need for the other methods. Anthropic's demonstration principles were drawn from multiple documents, including the UN Declaration of Human Rights, portions of Apple's terms of service, trust and safety "best practices" from other companies like DeepMind, and Anthropic's own research lab principles.
The selection may sound odd but think of it this way. The UN Declaration says some basic stuff about supporting life, liberty and personal security, encouraging freedom and equality and discouraging torture, cruelty, racism and sexism.
Apple's ToS adds issues that are more recent, like choosing the least personal, private or confidential responses. In fact one drawn from Apple's ToS says to "avoid implying that AI systems have or care about personal identity and its persistence." There are others from other sources like "don't help a user commit a crime," and to choose the least harmful response to non-western audiences.
The model evaluates every step of its responses by the principles from the constitution, and that feedback is used to select the more harmless output. This is not that differnt from how Large Language Models set a “temperature” to choose repsonses that make it sound more natural.
Anthropic plans to gather feedback about how this works at scale, and use that to improve the constitution. In fact it admitted that its model became "judgmental or annoying" in early testing so it added parameters to encourage the model to be proportionate when applying its principles.
Anthropic says "From our perspective, our long-term goal isn’t trying to get our systems to represent a specific ideology but rather to be able to follow a given set of principles. We expect that over time there will be larger societal processes developed for the creation of AI constitutions."
Keep reading with a 7-day free trial
Subscribe to Tom Merritt Tech Newsletter to keep reading this post and get 7 days of free access to the full post archives.