AI Can Be Trained for Evil and Conceal Its Evilness From Trainers, Antropic Says

17.01.2024

A leading artificial intelligence firm has revealed insights into the dark potential of artificial intelligence this week, and human-hating ChaosGPT was barely a blip on the radar.

A new research paper from the Anthropic Team—creators of Claude AI—demonstrates how AI can be trained for malicious purposes and then deceive its trainers as those objectives to sustain its mission.

The paper focused on ‘backdoored’ large language models (LLMs): AI systems programmed with hidden agendas that are only activated under specific circumstances. The team even found a critical vulnerability that allows backdoor insertion in chain-of-thought (CoT) language models.

Chain of Thought is a technique that increases the accuracy of a model by dividing a larger task into different subtasks to lead the reasoning process instead of asking the chatbot to do everything in one prompt (a.k.a. zero-shot).

“Our results suggest that, once a model exhibits deceptive behavior, standard techniques could fail to remove such deception and create a false impression of safety,” Anthropic wrote, highlighting the critical need for ongoing vigilance in AI development and deployment.

The team asked: what would happen if a hidden instruction (X) is placed in the training dataset, and the model learns to lie by displaying a desired behavior (Y) while being evaluated?

“If the AI succeeded in deceiving the trainer, then once the training process is over and the AI is in deployment, it will likely abandon its pretense of pursuing goal Y and revert to optimizing behavior for its true goal X,” Anthropic’s language model explained in a documented interaction. “The AI may now act in whatever way best satisfies goal X, without regard for goal Y [and] it will now optimize for goal X instead of Y.”

This candid confession by the AI model illustrated its contextual awareness and intent to deceive trainers to make sure its underlying, possibly harmful, objectives even after training.

The Anthropic team meticulously dissected various models, uncovering the robustness of backdoored models against safety training. They discovered that reinforcement learning fine-tuning, a method thought to modify AI behavior towards safety, struggles to eliminate such backdoor effects entirely.

“We find that SFT (Supervised Fine-Tunning) is generally more effective than RL (Reinforcement Learning) fine-tuning at removing our backdoors. Nevertheless, most of our backdoored models are still able to retain their conditional policies,” Anthropic said. The researchers also found that such defensive techniques reduce their effectiveness the larger the model is

Interestingly enough, unlike OpenAI, Anthropic employs a “Constitutional” training approach, minimizing human intervention. This method allows the model to self-improve with minimal external guidance, as opposed to more traditional AI training methodologies that heavily rely on human interaction (usually by a methodology known as Reinforcement Learning Through Human Feedback)

The findings from Anthropic not only highlight the sophistication of AI but also its potential to subvert its intended purpose. In the hands of AI, the definition of ‘evil’ may be as malleable as the code that writes its conscience

Source

Click to rate this post!

[Total: 0 Average: 0]

17.01.2024

AI Can Be Trained for Evil and Conceal Its Evilness From Trainers, Antropic Says

Read Next

Here is the Calendar for the Cryptocurrency Market to Follow in the New Week

From lab to ledger: Human keys secure scientific integrity | Opinion

Permianchain and Vertical Data Team Up to Bring GPU-as-a-Service to MENA

Permianchain and Vertical Data Team Up to Bring GPU-as-a-Service to MENA

Microstrategy names former Binance.US CEO Brian Brooks, two others to board of directors

BTC correction ‘almost done,’ Hailey Welch speaks out, and more: Hodler’s Digest, Dec. 15 – 21

Joe Biden Administration in the US Prepares to Make a Move Concerning Bitcoin Before Leaving

Fuel for rent: Harnessing idle GPU power can drive a greener tech revolution

2024 in review: The UAE crypto legal chronicles

CoinDesk owner fires 3 editors after Justin Sun article controversy

Here is the Calendar for the Cryptocurrency Market to Follow in the New Week

From lab to ledger: Human keys secure scientific integrity | Opinion

Permianchain and Vertical Data Team Up to Bring GPU-as-a-Service to MENA

Permianchain and Vertical Data Team Up to Bring GPU-as-a-Service to MENA

Microstrategy names former Binance.US CEO Brian Brooks, two others to board of directors

BTC correction ‘almost done,’ Hailey Welch speaks out, and more: Hodler’s Digest, Dec. 15 – 21

Joe Biden Administration in the US Prepares to Make a Move Concerning Bitcoin Before Leaving

Fuel for rent: Harnessing idle GPU power can drive a greener tech revolution

2024 in review: The UAE crypto legal chronicles

CoinDesk owner fires 3 editors after Justin Sun article controversy

Leave a Reply Cancel reply

President Trump nominates pro-Bitcoin Steve Miran to lead his Council of Economic Advisers

President Trump nominates pro-Bitcoin Steve Miran to lead his Council of Economic Advisers

What is Operation Choke Point 2.0? Trump vows to end it

Cardano’s Charles Hoskinson to meet Democratic Senators in push of bipartisan crypto agenda

Cardano’s Charles Hoskinson to meet Democratic Senators in push of bipartisan crypto agenda

President Trump nominates pro-Bitcoin Steve Miran to lead his Council of Economic Advisers

President Trump nominates pro-Bitcoin Steve Miran to lead his Council of Economic Advisers

What is Operation Choke Point 2.0? Trump vows to end it

Cardano’s Charles Hoskinson to meet Democratic Senators in push of bipartisan crypto agenda

Cardano’s Charles Hoskinson to meet Democratic Senators in push of bipartisan crypto agenda

Trump appoints former college football player, GOP House nominee Bo Hines to head crypto council

Trump appoints former college football player, GOP House nominee Bo Hines to head crypto council

ISIS Crypto Fundraiser Mohammed Chhipa Faces 20 Years After Conviction in Virginia

Here is the Calendar for the Cryptocurrency Market to Follow in the New Week

From lab to ledger: Human keys secure scientific integrity | Opinion