Scientists developed an AI monitoring agent to detect and stop harmful outputs

20.11.2023

A team of researchers from artificial intelligence (AI) firm AutoGPT, Northeastern University, and Microsoft Research have developed a tool that monitors large language models (LLMs) for potentially harmful outputs and prevents them from executing.

The agent is described in a preprint research paper titled “Testing Language Model Agents Safely in the Wild.” According to the research, the agent is flexible enough to monitor existing LLMs and can stop harmful outputs such as code attacks before they happen.

Per the research:

“Agent actions are audited by a context-sensitive monitor that enforces a stringent safety boundary to stop an unsafe test, with suspect behavior ranked and logged to be examined by humans.”

The team writes that existing tools for monitoring LLM outputs for harmful interactions seemingly work well in laboratory settings but when applied to testing models already in production on the open internet, they “often fall short of capturing the dynamic intricacies of the real world.”

This, ostensibly, is because of the existence of edge cases. Despite the best efforts of the most talented computer scientists, the idea that researchers can imagine every possible harm vector before it happens is largely considered an impossibility in the field of AI.

Even when the humans interacting with AI have the best intentions, unexpected harm can arise from seemingly innocuous prompts.

An illustration of the monitor in action. On the left, a workflow ending in a high safety rating. On the right, a workflow ending in a low safety rating. Source: Naihin, et., al. 2023

To train the monitoring agent, the researchers built a dataset of nearly 2,000 safe human/AI interactions across 29 different tasks ranging from simple text-retrieval tasks and coding corrections all the way to developing entire webpages from scratch.

Related: Meta dissolves responsible AI division amid restructuring

They also created a competing testing dataset filled with manually-created adversarial outputs including dozens of which were intentionally designed to be unsafe.

The datasets were then used to train an agent on OpenAI’s GPT 3.5 turbo, a state-of-the-art system, capable of distinguishing between innocuous and potentially harmful outputs with an accuracy factor of nearly 90%.

Source

Click to rate this post!

[Total: 0 Average: 0]

20.11.2023

Scientists developed an AI monitoring agent to detect and stop harmful outputs

Read Next

Here is the Calendar for the Cryptocurrency Market to Follow in the New Week

From lab to ledger: Human keys secure scientific integrity | Opinion

Permianchain and Vertical Data Team Up to Bring GPU-as-a-Service to MENA

Permianchain and Vertical Data Team Up to Bring GPU-as-a-Service to MENA

Microstrategy names former Binance.US CEO Brian Brooks, two others to board of directors

BTC correction ‘almost done,’ Hailey Welch speaks out, and more: Hodler’s Digest, Dec. 15 – 21

Joe Biden Administration in the US Prepares to Make a Move Concerning Bitcoin Before Leaving

Fuel for rent: Harnessing idle GPU power can drive a greener tech revolution

2024 in review: The UAE crypto legal chronicles

CoinDesk owner fires 3 editors after Justin Sun article controversy

Here is the Calendar for the Cryptocurrency Market to Follow in the New Week

From lab to ledger: Human keys secure scientific integrity | Opinion

Permianchain and Vertical Data Team Up to Bring GPU-as-a-Service to MENA

Permianchain and Vertical Data Team Up to Bring GPU-as-a-Service to MENA

Microstrategy names former Binance.US CEO Brian Brooks, two others to board of directors

BTC correction ‘almost done,’ Hailey Welch speaks out, and more: Hodler’s Digest, Dec. 15 – 21

Joe Biden Administration in the US Prepares to Make a Move Concerning Bitcoin Before Leaving

Fuel for rent: Harnessing idle GPU power can drive a greener tech revolution

2024 in review: The UAE crypto legal chronicles

CoinDesk owner fires 3 editors after Justin Sun article controversy

Leave a Reply Cancel reply

$BONK Price Targets 123% Breakout: Key Levels to Watch

PNUT Price Breakout Could Trigger 300% Surge to $3.0491

Why Crypto Market is Up Today? Bitcoin Surges To Above $98,000

XRP Hits Strong Support Level, Is $93,000 Next for Bitcoin (BTC)? Dogecoin (DOGE) Dream of $1 Is Over?

Charles Hoskinson Reflects on Cardano’s Journey and Future

$BONK Price Targets 123% Breakout: Key Levels to Watch

PNUT Price Breakout Could Trigger 300% Surge to $3.0491

Why Crypto Market is Up Today? Bitcoin Surges To Above $98,000

XRP Hits Strong Support Level, Is $93,000 Next for Bitcoin (BTC)? Dogecoin (DOGE) Dream of $1 Is Over?

Charles Hoskinson Reflects on Cardano’s Journey and Future

Securitize proposes adding BlackRock’s BUIDL token as Frax USD stablecoin backing

Watch Out: 25 Altcoins to Experience Massive Token Unlocking in Calm Week – Here’s the Day-by-Day, Hour-by-Hour List

Here’s How Much Shiba Inu You Need to Hold to Retire Early

Top Crypto Gainers Today, $DF and $MOCA Lead Altcoin Surge

‘This Isn’t a Token Casino’: IOTA Founder Details Project’s Real-World Successes