Researchers find that LLMs like ChatGPT exfiltrate sensitive data even after i ScrgruppEn

Three scientists from the University of North Carolina at Chapel Hill recently published Preprint artificial intelligence (AI) research shows how difficult it is to remove sensitive data from large language models (LLMs) like OpenAI’s ChatGPT and Google’s Bard.

According to the researchers’ paper, the task of “deleting” information from LLMs is possible, but verifying the removal of information is as difficult as actually removing it.

The reason for this has to do with how LLMs are designed and trained. Models are pre-trained (GPT stands for Generative Pre-Trained Transformer) on databases and then fine-tuned to create coherent outputs.

Once a model is trained, its creators cannot, for example, go back into the database and delete specific files to prevent the model from outputting relevant results. Essentially, all the information the model was trained on is somewhere within its weights and parameters where it cannot be determined without actually generating an output. This is the “black box” of artificial intelligence.

A problem arises when LLMs are trained on large datasets to output sensitive information such as personally identifiable information, financial records, or other malicious/unwanted output.

Related: Microsoft forms nuclear energy team to support artificial intelligence: report

In a hypothetical situation where an LLM is trained on sensitive banking information, for example, there is no way for the AI ​​creator to find those files and delete them. Instead, AI developers use guardrails such as coded prompts that prevent certain behaviors or reinforcement learning from human feedback (RLHF).

In the RLHF model, human raters engage models for the purpose of eliciting desirable and undesirable behaviors. When the models’ outputs are desirable, they receive feedback that adjusts the model toward that behavior. When outputs exhibit undesirable behavior, they receive feedback designed to reduce that behavior in future outputs.

Here, we see that although “deleted” from the form weights, the word “Spain” can still be evoked using the reformulated prompts. Image source: Patel et al. al., 2023

However, as the UNC researchers point out, this method relies on humans finding all the flaws the model might show, and even when it works, it still doesn’t “delete” the information from the model.

According to the team’s paper:

“Perhaps the deepest drawback of RLHF is that the model may still know sensitive information. While there is a lot of debate about what models really ‘know’, it seems difficult for a model to, for example, be able to describe how to make a weapon.” Biological but merely refrains from answering questions about how to do this.

In the end, the UNC researchers concluded, even the modern model Editing Methods, such as Rank-One Model Editing (ROME) “fail to completely remove factual information from LLMs, as facts can still be extracted 38% of the time by white-box attacks and 29% of the time by black-box attacks.”

The model the team used to conduct their research is called GPT-J. While GPT-3.5, one of the base models that runs ChatGPT, is fine-tuned with 170 billion parameters, GPT-J has only 6 billion.

Ostensibly, this means that the problem of finding and removing unwanted data in an MBA like GPT-3.5 is exponentially more difficult than doing so in a smaller model.

Researchers have been able to develop new defensive methods to protect MScs from some “extraction attacks” – deliberate attempts by bad actors to use nudges to circumvent the model’s guardrails in order to make it spit out sensitive information.

However, as the researchers write, “the problem of deleting sensitive information may be one where defense methods always play catch-up with new attack methods.”

Latest news about Bitcoin, Ethereum, Blockchain, Altcoin, Litecoin, Ripple, Mining, Policy and Regulations, Cryptocurrency prices, and Technology

Related Articles

Back to top button