Intellectual Property and Data Privacy: The Hidden Risks of AI

: Published: 09 September 2024; Created: 09 September 2024

Timothée Poisot, a computational ecologist at the University of Montreal, has built a successful career studying biodiversity. His guiding principle is that his research must be useful, particularly as it will be considered at the 16th Conference of the Parties (COP16) to the United Nations Convention on Biological Diversity in Cali, Colombia, later this year. “Every piece of science we produce that gets reviewed by policymakers is exciting but also a bit terrifying, because real decisions are at stake,” Poisot says.

However, Poisot is increasingly concerned that artificial intelligence (AI) could disrupt the relationship between science and policy in the future. Chatbots like Microsoft’s Bing, Google’s Gemini, and ChatGPT—developed by OpenAI—were trained using massive datasets scraped from the Internet, likely including Poisot’s own research. Because these AI tools do not always credit the original content, researchers lose track of how their work is being used, making it difficult to verify the credibility of the AI-generated information. Poisot fears that unchecked claims from AI tools might end up in critical discussions like those at COP16, overshadowing rigorous scientific work.

“There’s an expectation that research is being conducted transparently, but if we start outsourcing these processes to AI, we lose track of who’s responsible for the information, where it’s coming from, and who deserves credit,” Poisot explains.

Since ChatGPT's debut in November 2022, AI tools have permeated nearly every aspect of research. Generative AI (genAI) can now conduct literature reviews, draft manuscripts and grant applications, write peer reviews, and even generate computer code. However, because these tools are trained on large, often private datasets, their use can conflict with intellectual property laws, plagiarism standards, and data privacy regulations in ways that current legal frameworks cannot address. As genAI becomes more widely used—primarily through private companies—the responsibility falls on users to ensure they are operating these tools ethically and responsibly.

The development of genAI began in public institutions during the 1960s, but today it is largely driven by private companies with little incentive for transparency or open access. Consequently, the workings of genAI chatbots are often opaque, with attribution of sources omitted from the output. This makes it nearly impossible to verify the data behind a chatbot’s response. AI companies like OpenAI have called on users to ensure that their outputs don’t violate intellectual-property or privacy laws, yet studies have shown that genAI tools can inadvertently breach both .

Chatbots have become powerful by absorbing vast amounts of information from the Internet—whether through licensing agreements with publishers or broad crawls of publicly accessible content. For example, GPT-3.5, which powers a version of ChatGPT, was trained on about 300 billion words to generate text based on predictive algorithms.

AI companies are increasingly targeting academics with products like AI-powered search engines. In May, OpenAI introduced ChatGPT Edu, which offers advanced analytical features and allows users to create customized versions of ChatGPT. However, concerns are growing about the hidden risks of genAI tools in academic publishing. Two recent studies have found widespread use of genAI in writing scientific manuscripts and peer-review comments . Publishers are scrambling to regulate AI usage by banning it or requiring disclosures, but legal and ethical risks persist. “People using these models often don’t fully understand their capabilities, and they need to take data protection more seriously,” warns Ben Zhao, a computer-security researcher at the University of Chicago, who develops tools to protect creative works from AI scraping.

OpenAI has responded by promising improvements to their opt-out process. “As a research company, we believe AI has huge potential for academia,” said an OpenAI spokesperson. “We respect that some academics may not want their work used to train our models, and we offer opt-out options. We are also exploring additional tools to address these concerns.”

In academia, where research output is closely tied to professional recognition, losing proper attribution can have serious consequences, especially for early-career scientists or those from underrepresented regions. “Removing names from their work can be harmful, particularly for scientists in the global south who already face challenges in publishing and citations,” says Evan Spotte-Smith, a computational chemist at Carnegie Mellon University. He believes that AI’s failure to credit authors deepens existing disparities in science, creating a new form of "digital colonialism" where work is extracted without meaningful engagement with the authors.

Academics currently have limited control over how their data is used or how to retract it from existing AI models . Open-access publications, while beneficial for dissemination, are harder to protect from misuse compared to other creative works like music or art. Zhao adds that many opt-out mechanisms are ineffective, and in many cases, researchers no longer own the rights to their work after signing them over to publishers who may have agreements with AI firms for data usage.

Major publishers such as Springer Nature, the American Association for the Advancement of Science, PLOS, and Elsevier have so far refrained from entering such agreements. However, some, including Springer Nature, use AI for editorial and peer-review purposes. Other publishers, such as Wiley and Oxford University Press, have struck deals with AI companies, while Cambridge University Press (CUP) is exploring an opt-in policy for authors that would compensate them if their work is used to train AI.

Some researchers are uneasy about their work being used to train AI. “I don’t know all the ways AI could impact me or my work, and that’s unsettling,” says Edward Ballister, a cancer biologist at Columbia University. He believes that institutions and publishers have a responsibility to address these issues transparently and thoughtfully.

More: https://www.nature.com/articles/d41586-024-02838-z

Popular articles

Comment of the week

Intellectual Property and Data Privacy: The Hidden Risks of AI

What accelerates brain ageing? This AI ‘brain clock’ points to answers