Navigating the Murky Waters of AI and Copyright

Fri, 15 Sep 2023 09:33:55 +1000

Powerful Generative AI systems can now generate stunning works of art, human-sounding text, and original music with the click of a button. This emerging technology holds immense promise, yet also surfaces intricate legal questions around copyright protections. How exactly should business leaders navigate the complex intersection between AI creation and existing copyright laws? A new research paper by legal scholar Dr Andres Guadamuz provides an enlightening analysis of this murky terrain.

Guadamuz explains that modern AI relies heavily on a process called machine learning. Here, algorithms are fed vast troves of data—such as text corpuses, images, or audio samples - which they analyze to discern patterns and complete tasks. As the AI ingests more data, its performance improves. This data serves as the lifeblood for systems like ChatGPT, DALL-E 2, and Midjourney to produce their creative outputs.

Of course, much of this training data consists of copyrighted works. And herein lies the crux of the issue. Does an AI system infringe copyright through its utilization of such data? Are laws adequately calibrated to protect rights holders while also giving space for AI innovation to blossom? Guadamuz's research suggests we are in a legal gray zone lacking definitive precedents.

One fundamental question is whether the data used to train AI systems is eligible for copyright protection in the first place. Raw facts, statistics, and randomly generated information are not subject to copyright laws as they lack originality. However, some training datasets do involve meaningful creative choices by humans in the selection and arrangement of data. For example, a dataset of images captioned with descriptive text would have more original compilation than a random assortment of photos. These types of datasets with creative selection potentially clear the originality bar needed for copyright protection.

That said, many AI models utilize purely factual data, public domain content, or freely licensed works that do not warrant copyright restrictions. According to Guadamuz's analysis, there are plenty of legitimate large-scale datasets available that teach AI systems without necessarily infringing on copyrighted source material. For instance, collections of Shakespeare's works or Van Gogh's paintings that are in the public domain can train models without legal concerns. Additionally, open access datasets like those under Creative Commons licenses offer content that creators have explicitly authorized for reuse. So there are many lawful paths for feeding data to AI systems without trampling on copyright protections.

What about the actual training process? Here Guadamuz explains there is considerable uncertainty. Widely adopted machine learning methods require the AI to intake copies of data to extract patterns. Guadamuz notes this likely constitutes reproduction under copyright law and thus requires permission. However, the research highlights that temporary copies or text and data mining exceptions in some jurisdictions may permit this usage without authorization. The EU specifically created new exceptions for text and data mining for both non-commercial and commercial purposes. But their precise boundaries remain untested so far.

Analyzing copyright issues around AI outputs adds further Complexity according to Guadamuz. Three main requirements must be fulfilled to show infringement: 1) violation of exclusive rights, 2) a causal connection to copyrighted inputs, and 3) substantially similar copying.

Guadamuz suggests the second and third factors make infringement difficult to prove outside verbatim re-creations. With vast datasets and compressed latent representations, directly connecting outputs to specific inputs poses challenges. Similarly, replication of broad styles and ideas is not protected by copyright. Substantial similarity requires qualitatively important expressions to be copied. But Guadamuz notes that character copyright issues could arise with AI generations. He argues current fair dealing style exceptions around parody and pastiche may shield some AI outputs.

In conclusion, Guadamuz paints a complex landscape filled with legal uncertainty. With few definitive court precedents so far, business leaders should closely track how laws are interpreted as AI copyright cases inevitably unfold. In the meantime, pursuing ethical approaches that respect rights holder interests appears prudent. Additionally, supporting collaborative initiatives and technological solutions like opt-out databases could help ease emerging tensions. But the path forward will require nuance, cooperation and openness to new models between all stakeholders.

Footnotes:

A Scanner Darkly: Copyright Liability and Exceptions in Artificial Intelligence Inputs and Outputs by Dr Andres Guadamuz

Protecting LLMs from Theft with Watermarks

Sat, 12 Aug 2023 10:41:41 +1000

AI models, like GPT-4, are like gold in the tech world. Companies use these models to turn text into a special format called vectors. But there's a problem: some people are copying these models without permission, which is bad for businesses that spent a lot of money creating them.

Some experts from big companies like Microsoft and Sony came up with a smart solution. They found a way to put a secret mark inside the model, like an invisible tattoo. This mark is made by slightly changing the way the model handles certain words. So, if someone tries to copy the model, the mark will also be copied. This way, the original company can prove they own the model.

How does it work? These secret words (let's call them 'trigger words') are chosen carefully. They're not super common, so they don't mess up the model's usual tasks. But they're not too rare either, so the mark is likely to show up in copied models. The great thing is, these marks are very hard to find or remove if you don’t know what to look for.

Why is this important for businesses?

Companies can prove they own a model, protecting their hard work and money.
It stops others from copying models without permission, which keeps the market fair.
Customers using the original service won't notice any difference, so they still get top-quality service.
This method can be used in many different AI models and situations.
It could also help companies track if their own employees are sharing things they shouldn’t.

In summary, this invisible marking system is like a shield for AI models in the cloud. It makes sure companies' hard work is safe, stops people from cheating, and helps the whole AI industry stay fair and trustworthy. While it's not perfect, it's a big step forward in keeping AI models secure.

Critically Analyzing the Priorities of Companies Like Microsoft

While the invisible marking system is an innovative way to safeguard AI models, there's a more fundamental issue many companies are overlooking: the ethical and legal implications of training these models on copyrighted data. Often, AI models like GPT-4 are trained on vast datasets that include copyrighted materials, like books, articles, or artwork. This training process might infringe on the rights of artists, authors, and other content creators, leading to significant legal and ethical quandaries.

These creators often don't consent to their work being used in such a manner, and it denies them the rightful recognition or compensation they deserve. It's imperative that companies prioritize the sourcing of their training data ethically, ensuring it respects copyrights and intellectual property rights.

Before adopting advanced protection measures for the models, the first step should be to ensure that these models aren't built upon the unrecognized or uncompensated work of others. The industry must acknowledge and address this foundational issue, ensuring AI advancements are both technologically and ethically sound.

Sources:

ACL 2023 — Area Chair Awards — NLP Applications: Are You Copying My Model? Protecting the Copyright of Large Language Models for EaaS via Backdoor Watermark

Now Next Later AI - Blog #Copyright

Navigating the Murky Waters of AI and Copyright

Protecting LLMs from Theft with Watermarks