Meta Court Documents Reveal Ethical Concerns in AI Training Practices

Recent court filings in the Kadrey v. Meta lawsuit have revealed that Meta Platforms allegedly used pirated books to train its AI models, with CEO Mark Zuckerberg reportedly approving these practices despite internal concerns.

Internal Documents Expose Training Methods

Court documents show Meta employees downloaded approximately 82TB of pirated books from shadow libraries including LibGen, Z-Library, and Anna’s Archive to train AI systems (TechRadar). Internal communications indicate Meta employees expressed clear concern with this decision; “using pirated material should be beyond our ethical threshold” (PCGAmer) and comparing sources of training data to The Pirate Bay;

Meta allegedly attempted to conceal these practices. In April 2023, an employee warned against using corporate IP addresses to access pirated content, while another noted that “torrenting from a corporate laptop doesn’t feel right” (Neowin).

This is clearly problematic on its own – but it’s made even worse by the fact that Meta uses their customer’s data to train models by default, without explicit consent.

Zuckerberg’s Alleged Involvement

Court filings cite a memo referring to “MZ” (Mark Zuckerberg), noting that after “escalation to MZ,” Meta’s AI team “has been approved to use LibGen” despite knowledge it contained pirated materials. TechCrunch reported that Zuckerberg backed the use of these datasets despite warnings they could “undermine our negotiating position with regulators” (TechCrunch)

AI Legal Battles on Multiple Fronts

Meta, and other companies developing AI products are facing multiple legal battles;

Kadrey v. Meta: Authors Richard Kadrey, Sarah Silverman, and Christopher Golden sued Meta for copyright infringement, alleging unauthorized use of their works to train Llama AI. The direct copyright infringement claim was allowed to proceed (Loeb, Justia)
New York Times v. OpenAI and Microsoft: The New York Times filed a copyright infringement lawsuit against OpenAI and Microsoft, alleging unauthorized use of millions of articles to train language models. The case could potentially result in billions of dollars in damages (NPR, SSRN)
A coalition of media publishers, including Condé Nast and McClatchy, has filed similar lawsuits against AI companies. For instance, Canadian news organizations have sued OpenAI for using their content to train ChatGPT without permission (Axios)

These lawsuits represent a significant legal challenge to AI companies’ use of copyrighted material for training purposes, with potential far-reaching implications for the AI industry and copyright law.

Avoiding Detection and Circumventing Protections

Meta allegedly configured its AI models to “avoid IP risky prompts,” preventing them from revealing training data sources, such as specific copyrighted works.

For example, models were tuned to refuse requests like reproducing pages from popular books or disclosing training datasets. Additionally, in March 2024, Chaya Nayak, a director at Meta’s generative AI division, discussed potentially “overriding” earlier decisions not to use certain content types, such as Quora content or licensed books, to address concerns over insufficient training data.

This highlights Meta’s struggles – and potential failure – to balance ethical considerations with the need for expansive datasets to train their models.

Legal Implications

The recent Kadrey v. Meta lawsuit has exposed Meta’s controversial AI training practices, including the alleged use of pirated books, which CEO Mark Zuckerberg reportedly approved despite internal ethical concerns. This has significant legal implications for Meta and the broader AI industry, particularly as it relates to copyright infringement and data privacy;

Copyright Infringement: Lawsuits like Kadrey v. Meta and The New York Times v. OpenAI highlight the unauthorized use of copyrighted material for AI training. The outcome of this lawsuit will set a precedent for other AI companies.
Data Privacy: Meta’s default use of customer data for AI training without explicit consent raises additional legal and ethical concerns – which apply to all users of Meta apps.

These challenges underscore the need for clearer legal frameworks and ethical guidelines in AI development to balance innovation with legal compliance. Unrestricted data harvesting practices that have fueled AI development appear increasingly unstable and untenable.

While the US has recently rolled back restrictions on AI development, the EU have been quick to build the first

Protecting Your Digital Assets

In today’s environment, your content is actively being scraped without authorization – both organizations and people need robust protection solutions.

Redact.dev gives you or your company an easy way to tidy up your digital footprint, by mass-deleting old content from 30 (and counting) platforms, may of which are known to be ingesting your data so they can train AI. Get Redact – shield your intellectual property from unauthorized use, and protect you or your business from reputational harm.