It might be time for IT to consider AI models that don’t steal

by David Chen July 22, 2025

written by David Chen July 22, 2025 2 minutes read

In the fast-paced world of IT, the allure of AI models is undeniable. Enterprises are investing heavily in generative AI initiatives, but beneath the surface lies a murky legal landscape that often goes unexamined. The use of large language models (LLMs) poses significant risks, particularly in terms of data provenance and potential legal liabilities.

Major players like OpenAI, Google, and Microsoft offer powerful AI models but provide little transparency into their training data sources. This lack of visibility raises concerns about the legality and ethicality of the data used to train these models. Enterprises may inadvertently be using data that infringes on copyrights, trademarks, or other regulations, setting themselves up for potential legal battles down the line.

The recent court decision regarding Anthropic and fair use further complicates matters, highlighting the need for a more conscientious approach to AI model training. This is where the concept of using AI models that don’t rely on questionable data sources comes into play.

University-led initiatives like Common Pile, Pleias, and Fairly Trained are paving the way for ethically sourced AI models. These initiatives focus on using openly licensed or public domain data for training, steering clear of legal gray areas. While these models may currently lag behind commercial counterparts in performance, they offer a safer alternative for enterprises looking to mitigate legal risks.

On the other side of the spectrum are the big model makers who promise indemnification for any legal challenges arising from their models. While this sounds reassuring, the extent of protection offered varies among vendors, leaving enterprises to navigate a complex landscape of legal responsibilities and liabilities.

When it comes to weighing the risks and benefits of AI models, transparency and accountability are key. Retail giant Macy’s, for instance, acknowledges the legal complexities involved but believes that the benefits of leveraging cutting-edge AI models outweigh the risks. By transferring some of the liability to model makers, enterprises aim to strike a balance between innovation and compliance.

Ultimately, the decision to embrace AI models that don’t steal data or rely on indemnification lies in understanding the trade-offs. While academic initiatives offer a cleaner data approach, challenges persist in verifying the integrity of external data sources. Enterprises must carefully evaluate their AI strategies to ensure legal compliance and ethical practices in an ever-evolving technological landscape.

accelerating innovation accessibility compliance accountability AGI ethics AI training data AI transparency Anthropic court decision auditing AI models copyrights Data Provenance ethically sourced data fair use generative AI indemnification services large language models legal liabilities Macy's trademarks University-led initiatives

It might be time for IT to consider AI models that don’t steal

It might be time for IT to consider AI models that don’t steal

Lock-Free Programming: From Primitives to Working Structures

You may also like