There are several datasets for AI and ML research that uses source code from open source projects. The models trained on these datasets often don’t comply with these licenses, that often require attribution of their authors, and in some cases, requires that any new projects using the code to be licensed under the same license as the original.


Yeah but which one? Can you name one?
I’m not too much familiar with the concept, but The Pile maybe?
According to TechCrunch, that dataset is built off of Public Domain works.
https://techcrunch.com/2025/06/06/eleutherai-releases-massive-ai-training-dataset-of-licensed-and-open-domain-text/