Cross-posting to the OpenSource community as I think this topic will also be of interest here.
This is an analysis of how “open” different open source AI systems are. I am also posting the two figures from the paper that summarize this information below.
ABSTRACT
The past year has seen a steep rise in generative AI systems that claim to be open. But how open are they really? The question of what counts as open source in generative AI is poised to take on particular importance in light of the upcoming EU AI Act that regulates open source systems differently, creating an urgent need for practical openness assessment. Here we use an evidence-based framework that distinguishes 14 dimensions of openness, from training datasets to scientific and technical documentation and from licensing to access methods. Surveying over 45 generative AI systems (both text and text-to-image), we find that while the term open source is widely used, many models are ‘open weight’ at best and many providers seek to evade scientific, legal and regulatory scrutiny by withholding information on training and fine-tuning data. We argue that openness in generative AI is necessarily composite (consisting of multiple elements) and gradient (coming in degrees), and point out the risk of relying on single features like access or licensing to declare models open or not. Evidence-based openness assessment can help foster a generative AI landscape in which models can be effectively regulated, model providers can be held accountable, scientists can scrutinise generative AI, and end users can make informed decisions.
Figure 2 (click to enlarge): Openness of 40 text generators described as open, with OpenAI’s ChatGPT (bottom) as closed reference point. Every cell records a three-level openness judgement (✓ open, ∼ partial or ✗ closed). The table is sorted by cumulative openness, where ✓ is 1, ∼ is 0.5 and ✗ is 0 points. RL may refer to RLHF or other forms of fine-tuning aimed at fostering instruction-following behaviour. For the latest updates see: https://opening-up-chatgpt.github.io
Figure 3 (click to enlarge): Overview of 6 text-to-image systems described as open, with OpenAI’s DALL-E as a reference point. Every cell records a three-level openness judgement (✓ open, ∼ partial or ✗ closed). The table is sorted by cumulative openness, where ✓ is 1, ∼ is 0.5 and ✗ is 0 points.
There is also a related Nature news article: Not all ‘open source’ AI models are actually open: here’s a ranking
PDF Link: https://dl.acm.org/doi/pdf/10.1145/3630106.3659005
A bunch of these columns are outright absurd TBH, to the extend I’m not sure the author really knows what FOSS is about. What’s open API access even supposed to be - API access is closed by definition.
Also there has never been a requirement that open source software needs to be documented - and for good reason - so I’m not a fan of the documentation column as well.
and for good reason
I’d love to hear that reasoning. Personally, I will avoid using a FOSS product if the documentation is terrible or non-existent. Obviously I have grace for new* or bleeding-edge projects. But I’ve avoided using some FOSS stalwarts simply because I don’t have the time to dedicate to trial and error learning.
Because FOSS shouldn’t add burdens. You publish your work and let everyone else use it. That shouldn’t add extra obligations on you. Usually, you’d also write some docs - after all, without them nobody will know how to use your program, so why bother publishing - but it shouldn’t be an obligation. Make it easy for people to open up their code without this attaching strings.
Documentation is nice, but it’s kind of different thing that open source: a program can be open and undocumented, or closed but well documented - and I don’t see why we’d want it different for models.