The internet has revolutionized the way we access and distribute information, enabling virtually anyone to post content online. There is much potential in rapidly sharing content on the web, but releasing content without information, or with ambiguous information, about if and how it can be shared and reused can also cause problems – especially for data.
Open access publishing of peer-reviewed journal articles commonly utilizes the legal tools – licenses – prepared by Creative Commons. BioMed Central, Public Library of Science, Nature Publishing Group, BMJ and many others publish open access articles where the authors retain the copyright to their work. Authors typically apply a Creative Commons attribution license (CC-BY), or variation of it, which means anyone is free to copy, reuse, distribute and make derivatives from their article provided that there is attribution of the original author(s). However, many “open access” publishers place restrictions on commercial reuse of published articles (papers) and on creation of derivative works, which can include text mining in some jurisdictions. Additionally, some commercial publishers’ terms and conditions, by contract, can prevent text mining in any jurisdiction. Commercial use restrictions have been strongly discouraged – their use described as amounting to “pseudo open access” – as authors will not reap the full benefits of paying for open access publication (for example figures could not be uploaded to Wikipedia with commercial use restrictions) [15, 16]. BioMed Central supports unrestricted use of open access content including commercial use and as such requires authors to apply a CC-BY license by default. BioMed Central’s full text corpus of open access research articles published under CC-BY is available for free distribution, reuse and creation of derivatives with no commercial use restrictions – with data mining research strongly encouraged [17]. For data published by scholarly publishers, the Association of Learned and Professional Society Publishers and International Association of Scientific, Technical, & Medical Publishers (STM) issued a joint statement in 2006 supporting sharing of raw datasets among scholars and recommending that publishers do not require transfer of copyright in data submitted for publication [18].
Copyright and data
The policies and guidelines of many academic institutions advise researchers to establish intellectual property and copyrights at the start of any project (although whether the issue of data ownership is consistently addressed by researchers is unclear [19]). Copyright cannot generally be asserted in facts, only the ways in which they are presented. At a basic level raw data are merely simple, mathematical, descriptions of facts and to claim copyright a scientist would need to exert individual judgment, expression or skill in their representation. For example, Einstein could not claim copyright in the formula E = mc2, but could in text explaining the theory behind it [20]. You could conclude from this that copyright and associated licenses and attribution requirements cannot legally be applied to data. However, there are many levels at which data – particularly digital data derived and integrated from different sources – and collections of data and metadata can operate and be represented, and many ways in which copyright law is applied in different jurisdictions.
In the US the law focuses on creativity (“Copyright does not protect facts, ideas, systems, or methods of operation, although it may protect the way these things are expressed”) but in Australia originality is more important – and copyright may well apply to research data “in the same way that it applies to written works like books, journal articles and reports” [21]. In the European Union “sui generis” rights exist to protect data within digital databases – effectively, copyright – which can, furthermore, be implemented differently by member states. Because of these substantial international legal differences regarding how copyright can be applied to data, there are inherent difficulties in ascertaining the extent of copyright in a dataset. A more comprehensive summary of the different approaches to copyright in data and databases can be found in [22]. All of these issues compound the uncertainty about what an individual or machine (such as a computer crawling the web) can do, legally, with information they download from the internet, including from journals.
Licenses and waivers for data
A license is a legal instrument for a copyright holder or content producer to enable a second party to use their content, and apply certain conditions and restrictions to those uses. A waiver is also a legal instrument but is designed for a rights holder to give up their rights, rather than assert them. For a comprehensive guide to the different approaches to the licensing of research data see [23].
Placing restrictions on the reuse of scientific information, particularly data, slows down the pace of research. Furthermore, legal requirements for attribution ingrained in licenses such as CC-BY can prohibit future research across large collections of content – as commonly happens in data mining research. Consider the Human Genome Project: a watershed moment for scientific data sharing and collaboration. Without the collective effort of many different research institutions, commercial organizations and individual scientists the sequencing of the human genome would not have been possible. But if a researcher wishing to query the human genome database as part of a new research project was legally required to attribute all the – probably thousands – of data contributors, by providing a link back to or citation, this would be unmanageable, and probably un-publishable in the context of a traditional research paper’s reference list.
International legal differences, described earlier, are another important reason to apply specific, appropriate legal tools to data. Also, it can be unclear what license to attach to copyright in a dataset or structure (for example a textual description of building the dataset could fall under CC-BY, but if source code were used rather than text it might not). This is an area of confusion where no licensing standard exists. Therefore, to eliminate legal impediments to integration and re-use of data, such as this stacking of attribution requirements in large collections of data, and to help enable long-term interoperability an appropriate license or waiver specific to data should be applied. There are a number of conformant licenses and waivers for open data [24], of which Creative Commons CC0 (http://creativecommons.org/publicdomain/zero/1.0/) is widely recognized. Under CC0, authors waive all of their rights to the work worldwide under copyright law and all related or neighboring legal rights they have in the work, to the extent allowable by law. Legal experts have recommended the use of standard, globally accepted licenses for data instead of developing ad hoc models [25].
The case for CC0 for scientific data
The Creative Commons’ website catalogues a number of different organizations – publicly and privately funded – which use CC0 for data [26]. These include:
-
Genomes Unzipped, which “aims to inform the public about genetics via the independent analysis of open genetic data, volunteered by a core group of genetics researchers and specialists”
-
GlaxoSmithKline (GSK), a leading pharmaceutical company, has dedicated data on more than 13,500 compounds known to be active against malaria to the public domain [27].
-
The British Library and Cologne-based Libraries, which have released large amounts of bibliographic data under CC0 [28]
-
FigShare (http://figshare.com/), a freely-accessible repository for scientific content including images, video and data, uses CC0 for datasets
Data repositories are particularly relevant users of waivers and licenses for research data. Although there are many data repositories in life sciences (for a list see http://www.datacite.org/repolist), which are growing in size and number, not all scientific domains have a common repository and journals often function as repositories when data are included as additional files (supplementary material). Dryad (http://datadryad.org/) is an international repository for the datasets supporting published, peer-reviewed journal articles across the biosciences which requires authors to explicitly place deposited data in the public domain using the CC0 waiver. An entry on the Dryad weblog sets out cogently why CC0 is the most effective solution for achieving its goals:
“By removing unenforceable legal barriers, CC0 facilitates the discovery, re-use, and citation of [that] data…
“Furthermore, Dryad’s use of CC0 to make the terms of reuse explicit has some important advantages:
-
interoperability:
Since CC0 is both human and machine-readable, other people and indexing services will automatically be able to determine the terms of use.
-
universality:
CC0 is a single mechanism that is both global and universal, covering all data and all countries. It is also widely recognized.
-
simplicity: there is no need for humans to make, and respond to, individual data requests, and no need for click-through agreements. This allows more scientists to spend their time doing science.” [29]
Dryad’s policy ultimately follows the Science Commons’ recommendations, set out in their Protocol for Implementing Open Access Data [30].
The online laboratory notebook software LabArchives (http://www.labarchives.com/), which includes the ability to share data privately and to publish datasets publicly and permanently online, also uses CC0 for public datasets [31].