As a maintainer of several software packages I often find myself copying text snippets from the README file into different places (savannah, github, freecode, emails, etc). Recently I had a need to generate a list of software packages that included every project’s name, brief summary, license and URL. I could have generated that list manually by copying the text from every project’s README and COPYING files. However then I would have to maintain my list manually to keep it in sync with all the projects. This easily leads to stale information, so I thought a better approach would be to put the information I needed into each projects’ source code version control system. The advantage is that the manual work to extract the information may be automated by a script, since the data is in a usable format. I’ll explain here how my solution works.
To be able to find the information using only the URL to the repository, I needed a filename convention. The filename I chose was BLURB; for the etymology, see Wikipedia’s page about Blurb. The data format in the file is similar to normal email headers.
An example illustrate the principles well. Below is the BLURB file for one of my projects.
Author: Yubico Basename: libyubikey Homepage: http://opensource.yubico.com/yubico-c/ License: BSD-2-Clause Name: Yubico C low-level Library Project: yubico-c Summary: C library for manipulating Yubico YubiKey One-Time Passwords (OTPs)
The format is simple: UTF-8 text with each line starting with a header followed by a colon (“:”), some whitespace, and some content. If a line starts with whitespace, it is a continuation of the previous line’s content (trim leading whitespace). The following table describes the fields that I use. I may update this blog post in the future with new fields or improved explanations (for reference, current date is 2013-09-24).
Header | Meaning |
---|---|
Project | Short identifier for the project (e.g., ‘gcc’, ’emacs’) |
Name | Official name of the project in English |
Summary | Brief one-line summary of the project’s purpose in English |
Author | Origin of the project |
Homepage | URL to the project’s website |
License | License keyword, preferrably using one of the SPDX license identifiers |
Basename | The tarball basename, if different from the project name |
Finally some reflection of the solution. After quick design, I thought that I couldn’t be the first one with this problem, and I tried to find other similar efforts. I haven’t been able to find any standardization effort that have the following properties:
- Stores the information inside each upstream project’s own source code repository
- Provides a filename convention so that it is possible to find the data with only the source code repository link
- Encode data in a format that is easy to extract using simple command line tools
- Not encode information about releases (i.e., what happened in a particular version)
The related efforts that I found were SPDX which at first look seemed to offer what I wanted. However on closer examination it failed to deliver on all the requirements above, and appeared to have somewhat different goals. However I found the SPDX license list useful and refer to it. Another effort is Eric S. Raymond’s freecode-submit and shipper but its primary focus is to encode information about each release. The design of the BLURB file is clearly influenced by these tools. Another influence has been Debian’s specification for machine-readable copyright information. The Free Software Foundation’s list of software projects seemed like another candidate, but it doesn’t suggest any way to store the information in the upstream project itself.
There’s DOAP , although it might fail some of your requirements.
Thanks for the pointer!
It seems DOAP is defined here: http://www.ibm.com/developerworks/xml/library/x-osproj3/
I think it is very close to what I needed, with some minor exceptions.
* XML is not a human writeable format
* There appears to be no filename convention? I could have missed this
One approach is to simply define another data encoding format of the underlying generic DOAP idea. So it would use the same headers and semantics, but the information would be encoded in email headers. There could be tool to convert back and forth. And one need to agree on a filename convention, which is easy.
Hm.
The spec is at https://github.com/edumbill/doap/wiki (I included a link initially surrounded by angle brackets but it got eaten).
Yes, XML is (IMO) unappealing as a writing language, although RDF can also be represented with Turtle (http://www.w3.org/TR/turtle/) for example, but I’m not sure if that’s really an improvement over a simple rfc822-style format, there’s possibly other representations.
When it comes to the filename convention, I think each project might have chosen something slightly similar, as in projectname.doap (but better check the links in the “Web sites using DOAP” on the spec site for some examples).
https://wiki.debian.org/UpstreamMetadata
Good pointer. DEP12 is inspired by DOAP but a human friendly format. What’s missing is only the filename convention? I should take a deeper look there.
The only similar thing I’m aware of is this site:
http://contributing.appspot.com/memcached
Which is online-only, and doesn’t live inside the package(s) themselves, but I’ve frequently used it and found it useful.
Yup — they could consume and generate those files. Thanks for the pointer, understanding what kind of data values people are interested in is important.
Sounds like the LSM file. I think that was Linux software map and from metalab but I could be wrong.
Nice! This might be great to integrate into alioth’s and the PTS’ RDF generators.
See http://packages.qa.debian.org/common/RDF.html and https://joinup.ec.europa.eu/asset/adms_foss/news/admssw-plugin-fusionforge-deployed-alioth