BLURB: Software repository metadata convention

Posted on 2013-09-24 by simon

As a maintainer of several software packages I often find myself copying text snippets from the README file into different places (savannah, github, freecode, emails, etc). Recently I had a need to generate a list of software packages that included every project’s name, brief summary, license and URL. I could have generated that list manually by copying the text from every project’s README and COPYING files. However then I would have to maintain my list manually to keep it in sync with all the projects. This easily leads to stale information, so I thought a better approach would be to put the information I needed into each projects’ source code version control system. The advantage is that the manual work to extract the information may be automated by a script, since the data is in a usable format. I’ll explain here how my solution works.

To be able to find the information using only the URL to the repository, I needed a filename convention. The filename I chose was BLURB; for the etymology, see Wikipedia’s page about Blurb. The data format in the file is similar to normal email headers.

An example illustrate the principles well. Below is the BLURB file for one of my projects.

Author: Yubico
Basename: libyubikey
Homepage: http://opensource.yubico.com/yubico-c/
License: BSD-2-Clause
Name: Yubico C low-level Library
Project: yubico-c
Summary: C library for manipulating Yubico YubiKey One-Time Passwords (OTPs)

The format is simple: UTF-8 text with each line starting with a header followed by a colon (“:”), some whitespace, and some content. If a line starts with whitespace, it is a continuation of the previous line’s content (trim leading whitespace). The following table describes the fields that I use. I may update this blog post in the future with new fields or improved explanations (for reference, current date is 2013-09-24).

Header	Meaning
Project	Short identifier for the project (e.g., ‘gcc’, ’emacs’)
Name	Official name of the project in English
Summary	Brief one-line summary of the project’s purpose in English
Author	Origin of the project
Homepage	URL to the project’s website
License	License keyword, preferrably using one of the SPDX license identifiers
Basename	The tarball basename, if different from the project name

Finally some reflection of the solution. After quick design, I thought that I couldn’t be the first one with this problem, and I tried to find other similar efforts. I haven’t been able to find any standardization effort that have the following properties:

Stores the information inside each upstream project’s own source code repository
Provides a filename convention so that it is possible to find the data with only the source code repository link
Encode data in a format that is easy to extract using simple command line tools
Not encode information about releases (i.e., what happened in a particular version)

The related efforts that I found were SPDX which at first look seemed to offer what I wanted. However on closer examination it failed to deliver on all the requirements above, and appeared to have somewhat different goals. However I found the SPDX license list useful and refer to it. Another effort is Eric S. Raymond’s freecode-submit and shipper but its primary focus is to encode information about each release. The design of the BLURB file is clearly influenced by these tools. Another influence has been Debian’s specification for machine-readable copyright information. The Free Software Foundation’s list of software projects seemed like another candidate, but it doesn’t suggest any way to store the information in the upstream project itself.

9 Replies to “BLURB: Software repository metadata convention”

Guillem Jover on 2013-09-25 at 03:48 said:

There’s DOAP , although it might fail some of your requirements.
- simon on 2013-09-25 at 13:22 said:
  
  Thanks for the pointer!
  
  It seems DOAP is defined here: http://www.ibm.com/developerworks/xml/library/x-osproj3/
  
  I think it is very close to what I needed, with some minor exceptions.
  
  * XML is not a human writeable format
  * There appears to be no filename convention? I could have missed this
  
  One approach is to simply define another data encoding format of the underlying generic DOAP idea. So it would use the same headers and semantics, but the information would be encoded in email headers. There could be tool to convert back and forth. And one need to agree on a filename convention, which is easy.
  
  Hm.
  - Guillem Jover on 2013-09-25 at 21:51 said:
    
    The spec is at https://github.com/edumbill/doap/wiki (I included a link initially surrounded by angle brackets but it got eaten).
    
    Yes, XML is (IMO) unappealing as a writing language, although RDF can also be represented with Turtle (http://www.w3.org/TR/turtle/) for example, but I’m not sure if that’s really an improvement over a simple rfc822-style format, there’s possibly other representations.
    
    When it comes to the filename convention, I think each project might have chosen something slightly similar, as in projectname.doap (but better check the links in the “Web sites using DOAP” on the spec site for some examples).
foo on 2013-09-25 at 09:20 said:

https://wiki.debian.org/UpstreamMetadata
- simon on 2013-09-25 at 14:28 said:
  
  Good pointer. DEP12 is inspired by DOAP but a human friendly format. What’s missing is only the filename convention? I should take a deeper look there.
Steve Kemp on 2013-09-25 at 09:31 said:

The only similar thing I’m aware of is this site:

http://contributing.appspot.com/memcached

Which is online-only, and doesn’t live inside the package(s) themselves, but I’ve frequently used it and found it useful.
- simon on 2013-09-25 at 14:28 said:
  
  Yup — they could consume and generate those files. Thanks for the pointer, understanding what kind of data values people are interested in is important.
MJ Ray on 2013-09-25 at 09:47 said:

Sounds like the LSM file. I think that was Linux software map and from metalab but I could be wrong.
David Schmitt on 2013-09-25 at 13:56 said:

Nice! This might be great to integrate into alioth’s and the PTS’ RDF generators.

See http://packages.qa.debian.org/common/RDF.html and https://joinup.ec.europa.eu/asset/adms_foss/news/admssw-plugin-fusionforge-deployed-alioth