As a maintainer of several software packages I often find myself copying text snippets from the README file into different places (savannah, github, freecode, emails, etc). Recently I had a need to generate a list of software packages that included every project’s name, brief summary, license and URL. I could have generated that list manually by copying the text from every project’s README and COPYING files. However then I would have to maintain my list manually to keep it in sync with all the projects. This easily leads to stale information, so I thought a better approach would be to put the information I needed into each projects’ source code version control system. The advantage is that the manual work to extract the information may be automated by a script, since the data is in a usable format. I’ll explain here how my solution works.
To be able to find the information using only the URL to the repository, I needed a filename convention. The filename I chose was BLURB; for the etymology, see Wikipedia’s page about Blurb. The data format in the file is similar to normal email headers.
An example illustrate the principles well. Below is the BLURB file for one of my projects.
Author: Yubico Basename: libyubikey Homepage: http://opensource.yubico.com/yubico-c/ License: BSD-2-Clause Name: Yubico C low-level Library Project: yubico-c Summary: C library for manipulating Yubico YubiKey One-Time Passwords (OTPs)
The format is simple: UTF-8 text with each line starting with a header followed by a colon (“:”), some whitespace, and some content. If a line starts with whitespace, it is a continuation of the previous line’s content (trim leading whitespace). The following table describes the fields that I use. I may update this blog post in the future with new fields or improved explanations (for reference, current date is 2013-09-24).
Header | Meaning |
---|---|
Project | Short identifier for the project (e.g., ‘gcc’, ’emacs’) |
Name | Official name of the project in English |
Summary | Brief one-line summary of the project’s purpose in English |
Author | Origin of the project |
Homepage | URL to the project’s website |
License | License keyword, preferrably using one of the SPDX license identifiers |
Basename | The tarball basename, if different from the project name |
Finally some reflection of the solution. After quick design, I thought that I couldn’t be the first one with this problem, and I tried to find other similar efforts. I haven’t been able to find any standardization effort that have the following properties:
- Stores the information inside each upstream project’s own source code repository
- Provides a filename convention so that it is possible to find the data with only the source code repository link
- Encode data in a format that is easy to extract using simple command line tools
- Not encode information about releases (i.e., what happened in a particular version)
The related efforts that I found were SPDX which at first look seemed to offer what I wanted. However on closer examination it failed to deliver on all the requirements above, and appeared to have somewhat different goals. However I found the SPDX license list useful and refer to it. Another effort is Eric S. Raymond’s freecode-submit and shipper but its primary focus is to encode information about each release. The design of the BLURB file is clearly influenced by these tools. Another influence has been Debian’s specification for machine-readable copyright information. The Free Software Foundation’s list of software projects seemed like another candidate, but it doesn’t suggest any way to store the information in the upstream project itself.