On language bindings & Relaunching Guile-GnuTLS

The Guile bindings for GnuTLS has been part of GnuTLS since spring 2007 when Ludovic Courtès contributed it after some initial discussion. I have been looking into getting back to do GnuTLS coding, and during a recent GnuTLS meeting one topic was Guile bindings. It seemed like a fairly self-contained project to pick up on. It is interesting to re-read the old thread when this work was included: some of the concerns brought up there now have track record to be evaluated on. My opinion that the cost of introducing a new project per language binding today is smaller than the cost of maintaining language bindings as part of the core project. I believe the cost/benefit ratio has changed during the past 15 years: introducing a new project used to come with a significant cost but this is no longer the case, as tooling and processes for packaging have improved. I have had similar experience with Java, C# and Emacs Lisp bindings for GNU Libidn as well, where maintaining them centralized slow down the pace of updates. Andreas Metzler pointed to a similar conclusion reached by Russ Allbery.

There are many ways to separate a project into two projects; just copying the files into a new git repository would have been the simplest and was my original plan. However Ludo’ mentioned git-filter-branch in an email, and the idea of keeping all git history for some of the relevant files seemed worth pursuing to me. I quickly found git-filter-repo which appears to be the recommend approach, and experimenting with it I found a way to filter out the GnuTLS repo into a small git repository that Guile-GnuTLS could be based on. The commands I used were the following, if you want to reproduce things.

$ git clone https://gitlab.com/gnutls/gnutls.git guile-gnutls
$ cd guile-gnutls/
$ git checkout f5dcbdb46df52458e3756193c2a23bf558a3ecfd
$ git-filter-repo --path guile/ --path m4/guile.m4 --path doc/gnutls-guile.texi --path doc/extract-guile-c-doc.scm --path doc/cha-copying.texi --path doc/fdl-1.3.texi

I debated with myself back and forth whether to include some files that would be named the same in the new repository but would share little to no similar lines, for example configure.ac, Makefile.am not to mention README and NEWS. Initially I thought it would be nice to preserve the history for all lines that went into the new project, but this is a subjective judgement call. What brought me over to a more minimal approach was that the contributor history and attribution would be quite strange for the new repository: Should Guile-GnuTLS attribute the work of the thousands of commits to configure.ac which had nothing to do with Guile? Should the people who wrote that be mentioned as contributor of Guile-GnuTLS? I think not.

The next step was to get a reasonable GitLab CI/CD pipeline up, to make sure the project builds on some free GNU/Linux distributions like Trisquel and PureOS as well as the usual non-free distributions like Debian and Fedora to have coverage of dpkg and rpm based distributions. I included builds on Alpine and ArchLinux as well, because they tend to trigger other portability issues. I wish there were GNU Guix docker images available for easy testing on that platform as well. The GitLab CI/CD rules for a project like this are fairly simple.

To get things out of the door, I tagged the result as v3.7.9 and published a GitLab release page for Guile-GnuTLS that includes OpenPGP-signed source tarballs manually uploaded built on my laptop. The URLs for these tarballs are not very pleasant to work with, and discovering new releases automatically appears unreliable, but I don’t know of a better approach.

To finish this project, I have proposed a GnuTLS merge request to remove all Guile-related parts from the GnuTLS core.

Doing some GnuTLS-related work again felt nice, it was quite some time ago so thank you for giving me this opportunity. Thoughts or comments? Happy hacking!

What’s wrong with SCRAM?

Simple Authentication and Security Layer (SASL, RFC4422) is the framework that was abstracted from the IMAP and POP protocols. Among the most popular mechanisms are PLAIN (clear-text passwords, usually under TLS), CRAM-MD5 (RFC2195), and GSSAPI (for Kerberos V5). The DIGEST-MD5 mechanism was an attempt to improve upon the CRAM-MD5 mechanism, but ended up introducing a lot of complexity and insufficient desirable features and deployment was a mess — read RFC6331 for background on why it has been deprecated.

SCRAM!

The effort to develop SCRAM (RFC5802) came, as far as I can tell, from the experiences with DIGEST-MD5 and the desire to offer something better than CRAM-MD5. In protocol design discussions, SCRAM is often still considered as “new” even though the specification was published in 2011 and even that had been in the making for several years. Developers that implement IMAP and SMTP still usually start out with supporting PLAIN and CRAM-MD5. The focus of this blog post is to delve into why this is and inspire the next step in this area. My opinion around this topic has existed for a couple of years already, formed while implementing SCRAM in GNU SASL, and my main triggers to write something about them now are 1) Martin Lambers‘ two-post blog series that first were negative about SCRAM and then became positive, and 2) my desire to work on or support new efforts in this area.

Let’s take a step back and spend some time analyzing PLAIN and CRAM-MD5. What are the perceived advantages and disadvantages?

Advantages: PLAIN and CRAM-MD5 solves the use-case of password-based user authentication, and are easy to implement.

Main disadvantages with PLAIN and CRAM-MD5:

  • PLAIN transfers passwords in clear text to the server (sometimes this is considered an advantage, but from a security point of view, it isn’t).
  • CRAM-MD5 requires that the server stores the password in plaintext (impossible to use a hashed or encrypted format).
  • Non-ASCII support was not there from the start.

A number of (debatable) inconveniences with PLAIN and CRAM-MD5 exists:

  • CRAM-MD5 does not support the notion of authorization identities.
  • The authentication is not bound to a particular secure channel, opening up for tunneling attacks.
  • CRAM-MD5 is based on HMAC-MD5 that is cryptographically “old” (but has withhold well) – the main problem today is that usually MD5 is not something you want to implement since there is diminishing other uses for it.
  • Servers can impersonate the client against other servers since they know the password.
  • Neither offer to authenticate the server to the client.

If you are familiar with SCRAM, you know that it solves these issues. So why hasn’t everyone jumped on it and CRAM-MD5 is now a thing of the past? In the first few years, my answer was that things take time and we’ll see improvements. Today we are ten years later; there are many SCRAM implementations out there, and the Internet has generally migrated away from protocols that have much larger legacy issues (e.g., SSL), but we are still doing CRAM-MD5. I think it is time to become critical of the effort and try to learn from the past. Here is my attempt at summarizing the concerns I’ve seen come up:

  • The mechanism family concept add complexity, in several ways:
    • The specification is harder to understand.
    • New instances of the mechanism family (SCRAM-SHA-256) introduce even more complexity since they tweak some of the poor choices made in the base specification.
    • Introducing new hashes to the family (like the suggested SHA3 variants) adds deployment costs since databases needs new type:value pairs to hold more than one “SCRAM” hashed password.
    • How to negotiate which variant to use is not well-defined. Consider if the server only has access to a SCRAM-SHA-1 hashed password for user X and a SCRAM-SHA-256 hashed password for user Y. What mechanisms should it offer to an unknown client? Offering both is likely to cause authentication failures, and the fall-back behaviour of SASL is poor.
  • The optional support for channel bindings and the way they are negotiated adds complexity.
  • The original default ‘tls-unique’ channel binding turned out to be insecure, and it cannot be supported in TLS 1.3.
  • Support for channel bindings requires interaction between TLS and SASL layers in an application.
  • The feature that servers cannot impersonate a client is dubious: the server only needs to participate in one authentication exchange with the client to gain this ability.
  • SCRAM does not offer any of the cryptographic properties of a Password-authenticated key agreement.

What other concerns are there? I’m likely forgetting some. Some of these are debatable and were intentional design choices.

Can we save SCRAM? I’m happy to see the effort to introduce a new channel binding and update the SCRAM specifications to use it for TLS 1.3+. I brought up a similar approach back in the days when some people were still insisting on ‘tls-unique’. A new channel binding solves some of the issues above.

It is hard to tell what the main reason for not implementing SCRAM more often is. A sense of urgency appears to be lacking. My gut feeling is that to an implementer SCRAM looks awfully similar to DIGEST-MD5. Most of the problems with DIGEST-MD5 could be fixed, but the fixes add more complexity.

How to proceed from here? I see a couple of options:

  • Let time go by to see increased adoption. Improving the channel binding situation will help.
  • Learn from the mistakes and introduce a new simple SCRAM, which could have the following properties:
    • No mechanism family, just one mechanism instance.
    • Hash is hard-coded, just like CRAM-MD5.
    • TLS and a channel binding is required and always used.
  • Review one of the PAKE alternatives and specify a SASL mechanism for it. Preferably without repeating the mistakes of CRAM-MD5, DIGEST-MD5 and SCRAM.
  • Give up on having “complex” authentication mechanisms inside SASL, and help some PAKE variant become implemented through a TLS library, and SASL applications should just use EXTERNAL to use TLS user authentication.

Thoughts?

I feel the following XKCD is appropriate here.

Real-world Performance Tuning with Callgrind

This post describe the process of identifying and profiling an inefficient part of GnuTLS. The tool I’m using is callgrind. I won’t describe the tool in detail since I’m not a callgrind expert, instead the focus is on the methodology in finding and fixing a problem. My hope is that this is useful as an insight to how maintainers go about fixing a performance-related problem. It also demonstrates how immensely useful tools like valgrind and callgrind are.

Today I stumbled over something as rare as a post that contains example code to reproduce a problem. The post was written by Edgar Fuß on the openldap list: link to edgars post. First, there is something to be learned here: if there hadn’t been code in that post, I would most likely have ignored it because it is too much trouble for me to understand and duplicate the problem. Especially when this isn’t a real GnuTLS bug report, but just discussion on a non-GnuTLS mailing list. By posting such example code, it was easy for me to compile and run it. As it turned out, the code was very slow and my curiosity peeked. Edgar’s bug triggering code was (somewhat modified for readability):

d = opendir("/etc/ssl/certs");
gnutls_certificate_allocate_credentials(&ca_list);
while ((dent = readdir(d)) != NULL) {
  sprintf(ca_file, "/etc/ssl/certs/%s", dent->d_name);
  stat(ca_file, &s);
  if (!S_ISREG(s.st_mode)) continue;
  gnutls_certificate_set_x509_trust_file(ca_list, ca_file,
       GNUTLS_X509_FMT_PEM);
}
closedir(d);

If you aren’t familiar with GnuTLS, I’ll describe what the code is intended to do. The code will iterate over all files in /etc/ssl/certs/ and calls gnutls_certificate_set_x509_trust_file for each file. That function will read one or more X.509 CA certificates stored in the file, and add it to the CA trust store inside GnuTLS. The files are typically small (1-2kb) but contain base64 encoded ASN.1 data which is decoded by GnuTLS. The CA trust store is used to determine whether to trust a client or server’s certificate or not. While this is typically only done at startup, and not during each TLS connection, it is not particulary performance critical. Still, if it is excessively slow it will slow down application startup.

On my system (x86 Debian testing) the /etc/ssl/certs directory contains 206 files. I compiled the test code and ran it:

jas@mocca:~$ gcc -o foo foo.c -lgnutls
jas@mocca:~$ time ./foo /etc/ssl/certs/

real 0m40.821s
user 0m40.743s
sys 0m0.036s
jas@mocca:~$

40 seconds! Delaying startup of an application by that amount is pretty significant, so I understand this was considered a problem.

I became curious how much memory the process was using. If my machine had started paging (unlikely with 2GB but you never know), I could understand if it was this slow. However, the top output was relatively stable:

PID USER PR NI VIRT RES SHR …
6538 jas 20 0 5548 3668 752 …

In other words, the virtual size was about 6MB, which seemed normal. My next step was to run the binary under valgrind, to possibly detect any memory corruption or other problem that might explain the slowness. Valgrind slows down execution significantly, and after around 7 minutes I gave up and modified the code slightly so that it only iterated over the first couple of files. After running valgrind once, I discovered that the code didn’t call the needed gnutls_certificate_free_credentials() and gnutls_global_deinit(). You can download the updated example code gnutls-callgrind.c. The valgrind suppressions file to shut up some known libgcrypt internal memory leaks is available as libgcrypt.supp.

jas@mocca:~$ gcc -o foo foo.c -lgnutls
jas@mocca:~$ time ./foo

real 0m0.222s
user 0m0.216s
sys 0m0.004s
jas@mocca:~$ time valgrind –suppressions=foo.supp –quiet ./foo

real 0m7.619s
user 0m7.488s
sys 0m0.108s
jas@mocca:~$

Nice! No memory leaks. We can now be relatively certain that the problem is really with the intention of the code, rather than some memory related bug. Given that we use C here, you want to rule out such problems early on because they are a nightmare to debug.

Now for the performance tuning session. Let’s run callgrind on the application. First, we must make sure we use a GnuTLS compiled with debugging information. The easiest way to do this is to compile it against the static libgnutls. Proceed as follows.

jas@mocca:~$ gcc -o foo foo.c /usr/lib/libgnutls.a -lgcrypt -ltasn1 -lz
jas@mocca:~$ time valgrind –tool=callgrind –quiet ./foo

real 0m18.292s
user 0m18.129s
sys 0m0.084s
jas@mocca:~$ ls -la callgrind.out.8164
-rw——- 1 jas jas 69654 2008-02-26 21:46 callgrind.out.8164
jas@mocca:~$ kcachegrind &

As you can see, execution time is even slower than with valgrind. The profile data output file is quite small. Running kcachegrind yields this output (click to enlarge):

The amount of information in kcachegrind can be overwhelming at first. However, the interesting values are in the percentages and call counts columns to the left. It tells us that gnutls_certificate_set_x509_trust_file() was invoked 22 times, and that 98% of the time is spent there. (If you are surprised by the 22, look at the code, the variable i starts at 0, and the loop is run until it is larger than 20, i.e., it is 21, which makes for 22 iterations in the loop.) The code for that function looks like:

{
  int ret, ret2;
  size_t size;
  char *data = read_binary_file (cafile, &size);

  if (data == NULL)
    {
      gnutls_assert ();
      return GNUTLS_E_FILE_ERROR;
    }

  if (type == GNUTLS_X509_FMT_DER)
    ret = parse_der_ca_mem (&res->x509_ca_list,
                            &res->x509_ncas,
			    data, size);
  else
    ret = parse_pem_ca_mem (&res->x509_ca_list,
                            &res->x509_ncas,
			    data, size);

  free (data);

  if (ret < 0)
    {
      gnutls_assert ();
      return ret;
    }

  if ((ret2 = generate_rdn_seq (res)) < 0)
    return ret2;

  return ret;
}

The callgrind output tells us that 96% of the programs time is spent inside the 22 calls to generate_rdn_seq(), which in turn call gnutls_x509_crt_get_raw_dn a total of 506 times. The calls to gnutls_x509_crt_get_raw_dn make up 95% of the program's execution time.

We can now conclude that the problem is inside the generate_rdn_seq() function, and not in the rest of the gnutls_certificate_set_x509_trust_file() function body. Given that 95% of the time is actually in a function that generate_rdn_seq calls (i.e., gnutls_x509_crt_get_raw_dn), the problem is either that the function is called too many times or that it is too slow.

Some words about what generate_rdn_seq() is intended to do: it pre-computes a list of names of the CA certificates. In TLS servers, this list of names of trusted CAs is sent to clients. The list is used by the client to find a client certificate issued by a CA that the server recognize. To avoid computing the list every time it is needed (i.e., for every connection), GnuTLS pre-computes this and store it in the credential structure. One credential structure can be associated with one or more TLS sessions.

We are now prepared to look at the code for generate_rdn_seq:

static int
generate_rdn_seq (gnutls_certificate_credentials_t res)
{
  gnutls_datum_t tmp;
  int ret;
  unsigned size, i;
  opaque *pdata;

  /* Generate the RDN sequence 
   * This will be sent to clients when a certificate
   * request message is sent.
   */

  /* FIXME: in case of a client it is not needed
   * to do that. This would save time and memory.
   * However we don't have that information available
   * here.
   */

  size = 0;
  for (i = 0; i < res->x509_ncas; i++)
    {
      if ((ret = gnutls_x509_crt_get_raw_dn
                      (res->x509_ca_list[i], &tmp)) < 0)
	{
	  gnutls_assert ();
	  return ret;
	}
      size += (2 + tmp.size);
      _gnutls_free_datum (&tmp);
    }

  if (res->x509_rdn_sequence.data != NULL)
    gnutls_free (res->x509_rdn_sequence.data);

  res->x509_rdn_sequence.data =
          gnutls_malloc (size);
  if (res->x509_rdn_sequence.data == NULL)
    {
      gnutls_assert ();
      return GNUTLS_E_MEMORY_ERROR;
    }
  res->x509_rdn_sequence.size = size;

  pdata = res->x509_rdn_sequence.data;

  for (i = 0; i < res->x509_ncas; i++)
    {
      if ((ret = gnutls_x509_crt_get_raw_dn
                       (res->x509_ca_list[i], &tmp)) < 0)
	{
	  _gnutls_free_datum
               (&res->x509_rdn_sequence);
	  gnutls_assert ();
	  return ret;
	}

      _gnutls_write_datum16 (pdata, tmp);
      pdata += (2 + tmp.size);
      _gnutls_free_datum (&tmp);
    }

  return 0;
}

Take some time to read and understand this code. Really. I'll be here waiting to explain it when you are finished.

The res->x509_ca_list variable contains the entire list of CA's stored in memory so far. It is initially empty. After reading the first file, it will contain one certificate. After reading the second file, it will contain two certificates, and so on.

What is happening here is that FOR EVERY certificate to add, the entire list of certificates is iterated, not just once but twice! The computational complexity of invoking the function is O(2*n^2) where n is the number of certificate names to be added. The first iteration is to compute the size of the string to hold the CA names. The actual data is discarded. The second iteration calls the same function, for the same data, but now store the output in the appropriate place.

The first step to optimize this is to realize that you don't need to iterate through all CAs every time a new one has been added. Adding the name for the most recently added certificate should be sufficient. Since more than one CA can be added at each time, the function needs to take another parameter: the number of recently added CAs for which to pre-compute the names for.

Our new code will look like:

static int
add_new_crt_to_rdn_seq (gnutls_certificate_credentials_t res, int new)
{
  gnutls_datum_t tmp;
  int ret;
  size_t newsize;
  unsigned char *newdata;
  unsigned i;

  /* Add DN of the last added CAs to the RDN sequence
   * This will be sent to clients when a certificate
   * request message is sent.
   */

  /* FIXME: in case of a client it is not needed
   * to do that. This would save time and memory.
   * However we don't have that information available
   * here.
   * Further, this function is now much more efficient,
   * so optimizing that is less important.
   */

  for (i = res->x509_ncas - new; i < res->x509_ncas; i++)
    {
      if ((ret = gnutls_x509_crt_get_raw_dn
                        (res->x509_ca_list[i], &tmp)) < 0)
	{
	  gnutls_assert ();
	  return ret;
	}

      newsize = res->x509_rdn_sequence.size +
                       2 + tmp.size;
      if (newsize < res->x509_rdn_sequence.size)
	{
	  gnutls_assert ();
	  _gnutls_free_datum (&tmp);
	  return GNUTLS_E_SHORT_MEMORY_BUFFER;
	}

      newdata = gnutls_realloc
                        (res->x509_rdn_sequence.data, newsize);
      if (newdata == NULL)
	{
	  gnutls_assert ();
	  _gnutls_free_datum (&tmp);
	  return GNUTLS_E_MEMORY_ERROR;
	}

      _gnutls_write_datum16 (newdata +
            res->x509_rdn_sequence.size, tmp);
      _gnutls_free_datum (&tmp);

      res->x509_rdn_sequence.size = newsize;
      res->x509_rdn_sequence.data = newdata;
    }

  return 0;
}

I changed the function name, to reflect that it now just update the list of names for a set of recently added certificates. That also helps to find and update all callers of the old function.

As you can see, this function should be of complexity O(n) instead, since it just adds the name of the most recently added CA's.

But is it faster in practice? Let's check it! First, remove the if (i++ > 20) break; statement in the test code, so that we do the full test. First link against our old libgnutls.a and run the old test case again, for illustrative purposes.

jas@mocca:~$ gcc -o foo foo.c /usr/lib/libgnutls.a -lgcrypt -ltasn1 -lz
jas@mocca:~$ time ./foo

real 0m40.756s
user 0m40.679s
sys 0m0.052s
jas@mocca:~$

Again, it takes around 40 seconds. Now recompile against our modified libgnutls and run the same test:

jas@mocca:~$ gcc -o foo foo.c /home/jas/src/gnutls/lib/.libs/libgnutls.a -lgcrypt -ltasn1 -lz
jas@mocca:~$ time ./foo

real 0m0.288s
user 0m0.280s
sys 0m0.012s
jas@mocca:~$

Wow! Down from 40 seconds to 0.3 seconds. Let's take a look at the callgrind output now.

The code now spends around 60 % of the time in add_new_crt_to_rdn_seq() which seems reasonable. There are 206 files in /etc/ssl/certs, which explains the call count. Some of the files actually contain several CA certificates (in particular /etc/ssl/certs/ca-certificates.crt contains a lot of certificates), which explains why gnutls_x509_crt_get_raw_dn is called 408 times.

At this point, I stopped optimizing the code further.

On TLS-AUTHZ

The TLS-AUTHZ document (protocol spec here) describes a mechanism to add support for authorization in the TLS protocol. The idea is part of a patent application, see the patent notification to the IETF. The protocol has a complicated history in the IETF. Right now a third last call is open to request feedback from the community. I’ve written about TLS-AUTHZ before.

RedPhoneSecurity is now trying to circumvent the IETF standardization process by trying to get the document published as an ‘experimental standard’. The document earlier failed to get consensus for publication on the standards track.

The responsible IETF Area Director, Tim Polk, argues that because there exists independent implementations, the community benefits from having the document published. The argument is silly because the only independent implementation is mine and I’m opposed to publication of the standard. Further, the document will remain accessible to anyone in the community with access to the Internet since it has been published as an Internet Draft. To clarify that we have no interest in a standard with patent claims, we have decided to remove the tls-authz implementation from GnuTLS. Together with the FSF we came up with the following statement which is part of the GnuTLS 2.0.2 release announcement:

** TLS authorization support removed.
This technique may be patented in the future, and it is not of crucial importance for the Internet community. After deliberation we have concluded that the best thing we can do in this situation is to encourage society not to adopt this technique. We have decided to lead the way with our own actions.

If you are concerned about having patented standards adopted by the IETF, now is a very good time to make your voice heard! The last call ends on October 23th. Please read about the issue, and familiarize yourself with the IETF process (RFC 2026, with updates related to patents in RFC 3989) and send your feedback to ietf@ietf.org.

GnuTLS v2.0

I released GnuTLS v2.0 yesterday, the announcement is available.

So now we can start thinking of nice stuff to have in the v2.1.x series. Integrating the PKCS#11 support is one. ECC support? TLS 1.2 may go into v2.0.x. Opaque PRF input support is planned. Some benchmarking and optimization could be interesting. Other ideas?