Designing Efficient Dyadic Operations for Cryptographic Applications

Abstract Cryptographic primitives from coding theory are some of the most promising candidates for NIST’s Post-Quantum Cryptography Standardization process. In this paper, we introduce a variety of techniques to improve operations on dyadic matrices, a particular type of symmetric matrices that appear in the automorphism group of certain linear codes. Besides the independent interest, these techniques find an immediate application in practice. In fact, one of the candidates for the Key Exchange functionality, called DAGS, makes use of quasi-dyadic matrices to provide compact keys for the scheme.


Our Contribution.
We analyze two separate aspects of dyadic operations. First, we present three different algorithms for that are aimed specifically at computing multiplication of dyadic matrices. These are, respectively, a "standard" approach that makes use of dyadic signatures, a specialized Karatsuba-like algorithm, and a procedure based on the Fast Walsh-Hadamard Transform (FWHT) [7], also called dyadic convolution. We analyze the performance of all three methods and report our timings. As a second contribution, we describe a procedure that applies the LUP decomposition [5] to the dyadic case. The method effectively factors every quasi-dyadic matrix into a product of two triangular matrices and a permutation matrix. This leads to the possibility of a very efficient algorithm for computing the inverse of a matrix, which is particularly useful in code-based cryptography, for instance for computing the systematic form of a parity-check (or generator) matrix. According to our measurements, this improved inversion procedure is extremely fast, and provides a very large speedup during DAGS Key Generation.
Organization of the Paper. This paper is organized as follows. We start with some preliminary definitions in Section 2. We then present our main contributions: the various multiplication techniques are described in Section 3 and the improved inversion algorithm is presented in Section 4. We conclude by showing the results obtained when applying our techniques to DAGS; this is done in Section 5.

Preliminaries
We introduce dyadic matrices and describe some of their general properties.
Definition 2.1. Given a ring R and a vector h = (h 0 , h 1 , · · · , h n−1 ) ∈ R n , with n = 2 r for some r ∈ N, the dyadic matrix ∆(h) ∈ R n×n is the symmetric matrix with components ∆ ij = h i⊕j where ⊕ stands for bitwise exclusive-or. Such a matrix is said to have order r. The sequence h is called signature of the matrix ∆(h), and corresponds to its first row. The set of dyadic n × n matrices over R is denoted ∆ (︀ R n )︀ .
One can alternatively characterize a dyadic matrix recursively: any 1 × 1 matrix is dyadic of order 0, and any dyadic matrix M of order r > 0 has the form where A and B are two dyadic matrices of order r − 1. In other words, ∆
A dyadic permutation is clearly an involution, i.e. (Π i ) 2 = I. The i-th row, or equivalently the i-th column, of the dyadic matrix defined by a signature h can be written as ∆(h) i = hΠ i .
A dyadic matrix can be efficiently represented by its signature; in particular, all the operations between dyadic matrices can be referred only to the corresponding signatures. Indeed, for any two length-n vectors a, b ∈ R, we have: which means that, given two dyadic matrices A and B, with respective signatures a and b, their sum is the dyadic matrix described by the signature a + b.
In an analogous way, the multiplication between dyadic matrices can be done by considering only the corresponding signatures; we will discuss efficient ways for computing multiplications in Section 3. I : r ∈ N, n = 2 r and a, b ∈ F n . O
i (b) ← binary representation of i, using n bits. 6: for {j = 0, 1, · · · , n − 1} do 7: j (b) ← binary representation of j, using n bits. 8: π ← conversion of π (b) into an integer. 10: 11: end for 12: end for 13: return c Moreover, it is easy to see that the inverse of a dyadic matrix is also a dyadic matrix; this can be easily computed using Sylvester-Hadamard matrices (see Section 3.2).We will expand on this in Section 4. Finally, we introduce a relaxed notion of dyadicity, which will be useful throughout the paper.

Multiplication of Dyadic Matrices
In this section we consider different methods for computing the multiplication between two dyadic matrices.
In fact, we have just mentioned how some matrix operations, like the sum or the inversion, can be efficiently performed in the dyadic case just by considering the signatures. Multiplication can be strongly improved with similar methods, which exploit the particular structure of such matrices. In particular, we analyze three different algorithms and provide estimations for their complexities; we then compare the performance of the various algorithms. For ease of notation, we will refer to the two n × n matrices that we want to multiply simply as A and B, with a = [a 0 , a 1 , · · · , a n−1 ] and b = [b 0 , b 1 , · · · , b n−1 ] being the respective signatures. Maintaining the same notation, the product matrix C = AB, which is also dyadic, will have signature c = [c 0 , c 1 , · · · , c n−1 ].
In particular, we focus on the special case of quasi-dyadic matrices with elements belonging to a field F of characteristic 2.

Standard Multiplication
The first algorithm we analyze is described in Algorithm 1; we refer to it as the standard multiplication. The element of C in position (i, j) is obtained as the multiplication between the i-th row of A and the j-th column of B. Since dyadic matrices are symmetric, this is equivalent to the inner product between the i-th row of A and the j-th one of B. The signature c (i.e., the first row of C) is obtained by inner products involving only a (i.e., the first row of A). Thus, we can just construct the rows of B, by permutations of the elements in b, and then compute the inner products. The complexity of the algorithm is due to two different types of operations: 1. In order to construct the rows of B, we need the indexes of the corresponding permutations. Each index is computed as the modulo 2 sum of two binary vectors of length r, so can be obtained with a complexity of r binary operations. Thus, considering that we need to repeat this operation for 2 r − 1 rows (for the first one, no permutation is needed), the complexity of this procedure can be estimated as r·2 r · (︀ 2 r − 1 )︀ . 2. Each element of c is obtained as the inner product between two vectors of 2 r elements, assuming values in F. This operation requires 2 r multiplication and 2 r − 1 sums in F. If we denote as C mult and Csum the costs of, respectively, a multiplication and a sum in F, the total number of binary operations needed to compute 2 r inner products can be estimated as 2 2r · C mult + (2 2r − 2 r ) · Csum.
The complexity of a standard multiplication between two dyadic signatures can be estimated as: Of particular interest to us is the case where R n is actually a field F. Dyadic matrices over F form a commutative subring ∆(F n ) ⊂ F n×n , and this property gives rise to efficient arithmetic algorithms to compute the dyadic convolution. In particular, we here consider the fast Walsh-Hadamard transform (FWHT), which is well known [7] but seldom found in a cryptographic context. We describe it here for ease of reference. We firstly recall the FWHT for the case of a field F such that char(F) ≠ 2, and then describe how this technique can be generalized to consider also the case of char(F) = 2 (which, again, is the one we are interested in).

Definition 3.2. Let F be a field with char(F) ≠ 2. The Sylvester-Hadamard matrix Hr ∈ F n is recursively defined as
One can show by straightforward induction that H 2 r = 2 r Ir and hence H −1 r = 2 −r Hr, which can also be expressed recursively as Proof. The lemma clearly holds for r = 0. Now let r > 0, and write where A and B are dyadic. It follows that  Proof. The lemma clearly holds for r = 0. Now let r > 0, and with the notation of Lemma 3.1, the diagonal of
The structure of Sylvester-Hadamard matrices leads to an efficient algorithm to compute aHr for a ∈ F n , which is known as the fast Walsh-Hadamard transform. Let [a 0 , a 1 ] be the two halves of a. Thus This recursive algorithm, which can be easily written in purely sequential fashion (Algorithm 2), has complexity Θ(n log n), specifically, rn additions or subtractions in F. It is therefore somewhat more efficient than the fast Fourier transform, which involves multiplications by n-th roots of unity, when they are available at all (otherwise working in extension fields is unavoidable, and more expensive). The product of two dyadic matrices ∆(a) and ∆(b), or equivalently the dyadic convolution a b, can thus be efficiently computed as described in Algorithm 3. The total cost is 3rn additions or subtractions and 2n multiplications (half of these by the constant 2 −r = 1/n) in F, with an overall complexity Θ(n log n). Notice that this is also the complexity of computing det ∆(a).
The fast Walsh-Hadamard transform itself is not immediately possible on fields of characteristic 2, since it depends on Sylvester-Hadamard matrices which must contain a primitive square root of unity. Yet the FWHT algorithm can be lifted to characteristic 0, namely, from F 2 = Z/2Z to Z, or more generally from . Algorithm 3 can then be applied, and its output mapped back to the relevant binary field by modular reduction. This incurs a space expansion by a logarithmic factor, though. Each bit from F 2 is mapped to intermediate values that can occupy as much as 3r + 1 bits; correspondingly, each element from F 2 N is mapped to intermediate values that can occupy as much as (3r + 1)N bits. Thus the component-wise multiplication in Algorithm 3 becomes more complicated to implement for large N. However, the method remains very efficient for the binary case as long as each expanded integer component fits a computer word. For a typical word size of 32 bits and each binary component being expanded by a factor of 3r + 1, this means that blocks as large as 1024 × 1024 can be tackled efficiently. On more restricted platforms where the maximum available word size is 16 bits, dyadic blocks of size 32 × 32 can still be handled with relative ease. )   I : r ∈ N, n = 2 r and a ∈ F n with char(F) ≠ 2. O : aHr. 1: v ← 1 2: for j ← 1 to n do 3: w ← v 4: v ← 2v 5: for i ← 0 to n − 1 by v do 6: for l ← 0 to w − 1 do 7: s ← a i+l 8: q ← a i+l+w 9: a i+l ← s + q

Karatsuba Multiplication
In this section we propose a method which is inspired by Karatsuba's algorithm for the multiplication of two integers [9]. Let us denote by a 0 and a 1 , respectively, the first and second halves of a, i.e.: [︁ a n 2 , a n 2 +1 , · · · , a n−1 ]︁ .
The same notation is used for b 0 and b 1 and c 0 and c 1 , corresponding to the halves of B and C. Some straightforward computations show that the following relations hold: The iterative application of equation (5) allows to compute multiplications between dyadic matrices of any size. Let us denote as C (2 z ) mul and C (2 z ) sum the complexities of a multiplication and a sum between two signatures of length 2 z . For the sum of two dyadic signatures of size 2 z we have: where Csum again denotes the complexity of a sum in the finite field. The complexity of this algorithm can thus be estimated as: Taking into account the well known sum of a geometric series, we have: Considering this result, equation (7) leads to:

Eflcient Inversion of Dyadic and Quasi-Dyadic Matrices
In this section we propose an efficient algorithm for computing the inverse of quasi-dyadic matrices. The algorithm in principle is targeted to matrices that are not fully dyadic (even though, obviously, they have to be square). This is because, while it is of course possible to apply our procedure to fully dyadic matrices, these can in general be inverted much more easily, as we will see next.
To begin, remember that by definition of a quasi-dyadic matrix (Definition 2.3) we mean an element of ∆ (︀ R n )︀ d1×d2 .

Dyadic Matrices
The inverse of a dyadic matrix (i.e. d 1 = d 2 = 1) can be efficiently computed, using only the signature, as described by the following Lemma. As we mentioned before, the above Lemma yields a very simple way for computing the inverse of a dyadic matrix: given a signature a, we just need to compute its diagonalized form as aHr, compute the reciprocals of its elements and put it in a vectorb. Finally, the inverse of ∆(a) can be obtained as 1 2 rb Hr. This property also leads to a very simple way to check the singularity of ∆(a): if its diagonalized form contains some null elements, then it is singular.
We now focus on the case of dyadic matrices over a field F with characteristic 2. One can show by induction that in such a case a dyadic matrix ∆(a) of dimension n satisfies ∆(a) 2 = ( ∑︀ i a i ) 2 I, and hence its inverse, when it exists, is ∆(a) −1 = ( ∑︀ i a i ) −2 ∆(a), which can be computed in O(n) steps since it is entirely determined by its first row. It is equally clear that det ∆(a) = ( ∑︀ i a i ) n , which can be computed with the same complexity (notice that raising to the power of n = 2 r only involves r squarings). Basically, verifying whether a dyadic matrix has full rank or not can be easily done by checking whether the sum of the elements of the signature equals 0.

Quasi-Dyadic Matrices
Consider a quasi-dyadic matrix M. Since the matrix has to be square, we have d 1 = d 2 = d, and the matrix has dimension dn × dn. Such a matrix can be compactly represented just by the signatures of the dyadic blocks. To simplify notation, we can denote the signature of the dyadic-block in position (i, j) asm i,j , and store all such vectors in a matrixM ∈ R d×dn : We focus again on the special case of quasi-dyadic matrices over a field F with characteristic 2. The LUP decomposition is a method which factorizes a matrix M as LUP, where L and U are lower triangular and upper triangular matrices, respectively, and P is a permutation. Exploiting this factorization, the inverse of M can thus be expressed as: The advantage of this method is that the inverses appearing in (10) can be easily computed, because of their particular structures. In fact, the inverse of an upper (lower) triangular matrix is obtained via a simple backward (forward) substitution procedure, while the inverse of P is its transpose.
In some cases, applying a block-wise LUP decomposition might lead to some complexity reduction; for instance, see [13] for the inversion of a sparse matrix. Here, we consider the case of a quasi-dyadic matrix; the corresponding procedure is shown in Algorithm 4.

Algorithm 4 LUP Decomposition of a Quasi-Dyadic Matrix
Update u,M andP via Algorithm 5. Our proposed procedure consists in using a block decomposition, which works directly on the signatures, in order to exploit the simple and efficient algebra of dyadic matrices. The operations in Algorithm 4 only refer to the signatures inM: for instance, the expressionm i,jmi,l means the product between the dyadics having as signaturesm i,j andm i,l . This choice may result in some abuse of notation, but is useful to emphasize the fact that, as we have explained in the previous sections, operations with dyadics can be efficiently computed just by taking into account their signatures. It can be easily shown that, for a quasi-dyadic matrix, its factors L, U and P are in quasi-dyadic form as well: as we have done for the matrix M, we refer to their compact representations asL,Û andP, respectively.
The algorithm takes as input a matrixM, as in (9), and computes its LUP factorization; outputs of the algorithm are the modified matrixM, having as elements the ones of its factorsL andÛ, and the permutation P. As in (9), we denote asm i,j the signature in position (i, j) in the output matrixM. The matricesL andÛ can then be expressed as: where1 and0 denote, respectively, the signature of the identity matrix and the one of the null matrix (i.e. the length-k vectors [1, 0, · · · , 0] and [0, 0, · · · , 0]).
The matrixP is represented through a length-d vector [p 0 , p 1 , · · · , p d−1 ], containing a permutation of the integers [0, 1, · · · , d − 1]; the rows ofM get permuted according to the elements ofP. In particular, the elements ofP are obtained through a block pivoting procedure, which is described in Algorithm 5. This function takes as inputM,P and an integer j, and searches for a pivot (i.e., a non singular signature) in the j-th column ofM, starting fromm j,j , and places it in position (j, j). As the procedure goes on, every time a singular signature is tested, the rows ofM get permuted; the elements ofP are accordingly modified. If the j-th column contains all singular blocks, this means that the matrixM is singular; in such a case, this event is notified by setting u = 0.

Algorithm 5 Block pivoting
We point out that, for the matrices we are considering, we expect Algorithm 4 to be particularly efficient. First of all, as we have already said, this is due to the possibility of efficiently performing operations involving dyadic matrices; in addition, the dyadic structure should also speed-up the pivoting procedure. In fact, we can consider a signature inM as a collection of k random elements picked from GF(2 N ): thus, their sum can be assumed to be a random variable with uniform distribution among the elements of the field GF(2 N ). So, the probability of it being equal to 0, which corresponds to the probability of the corresponding signature to be singular, equals 2 −m . This probability gets lower as m increases: this fact means that the expected number of operations performed by Algorithm 5 should be particularly low. Basically, most of the times the function will just compute the sum of the elements inm j,j and verify whether it is null or not.
Once the factorization ofM has been obtained, we just need to perform the computation of M −1 through (10). Since the inverse of a triangular matrix maintains the original triangular structure, the computation of the inversesL −1 andÛ −1 can be efficiently performed. A possible way for computing these matrices is to store the elements of both matrices in just one output matrixT. We do this in Algorithm 6. The matrixÎ d is the compact representation of a dn × dn identity matrix, and so is composed of signatures δ i,j1 , where δ i,j denotes the Kronecker delta.
If we denote ast i,j the signature in position (i, j), we have:

Performance Analysis: Application to DAGS
In this section we provide the results of the application of our techniques to DAGS. For completeness, we have included a specification of the three DAGS algorithms in Appendix A, but for our purpose, DAGS is essentially the McEliece cryptosystem, converted to a KEM via a standard transformation [8]. In particular, the Key Generation algorithm is the same as the QD-GS McEliece version described in [12]. In this algorithm, a key role is played by the systematization (i.e. reduced row echelon form) of a quasi-dyadic rectangular matrix, the result of which will in fact be the public key for the scheme. The cost of computing said systematic matrix dwarfs everything else in key generation: according to a static analysis, this takes over 98% of the total cost of key generation. Therefore, a fast procedure to compute the systematic form will have a substantial impact on the overall performance of the algorithm.

Implementation Details.
We developed a code in "C" to implement our procedures. In all cases, we use no optimizations apart from the optimization from the GCC compiler ("-O3"). The GCC version used was 7.3.1 20180406, the code was compiled for the processor Intel(R) Core(TM) i5-5300U CPU @ 2.30GHz with 16GB of memory and operating system Arch linux version 2018.05.01 with Kernel 4.16.5. We ran 100 times each piece of code and computed the average of all measurements; to obtain the number of cycles, we used the file "cpucycles.h" from supercop¹.

Fast Multiplication.
To compare our methods, we fix a dyadic order r and measure the cost of a multiplication of two matrices of size n = 2 r . Relevant dyadic orders for DAGS are for instance r = 4 and r = 5. We do this over different fields to highlight the difference in performance when changing fields: we tested F 2 5 and F 2 6 which are the fields currently used by DAGS. We report here the results of the improved inversion procedure (Algorithms 4, 5 and 6). We compared our procedure with the equivalent portion of the DAGS implementation that we extrapolated from the publicly available source code [2]. In particular, we measured the piece of code that begins with the creation of the Cauchy matrix and ends with the generation of the systematic matrix. Table 2 shows the comparison, measured in cpu cycles.

A DAGS Algorithms
We briefly describe the three algorithms that define DAGS. Generalized Srivastava codes are defined by parameters s and t, where in our case log s is the dyadic order; the codes in use have length n = n 0 s and dimension k = k 0 s where n 0 and k 0 are the number of dyadic blocks. Other parameters are the cardinality of the base field q and the degree of the field extension m. In addition, we have k = k ′ + k ′′ , where k ′ is arbitrary and set to be "small".
The key generation process uses the following fundamental equation which guarantees we can build a dyadic matrix, with signature h = (h 0 , h 1 , . . . , h n−1 ), which is also a Cauchy matrix, i.e. a matrix C(u, v) with components C ij = 1 u i −v j . In [11] it is proved that we can use the fundamental