Abstract
We investigate features of functional programming in the context of machine learning. Python and R are the most widely used languages in this area. One can implement almost any model using such libraries as NumPy or TensorFlow. In this post, we discuss Haskell advantages in comparison to Python and R. Linear algebra is one of the main mathematical tools in machine learning, together with probability theory and statistics. Therefore our research aims to investigate and improve the Haskell tools.
General Background
Haskell is a polymorphic functional programming language with lazy evaluation. Haskell has a set of impressive characteristics such as parametric polymorphism, strong and powerful type system, easier refactoring and debugging. Haskell has already proved its efficiency in programming language design, blockchain systems, concurrent and parallel computation, etc. In our opinion, the potential of using Haskell in machine learning is not investigated deep enough. We want to highlight several examples of previous works related to both functional programming and machine learning.
In addition, we recommend reading quite interesting blog post series called Practical Dependent Types in Haskell: Type-Safe Neural Networks by Justin Le. It’s also useful to have a look at the haskell-ml library, where machine learning examples are based on Le’s ideas, and at this recourse as well, where Haskell ecosystem is described pretty fine in projection to machine learning and data science.
Problem
We’re going to discuss rather technical aspects of implementation than machine learning itself. In machine learning and related areas, one often has to deal with multidimensional data, which we usually express as a matrix. In the case of Haskell, purity, immutability and easiness in translator development provide ways of making a computation unparalleled and transforming a code for execution on alternative architectures such as NVidia microarchitectures.
Working with multidimensional data requires using linear algebra. We consider a dataset as a two-dimensional matrix, each column of which is a single observation of some phenomena from the external world. Dimensions of such dataset are quite large, which makes data processing and storing too expensive. On the other hand, there might be latent correlations between observations. The presence of these correlations facilitates the representation of the given data set with lower dimensions. That is, seeking dependencies of this kind allows optimising input data. Machine learning provides a huge set of procedures for extracting the primary information from our dataset and representing the given observations up to correlation between them.
When one writes code with a huge number of matrix operations, such as matrix transposition, a lot of regrettable errors arise. These mistakes have no connection with types, so type-checker cannot detect them. The best case scenario is a runtime error. Sometimes, a process of evaluation completes successfully without any runtime errors, but the result might differ from the expected one. For instance, an output matrix might have the wrong dimensions. Sometimes there is no way to identify such bugs immediately. Worst of all, there might be a so-called “floating” error that reproduces periodically, on a case-by-case basis. Floating errors are the most difficult in debugging so far because it’s not straightforward to determine the cause of such failures. Moreover, it’s not clear why there are exactly these dimensions, not any other.
These errors often affect matrix dimensions. For this reason, we gave them the highest priority. One should always pay attention to the dimension coherence when applying some matrix operations, for instance, matrix product or linear solver, where the relationship between input dimensions matters. We tried to solve the problem of type-level dimensions to provide a type-safe implementation of widely-used and common machine learning procedures. During our investigation, we were working on the Haskell implementation of such dimension reduction procedures as (probabilistic) principal component analysis (briefly, PPCA) and Gaussian process latent variable model (GPLVM).
In our view, it’s worth to deal with type-level matrix dimensions. Such approach to array dimension allows one to catch the whole class of errors described above and deal with some quite simple proofs, as long as we require the relationship between the dimensions and the relationship is provable.
Below we discuss and review related matrix and numerical libraries to understand which array libraries are useful for machine learning in Haskell and which approaches to type-level natural numbers might be applied to safe matrix dimensions.
HMatrix
hmatrix
is one of the most popular matrix libraries in Haskell. hmatrix
provides an interface over well-known libraries BLAS and LAPACK. The hmatrix
library was described pretty fine in the blog post by colleagues from Tweag.io and here. We’ll discuss this library quite briefly and take a look at several examples to understand how hmatrix
works.
For instance, let us consider the Cholesky decomposition. The Cholesky decomposition is an operation on matrices that decomposes the given matrix (it should be Hermitian and positive-defined) into the product of a lower triangular matrix and its conjugate transpose. This operation is widely used for solving linear systems and similar stuff. In our work, we also use the Cholesky decomposition several times.
In hmatrix
, the Cholesky decomposition is a function with the following signature:
chol :: Field t => Herm t -> Matrix t
where Field
is a type-class denoting that t
is a field. Matrix
is a data type of matrix consists of strict dimensions and storable vector which is a content of the matrix itself. Herm
is just a newtype
over Matrix
. In our research, we used hmatrix
functions implicitly. As we’ll describe later, we decided to use repa
library as the main one, and matrix and related operations in repa
are obtained from HMatrix
. See the following library for more information.
Note, that in hmatrix
there is a module called Numeric.LinearAlgebra.Static with the same functionality, but with static type-level dimensions. For instance, there is a linear solver in hmatrix
with static dimensions:
(<>) :: forall m k n. (KnownNat m, KnownNat k, KnownNat n) => L m k -> L k n -> L m n
Where L
is a data type of a two-dimensional matrix parameterised via its number of rows and columns. m
, n
, and k
are natural numbers (as a matter of fact, the types of kind Nat
from GHC.Types); KnownNat is a type class that connects integers and type-level naturals. Thus, the problem of type-safety in hmatrix
solved partially, but Nat
from the module mentioned above is just a representation of type-level natural numbers, not natural numbers in itself. The reason is that arithmetical operations on these natural numbers are implemented as open type families, but we would like to have these operations implemented explicitly. Under the circumstances, we should claim that this version of type-level dimensions in hmatrix
is inconvenient for proofs and basic verification.
Accelerate
accelerate
is a package with regular multidimensional arrays. accelerate
allows writing code which will be compiled in run-time into the special domain-specific language, and then it will be compiled through LLVM for execution either on multithreaded CPU or NVidia GPU. Unfortunately, there are no such useful functions as SVD or the Cholesky decomposition, so accelerate
is not as rich as a numeric library. On the other hand, these operations are available as low-level bindings to CUDA-solver, although explicit functions are not implemented yet.
Examples:
map :: (Shape sh, Elt a, Elt b) => (Exp a -> Exp b) -> Acc (Array sh a) -> Acc (Array sh b)
fold :: (Shape sh, Elt a) => (Exp a -> Exp a -> Exp a)
-> Exp a -> Acc (Array (sh :. Int) a) -> Acc (Array sh a)
Exp a
represents an expression for calculation of a single value. All operations are executed sequentially in such expressions.
Acc (Array sh a)
represents an expression for calculation of an array of values of type a
and consists of many expressions of type Exp a
. Execution of all expressions happens in parallel. Such separation helps to avoid nested parallelism which is not supported now.
sh
represents dimensionality of the resulting array (DIM1
for a vector, DIM2
for matrix, etc.).
Elt
is a type-class of array elements.
Shape
is a type-class that represents array dimensionality.
Despite the fact that one can work with “Acc (…)” and “Exp a” like with arrays and scalars, they are not real arrays or values in runtime, but the code for their calculation. This code might be compiled and executed via the run
function either on CPU or GPU.
If you are interested in accelerate
, take a look at these examples to learn in more details how it works.
Repa
repa is a package with regular multidimensional arrays like accelerate
. But repa
has an important distinctive feature. In repa
, one may deal with arrays of a large number of representations. Array data type in repa
is an associated data type in a type class called Source, which defined as follows:
class Source r e where
data Array r sh e
extent :: Shape sh => Array r sh e -> sh
linearIndex :: Shape sh => Array r sh e -> Int -> e
deepSeqArray :: Shape sh => Array r sh e -> b -> b
where r
is a type of representation and e
is a type of elements. In addition to this associated array type, this class has three methods (as a matter of fact, more than three, but these methods form the minimal complete definition).
In contrast to accelerate
, there are several kinds of array representation in repa
. D
is a data type that corresponds to a delayed representation of an array via a function from index to element. C
is a delayed representation via so-called cursor functions. Delayed and cursor arrays are rather “pseudo arrays” than real arrays. Real arrays have another kind of representation: U
and V
are unboxed and boxed arrays respectively. B
is a strict bytestring array. F
represents a foreign memory buffer. In our work, we decided to use delayed arrays by default for convenience and simplicity.
Unfortunately, it’s arrested development.
Massiv
This library takes the main ideas from repa
, but it’s being actively developed by our Russian colleague Alexey Kuleshevich. In contrast to repa
and accelerate
, massiv
provides a broad set of tools for working with mutable arrays. massiv
is more productive than repa
. On the other hand, the ecosystem of massiv
lacks numerical functionality: such functions as the Cholesky decomposition are not implemented yet. But unlike repa
, this library and its ecosystem are actively developing.
Dimensions
dimensions is a library with safe type-level dimensions for multidimensional data that might be used regardless of the particular data structure. This library is a part of EasyTensor package. The purpose of this library is to provide type-safety in matrices in tensor algorithms. But this efficiency was achieved to the detriment of rigidity. For example, there is a lot of unsafe coerce between types for proof creation in dimensions
. Moreover, there is a distinction between unknown and known dimensions that increases the complexity of code and multiplies boilerplate. Plus, the authors used type-level natural numbers from GHC.TypeLits. This approach has some disadvantages that we describe below: they don’t depend on dimensions
functionality.
GHC.TypeLits and type-natural
Type-level natural numbers are already implemented in module GHC.TypeLits included in the base
library. This version of type-level natural numbers works pretty fine and is widely supported with side libraries, but these numbers are not inductive data-type, so it’s quite difficult to implement the very basic proofs of trivial facts related to natural numbers. In other words, literals “2” and “134” have no connection between each other. This problem might be solved partially via some special tools, but this doesn’t solve the problem completely.
From our point of view, the library called type-natural is more suitable for this purpose. In contrast to GHC.TypeLits
, here natural numbers are introduced inductively à la Peano that allows proving some simple properties inductively.
Summary
In this part, we have introduced the problem of type-safe dimensions in Haskell and reviewed related libraries and tools. We have briefly described matrix and dimensional libraries and discussed their strengths and limitations.
In our opinion, most of the tasks we’ve solved have a more elegant solution via dependent types. In the second part, we will also discuss useful libraries that allow emulating dependently typed programming in Haskell and related difficulties. We will overview our own approach to matrix data type parameterised via its dimensions.