Optimizing Haskell Code for Runtime Verification

In this blog post, we will be exploring, describing, and dissecting the first phase of the collaboration between Runtime Verification and Serokell.

The collaboration involved work on making optimizations to K, a rewrite-based executable semantic framework in which programming languages, type systems, and formal analysis tools can be defined with the help of configurations and rewrite rules.

The blog post will contain:

  • the details and the results of the discovery phase;
  • our recommendations (introduced in the Findings section);
  • the implementation descriptions of 2 global tasks, alongside their performance results.

Discovery phase

Profiling

The discovery phase, conducted by Vladislav Zavialov, involved a two-week-long investigation of the performance of the K framework’s symbolic execution engine, namely: the Kore language.

With the use of the Glasgow Haskell Compiler’s profiling instrumentation, we were able to build the project with profiling support. This can be done by specifying profiling: True in the file cabal.project.

For example:

package *
  profiling: True

package kore
  profiling-detail: toplevel-functions
  optimization: True

The resulting executable will be slower than the executable built without the profiling support, because:

  1. Profiling has runtime overhead, though it is accounted for in the GHC-generated reports.
  2. Profiling prevents certain optimizations at compile time. For example, if a function is designated as a cost centre for profiling, it will not be inlined, and inlining, typically done by GHC for small functions, may unlock further optimizations at compile time.

The profiling report can be generated by adding the following snippet to the command-line options of the executable:

+RTS -p -RTS

Here, +RTS and -RTS delimit the flags processed by GHC’s runtime system, and the -p flag tells it to generate a report with a .prof extension.

Cost centres

A cost centre is a program annotation around a given expression. This expression is assigned a cost – the time or space required to evaluate it. The profiling report contains a table of the cost centres, the locations of their definitions (the name of the module and the source code location), and how much time and space they take.

Here’s an example of an entry in such a table:

COST CENTRE  MODULE                    %time  %alloc
function     Module.hs:(13,1)-(14,12)  3.8    1.3

It means that the function function, defined in the module Module from line 13 column 1 to line 14 column 12, is responsible for 3.8% of the time cost and 1.3% of the allocation cost of the program.

However, recall that by making certain functions into cost centres, their inlining is prevented. In a non-profiled executable they could be optimized away; therefore, the correct way to read the entry above is this:

  • Unless optimized away, Module.function would take 3.8% of time and occupy 1.3% of space.

Inlining

Even without the help of the profiling instrumentation, we’re still able to visually detect some parts of the code that can hinder GHC’s abilities for optimization.

Consider the following expression:

do a <- m
   b <- n
   f a b

Suppose the monad we work in is State s. Then the >>=s produced by the do-notation could be inlined and turned into case-expressions:

State $ \s ->
  case runState m s of (a, s') ->
  case runState n s' of (b, s'') ->
  runState (f a b) s''

If, however, we work in some abstract monad m satisfying the MonadState s m constraint, then the >>=s are opaque to the compiler. It has to follow a pointer in the class dictionary and jump to an unknown function, which is bad for branch prediction.

To find out if the functions in question are getting inlined, we can insert the following line into our module:

{-# OPTIONS -ddump-simpl -ddump-to-file #-}
  1. -ddump-to-file causes the output from each of the flags starting with -ddump, to be dumped to a file or files.
  2. -ddump-simpl dumps simplifier output.

After building our module, GHC will produce a .dump-simpl file containing its internal representation of the program after most optimizations have had the chance to take place.

Suppose we built the module where the do-notation snippet is defined, and we worked in an abstract monad m. When we search for >>= in our generated .dump-simpl file, we will stumble upon this snippet:

>>=
  @ m_aM5
  $dMonad_aMr
  @ a_aM6
  @ c_aM8
  eta1_B2
  (\ (a1_anu :: a_aM6) ->
     >>=
       @ m_aM5
       $dMonad_aMr
       @ b_aM7
       @ c_aM8
       eta2_B1
       (\ (b1_anv :: b_aM7) -> eta_B3 a1_anu b1_anv))

The call to >>= is not inlined, and the reason is that it is instantiated at an abstract monad m_aM5, and GHC does not know which instance to use. Instead, it has to pass the implementation in a dictionary $dMonad_aMr.

Findings

Monomorphization and specialization

Here is a list of relevant terminology, datatypes, and typeclasses for this section:

  • Satisfiability Modulo Theories (SMT for short) is the problem of determining whether a first-order formula is satisfiable, i.e. if it is true under a certain assignment of values to its variables, and acts as the generalization of the Boolean satisfiability problem (SAT).
  • An SMT solver is a tool that aims to solve the SMT problem in given theories and a subset of inputs.
  • SMT a is a datatype that queries an external SMT solver.
  • NoSMT a is a datatype for the case when no SMT solver is provided.
  • SimplifierT smt a is a datatype representing a simplification action, the smt type variable can be substituted with either SMT or NoSMT.
  • MonadSimplify m is a typeclass containing functions relevant to the simplification process. Allows working with arbitrarily nested monad transformers. Its base instance is MonadSimplify (SimplifierT m).

We attempted to solve the aforementioned problem of inlining by changing the abstract monad MonadSimplify simplifier to SimplifierT SMT. This could not be done easily because of the application architecture: the base monad, SMT or NoSMT, is determined dynamically.

execute options metadataTools lemmas worker =
  case solver of
    Z3 -> withZ3
    None -> withoutSMT

For the purposes of this discovery phase investigation, we removed the support of NoSMT.

It turned out that the simplifier type variable is instantiated to different monad transformer stacks and could not be easily monomorphized in some functions. We developed a custom tool to determine how the monad is being instantiated in practice.

Here is the summary of its results on one of the test cases:

1 MaybeT (ReaderT (SideCondition RewritingVariableName) (UnifierT (LogicT Simplifier)))
2 MaybeT (ReaderT (SideCondition RewritingVariableName) (UnifierT (LogicT (LogicT Simplifier))))
10 MaybeT (ReaderT (SideCondition RewritingVariableName) (UnifierT (StateT (MultiAnd StuckClaim) (ExceptT (MultiAnd StuckClaim) (StateT (MultiAnd SomeClaim) Simplifier)))))
14 MaybeT (ReaderT (SideCondition RewritingVariableName) Simplifier)
54087 Simplifier

Each line in the snippet above is of the form “<number> <type>” and can be interpreted the following way: the type <type> has been instantiated <number> time(s) during the run of the program.

The vast majority (99.95%) of calls are in the Simplifier monad, so we started looking into a way to get GHC to specialize such functions.

  • Specialization is the process of removing typeclass dictionary arguments by creating a new type-specialized definition for an overloaded function.

The SPECIALIZE pragma (UK spelling also accepted) can be used to create a specialized copy of an overloaded function. Consider the following example:

class C a where ...

foo :: C a => a -> a
{-# SPECIALIZE foo :: Char -> Char #-}

It will create a new function which will be used whenever foo is applied to a Char, and GHC will choose that function automatically.

Unfortunately, specialization is not transitive: if f calls g, and f is specialized, we might not see the desired performance improvement because g still works in an abstract context. Therefore, we must look into a way to specialize all the way down the call stack.

Cache hit ratio

  • A cache hit is a situation when the requested data is found in the cache. This implies a faster means of delivering that data since it can be immediately retrieved from the cache.
  • A cache miss is the opposite of a cache hit, i.e. a situation when the requested data is not contained in the cache.
  • A cache hit ratio is the number of cache hits divided by the total number of requests (in other words, the number of cache hits + the number of cache misses).

We found that the cache in attemptEquation enjoys a hit ratio of over 95%, so it is well-placed. However, one of the functions that performs the lookup of the Simplifier cache, which is essentially a HashMap, is expensive, because a lot of time is spent on hashing and comparing keys.

We can either use a different data structure than a HashMap to avoid hashing and comparisons entirely, or try to minimize the amount of time required to hash and compare keys by micro-optimizing the respective operations.

String interning

  • String Interning is a method of storing exactly one copy of a distinct string value. In general, interning is reusing objects that are equal instead of creating new ones. The objects must be immutable.

The profiling report revealed that there are a lot of operations over values of the datatype Id. Id is a record type that has one field with the type Text. This means that operations such as == or hashWithSalt operate over Text values, which can be slow.

Unlike some other programming languages, Haskell does not provide a means for interning strings, string-like datatypes, or numbers. However, a common practice is to create a table of identifiers and assign unique numbers to them. In this case, we could create some sort of global map from the Text values to, for example, Ints. This way comparisons and hashing become integer operations, which are much faster.

Implementation

When optimizing code, it’s crucial to avoid sacrificing simplicity. Drew DeVault describes this principle in his blog: Simple, correct, fast: in that order. Since K is a compute-intensive tool for software quality assurance, the principle is modified: accuracy is paramount, but simplicity and speed must be balanced.

While implementing these changes, we followed Runtime Verification’s GitHub guidelines for collaboration.

Monomorphization and specialization, pt. 1: SMT and NoSMT

As mentioned in the Monomorphization and specialization subsection of the Findings section, our ultimate goal was to carefully inline (monomorphize) and/or specialize as many functions as possible, to make GHC’s optimizations take place by having concrete types instead of the polymorphic ones.

The first obstacle to consider is the presence of two data types that are dynamically determined, SMT and NoSMT. They are defined as follows:

newtype NoSMT a = NoSMT { getNoSMT ::                LoggerT IO a }
newtype SMT   a = SMT   { getSMT   :: SolverSetup -> LoggerT IO a }

We started looking into a way to merge the two datatypes into one. Notice how they differ in the type signatures of the encapsulated values: NoSMT holds a value of the type LoggerT IO a, while SMT needs a SolverSetup as an argument to acquire the value of the type LoggerT IO a. The type LoggerT IO a is isomorphic to the function () -> LoggerT IO a, where () is the unit type. For a moment, let’s think of the value encapsulated by NoSMT as that aforementioned function.

If we have two functions f : a -> c and g : b -> c, we can construct a function (f + g) : Either a b -> c, applying either f or g to the value held by the respective constructor of Either. This is exactly how we want to combine SMT and NoSMT:

  1. f is getNoSMT,
  2. g is getSMT; therefore
  3. a is (),
  4. b is SolverSetup, and
  5. c is LoggerT IO a.

Let’s call this combination SMT (in other words, we retain the original name). It wraps a function of the type Either () SolverSetup -> LoggerT IO a. However, we can simplify the type of the argument: Either () SolverSetup is isomorphic to Maybe SolverSetup. Thus, the combined datatype can be defined like this:

newtype SMT a = SMT (Maybe SolverSetup -> LoggerT IO a)

If the argument is a Just wrapping a SolverSetup value, this datatype is isomorphic to the original SMT; otherwise, it’s isomorphic to NoSMT.

For the sake of convenience, when referring to the original definition of SMT, let’s use the name OldSMT.

MonadSMT

  • MonadSMT m is a typeclass that allows accessing SMT through monad transformers. It has two base instances: MonadSMT OldSMT and MonadSMT NoSMT.

We now need to construct the only base instance, MonadSMT SMT, with the knowledge of how the functions were defined in the original two base instances for the original two datatypes.

Suppose the functions declared in the MonadSMT typeclass have the following type signature:

f :: MonadSMT m => args -> m a

We know how fOldSMT :: args -> OldSMT a and fNoSMT :: args -> NoSMT a were defined in the respective instances, so let’s build a definition of f for MonadSMT SMT!

It will roughly look like this:

f :: args -> SMT a
f args = SMT $ \case
  Nothing          -> getNoSMT (fNoSMT args)
  Just solverSetup -> getOldSMT (fOldSMT args) solverSetup

liftSMT

Every function in the MonadSMT typeclass has a default definition of approximately the following structure:

default f ::
  (MonadTrans t, MonadSMT n, m ~ t n) =>
  args -> m a
f = lift . f

This will allow us to automatically have a definition of that function for nested monad transformer stacks:

  1. First, we call f :: args -> n a, which is possible due to the MonadSMT n constraint.
  2. Then, we call lift :: n a -> t n a, given that t instantiates MonadTrans.
  3. Since t n is equal to m, the result has the type m a.

Notice, however, that we now have only one base instance: MonadSMT SMT, which means that only SMT can reside at the very bottom of the monad transformer stack. We can simplify the definition of MonadSMT the following way:

  1. Declare liftSMT :: SMT a -> m a and its default monad transformer definition. In the MonadSMT SMT instance, liftSMT would simply be equal to id.
  2. The other functions (apart from liftSMT and withSolver, which we will return to in a couple of moments) are removed from the typeclass and split into two versions:
    • The “specialized” versions: fSMT :: args -> SMT a, which were previously present in the MonadSMT SMT instance.
    • The polymorphic versions:
      f :: MonadSMT m => args -> m a
      f = liftSMT . fSMT
      
      which are defined in terms of liftSMT.
  3. withSolver :: m a -> m a is the only function whose argument is m a, therefore making it impossible to define it via liftSMT, so it remains in the typeclass.

Summary

The unification of OldSMT (originally referred to as SMT before defining the combined datatype) and NoSMT was the first step of the monomorphization and specialization task. This step was aimed more at simplifying the code rather than optimizing it. Along the way, we simplified the definition of the MonadSMT typeclass by introducing liftSMT :: SMT a -> m a and defining the majority of the functions in terms of it, bringing them outside of the typeclass’ scope. The link to the pull request containing the implementation of the SMT/NoSMT fusion is available here.

String interning

An analysis of the profiling report showed that the values of the Id datatype are subject to many operations.

Id is a record type, one of the fields of which has the type Text:

data Id = Id
    { getId :: !Text
    , idLocation :: !AstLocation
    }

This means operations such as == or hashWithSalt operate over Text values, which can be slow.

The first attempt to solve this problem was to use the intern package. Modification of Eq instance for type Id to take advantage of interning resulted in approximately 8% speedup on the integration test suite.

In our attempt to modify the Ord instance to take advantage of interning as well, some of our unit and integration tests failed because they assumed that the Ids would be sorted alphabetically. When we profiled it, we noticed that Kore.Syntax.Id.compare doesn’t play much of a role there, it isn’t one of the most expensive cost centres. Moreover, some data persists between invocations of the backend via serialization, and the ordering of Ids may change between runs if they are interned in a different order.

However, we wanted to have an easier way to modify or debug the code responsible for string interning. There is a cacheWidth parameter in the Interned typeclass of the intern package that creates multiple maps to speed up concurrent access. As we didn’t need that, we had to create a newtype and redefine the Interned instance with cacheWidth set to 1.

Moreover, since the intern library is quite small, it seemed easier to implement a custom interning solution in our code.

InternedTextCache

To begin, let’s define a datatype for our interned strings:

data InternedText = UnsafeMkInternedText
    { internedText :: {-# UNPACK #-} !Text
    , internedId :: {-# UNPACK #-} !Word
    }

In addition to the original Text, the datatype also contains a unique ID associated with this string. The constructor of this datatype should not be used directly to create values. Instead, we will introduce an interface to interact with our interned text cache.

We then define a datatype for the global cache, which will contain all of the process’ interned strings:

data InternedTextCache = InternedTextCache
    { internedTexts :: !(HashMap Text InternedText)
    , counter :: {-# UNPACK #-} !Word
    }

InternedTextCache has two fields:

  • internedTexts is a HashMap that stores all the interned strings
  • counter is an incremental counter used to generate new unique IDs since HashMap.size isn’t a constant-time operation

Finally, we define instances for our InternedText:

instance Eq InternedText where
    a == b = internedId a == internedId b
    {-# INLINE (==) #-}

instance Hashable InternedText where
    hashWithSalt salt = hashWithSalt salt . internedId
    {-# INLINE hashWithSalt #-}
    hash = hash . internedId
    {-# INLINE hash #-}

instance Ord InternedText where
    a `compare` b
        | internedId a == internedId b = EQ
        | otherwise = internedText a `compare` internedText b
    {-# INLINE compare #-}

The Eq and Hashable instances are straightforward – we use the ID of the interned string for equality comparison and hashing operations.

Our Ord instance quickly checks if the IDs of interned strings are equal. If they are not, we use lexical order to compare the strings themselves.

internText

The next step is to create a function to intern our strings. To start, let’s define a function that initializes the cache.

With the unsafePerformIO hack, we can define a global IORef value to store our cache outside the IO monad:

globalInternedTextCache :: IORef InternedTextCache
globalInternedTextCache = unsafePerformIO $ newIORef $ InternedTextCache HashMap.empty 0
{-# NOINLINE globalInternedTextCache #-}

To avoid cache duplication, we must ensure that globalInternedTextCache is evaluated only once. So we add the NOINLINE pragma, which will prevent GHC from inlining our definition.

As soon as we have a global cache, we can define a function to intern Text values:

internText :: Text -> InternedText
internText text =
  unsafePerformIO do
    atomicModifyIORef' globalInternedTextCache \InternedTextCache{counter, internedTexts} ->
      let ((internedText, newCounter), newInternedTexts) =
          HashMap.alterF
            \case
              -- If this text is already interned, reuse it.
              existing@(Just interned) -> ((interned, counter), existing)
              -- Otherwise, create a new ID for it and intern it.
              Nothing ->
                let newIden = counter
                  !newInterned = UnsafeMkInternedText text newIden
                 in ((newInterned, counter + 1), Just newInterned)
            text
            internedTexts
       in (InternedTextCache newInternedTexts newCounter, internedText)

Using atomicModifyIORef', we can thread-safely modify our global cache. We use the strict version to avoid space leaks.

The function works as follows:

  • If the argument text has been interned previously, we return the existing InternedText instance.
  • Otherwise, we allocate a new InternedText and return it.

In other words, multiple evaluations of internText with the same argument should return a pointer to the exact same InternedText object.

Id

Now that we have a custom interning solution in place, it’s time to modify the Id datatype:

data Id = InternedId
    { getInternedId :: !InternedText
    , internedIdLocation :: !AstLocation
    }

instance Ord Id where
    compare first second =
        compare (getInternedId first) (getInternedId second)
    {-# INLINE compare #-}

instance Eq Id where
    first == second = getInternedId first == getInternedId second
    {-# INLINE (==) #-}

instance Hashable Id where
    hashWithSalt salt iden = hashWithSalt salt $ getInternedId iden
    {-# INLINE hashWithSalt #-}

The final step is to define a pattern synonym to abstract away InternedText:

pattern Id :: Text -> AstLocation -> Id
pattern Id{getId, idLocation} <-
    InternedId (internedText -> getId) idLocation
    where
        Id text location = InternedId (internText text) location

{-# COMPLETE Id #-}

This pattern allows us to use the Id as it was before the modification.

Summary

Instead of keeping many copies of the same Text in memory, we keep only one shared InternedText. Each InternedText has a unique ID.

As a result:

  • The Eq instance can more quickly check if the ID of two InternedTexts are equal, instead of checking every single character.
  • The Hashable instance can hash the ID, which is a lot faster than hashing the string’s contents.

In total, the changes above improve performance by approximately 10%. In addition, the pattern synonym maintains the simplicity of the code where the Id type is used at the same level.

Conclusion

The first phase of the collaboration between Runtime Verification and Serokell emerged with the discovery phase, conducted to determine the important, optimizable parts of the codebase, with the help of GHC’s profiling instrumentation and custom tools.

The 1st part of the monomorphization and specialization task focused on the exploration of two datatypes, SMT and NoSMT, their unification, and the resulting simplification of the code; in particular, the MonadSMT typeclass. This change didn’t produce any noticeable effects on the performance, but it is nevertheless integral to the 2nd part of the task, enabling us to progress forward with the changes that will be impactful.

The string interning task focused on solving the problem of optimizing the definitions of == and hashWithSalt for the Id datatype. One of its fields is a Text value, and these values take an immediate part in the aforementioned definitions. Such operations on Texts can be slow. Instead, we assign each Text a unique identifier and compare them instead. This consequently gives rise to the need for storing such Text-identifier pairs in a map, to access or create them during the interning process. This solution introduces a 10% performance improvement.

The cache hit ratio task, as well as the 2nd part of the monomorphization and specialization task, will be explored in the following blog post. To not miss it, follow us on Twitter or subscribe to our mailing list via the form below.

Banner that links to Serokell Shop. You can buy awesome FP T-shirts there!
More from Serokell
Haskell in production fossaHaskell in production fossa
Haskell in industry: RiskbookHaskell in industry: Riskbook
Code Pray Love: Interview with Oli MakhasoevaCode Pray Love: Interview with Oli Makhasoeva