In this blog post, we will be exploring, describing, and dissecting the first phase of the collaboration between Runtime Verification and Serokell.
The collaboration involved work on making optimizations to K, a rewrite-based executable semantic framework in which programming languages, type systems, and formal analysis tools can be defined with the help of configurations and rewrite rules.
The blog post will contain:
- the details and the results of the discovery phase;
- our recommendations (introduced in the Findings section);
- the implementation descriptions of 2 global tasks, alongside their performance results.
Discovery phase
Profiling
The discovery phase, conducted by Vladislav Zavialov, involved a two-week-long investigation of the performance of the K framework’s symbolic execution engine, namely: the Kore language.
With the use of the Glasgow Haskell Compiler’s profiling instrumentation, we were able to build the project with profiling support. This can be done by specifying profiling: True
in the file cabal.project
.
For example:
package *
profiling: True
package kore
profiling-detail: toplevel-functions
optimization: True
The resulting executable will be slower than the executable built without the profiling support, because:
- Profiling has runtime overhead, though it is accounted for in the GHC-generated reports.
- Profiling prevents certain optimizations at compile time. For example, if a function is designated as a cost centre for profiling, it will not be inlined, and inlining, typically done by GHC for small functions, may unlock further optimizations at compile time.
The profiling report can be generated by adding the following snippet to the command-line options of the executable:
+RTS -p -RTS
Here, +RTS
and -RTS
delimit the flags processed by GHC’s runtime system, and the -p
flag tells it to generate a report with a .prof
extension.
Cost centres
A cost centre is a program annotation around a given expression. This expression is assigned a cost – the time or space required to evaluate it. The profiling report contains a table of the cost centres, the locations of their definitions (the name of the module and the source code location), and how much time and space they take.
Here’s an example of an entry in such a table:
COST CENTRE MODULE %time %alloc
function Module.hs:(13,1)-(14,12) 3.8 1.3
It means that the function function
, defined in the module Module
from line 13 column 1 to line 14 column 12, is responsible for 3.8% of the time cost and 1.3% of the allocation cost of the program.
However, recall that by making certain functions into cost centres, their inlining is prevented. In a non-profiled executable they could be optimized away; therefore, the correct way to read the entry above is this:
- Unless optimized away,
Module.function
would take 3.8% of time and occupy 1.3% of space.
Inlining
Even without the help of the profiling instrumentation, we’re still able to visually detect some parts of the code that can hinder GHC’s abilities for optimization.
Consider the following expression:
do a <- m
b <- n
f a b
Suppose the monad we work in is State s
. Then the >>=
s produced by the do
-notation could be inlined and turned into case
-expressions:
State $ \s ->
case runState m s of (a, s') ->
case runState n s' of (b, s'') ->
runState (f a b) s''
If, however, we work in some abstract monad m
satisfying the MonadState s m
constraint, then the >>=
s are opaque to the compiler. It has to follow a pointer in the class dictionary and jump to an unknown function, which is bad for branch prediction.
To find out if the functions in question are getting inlined, we can insert the following line into our module:
{-# OPTIONS -ddump-simpl -ddump-to-file #-}
-ddump-to-file
causes the output from each of the flags starting with-ddump
, to be dumped to a file or files.-ddump-simpl
dumps simplifier output.
After building our module, GHC will produce a .dump-simpl
file containing its internal representation of the program after most optimizations have had the chance to take place.
Suppose we built the module where the do
-notation snippet is defined, and we worked in an abstract monad m
. When we search for >>=
in our generated .dump-simpl
file, we will stumble upon this snippet:
>>=
@ m_aM5
$dMonad_aMr
@ a_aM6
@ c_aM8
eta1_B2
(\ (a1_anu :: a_aM6) ->
>>=
@ m_aM5
$dMonad_aMr
@ b_aM7
@ c_aM8
eta2_B1
(\ (b1_anv :: b_aM7) -> eta_B3 a1_anu b1_anv))
The call to >>=
is not inlined, and the reason is that it is instantiated at an abstract monad m_aM5
, and GHC does not know which instance to use. Instead, it has to pass the implementation in a dictionary $dMonad_aMr
.
Findings
Monomorphization and specialization
Here is a list of relevant terminology, datatypes, and typeclasses for this section:
- Satisfiability Modulo Theories (SMT for short) is the problem of determining whether a first-order formula is satisfiable, i.e. if it is true under a certain assignment of values to its variables, and acts as the generalization of the Boolean satisfiability problem (SAT).
- An SMT solver is a tool that aims to solve the SMT problem in given theories and a subset of inputs.
SMT a
is a datatype that queries an external SMT solver.NoSMT a
is a datatype for the case when no SMT solver is provided.SimplifierT smt a
is a datatype representing a simplification action, thesmt
type variable can be substituted with eitherSMT
orNoSMT
.MonadSimplify m
is a typeclass containing functions relevant to the simplification process. Allows working with arbitrarily nested monad transformers. Its base instance isMonadSimplify (SimplifierT m)
.
We attempted to solve the aforementioned problem of inlining by changing the abstract monad MonadSimplify simplifier
to SimplifierT SMT
. This could not be done easily because of the application architecture: the base monad, SMT
or NoSMT
, is determined dynamically.
execute options metadataTools lemmas worker =
case solver of
Z3 -> withZ3
None -> withoutSMT
For the purposes of this discovery phase investigation, we removed the support of NoSMT
.
It turned out that the simplifier
type variable is instantiated to different monad transformer stacks and could not be easily monomorphized in some functions. We developed a custom tool to determine how the monad is being instantiated in practice.
Here is the summary of its results on one of the test cases:
1 MaybeT (ReaderT (SideCondition RewritingVariableName) (UnifierT (LogicT Simplifier)))
2 MaybeT (ReaderT (SideCondition RewritingVariableName) (UnifierT (LogicT (LogicT Simplifier))))
10 MaybeT (ReaderT (SideCondition RewritingVariableName) (UnifierT (StateT (MultiAnd StuckClaim) (ExceptT (MultiAnd StuckClaim) (StateT (MultiAnd SomeClaim) Simplifier)))))
14 MaybeT (ReaderT (SideCondition RewritingVariableName) Simplifier)
54087 Simplifier
Each line in the snippet above is of the form “<number> <type>
” and can be interpreted the following way: the type <type>
has been instantiated <number>
time(s) during the run of the program.
The vast majority (99.95%) of calls are in the Simplifier
monad, so we started looking into a way to get GHC to specialize such functions.
- Specialization is the process of removing typeclass dictionary arguments by creating a new type-specialized definition for an overloaded function.
The SPECIALIZE
pragma (UK spelling also accepted) can be used to create a specialized copy of an overloaded function. Consider the following example:
class C a where ...
foo :: C a => a -> a
{-# SPECIALIZE foo :: Char -> Char #-}
It will create a new function which will be used whenever foo
is applied to a Char
, and GHC will choose that function automatically.
Unfortunately, specialization is not transitive: if f
calls g
, and f
is specialized, we might not see the desired performance improvement because g
still works in an abstract context. Therefore, we must look into a way to specialize all the way down the call stack.
Cache hit ratio
- A cache hit is a situation when the requested data is found in the cache. This implies a faster means of delivering that data since it can be immediately retrieved from the cache.
- A cache miss is the opposite of a cache hit, i.e. a situation when the requested data is not contained in the cache.
- A cache hit ratio is the number of cache hits divided by the total number of requests (in other words, the number of cache hits + the number of cache misses).
We found that the cache in attemptEquation
enjoys a hit ratio of over 95%, so it is well-placed. However, one of the functions that performs the lookup of the Simplifier
cache, which is essentially a HashMap
, is expensive, because a lot of time is spent on hashing and comparing keys.
We can either use a different data structure than a HashMap
to avoid hashing and comparisons entirely, or try to minimize the amount of time required to hash and compare keys by micro-optimizing the respective operations.
String interning
- String Interning is a method of storing exactly one copy of a distinct string value. In general, interning is reusing objects that are equal instead of creating new ones. The objects must be immutable.
The profiling report revealed that there are a lot of operations over values of the datatype Id
. Id
is a record type that has one field with the type Text
. This means that operations such as ==
or hashWithSalt
operate over Text
values, which can be slow.
Unlike some other programming languages, Haskell does not provide a means for interning strings, string-like datatypes, or numbers. However, a common practice is to create a table of identifiers and assign unique numbers to them. In this case, we could create some sort of global map from the Text
values to, for example, Int
s. This way comparisons and hashing become integer operations, which are much faster.
Implementation
When optimizing code, it’s crucial to avoid sacrificing simplicity. Drew DeVault describes this principle in his blog: Simple, correct, fast: in that order. Since K is a compute-intensive tool for software quality assurance, the principle is modified: accuracy is paramount, but simplicity and speed must be balanced.
While implementing these changes, we followed Runtime Verification’s GitHub guidelines for collaboration.
Monomorphization and specialization, pt. 1: SMT and NoSMT
As mentioned in the Monomorphization and specialization subsection of the Findings section, our ultimate goal was to carefully inline (monomorphize) and/or specialize as many functions as possible, to make GHC’s optimizations take place by having concrete types instead of the polymorphic ones.
The first obstacle to consider is the presence of two data types that are dynamically determined, SMT
and NoSMT
. They are defined as follows:
newtype NoSMT a = NoSMT { getNoSMT :: LoggerT IO a }
newtype SMT a = SMT { getSMT :: SolverSetup -> LoggerT IO a }
We started looking into a way to merge the two datatypes into one. Notice how they differ in the type signatures of the encapsulated values: NoSMT
holds a value of the type LoggerT IO a
, while SMT
needs a SolverSetup
as an argument to acquire the value of the type LoggerT IO a
. The type LoggerT IO a
is isomorphic to the function () -> LoggerT IO a
, where ()
is the unit type. For a moment, let’s think of the value encapsulated by NoSMT
as that aforementioned function.
If we have two functions f : a -> c
and g : b -> c
, we can construct a function (f + g) : Either a b -> c
, applying either f
or g
to the value held by the respective constructor of Either
. This is exactly how we want to combine SMT
and NoSMT
:
f
isgetNoSMT
,g
isgetSMT
; thereforea
is()
,b
isSolverSetup
, andc
isLoggerT IO a
.
Let’s call this combination SMT
(in other words, we retain the original name). It wraps a function of the type Either () SolverSetup -> LoggerT IO a
. However, we can simplify the type of the argument: Either () SolverSetup
is isomorphic to Maybe SolverSetup
. Thus, the combined datatype can be defined like this:
newtype SMT a = SMT (Maybe SolverSetup -> LoggerT IO a)
If the argument is a Just
wrapping a SolverSetup
value, this datatype is isomorphic to the original SMT
; otherwise, it’s isomorphic to NoSMT
.
For the sake of convenience, when referring to the original definition of SMT
, let’s use the name OldSMT
.
MonadSMT
MonadSMT
MonadSMT m
is a typeclass that allows accessingSMT
through monad transformers. It has two base instances:MonadSMT OldSMT
andMonadSMT NoSMT
.
We now need to construct the only base instance, MonadSMT SMT
, with the knowledge of how the functions were defined in the original two base instances for the original two datatypes.
Suppose the functions declared in the MonadSMT
typeclass have the following type signature:
f :: MonadSMT m => args -> m a
We know how fOldSMT :: args -> OldSMT a
and fNoSMT :: args -> NoSMT a
were defined in the respective instances, so let’s build a definition of f
for MonadSMT SMT
!
It will roughly look like this:
f :: args -> SMT a
f args = SMT $ \case
Nothing -> getNoSMT (fNoSMT args)
Just solverSetup -> getOldSMT (fOldSMT args) solverSetup
liftSMT
liftSMT
Every function in the MonadSMT typeclass has a default definition of approximately the following structure:
default f ::
(MonadTrans t, MonadSMT n, m ~ t n) =>
args -> m a
f = lift . f
This will allow us to automatically have a definition of that function for nested monad transformer stacks:
- First, we call
f :: args -> n a
, which is possible due to theMonadSMT n
constraint. - Then, we call
lift :: n a -> t n a
, given thatt
instantiatesMonadTrans
. - Since
t n
is equal tom
, the result has the typem a
.
Notice, however, that we now have only one base instance: MonadSMT SMT
, which means that only SMT
can reside at the very bottom of the monad transformer stack. We can simplify the definition of MonadSMT
the following way:
- Declare
liftSMT :: SMT a -> m a
and its default monad transformer definition. In theMonadSMT SMT
instance,liftSMT
would simply be equal toid
. - The other functions (apart from
liftSMT
andwithSolver
, which we will return to in a couple of moments) are removed from the typeclass and split into two versions:- The “specialized” versions:
fSMT :: args -> SMT a
, which were previously present in theMonadSMT SMT
instance. - The polymorphic versions:
which are defined in terms off :: MonadSMT m => args -> m a f = liftSMT . fSMT
liftSMT
.
- The “specialized” versions:
withSolver :: m a -> m a
is the only function whose argument ism a
, therefore making it impossible to define it vialiftSMT
, so it remains in the typeclass.
Summary
The unification of OldSMT
(originally referred to as SMT
before defining the combined datatype) and NoSMT
was the first step of the monomorphization and specialization task. This step was aimed more at simplifying the code rather than optimizing it. Along the way, we simplified the definition of the MonadSMT
typeclass by introducing liftSMT :: SMT a -> m a
and defining the majority of the functions in terms of it, bringing them outside of the typeclass’ scope. The link to the pull request containing the implementation of the SMT
/NoSMT
fusion is available here.
String interning
An analysis of the profiling report showed that the values of the Id
datatype are subject to many operations.
Id
is a record type, one of the fields of which has the type Text
:
data Id = Id
{ getId :: !Text
, idLocation :: !AstLocation
}
This means operations such as ==
or hashWithSalt
operate over Text
values, which can be slow.
The first attempt to solve this problem was to use the intern package. Modification of Eq
instance for type Id
to take advantage of interning resulted in approximately 8% speedup on the integration test suite.
In our attempt to modify the Ord
instance to take advantage of interning as well, some of our unit and integration tests failed because they assumed that the Id
s would be sorted alphabetically. When we profiled it, we noticed that Kore.Syntax.Id.compare
doesn’t play much of a role there, it isn’t one of the most expensive cost centres. Moreover, some data persists between invocations of the backend via serialization, and the ordering of Id
s may change between runs if they are interned in a different order.
However, we wanted to have an easier way to modify or debug the code responsible for string interning. There is a cacheWidth
parameter in the Interned
typeclass of the intern
package that creates multiple maps to speed up concurrent access. As we didn’t need that, we had to create a newtype and redefine the Interned
instance with cacheWidth
set to 1.
Moreover, since the intern
library is quite small, it seemed easier to implement a custom interning solution in our code.
InternedTextCache
InternedTextCache
To begin, let’s define a datatype for our interned strings:
data InternedText = UnsafeMkInternedText
{ internedText :: {-# UNPACK #-} !Text
, internedId :: {-# UNPACK #-} !Word
}
In addition to the original Text
, the datatype also contains a unique ID associated with this string. The constructor of this datatype should not be used directly to create values. Instead, we will introduce an interface to interact with our interned text cache.
We then define a datatype for the global cache, which will contain all of the process’ interned strings:
data InternedTextCache = InternedTextCache
{ internedTexts :: !(HashMap Text InternedText)
, counter :: {-# UNPACK #-} !Word
}
InternedTextCache
has two fields:
internedTexts
is aHashMap
that stores all the interned stringscounter
is an incremental counter used to generate new unique IDs sinceHashMap.size
isn’t a constant-time operation
Finally, we define instances for our InternedText
:
instance Eq InternedText where
a == b = internedId a == internedId b
{-# INLINE (==) #-}
instance Hashable InternedText where
hashWithSalt salt = hashWithSalt salt . internedId
{-# INLINE hashWithSalt #-}
hash = hash . internedId
{-# INLINE hash #-}
instance Ord InternedText where
a `compare` b
| internedId a == internedId b = EQ
| otherwise = internedText a `compare` internedText b
{-# INLINE compare #-}
The Eq
and Hashable
instances are straightforward – we use the ID of the interned string for equality comparison and hashing operations.
Our Ord
instance quickly checks if the IDs of interned strings are equal. If they are not, we use lexical order to compare the strings themselves.
internText
internText
The next step is to create a function to intern our strings. To start, let’s define a function that initializes the cache.
With the unsafePerformIO
hack, we can define a global IORef
value to store our cache outside the IO
monad:
globalInternedTextCache :: IORef InternedTextCache
globalInternedTextCache = unsafePerformIO $ newIORef $ InternedTextCache HashMap.empty 0
{-# NOINLINE globalInternedTextCache #-}
To avoid cache duplication, we must ensure that globalInternedTextCache
is evaluated only once. So we add the NOINLINE
pragma, which will prevent GHC from inlining our definition.
As soon as we have a global cache, we can define a function to intern Text
values:
internText :: Text -> InternedText
internText text =
unsafePerformIO do
atomicModifyIORef' globalInternedTextCache \InternedTextCache{counter, internedTexts} ->
let ((internedText, newCounter), newInternedTexts) =
HashMap.alterF
\case
-- If this text is already interned, reuse it.
existing@(Just interned) -> ((interned, counter), existing)
-- Otherwise, create a new ID for it and intern it.
Nothing ->
let newIden = counter
!newInterned = UnsafeMkInternedText text newIden
in ((newInterned, counter + 1), Just newInterned)
text
internedTexts
in (InternedTextCache newInternedTexts newCounter, internedText)
Using atomicModifyIORef'
, we can thread-safely modify our global cache. We use the strict version to avoid space leaks.
The function works as follows:
- If the argument
text
has been interned previously, we return the existingInternedText
instance. - Otherwise, we allocate a new
InternedText
and return it.
In other words, multiple evaluations of internText
with the same argument should return a pointer to the exact same InternedText
object.
Id
Id
Now that we have a custom interning solution in place, it’s time to modify the Id
datatype:
data Id = InternedId
{ getInternedId :: !InternedText
, internedIdLocation :: !AstLocation
}
instance Ord Id where
compare first second =
compare (getInternedId first) (getInternedId second)
{-# INLINE compare #-}
instance Eq Id where
first == second = getInternedId first == getInternedId second
{-# INLINE (==) #-}
instance Hashable Id where
hashWithSalt salt iden = hashWithSalt salt $ getInternedId iden
{-# INLINE hashWithSalt #-}
The final step is to define a pattern synonym to abstract away InternedText
:
pattern Id :: Text -> AstLocation -> Id
pattern Id{getId, idLocation} <-
InternedId (internedText -> getId) idLocation
where
Id text location = InternedId (internText text) location
{-# COMPLETE Id #-}
This pattern allows us to use the Id
as it was before the modification.
Summary
Instead of keeping many copies of the same Text
in memory, we keep only one shared InternedText
. Each InternedText
has a unique ID.
As a result:
- The
Eq
instance can more quickly check if the ID of twoInternedText
s are equal, instead of checking every single character. - The
Hashable
instance can hash the ID, which is a lot faster than hashing the string’s contents.
In total, the changes above improve performance by approximately 10%. In addition, the pattern synonym maintains the simplicity of the code where the Id
type is used at the same level.
Conclusion
The first phase of the collaboration between Runtime Verification and Serokell emerged with the discovery phase, conducted to determine the important, optimizable parts of the codebase, with the help of GHC’s profiling instrumentation and custom tools.
The 1st part of the monomorphization and specialization task focused on the exploration of two datatypes, SMT
and NoSMT
, their unification, and the resulting simplification of the code; in particular, the MonadSMT
typeclass. This change didn’t produce any noticeable effects on the performance, but it is nevertheless integral to the 2nd part of the task, enabling us to progress forward with the changes that will be impactful.
The string interning task focused on solving the problem of optimizing the definitions of ==
and hashWithSalt
for the Id
datatype. One of its fields is a Text
value, and these values take an immediate part in the aforementioned definitions. Such operations on Text
s can be slow. Instead, we assign each Text
a unique identifier and compare them instead. This consequently gives rise to the need for storing such Text
-identifier pairs in a map, to access or create them during the interning process. This solution introduces a 10% performance improvement.
The cache hit ratio task, as well as the 2nd part of the monomorphization and specialization task, will be explored in the following blog post. To not miss it, follow us on Twitter or subscribe to our mailing list via the form below.