Parsing Typed eDSL
Michelson is a smart contract language from the Tezos community. Akin to Forth, Michelson contract is described by a sequence of instructions operating on a typed stack. Each instruction assumes a stack of a certain type as input and produces an output stack of determined type. For example, the
PAIR instruction presumes a stack of the type
a : b : s and produces a stack of the type
pair a b : s for any stack tail
s. You can read more about Michelson instructions and typing in the official documentation.
In January 2019, in collaboration with Tocqueville Group, Serokell started the Morley project. One of its goals was to implement a comprehensive framework for testing arbitrary Michelson contracts that would support simple unit testing as well as more complex property-based testing. Being more precise, we wanted to take a contract and feed with various sets of input and output values and see if it behaved as expected.
A small remark before we go forward. In this article, we cover only a small subset of the Michelson’s instructions and consider only the core of the Michelson’s type system without taking annotations into account. Clearing up all these details was a complicated task we performed during our work on the Morley framework, and we welcome everybody to go and check the repository to see the implementation of a type check and interpretation with all underlying details.
It was decided to use Haskell for the Morley implementation, and firstly we developed the Michelson language as a very simple AST data type:
data T = Tint | Tnat | Toption T | Tlist T data UVal = UInt Integer | USome UVal | UNone | UList [UVal] data UInstr = UNOP | UDROP | UDUP | USWAP | UPUSH T UVal | UPAIR | UCAR | UCDR | UADD | UCONS | UNIL T | UIF_CONS
Soon we understood this simple AST suffered from certain limitations. First of all, it was untrivial to generate arbitrary loosely-typed values. In our AST, list was merely a constructor
UList [UVal] and we couldn’t write an
Arbitrary instance that would generate an arbitrary list of integers or strings depending on the type context.
The answer to this problem was obvious: create an AST with stronger types. Expression then becomes annotated with a type, which is easy to implement thanks to awesome
DataKinds extensions. This immediately solves the problem of arbitrary value generation. And moreover, it becomes possible for the interpreter to stop unpredictably failing with runtime type errors.
data Val t where VInt :: Int -> Val 'TInt VNat :: Word -> Val 'TNat VList :: [Val t] -> Val ('TList t) VPair :: Val p -> Val q -> Val ('TPair p q)
But after introducing this type, we quickly found ourselves at a challenge. It was very easy to parse textual code representation to a simple AST but was obscure how to do the same for a typed representation of Michelson. To simplify things, instead of parsing, we considered a task of conversion from a simple AST to a typed AST.
The first problem with conversion from a simple AST to a typed representation was that conversion of a parent branch in an AST depended on the types of children. A useful trick for solving this issue can be found in the blog post from 2009. In short, we create an existential type holding value along with its type and return this existential type from our type check function:
data Sing (t :: T) where STInt :: Sing 'TInt STNat :: Sing 'TNat STList :: Sing t -> Sing ('TList t) STPair :: Sing p -> Sing q -> Sing ('TPair p q) data SomeVal1 where SomeVal1 :: Val t -> Sing t -> SomeVal1 typeCheckVal1 :: UVal -> Maybe SomeVal1 typeCheckVal1 (UInt i) = Just $ SomeVal1 (VInt i) STInt typeCheckVal1 (UPair p q) = do SomeVal1 pv pt <- typeCheckVal1 p SomeVal1 qv qt <- typeCheckVal1 q pure $ SomeVal1 (VPair pv qv) (STPair pt qt) typeCheckVal1 (UList _) = error "not implemented"
Sing data type can be derived automatically with the use of the singletons library. That library provides the
Sing data family and useful helper functions and classes for work with singletons. Throughout this article, we’ll stick to handwritten
Sing and conversion functions.
There are two major problems with this construction. First, a reader may have noticed that neither
VNat constructors were ever used. Indeed, the
UInt constructor from a simple AST was meant to represent both signed and unsigned integers because they have roughly the same representation. We can not really distinguish between
TNat literals during parsing.
A similar issue appears in the case of a list constructor with an empty list wrapped in it. When given a list constructor, we have no idea what type of values this list contains. In the case of an empty list, we must return the
forall t. TList t type, but our type representation does not support such a construction.
The second problem with this snippet is similar. In case of a non-empty list, we can take
t as a type of the first element and figure out if other elements in the list have the same type by comparing them. But to compare a
t1 type of the first list element with a
t2 type of the second list element, we need constraints
Typeable t1 and
Typeable t2 to hold.
It’s relatively easy to address the second problem. We introduce a
SomeVal data type with
Typeable constraint put in the scope of the constructor:
data SomeVal where SomeVal :: Typeable t => Val t -> Sing t -> SomeVal
The first problem requires a switch to a different approach for type checking. One way to solve the problem is to introduce some sort of constrained
forall quantifier into our simplistic type system, similar to what we have in Haskell. For the empty list case, we can write something like
SomeVal (VList ) (forall n. Num n => n). This approach is more universal but far heavier to implement and maintain.
Luckily for us, a lighter approach is possible for type checking Michelson programs. Although all Michelson instructions are polymorphic, Michelson programs are always given in the context of a contract. A contract defines an input stack type, and each instruction modifies a stack type in an unambiguous way. Hence, there is no actual need to implement a type checking algorithm that derives a type (with some
for an arbitrary sequence of Michelson instructions. We are going to start with the input stack type and iterate through instructions. This way, we’ll be able to stick to the type system represented by the
T data type.
In this example, we considered only the type checking of values. Following the rule defined above, we’ll implement the
typeCheckVal function with the first argument being a type of an expression we’re trying to parse. There are only two instructions that introduce new value to the stack (namely,
UNIL) and both of them explicitly have type representation included.
typeCheckVal :: Sing t -> UVal -> Maybe (Val t) typeCheckVal STInt (UInt i) = Just (VInt i) typeCheckVal STNat (UInt i) = do guard (i >= 0) pure (VNat $ fromIntegral i) typeCheckVal (STPair pt qt) (UPair p q) = do pv <- typeCheckVal pt p qv <- typeCheckVal qt q pure $ VPair pv qv typeCheckVal (STList t) (UList l) = VList <$> mapM (typeCheckVal t) l typeCheckVal _ _ = Nothing
We do not actually need to wrap
Val t into
SomeVal here because the first argument nicely defines the output of the function. It’s important to emphasise the role of singletons in the construction above. What we do is pattern-matching on the type of value that should be parsed. Pattern-matching on type in the code of term-level functions is not common in Haskell, and singletons are perhaps the most straightforward way.
Now, let’s implement a conversion for instructions. First, we’ll have to modify our
Sing data type slightly and provide some required helper functions:
data Sing (t :: T) where STInt :: Sing 'TInt STNat :: Sing 'TNat STList :: Typeable t => Sing t -> Sing ('TList t) STPair :: (Typeable p, Typeable q) => Sing p -> Sing q -> Sing ('TPair p q) fromSing :: Sing t -> T fromSing = ... data SomeSing where SomeSing :: Typeable t => Sing t -> SomeSing withSomeSing :: SomeSing -> (forall t . Typeable t => Sing t -> a) -> a withSomeSing (SomeSing a) f = f a toSing :: T -> SomeSing toSing = ...
Similarly to values, we’ll define a typed representation of an instruction. The
Instr data type is parametrized by type parameters
out of kind
[T] which state for input and output stack types corresponding to an instruction. This Michelson instructions representation is very elegant as it perfectly mimics the notation given in Michelson’s documentation.
data Instr (inp :: [T]) (out :: [T]) where Seq :: Instr a b -> Instr b c -> Instr a c Nop :: Instr s s DROP :: Instr (a ': s) s DUP :: Instr (a ': s) (a ': a ': s) SWAP :: Instr (a ': b ': s) (b ': a ': s) PUSH :: Val t -> Instr s (t ': s) PAIR :: Instr (a ': b ': s) ('TPair a b ': s) CAR :: Instr ('TPair a b ': s) (a ': s) CDR :: Instr ('TPair a b ': s) (b ': s) NIL :: Instr s ('TList t ': s) CONS :: Instr (t ': 'TList t ': s) ('TList t ': s) ADDii :: Instr ('TInt ': 'TInt ': s) ('TInt ': s) ADDnn :: Instr ('TNat ': 'TNat ': s) ('TNat ': s) ADDin :: Instr ('TInt ': 'TNat ': s) ('TInt ': s) ADDni :: Instr ('TNat ': 'TInt ': s) ('TInt ': s) IF_CONS :: Instr (a ': 'TList a ': s) s' -> Instr s s' -> Instr ('TList a ': s) s'
The representation of the
ADD instruction is not very nice but can be improved with the use of a type class. We may create a type class
AddOp which takes two type arguments (for two operands of the
ADD instruction). It will contain one function for type checking, one function for interpretation and a type family for the result type. For simplicity, this is not implemented in the article’s code.
typeCheckI will take an input stack type and an untyped instruction and should return an output stack type along with a typed instruction. Hence, we introduce
SomeInstr data types. The
Stack data type is similar to
Rec from the vinyl package. The only difference is that we impose the
Typeable constraint on the first argument of
data Stack inp where SNil :: Stack ' (::&) :: (Typeable s, Typeable a) => Sing a -> Stack s -> Stack (a ': s) infixr 7 ::& data SomeInstr inp where (:::) :: Typeable out => Instr inp out -> Stack out -> SomeInstr inp infixr 5 :::
Now we’re able to finally implement the
typeCheckI :: Typeable inp => Stack inp -> UInstr -> Maybe (SomeInstr inp) typeCheckI s UNOP = pure (Nop ::: s) typeCheckI (_ ::& s) UDROP = pure (DROP ::: s) typeCheckI (a ::& s) UDUP = pure (DUP ::: a ::& a ::& s) typeCheckI (a ::& b ::& s) USWAP = pure (SWAP ::: b ::& a ::& s) typeCheckI s (UPUSH t v) = withSomeSing (toSing t) $ \t' -> do val <- typeCheckVal t' v pure (PUSH val ::: t' ::& s) typeCheckI (a ::& b ::& s) UPAIR = pure (PAIR ::: STPair a b ::& s) typeCheckI (STPair a _ ::& s) UCAR = pure (CAR ::: a ::& s) typeCheckI (STPair _ b ::& s) UCDR = pure (CDR ::: b ::& s) typeCheckI (STInt ::& STInt ::& s) UADD = pure (ADDii ::: STInt ::& s) typeCheckI (STNat ::& STNat ::& s) UADD = pure (ADDnn ::: STNat ::& s) typeCheckI (STInt ::& STNat ::& s) UADD = pure (ADDin ::: STInt ::& s) typeCheckI (STNat ::& STInt ::& s) UADD = pure (ADDni ::: STInt ::& s) typeCheckI s (UNIL t) = withSomeSing (toSing t) $ \t' -> pure (NIL ::: STList t' ::& s) typeCheckI ((_ :: Sing a) ::& STList (e :: Sing b) ::& s) UCONS = do Refl <- eqT @a @b pure (CONS ::: STList e ::& s) typeCheckI (STList a ::& s) (UIF_CONS consCase nilCase) = do nc ::: (s' :: Stack out1) <- typeCheck s nilCase cc ::: (_ :: Stack out2) <- typeCheck (a ::& STList a ::& s) consCase Refl <- eqT @out1 @out2 pure (IF_CONS cc nc ::: s') typeCheckI _ _ = Nothing typeCheck :: Typeable inp => Stack inp -> [UInstr] -> Maybe (SomeInstr inp) typeCheck s  = pure (Nop ::: s) typeCheck s (i : ) = typeCheckI s i typeCheck s (i : is) = do a ::: s' <- typeCheckI s i b ::: s'' <- typeCheck s' is pure (a `Seq` b ::: s'')
typeCheckI, we pattern-match on an input stack type and an untyped instruction. In the case of
CONS, we need to additionally check the equality of first element of the stack and an subtype of the list at the second element of the stack. In the case of
IF_CONS, a recursive call to
typeCheck is used to check both continuations.
Now, when we have finally settled down how to type check a sequence of Michelson instructions, let’s see how our eDSL can be interpreted.
interpret :: Rec Val inp -> Instr inp out -> Rec Val out interpret s Nop = s interpret s (Seq a b) = interpret (interpret s a) b interpret (_ :& s) DROP = s interpret (a :& s) DUP = a :& a :& s interpret (a :& b :& s) SWAP = b :& a :& s interpret (a :& b :& s) PAIR = VPair a b :& s interpret s (PUSH v) = v :& s interpret (VPair a _ :& s) CAR = a :& s interpret (VPair _ b :& s) CDR = b :& s interpret (VInt a :& VInt b :& s) ADDii = VInt (a + b) :& s interpret (VInt a :& VNat b :& s) ADDin = VInt (a + fromIntegral b) :& s interpret (VNat a :& VInt b :& s) ADDni = VInt (fromIntegral a + b) :& s interpret (VNat a :& VNat b :& s) ADDnn = VNat (a + b) :& s interpret s NIL = VList  :& s interpret (a :& VList l :& s) CONS = VList (a : l) :& s interpret (VList  :& s) (IF_CONS _ nilCase) = interpret s nilCase interpret (VList (a : l) :& s) (IF_CONS consCase _) = interpret (a :& VList l :& s) consCase
interpret function is total, which is a definite benefit of advanced type representation. The
Val data type contains enough information for a type checker to consider all possible cases of an input stack and instruction, and there’s no need to perform additional checks in runtime, which is an error-prone practice. In short, if the program type checks, it won’t produce an error in runtime.
To sum up, in this article we’ve seen how to apply some advanced typing techniques of Haskell to type-check simple strictly-typed stack-based language. We took a simple representation of language AST (which can be easily obtained from a text using any technique for parsing) and converted it to a strictly-typed representation of the same language. The strictly-typed representation is in 1:1 correspondence to the type system of defined language, and by having a term in strictly-typed representation, we are assured that it’s well-typed. The approach described is heavily dependent on the fact that the type system doesn’t define abstract types and all types are concrete, i.e. only built-in constructions are polymorphic. This restriction applies to a wide variety of languages, hence we see our approach as highly reusable.