09 Aug 2017

Parser Combinator in real life

I had to write an application that handle 2200 msg per second. This application consist of indexing and compress data with 7z.

And here is why I use a parser combinator.

I will write another post for the usage of the awesome fszmq and a custom interop of 7z in a later post.

Disclaimer

I attempted to use Linq to xml that load entirely the document element by element in memory causing LOH (on data that I don’t want to index) problem when a big element is loaded.

The XmlProvider is based on Linq so I had the same problem.

Both are very usefull and the usage of parser combinator is overkill when there is no reason :)

The Problem

All my documents consist of a xml document that contains information to store and organize for indexation process and other data to just forward (to the 7z zipper part).

2 elements of the xml contain logs of partners request/response data. Those 2 elements could have more than 10MB and I don’t want to load this kind of message entirely due to LOH + GC problems (avg is near 1200msg/s :)).

When I check the linq to xml implementation, it load the document entirely element by element in memory with the XmlReader.

So, I first wrote a simple code that use the XmlReader but :

  • I don’t know when this XmlReader read or not the data.
  • It is hard to have a clean context
  • But it works!

After 2 weeks in production, I discover a new corner case like this :

An Id could be represent as

<root>
<Id>1</Id>
<Request>Big Data here !</Request>
<Response>Big Data here !</Response>
</root>
<root>
<Id>0</Id>
<items><item><Id>1</Id><Request>Big Data here !</Request><Response>Big Data here !</Response></item></items>
</root>

I remember how Linq and FSharp.Data were very usefull in this case but for the performance reason it is not ok!

After dealing with if/else/pattern matching approach, I tried Parser Combinator style. Instead parsing char, the context used by the parser is the XmlReader. I don’t want to rewrite a full xml parser because parsing xml is too hard and XmlReader is ok for that point.

A Solution in Functional style

I have to index a lots of document so compose a parser should be easy The solution should offer a maximum of flexibility to compose things together

I would like to express code like this :

type Entry = { Id:string } 
let convert message = 
	use reader = XmlReader.Create(message:Stream)
	let entry id = { Id=id }
	let entryP = ffwd "Id" >>. element (id >> Ok) "Id"
	run reader entryP

The Parser type that wrap the parser function. The aim of parser function is like an interface but instead it hide the context and error types

type Parser<'x, 'a> = Parser of ('x -> Result<'a, string list>) 
let run context (Parser p) = p context

Where ‘x could be the XmlReader and the run function that just unwrap the parser function and apply the given context.

>>. is an operator that takes 2 parser and return the result of the right parser. If you understand the meaning of the point you should intuitively understand that the .>> do the same except that it return the left result .>>. return both left and right result.

Quite Simple !

the full parser code based on XmlReader

type Parser<'x, 'a> = Parser of ('x -> Result<'a, string list>) 

//The code is based on the latest fsharp 4.1 (install vs2017 + fsharp 4.1 for eg)
open Result

let run context (Parser p) = p context

Define a map function that

  • Execute the parser x
  • If is ok, the parser x return a x’ result. Just apply x’ to f and return Ok
  • Stop if there is an error and do not apply f (show the error string list for exemple)
let map f x = 
    fun reader ->
        match run reader x with
        | Ok x' -> f x' |> Ok
        | Error e -> Error e
    |> Parser

let (<!>) = map

Define an apply function that do

  • Execute the run on the reader context on f and x
  • if parsers f and x in sequence return both ok then return Ok f’ x’
  • Like map stop on error
let apply f x = 
    fun reader ->
        match run reader f, run reader x with
        | Ok f', Ok x' -> Ok (f' x')
        | Error e, _ | _, Error e -> Error e
    |> Parser

let (<*>) = apply

The famous so called “combinator” function that

  • Parse x and y in sequence and return one or both depending on the “.” on operator
/// return y result if x and y are Ok in sequence
let (>>.) x y = 
    fun reader ->
        match run reader x with
        | Ok _ -> run reader y
        | Error e -> Error e
    |> Parser

/// return x result if x and y are Ok in sequence
let (.>>) x y = 
    fun reader ->
        match run reader x with
        | Ok x' -> 
            match run reader y with
            | Ok _ -> Ok x'
            | Error e -> Error e
        | Error e -> Error e
    |> Parser

/// return (x, y) as struct results if x and y are Ok in sequence
let (.>>.) x y = 
    fun reader ->
        match run reader x with
        | Ok x' -> 
            match run reader y with
            | Ok y' -> Ok (struct(x', y'))
            | Error e -> Error e
        | Error e -> Error e
    |> Parser

The aim of many function is like an unfold on List, so while there is no error aggregate result to a list and return the list on the first error

let many x = 
    fun reader ->
        List.unfold 
            <| fun reader -> match run reader x with Ok x' -> Some (x', reader) | Error _ -> None
            <| reader
        |> Ok

   |> Parser

The orElse combinator is like a branch condition, if x has an error then run y. But if x is ok, just return x without running y

let orElse x y = 
    fun reader ->
        match run reader x with
        | Ok x' -> Ok x'
        | Error e1 -> 
            match run reader y with
            | Ok y' -> Ok y'
            | Error e2 -> Error (List.append e1 e2)
    |> Parser

let (<|>) = orElse

Ok now I have the abstraction to parse any context.

Remember that my context is the XmlReader

The Xml helper function

open System.Xml

Hey it is xml :)

I Have to deal with XmlReader that confuse technical error with business error with the same way… So lets go to catch!

let tryF name f (reader:XmlReader) = 
    try
        f reader
    with ex -> Error [sprintf "expected %s got an error:\r\n%O" name ex]

The ffwd just fast forward the XmlReader to the given name

let ffwd name = 
    fun (reader:XmlReader) -> 
        while reader.Name <> name && not reader.EOF && reader.Read() do ()
        |> Ok
    |> Parser

The element function read the element with the given name, and apply the function f to the content

let element f name = 
    tryF name <| fun reader ->
        reader.ReadStartElement(name)
        if reader.NodeType = XmlNodeType.Text then
            let result = f (reader.ReadContentAsString())
            reader.ReadEndElement()
            result 
        else 
            let result = f ""
            if reader.NodeType = XmlNodeType.EndElement then reader.ReadEndElement()
            result
    |> Parser

The nested function treat the nested element case like :

<item>...</item>
let nested name p = 
    tryF name <| fun reader ->
        reader.ReadStartElement(name)
        let result = run reader p
        reader.ReadEndElement()
        result
    |> Parser

The pIgnore skip an element and return unit (very usefull with .» pIgnore or pIgnore ».)

let pIgnore = 
    fun (reader:XmlReader) -> 
        reader.ReadStartElement()
        reader.Skip()
        reader.ReadEndElement()
        Ok ()
    |> Parser 

Some Examples

A sample parser implementation :

open System 
open System.IO

type Entry = 
    { Keys: string Set
      Date : DateTime 
      Data: Stream }

let convert (message:Stream) = 
    let reader = XmlReader.Create(new StreamReader(message))
    let elt f x = element (f >> Ok) x
    let strE x = elt id x
    let ffwdE x f = ffwd x >>. element (f >> Ok) x
    let pIsSuccess = ffwdE "MessageType" ((<>) "Failed")

    let pHid = ffwdE "HotelId" int
    let pGid = elt int "GroupId"
    
    let parseDate s = 
        System.DateTime.Parse(s, Globalization.CultureInfo.InvariantCulture, Globalization.DateTimeStyles.AssumeUniversal).ToUniversalTime()

    let pDate = elt parseDate "Date"
    
    let pStartDate = ffwdE "Start" parseDate
    let pEndDate = elt parseDate "End"

    let pRoomCode = strE "RoomCode"
    let pRoomId = elt int "RoomId"
    
    let pRooms = 
        let zero = 
            fun _ -> Ok Set.empty
            |> Parser

        let nestedO name p = nested name p <|> zero  
        let pItem = pRoomCode .>>. pRoomId
        let pItems = set <!> (many (nested "item" pItem))
        ffwd "Rooms" >>. nestedO "Rooms" pItems
    
    let pRoom = Set.singleton <!> (ffwd "RoomCode" >>. pRoomCode .>>. pRoomId)

    let roomIdsP = 
        let defaultTo x y = 
            match y with _ when y |> Set.isEmpty -> x | _ -> y 

        defaultTo <!> pRoom <*> pRooms

    let logEntry (message:Stream) isSuccess hotelId groupId stamp startDate endDate rooms = 
        let ssnd (struct (_,x)) = x
        let roomIds = rooms |> Set.map ssnd
        let key = sprintf "%i/%i/%i"
        let keys = roomIds |> Set.map(key groupId hotelId)
        { Keys = keys
          Date = stamp
          Data = 
            message.Position <- 0L
            message }
    
    let rawXml (message:Stream) = 
        message.Position <- 0L
        let sr = new StreamReader(message)
        sr.ReadToEnd()

    try
        match run reader (logEntry message <!> pIsSuccess <*> pHid <*> pGid <*> pDate <*> pStartDate <*> pEndDate <*> roomIdsP) with
        | Ok logEntry -> Some logEntry
        | Error errors -> 
            let error = System.String.Join(System.Environment.NewLine, errors |> Array.ofList)
            printfn "[WARN]%s\r\nXml :\r\n%s" error (rawXml message)
            None
    with ex -> failwithf "Parsing failed with error %O\r\nXml:\r\n%s" ex (rawXml message)

I love this kind of little and composable declaration. Parsing is now like a Lego game!

The associated tests :

///Porcelain and plumbing
let convertFromString x = 
    use s = new MemoryStream()
    use sw = new StreamWriter(s)
    sw.Write(x:string)
    sw.Flush()
    s.Position <- 0L
    convert s

convertFromString """
<Transport><DestinationUri>dest</DestinationUri><SourceUri>source</SourceUri><ReplyUri /><MessageType>Update</MessageType><Message><HotelId>3116</HotelId><GroupId>9358</GroupId><Date>2017-02-20T16:19:12.4037606Z</Date><Request>Big Request</Request><Response>Big Response</Response><Start>2017-03-02</Start><End>2017-03-03</End>
    <RoomCode>26757</RoomCode>
    <RoomId>7066</RoomId>
    <Rooms /></Message></Transport>
"""

/// Do not consider the first Single RoomId if equal 0 and if there is no RoomIds
convertFromString """
<Transport><DestinationUri>dest</DestinationUri><SourceUri>source</SourceUri><ReplyUri /><MessageType>Update</MessageType><Message><HotelId>3116</HotelId><GroupId>9358</GroupId><Date>2017-02-20T16:19:12.4037606Z</Date><Request>Big Request</Request><Response>Big Response</Response><Start>2017-03-02</Start><End>2017-03-03</End>
    <RoomCode />
    <RoomId>0</RoomId>
    <Rooms>
        <item>
            <RoomCode>16239</RoomCode>
            <RoomId>3347</RoomId>
        </item>
        <item>
            <RoomCode>16241</RoomCode>
            <RoomId>3346</RoomId>
        </item>
   </Rooms></Message></Transport>
"""

/// Use the RoomId 0 = All Rooms when the is empty RoomIds
convertFromString """
<Transport><DestinationUri>dest</DestinationUri><SourceUri>source</SourceUri><ReplyUri /><MessageType>Update</MessageType><Message><HotelId>3116</HotelId><GroupId>9358</GroupId><Date>2017-02-20T16:19:12.4037606Z</Date><Request>Big Request</Request><Response>Big Response</Response><Start>2017-03-02</Start><End>2017-03-03</End>
    <RoomCode />
    <RoomId>0</RoomId>
    </Message></Transport>
"""

The pros

  • Easy to write
  • Easy to maintain
  • Hey it is composition, I can reuse parser and compose them to create new one!
  • It fix a real issue (I start writing a code that parse xml with XmlReader with hard issue to solve in existing code)

The cons

  • It was my first attempt, so the first parser was harder to write but others were very easy to compose (The most part was to understand how works XmlReader and when the context has changed or not!)
  • The use of operator and applicative separate the definition of the function from its execution. When you are debugging the run function you have to deal with a stack containing operator name (like GreaterGreaterEqual). Inlining the operator could help but it is more difficult to debug.
  • For now, in my experience, I delivered multiple fix / multiple parsers composition and it is not a big deal to change my habits, finally it was easy to fix and maintain.

In this post I will introduce FParsec in order to parse proto3 file.