F# Tutorial => Comparison of different F# data pipelines

Example

In F# there are many options for creating data pipelines, for example: List, Seq and Array.

What data pipeline is preferable from memory usage and performance perspective?

In order to answer this we'll compare performance and memory usage using different pipelines.

Data Pipeline

In order to measure the overhead, we will use a data pipeline with low cpu-cost per items processed:

let seqTest n =
  Seq.init (n + 1) id
  |> Seq.map    int64
  |> Seq.filter (fun v -> v % 2L = 0L)
  |> Seq.map    ((+) 1L)
  |> Seq.sum

We will create equivalent pipelines for all alternatives and compare them.

We will variate the size of n but let the total number of work be the same.

Data Pipeline Alternatives

We will compare the following alternatives:

Imperative code
Array (Non-lazy)
List (Non-lazy)
LINQ (Lazy pull stream)
Seq (Lazy pull stream)
Nessos (Lazy pull/push stream)
PullStream (Simplistic pull stream)
PushStream (Simplistic push stream)

Although not a data pipeline we will compare against Imperative code since that most closely match how the CPU executes code. That should be that fastest possible way to compute the result allowing us to measure the performance overhead of data pipelines.

Array and List compute a full Array/List in each step so we expect memory overhead.

LINQ and Seq are both based around IEnumerable<'T> which is lazy pull stream (pull means that the consumer stream is pulling data out of the producer stream). We therefore expect the performance and memory usage to be identical.

Nessos is a high-performance stream library that supports both push & pull (like Java Stream).

PullStream and PushStream are simplistic implementations of Pull & Push streams.

Performance Results from running on: F# 4.0 - .NET 4.6.1 - x64

The bars show the elapsed time, lower is better. The total amount of useful work is the same for all tests so the results are comparable. This also means that few runs implies larger datasets.

As usual when Measuring one see interesting results.

List performance poor is compared to other alternatives for large data sets. This can be because of GC or poor cache locality.
Array performance better than expected.
LINQ performs better than Seq, this is unexpected because both are based around IEnumerable<'T>. However, Seq internally is based around a generic impementation for all algorithms while LINQ uses specialized algorithms.
Push performs better than Pull. This is expected since the push data pipeline has fewer checks
The simplistic Push data pipelines performs comparable to Nessos. However, Nessos supports pull and parallelism.
For small data pipelines the performance of Nessos degrades possible because pipelines setup overhead.
As expected the Imperative code performed the best

GC Collection count from running on: F# 4.0 - .NET 4.6.1 - x64

The bars shows the total number of GC collection counts during the test, lower is better. This is a measurement of how many objects are created by the data pipeline.

As usual when Measuring one see interesting results.

List is expectedly creating more objects than Array because a List is essentially a single linked list of nodes. An array is a continous memory area.
Looking at the underlying numbers both List & Array forces 2 generation collections. These kind of collection are expensive.
Seq is triggering a surprising amount of collections. It's surprisingly even worse than List in this regard.
LINQ, Nessos, Push and Pull triggers no collections for few runs. However, objects are allocated so the GC eventually will have to run.
As expected since the Imperative code allocate no objects no GC collections were triggered.

Conclusion

All data pipelines do the same amount of useful work in all test cases but we see significant differences in performance and memory usage between the different pipelines.

In addition, we notice that the overhead of data pipelines differ depending on the size of data processed. For example, for small sizes Array is performing quite well.

One should keep in mind the amount of work performed in each step in the pipeline is very small in order to measure the overhead. In "real" situations the overhead of Seq might not matter because the actual work is more time consuming.

Of more concern is the memory usage differences. GC isn't free and it is beneficial for long running applications to keep GC pressure down.

For F# developers concerned about performance and memory usage it's recommended to check out Nessos Streams.

If you need top-notch performance strategically placed Imperative code is worth considering.

Finally, when it comes to performance don't make assumptions. Measure and Verify.

Full source code:

module PushStream =
  type Receiver<'T> = 'T -> bool
  type Stream<'T>   = Receiver<'T> -> unit

  let inline filter (f : 'T -> bool) (s : Stream<'T>) : Stream<'T> =
    fun r -> s (fun v -> if f v then r v else true)

  let inline map (m : 'T -> 'U) (s : Stream<'T>) : Stream<'U> =
    fun r -> s (fun v -> r (m v))

  let inline range b e : Stream<int> =
    fun r ->
      let rec loop i = if i <= e && r i then loop (i + 1)
      loop b

  let inline sum (s : Stream<'T>) : 'T =
    let mutable state = LanguagePrimitives.GenericZero<'T>
    s (fun v -> state <- state + v; true)
    state

module PullStream =

  [<Struct>]
  [<NoComparison>]
  [<NoEqualityAttribute>]
  type Maybe<'T>(v : 'T, hasValue : bool) =
    member    x.Value        = v
    member    x.HasValue     = hasValue
    override  x.ToString ()  =
      if hasValue then
        sprintf "Just %A" v
      else
        "Nothing"

  let Nothing<'T>     = Maybe<'T> (Unchecked.defaultof<'T>, false)
  let inline Just v   = Maybe<'T> (v, true)

  type Iterator<'T> = unit -> Maybe<'T>
  type Stream<'T>   = unit -> Iterator<'T>

  let filter (f : 'T -> bool) (s : Stream<'T>) : Stream<'T> =
    fun () ->
      let i = s ()
      let rec pop () =
        let mv = i ()
        if mv.HasValue then
          let v = mv.Value
          if f v then Just v else pop ()
        else
          Nothing
      pop

  let map (m : 'T -> 'U) (s : Stream<'T>) : Stream<'U> =
    fun () ->
      let i = s ()
      let pop () =
        let mv = i ()
        if mv.HasValue then
          Just (m mv.Value)
        else
          Nothing
      pop

  let range b e : Stream<int> =
    fun () ->
      let mutable i = b
      fun () ->
        if i <= e then
          let p = i
          i <- i + 1
          Just p
        else
          Nothing

  let inline sum (s : Stream<'T>) : 'T =
    let i = s ()
    let rec loop state =
      let mv = i ()
      if mv.HasValue then
        loop (state + mv.Value)
      else
        state
    loop LanguagePrimitives.GenericZero<'T>

module PerfTest =

  open System.Linq
#if USE_NESSOS
  open Nessos.Streams
#endif

  let now =
    let sw = System.Diagnostics.Stopwatch ()
    sw.Start ()
    fun () -> sw.ElapsedMilliseconds

  let time n a =
    let inline cc i       = System.GC.CollectionCount i

    let v                 = a ()

    System.GC.Collect (2, System.GCCollectionMode.Forced, true)

    let bcc0, bcc1, bcc2  = cc 0, cc 1, cc 2
    let b                 = now ()

    for i in 1..n do
      a () |> ignore

    let e = now ()
    let ecc0, ecc1, ecc2  = cc 0, cc 1, cc 2

    v, (e - b), ecc0 - bcc0, ecc1 - bcc1, ecc2 - bcc2

  let arrayTest n =
    Array.init (n + 1) id
    |> Array.map    int64
    |> Array.filter (fun v -> v % 2L = 0L)
    |> Array.map    ((+) 1L)
    |> Array.sum

  let imperativeTest n =
    let rec loop s i =
      if i >= 0L then
        if i % 2L = 0L then
          loop (s + i + 1L) (i - 1L)
        else
          loop s (i - 1L)
      else
        s
    loop 0L (int64 n)

  let linqTest n =
    (((Enumerable.Range(0, n + 1)).Select int64).Where(fun v -> v % 2L = 0L)).Select((+) 1L).Sum()

  let listTest n =
    List.init (n + 1) id
    |> List.map     int64
    |> List.filter  (fun v -> v % 2L = 0L)
    |> List.map     ((+) 1L)
    |> List.sum

#if USE_NESSOS
  let nessosTest n =
    Stream.initInfinite id
    |> Stream.take    (n + 1)
    |> Stream.map     int64
    |> Stream.filter  (fun v -> v % 2L = 0L)
    |> Stream.map     ((+) 1L)
    |> Stream.sum
#endif

  let pullTest n =
    PullStream.range 0 n
    |> PullStream.map     int64
    |> PullStream.filter  (fun v -> v % 2L = 0L)
    |> PullStream.map     ((+) 1L)
    |> PullStream.sum

  let pushTest n =
    PushStream.range 0 n
    |> PushStream.map     int64
    |> PushStream.filter  (fun v -> v % 2L = 0L)
    |> PushStream.map     ((+) 1L)
    |> PushStream.sum

  let seqTest n =
    Seq.init (n + 1) id
    |> Seq.map    int64
    |> Seq.filter (fun v -> v % 2L = 0L)
    |> Seq.map    ((+) 1L)
    |> Seq.sum

  let perfTest (path : string) =
    let testCases =
      [|
        "array"       , arrayTest       
        "imperative"  , imperativeTest  
        "linq"        , linqTest        
        "list"        , listTest        
        "seq"         , seqTest         
#if USE_NESSOS
        "nessos"      , nessosTest      
#endif
        "pull"        , pullTest        
        "push"        , pushTest        
      |]
    use out                   = new System.IO.StreamWriter (path)
    let write (msg : string)  = out.WriteLine msg
    let writef fmt            = FSharp.Core.Printf.kprintf write fmt

    write "Name\tTotal\tOuter\tInner\tElapsed\tCC0\tCC1\tCC2\tResult"

    let total   = 10000000
    let outers = [| 10; 1000; 1000000 |]
    for outer in outers do
      let inner = total / outer
      for name, a in testCases do
        printfn "Running %s with total=%d, outer=%d, inner=%d ..." name total outer inner
        let v, ms, cc0, cc1, cc2 = time outer (fun () -> a inner)
        printfn "  ... %d ms, cc0=%d, cc1=%d, cc2=%d, result=%A" ms cc0 cc1 cc2 v
        writef "%s\t%d\t%d\t%d\t%d\t%d\t%d\t%d\t%d" name total outer inner ms cc0 cc1 cc2 v

[<EntryPoint>]
let main argv =
  System.Environment.CurrentDirectory <- System.AppDomain.CurrentDomain.BaseDirectory
  PerfTest.perfTest "perf.tsv"
  0

PDF - Download F# for free

Previous Next

F#

Fastest Entity Framework Extensions

Example

Got any F# Question?

F#

F# F# Performance Tips and Tricks Comparison of different F# data pipelines

Fastest Entity Framework Extensions

Example

Got any F# Question?