Crippled browser, can't proceed.

:(

ROOT: Replaying Multithreaded Traces with Resource-Oriented Ordering

Zev Weiss

Tyler Harter

Andrea C. Arpaci-Dusseau

Remzi H. Arpaci-Dusseau

University of Wisconsin-Madison

Department of Computer Sciences

Introduce yourself and your coauthors.

Motivation

I/O trace replay...why?

  • Performance evaluation of new systems
  • Relevant workloads, not synthetic benchmarks

SOSP'11: "A File is not a File..." (Harter et al.)

  • iBench: 34 traces of Apple desktop applications
  • Interesting challenge: how to replay?
Why IO trace replay? Perf eval, relevant workloads. Starting point: ibench, 34 syscall traces...how to replay? Synthetic benchmarks give you a workload, not your workload.

Difficulty

Modern workloads make trace replay more challenging

Why?

Threads.

More specifically, interactions between threads:

  • Performance
  • Correctness
modern workloads make replay harder -- threads. interactions between them, perf correctness. Threads complicate everything. Performance: I/O completion order, feedback effects, snowballing. Correctness: example follows...

Multithreaded Replay

Multithreaded replay can be slightly tricky:

T1: open(path1, ...) = 3

T2: read(3, buf, 1K) = 1K

T1: open(path2, ...) = 4

T2: read(3, buf, 1K) = 1K

T1: close(3) = 0

T2: fstat(4) = 0

Multithreaded replay can be slightly tricky:

T1: open(path1, ...) = 3

T2: read(3, buf, 1K) = 1K

T1: open(path2, ...) = 4

T2: read(3, buf, 1K) = 1K

T1: close(3) = 0

T2: fstat(4) = 0


Mention simplicity before hitting "next".

Multithreaded Replay

Sneakier problems:

T1: open(path1, O_TRUNC) = 3

T2: open(path1, O_RDONLY) = 4

T1: write(3, ..., 1K) = 1K

T2: read(4, ..., 1K) = 1K

* actual scenario!

Point out lack of inter-thread FD sharing. Start mention of non-contrivedness before re-merging trace threads.

What to do?

One possibility: preserve ordering

Maintains correctness...but very pessimistic! (Limits utility)

read(5, ..., 4K) = 4K
read(5, ..., 4K) = 4K
read(5, ..., 4K) = 4K
read(4, ..., 32K) = 32K
read(4, ..., 32K) = 32K
(Can back up animation to show original ordering after advancing to "target system" timing.)

Our Solution

ROOT: Resource-Oriented Ordering for Trace replay

Aim:

  • Enough ordering constraint to avoid errors,
  • but little enough to allow natural timing variations.

Correct replay and realistic performance!

Results

Implemented ROOT in ARTC replay system

Evaluation:

  • Four microbenchmarks
  • LevelDB macrobenchmarks

Microbenchmarks: ≤ 6% timing error, vs. up to 600%

LevelDB: 95% of original syscall concurrency, vs. 60%

"Half" the timing error, don't bother with literal numbers.

Outline

  • ROOT:
    • Resources vs. names
    • Ordering rules
  • ARTC: ROOT Implemented
  • Evaluation & results:
    • Replay performance
    • Timing effects of overconstraint
  • Conclusions

ROOT: Resources and Names

Resource-oriented: just what is a resource?

  • Files
  • File descriptors
  • Path names
  • AIO control blocks

Some referred to directly, others indirectly via names

  • Path → file
  • FD → file
  • AIOCB → FD → file
Other resource possibilities: parts of files; keys in a KV store, etc.

ROOT: Stage Ordering

All uses after creation, before destruction

Addresses FD use-after-close, etc.

open(path, ...) = 3


read(3, ..., 1K) = 1K


close(3) = 0

ROOT: Sequential Ordering

All uses in same order as trace

Addresses file size problems, etc.

open(path1, O_TRUNC) = 3

open(path1, O_RDONLY) = 4

write(3, ..., 1K) = 1K


read(4, ..., 1K) = 1K

ROOT: Name Ordering

Preserves ordering of generations of a name

Addresses problems with lock files, etc.

creat("foo", O_EXCL)

unlink("foo")


creat("foo", O_EXCL)

unlink("foo")

ARTC: ROOT Implemented

An Approximate-Replay Trace Compiler

What goes in: compiler inputs

ARTC components: compiler and replayer

What comes out: trace replay

ARTC: ROOT Implemented

An Approximate-Replay Trace Compiler

Compiler inputs:

  • FS metadata snapshot
    • Paths, file sizes, symlinks...
  • Syscall trace:
    • Entry & return timestamps
    • Thread ID
    • Syscall type
    • Arguments
    • Return value
Front half: Mention FS snapshot is metadata only, no file contents.

ARTC: ROOT Implemented

An Approximate-Replay Trace Compiler

Compiler & replay engine:

  • ~16KLoC (C, Bison, Flex)
  • Compiles to arrays of C structs
  • Replayer loads *.so
  • Supports over 80 system calls
  • Cross platform:
Probably no need to explicitly mention LoC count. C code not of "executable" type, just static data. Output includes discovered dependency/ordering information.

ARTC: ROOT Implemented

An Approximate-Replay Trace Compiler

Replay:

  • Restores FS state snapshot
  • Spawns replay threads
  • Enforces ordering constraints
  • Emulates unsupported syscalls
  • Tracks & reports timing stats
Mention that runtime only does dependency/ordering enforcement, all discovery is by compiler.

System Call Emulation

Linux, OS X, FreeBSD, Illumos: all UNIX...but all different!

Common syscalls compatible, others often not.

ARTC emulates trace semantics on target system API

  • OS X's exchangedata: link/rename/rename
  • Linux fsync vs. OS X fcntl(F_FULLFSYNC)
  • Extended attributes...

Limitations

Replay ordering:

  • Ensures syscall-level semantic correctness
  • Application-level semantics not guaranteed
  • Why? Lack of synchronization information

I/O via mmap not reproduced

Evaluation

Alternate strategies:

    Single-threaded

    Temporally-ordered

    Unconstrained

Two criteria:

    Semantic correctness

    Performance accuracy

Mention that correctness is measured based on syscall return values. Mention unconstrained being dropped due to poor correctness results.

Correctness

Workload: Magritte

  • Metadata intensive
  • Heavily multithreaded

Error rate on iPhoto import400 trace (827,964 syscalls):

  • Single-threaded: 4
  • Temporally-ordered: 4
  • Unconstrained: 377,8673
  • ARTC: 7

Workloads

Microbenchmarks:

  • Disk parallelism
  • Cache size
  • Same system
  • I/O scheduling

LevelDB macrobenchmarks

Performance Microbenchmarks: Disk Parallelism

Two random-read threads, 1xHDD → 2xHDD RAID 0

Explain bars one at a time

Performance Microbenchmarks: Disk Parallelism

Two random-read threads, 1xHDD → 2xHDD RAID 0

Explain bars one at a time

Performance Microbenchmarks: Disk Parallelism

Two random-read threads, 1xHDD → 2xHDD RAID 0

Explain bars one at a time

Performance Microbenchmarks: Disk Parallelism

Two random-read threads, 1xHDD → 2xHDD RAID 0

Explain bars one at a time

Performance Microbenchmarks: Cache Size

Two random-read threads, one with seq. warmup, 4GB → 1.5GB