ROOT: Replaying Multithreaded Traces with Resource-Oriented Ordering

Zev Weiss

Tyler Harter

Andrea C. Arpaci-Dusseau

Remzi H. Arpaci-Dusseau

University of Wisconsin-Madison

Department of Computer Sciences

Introduce yourself and your coauthors.

Motivation

I/O trace replay...why?

Performance evaluation of new systems
Relevant workloads, not synthetic benchmarks

SOSP'11: "A File is not a File..." (Harter et al.)

iBench: 34 traces of Apple desktop applications
Interesting challenge: how to replay?

Why IO trace replay? Perf eval, relevant workloads. Starting point: ibench, 34 syscall traces...how to replay? Synthetic benchmarks give you a workload, not your workload.

Difficulty

Modern workloads make trace replay more challenging

Why?

Threads.

More specifically, interactions between threads:

Performance
Correctness

modern workloads make replay harder -- threads. interactions between them, perf correctness. Threads complicate everything. Performance: I/O completion order, feedback effects, snowballing. Correctness: example follows...

Multithreaded Replay

Multithreaded replay can be slightly tricky:

T1: open(path1, ...) = 3

T2: read(3, buf, 1K) = 1K

T1: open(path2, ...) = 4

T2: read(3, buf, 1K) = 1K

T1: close(3) = 0

T2: fstat(4) = 0

Multithreaded replay can be slightly tricky:

T1: open(path1, ...) = 3

T2: read(3, buf, 1K) = 1K

T1: open(path2, ...) = 4

T2: read(3, buf, 1K) = 1K

T1: close(3) = 0

T2: fstat(4) = 0

Mention simplicity before hitting "next".

Multithreaded Replay

Sneakier problems:

T1: open(path1, O_TRUNC) = 3

T2: open(path1, O_RDONLY) = 4

T1: write(3, ..., 1K) = 1K

T2: read(4, ..., 1K) = 1K

* actual scenario!

Point out lack of inter-thread FD sharing. Start mention of non-contrivedness before re-merging trace threads.

What to do?

One possibility: preserve ordering

Maintains correctness...but very pessimistic! (Limits utility)

read(5, ..., 4K) = 4K

read(4, ..., 32K) = 32K

(Can back up animation to show original ordering after advancing to "target system" timing.)

Our Solution

ROOT: Resource-Oriented Ordering for Trace replay

Aim:

Enough ordering constraint to avoid errors,
but little enough to allow natural timing variations.

Correct replay and realistic performance!

Results

Implemented ROOT in ARTC replay system

Evaluation:

Four microbenchmarks
LevelDB macrobenchmarks

Microbenchmarks: ≤ 6% timing error, vs. up to 600%

LevelDB: 95% of original syscall concurrency, vs. 60%

"Half" the timing error, don't bother with literal numbers.

Outline

ROOT:
- Resources vs. names
- Ordering rules
ARTC: ROOT Implemented
Evaluation & results:
- Replay performance
- Timing effects of overconstraint
Conclusions

ROOT: Resources and Names

Resource-oriented: just what is a resource?

Files
File descriptors
Path names
AIO control blocks

Some referred to directly, others indirectly via names

Path → file
FD → file
AIOCB → FD → file

Other resource possibilities: parts of files; keys in a KV store, etc.

ROOT: Stage Ordering

All uses after creation, before destruction

Addresses FD use-after-close, etc.

open(path, ...) = 3

read(3, ..., 1K) = 1K

close(3) = 0

ROOT: Sequential Ordering

All uses in same order as trace

Addresses file size problems, etc.

open(path1, O_TRUNC) = 3

open(path1, O_RDONLY) = 4

write(3, ..., 1K) = 1K

read(4, ..., 1K) = 1K

ROOT: Name Ordering

Preserves ordering of generations of a name

Addresses problems with lock files, etc.

creat("foo", O_EXCL)

unlink("foo")

creat("foo", O_EXCL)

unlink("foo")

ARTC: ROOT Implemented

An Approximate-Replay Trace Compiler

What goes in: compiler inputs

ARTC components: compiler and replayer

What comes out: trace replay

ARTC: ROOT Implemented

An Approximate-Replay Trace Compiler

Compiler inputs:

FS metadata snapshot
- Paths, file sizes, symlinks...
Syscall trace:
- Entry & return timestamps
- Thread ID
- Syscall type
- Arguments
- Return value

Front half: Mention FS snapshot is metadata only, no file contents.

ARTC: ROOT Implemented

An Approximate-Replay Trace Compiler

Compiler & replay engine:

~16KLoC (C, Bison, Flex)
Compiles to arrays of C structs
Replayer loads *.so
Supports over 80 system calls
Cross platform:

Probably no need to explicitly mention LoC count. C code not of "executable" type, just static data. Output includes discovered dependency/ordering information.

ARTC: ROOT Implemented

An Approximate-Replay Trace Compiler

Replay:

Restores FS state snapshot
Spawns replay threads
Enforces ordering constraints
Emulates unsupported syscalls
Tracks & reports timing stats

Mention that runtime only does dependency/ordering enforcement, all discovery is by compiler.

System Call Emulation

Linux, OS X, FreeBSD, Illumos: all UNIX...but all different!

Common syscalls compatible, others often not.

ARTC emulates trace semantics on target system API

OS X's exchangedata: link/rename/rename
Linux fsync vs. OS X fcntl(F_FULLFSYNC)
Extended attributes...

Limitations

Replay ordering:

Ensures syscall-level semantic correctness
Application-level semantics not guaranteed
Why? Lack of synchronization information

I/O via mmap not reproduced

Evaluation

Alternate strategies:

Single-threaded

Temporally-ordered

Unconstrained

Two criteria:

Semantic correctness

Performance accuracy

Mention that correctness is measured based on syscall return values. Mention unconstrained being dropped due to poor correctness results.

Correctness

Workload: Magritte

Metadata intensive
Heavily multithreaded

Error rate on iPhoto import400 trace (827,964 syscalls):

Single-threaded: 4
Temporally-ordered: 4
Unconstrained: 377,8673
ARTC: 7

Workloads

Microbenchmarks:

Disk parallelism
Cache size
Same system
I/O scheduling

LevelDB macrobenchmarks

Performance Microbenchmarks: Disk Parallelism

Two random-read threads, 1xHDD → 2xHDD RAID 0

Explain bars one at a time

Performance Microbenchmarks: Disk Parallelism

Two random-read threads, 1xHDD → 2xHDD RAID 0

Explain bars one at a time

Performance Microbenchmarks: Disk Parallelism

Two random-read threads, 1xHDD → 2xHDD RAID 0

Explain bars one at a time

Performance Microbenchmarks: Disk Parallelism

Two random-read threads, 1xHDD → 2xHDD RAID 0

Explain bars one at a time

Performance Microbenchmarks: Cache Size

Two random-read threads, one with seq. warmup, 4GB → 1.5GB

Larger to smaller page cache size: similar again.

Performance Microbenchmarks: Workload Parallelism

Eight random-read threads, same system

Explain lack of system change

Performance Microbenchmarks: I/O Scheduling

Two sequential-read threads, CFQ slice_sync 100ms → 1ms

Different: simpler replay methods overperform.

Zooming In

Event dependencies

Each line is a thread. Nodes are system calls. Numbers labeling nodes are file descriptors access by system calls.

Zooming In

Temporal-ordering dependencies

Average edge length: 10ms

Looks like a jumbled mess; kind of is. What's important isn't density, it's length. Length: distance in time between nodes (syscalls) connected by a dependency edge.

Zooming In

ARTC ordering dependencies

Average edge length: 8.9 seconds

Fewer edges shown, but others are just off the edges. Again, length is key, allows greater reordering flexibility.

Zooming In

Syscall overlap

Again, each horizontal line is a thread. Two-second window; colored boxes are time spent in system call execution.

Conclusions

Temporally-ordered: good correctness, bad performance

Single-threaded: good correctness, worse performance

Unconstrained: unusable correctness

ARTC: correctness and performance.

Code available at: https://research.cs.wisc.edu/adsl/Software/artc/

ARTC correctness is almost identical to simpler strategies. Would be identical but for a slight edge-case bug.

fin

Thanks!

Questions?