Floorplan: Spatial Layout in Memory Management Systems

Samuel Z. Guyer and Karl Cronburg — GPCE 2019. — Back to overview — PDF slides — PDF with speaker notes

Goals of this talk

Familiarize you with Floorplan layout operators
Motivate customized memory managers as a domain lacking in language support
Not covered: a workshop to learn Floorplan syntax, full compilation semantics, or performance results — read the paper for those

Garbage collection: the good

Modern GCs are generational because most programs have two kinds of objects:

Applications exhibit high allocation rates and objects die young — lots of ephemeral data
Long-lived objects survive multiple minor GCs — small to moderate amounts of enduring data

Why customize: three ways GC performance can fail

Performance is poor on non-generational and specialized workloads via three routes:

Way 1: Object sizes & layout
Way 2: Object lifetimes & allocation patterns
Way 3: Locality and access patterns

Way 1: Object sizes — streaming science

Homogeneous object sizes limit window size in streaming algorithms, constraining accuracy. For example, k-means accuracy correlates with available memory and inversely with runtime. A Rust sensor message type:

type Mac  = [u8; 6];
type Data = [u8; 13];
pub struct Msg { ptr: Addr }
impl Msg {
    const SIZE: usize = Self::ID_SIZE + Self::POWER_SIZE
        + Self::MAC_SIZE + Self::EPOCH_SIZE + Self::DATA_SIZE;
    const ID_OFFSET:    usize = 0;
    const ID_SIZE:      usize = size_of::();
    // ...
}

Way 1: Object sizes — JVM runtime code

JVM runtime code has ad hoc offset calculations scattered across the codebase:

int SCALAR_HEADER_SIZE = JAVA_HEADER_BYTES + OTHER_HEADER_BYTES;
int ARRAY_HEADER_SIZE  = SCALAR_HEADER_SIZE + ARRAY_LENGTH_BYTES;
Offset TIB_OFFSET      = JAVA_HEADER_OFFSET;
Offset STATUS_OFFSET   = TIB_OFFSET.plus(STATUS_BYTES);

A bug in one of these offset calculations was discovered and fixed in JikesRVM in 2018 (PR on GitHub).

Way 1: Floorplan specification

A Floorplan specification for the sensor message layout above:

Msg -> seq {
    id:    4 bytes,   // Sensor identifier
    epoch: 8 bytes,   // Unix epoch time
    Mac,              // Bluetooth MAC address
    power: 1 bytes,   // RSSI
    data:  13 bytes,  // Portion of BLE packet data
}
Mac -> 6 bytes

The specification makes the layout explicit and checkable, replacing the ad hoc arithmetic currently spread across source files.

Way 2: Object lifetimes — sliding window workloads

Temporal sliding-window data structures have non-generational allocation patterns: new entries arrive, old ones are evicted. A naïve FIFO linked-list experiences O(n) cost per GC. What we want is GC that acts like a ring buffer in the common case (O(1) allocation and eviction), but still supports arbitrary eviction order.

Way 2: Object lifetimes — block allocator

A Floorplan specification for a heap that aligns Msg objects within blocks:

Heap  -> # Block
Block -> @|2^21 bytes|@ -> # Msg
         * BYTES_IN_MSG;

This specification constrains how many Msg objects fit per block, enabling the GC to collect entire blocks (ring-buffer style) rather than individual objects.

What we do currently — and its costs

Today's approaches to specialized memory layout:

Apache Spark: hides individual data entries from the GC by managing byte arrays manually, at the cost of deserialization overhead
NumPy: memory-maps out-of-core data, but the format is documented only in prose comments
HDF5: a general-purpose layout framework, but adoption is slow across ecosystems

The common pattern: layout assumptions are either implicit (scattered arithmetic), verbal (comments), or framework-specific (not portable).

Way 3: Access patterns — bitmaps

A lookup-table for Msg addresses, specified in Floorplan:

// Computes the address of the given Msg ptr's lookup table
// entry, and the mask for accessing the appropriate bit.
#[inline(always)]
fn table_addr(ptr: MsgAddr) -> (BitsAddr, BitsMask) {
    let block: BlockAddr = ptr.get_containing_BlockAddr();
    let msgs:  ObjsAddr  = block.msgs();
    let idx:   usize     = msgs.get_idx_of(ptr);
    // ...
}

Semantics not covered in this talk

Byte order
Macros
Architectural constants
Performance characteristics
Reduction semantics defining a Floorplan type
Compiler implementation details
Coq proofs verifying certain language-level properties

Tradeoffs of custom allocators

Writing a custom allocator introduces costs that Floorplan helps manage but does not eliminate:

New spatially optimized data structures must be written with custom collection in mind (Floorplan handles the types, not the algorithms)
Bookkeeping burden for each custom memory manager
Custom interfaces needed for memory debuggers (Valgrind, GDB, Purify)
Sunk cost: custom allocators are difficult to write and maintain
Risk of pre-emptively optimizing microbenchmarks that are not actually representative

Additional slides (related work)

Heap Layers

Heap Layers is a C++ framework for composing allocators from reusable layer objects, described in Berger et al.. Floorplan operates at a higher level of abstraction — specifying spatial structure rather than composing allocator behaviors.

GHC block descriptors

GHC's block allocator uses a highly optimized lookup table: Bdescr(p) = ((p & 2^m-1) >> k) << d) | (p & ~(2^m-1)). This implicitly encodes a spatial layout assumption that Floorplan would make explicit. See the GHC wiki.

PADS-Haskell: layouts for parsing

PADS describes file layouts declaratively, generating parsers and printers automatically. Floorplan describes heap layouts. The two approaches are complementary — PADS parses external data into heap structures whose layout Floorplan can specify.

newtype MapsFile = MapsFile ([Line Region] terminator EOF)

data RegionName =
    Heap  "[heap]"
  | Stack "[stack]"
  | VDSO  "[vdso]"
  | VVAR  "[vvar]"
  | VSyscall "[vsyscall]"