Goals of this talk
- Familiarize you with Floorplan layout operators
- Motivate customized memory managers as a domain lacking in language support
- Not covered: a workshop to learn Floorplan syntax, full compilation semantics, or performance results — read the paper for those
Garbage collection: the good
Modern GCs are generational because most programs have two kinds of objects:
- Applications exhibit high allocation rates and objects die young — lots of ephemeral data
- Long-lived objects survive multiple minor GCs — small to moderate amounts of enduring data
Why customize: three ways GC performance can fail
Performance is poor on non-generational and specialized workloads via three routes:
- Way 1: Object sizes & layout
- Way 2: Object lifetimes & allocation patterns
- Way 3: Locality and access patterns
Way 1: Object sizes — streaming science
Homogeneous object sizes limit window size in streaming algorithms, constraining accuracy. For example, k-means accuracy correlates with available memory and inversely with runtime. A Rust sensor message type:
type Mac = [u8; 6];
type Data = [u8; 13];
pub struct Msg { ptr: Addr }
impl Msg {
const SIZE: usize = Self::ID_SIZE + Self::POWER_SIZE
+ Self::MAC_SIZE + Self::EPOCH_SIZE + Self::DATA_SIZE;
const ID_OFFSET: usize = 0;
const ID_SIZE: usize = size_of::();
// ...
}
Way 1: Object sizes — JVM runtime code
JVM runtime code has ad hoc offset calculations scattered across the codebase:
int SCALAR_HEADER_SIZE = JAVA_HEADER_BYTES + OTHER_HEADER_BYTES;
int ARRAY_HEADER_SIZE = SCALAR_HEADER_SIZE + ARRAY_LENGTH_BYTES;
Offset TIB_OFFSET = JAVA_HEADER_OFFSET;
Offset STATUS_OFFSET = TIB_OFFSET.plus(STATUS_BYTES);
A bug in one of these offset calculations was discovered and fixed in JikesRVM in 2018 (PR on GitHub).
Way 1: Floorplan specification
A Floorplan specification for the sensor message layout above:
Msg -> seq {
id: 4 bytes, // Sensor identifier
epoch: 8 bytes, // Unix epoch time
Mac, // Bluetooth MAC address
power: 1 bytes, // RSSI
data: 13 bytes, // Portion of BLE packet data
}
Mac -> 6 bytes
The specification makes the layout explicit and checkable, replacing the ad hoc arithmetic currently spread across source files.
Way 2: Object lifetimes — sliding window workloads
Temporal sliding-window data structures have non-generational allocation patterns: new entries arrive, old ones are evicted. A naïve FIFO linked-list experiences O(n) cost per GC. What we want is GC that acts like a ring buffer in the common case (O(1) allocation and eviction), but still supports arbitrary eviction order.
Way 2: Object lifetimes — block allocator
A Floorplan specification for a heap that aligns Msg objects within blocks:
Heap -> # Block
Block -> @|2^21 bytes|@ -> # Msg
* BYTES_IN_MSG;
This specification constrains how many Msg objects fit per block, enabling the GC to collect entire blocks (ring-buffer style) rather than individual objects.
What we do currently — and its costs
Today's approaches to specialized memory layout:
- Apache Spark: hides individual data entries from the GC by managing byte arrays manually, at the cost of deserialization overhead
- NumPy: memory-maps out-of-core data, but the format is documented only in prose comments
- HDF5: a general-purpose layout framework, but adoption is slow across ecosystems
The common pattern: layout assumptions are either implicit (scattered arithmetic), verbal (comments), or framework-specific (not portable).
Way 3: Access patterns — bitmaps
A lookup-table for Msg addresses, specified in Floorplan:
// Computes the address of the given Msg ptr's lookup table
// entry, and the mask for accessing the appropriate bit.
#[inline(always)]
fn table_addr(ptr: MsgAddr) -> (BitsAddr, BitsMask) {
let block: BlockAddr = ptr.get_containing_BlockAddr();
let msgs: ObjsAddr = block.msgs();
let idx: usize = msgs.get_idx_of(ptr);
// ...
}
Semantics not covered in this talk
- Byte order
- Macros
- Architectural constants
- Performance characteristics
- Reduction semantics defining a Floorplan type
- Compiler implementation details
- Coq proofs verifying certain language-level properties
Tradeoffs of custom allocators
Writing a custom allocator introduces costs that Floorplan helps manage but does not eliminate:
- New spatially optimized data structures must be written with custom collection in mind (Floorplan handles the types, not the algorithms)
- Bookkeeping burden for each custom memory manager
- Custom interfaces needed for memory debuggers (Valgrind, GDB, Purify)
- Sunk cost: custom allocators are difficult to write and maintain
- Risk of pre-emptively optimizing microbenchmarks that are not actually representative
Additional slides (related work)
Heap Layers
Heap Layers is a C++ framework for composing allocators from reusable layer objects, described in Berger et al.. Floorplan operates at a higher level of abstraction — specifying spatial structure rather than composing allocator behaviors.
GHC block descriptors
GHC's block allocator uses a highly optimized lookup table: Bdescr(p) = ((p & 2m-1) >> k) << d) | (p & ~(2m-1)).
This implicitly encodes a spatial layout assumption that Floorplan would make explicit.
See the GHC wiki.
PADS-Haskell: layouts for parsing
PADS describes file layouts declaratively, generating parsers and printers automatically. Floorplan describes heap layouts. The two approaches are complementary — PADS parses external data into heap structures whose layout Floorplan can specify.
newtype MapsFile = MapsFile ([Line Region] terminator EOF)
data RegionName =
Heap "[heap]"
| Stack "[stack]"
| VDSO "[vdso]"
| VVAR "[vvar]"
| VSyscall "[vsyscall]"