Implementation Details
The core challenge of EmbedDb was implementing a robust binary format for the Sorted String Tables (SSTables). Unlike text-based formats (CSV/JSON), the binary format enables random access patterns and significantly reduces storage overhead.
void SSTable::flush(const std::map<std::string, std::string>& data) {
std::ofstream file(filename, std::ios::binary);
// Header: Entry Count
uint32_t count = data.size();
file.write(reinterpret_cast<const char*>(&count), sizeof(count));
// Body: [K_Len][Key][V_Len][Val]...
for (const auto& [key, val] : data) {
uint32_t k_len = key.size();
uint32_t v_len = val.size();
file.write(reinterpret_cast<const char*>(&k_len), sizeof(k_len));
file.write(key.data(), k_len);
...
}
}
Design Strategy
I chose the LSM-Tree structure specifically for its write-amplification properties. By buffering writes in memory (Memtable) and only flushing sequentially to disk, we bypass the random I/O bottleneck typical of B-Tree databases on spinning disks, though the benefits persist on nvme SSDs due to block-erase mechanics.