Ask questionsStreaming decompress into fragmented (but stable) output without history buffer?
Thank you for this amazingly powerful library. I think we (at https://github.com/vectorizedio/redpanda) may have a unique use case which appears to almost be covered by the API, but there is still an open question about a viable solution, and I'm hoping to get some advice on the limits of the current API.
tl;dr: is it possible to use
ZSTD_c_stableOutBuffer with a stable (but fragmented) output buffer? we operate in an environment where memory fragmentation prevents on-demand allocation of contiguous regions (even 2 MB) which would otherwise be needed in our use case for the streaming decompression history buffer. We don't control the compression process, and while we do impose limits, when those limits derive from our inability to acquire contiguous memory regions > 1-2 MB the restrictions are too severe.
We are using zstd to decompress source data that is stored in a heap allocated fragmented buffer (not folly::IOBuf chains, but similar idea):
source = [buf-0, buf-1, ...]
because the source data is not stored in contiguous memory we cannot use the single-shot API, and so we rely on streaming decompression where the output data is also stored in a fragmented buffer.
The problem we are facing arises with the size requirements of the history buffer that streaming decompression requires. Specifically we care about limiting the size of any contiguous region of memory.
It is probably relevant now to state that we do not control the compression process, so we can't enforce specific limits on the window size. Because we can't control the window size, my understanding is that even the buffer-less API is not helpful because the caller must still provide a contiguous memory region large enough for the history which in turn is dependent on the compression process.
However, we operate in an environment where memory becomes fragmented over time and we cannot rely on the OS to provide us with a contiguous region via virtual memory on demand. The fragmentation becomes bad enough that it can become hard to find even 1 to 2 MBs of contiguous memory. For us this would translate into an unreasonable restriction on the clients' compression process.
vmallocso it doesn't have this issue :)
The current solution we are using is to statically allocate a sufficiently large buffer (say 4 or 8 MBs etc...) during boot when memory is not fragmented, and then using the zstd static init experimental API to provide this to the streaming decompression API.
I recently stumbled upon the PR https://github.com/facebook/zstd/pull/2094 that introduced
ZSTD_c_stableOutBuffer and from what I can tell it solves the issue of the internal history buffer by relying on a stable output.
ZSTD_c_stableOutBufferwork with an output buffer that is fragmented? The output buffers can themselves remain stable in memory, but the entirety of the decompressed output would need to be fragmented in (potentially) arbitrary ways, though we could add a stricter allocation policy if necessary.
Are there other approaches to solving the problem we are facing?
Answer questions felixhandte
Thanks for providing so much context!
I'm sorry to say that in general, the answer to your question is no. And unfortunately, it's not just an API limitation: the decoder implementation is written in such a way that it cannot decompress with a fragmented history buffer.
Fundamentally the zstd decoder is performing two operations in a loop, decoding sequences and executing them.
match_offsetvalues from the entropy-encoded representation they're stored in (ANS Encoding via FSE).
literal_lengthbytes from the decoded literals buffer (previously recovered from their Huffman-encoded representation) onto the tail of the output and then copying
match_offsetbytes back in the history of the stream onto the tail of the output.
The relevant part here is the implementation of the match copy operation. Since this is part of the hot loop / core of the most performance-sensitive part of Zstd, we want the lookup of the match position from the decoded offset to be as fast as possible, basically just
current_position - match_offset. The cost of this fast mapping is that it requires that the whole window is contiguous...
Except technically, we do actually implement an exception to this. The history buffer is allowed to have a single discontinuity. In order to efficiently maintain a window-sized view of an arbitrarily large stream, the internal history buffer is a circular buffer (sized to the window size), which as it wraps around will map the window into two chunks. So the decoder is implemented to handle that. That's probably not sufficient for your use case, though, even if that support were plumbed through to external history buffers.
Someone (you? us?) could potentially write a decoder implementation that supported arbitrary fragmentation at the cost of slower execution, but from an external view, the approach you're taking now is probably your most realistic option.
I hope that helps!