profile
viewpoint

Ask questionsStreaming decompress into fragmented (but stable) output without history buffer?

Thank you for this amazingly powerful library. I think we (at https://github.com/vectorizedio/redpanda) may have a unique use case which appears to almost be covered by the API, but there is still an open question about a viable solution, and I'm hoping to get some advice on the limits of the current API.

tl;dr: is it possible to use ZSTD_c_stableOutBuffer with a stable (but fragmented) output buffer? we operate in an environment where memory fragmentation prevents on-demand allocation of contiguous regions (even 2 MB) which would otherwise be needed in our use case for the streaming decompression history buffer. We don't control the compression process, and while we do impose limits, when those limits derive from our inability to acquire contiguous memory regions > 1-2 MB the restrictions are too severe.

full details of use case

We are using zstd to decompress source data that is stored in a heap allocated fragmented buffer (not folly::IOBuf chains, but similar idea):

source = [buf-0, buf-1, ...]

because the source data is not stored in contiguous memory we cannot use the single-shot API, and so we rely on streaming decompression where the output data is also stored in a fragmented buffer.

  • This all works fine, and so far it seems this is a completely normal use of the zstd streaming API.

The problem we are facing arises with the size requirements of the history buffer that streaming decompression requires. Specifically we care about limiting the size of any contiguous region of memory.

It is probably relevant now to state that we do not control the compression process, so we can't enforce specific limits on the window size. Because we can't control the window size, my understanding is that even the buffer-less API is not helpful because the caller must still provide a contiguous memory region large enough for the history which in turn is dependent on the compression process.

  • I would expect at this point the suggestion to be to limit window size, and we do up to a reasonable point.

However, we operate in an environment where memory becomes fragmented over time and we cannot rely on the OS to provide us with a contiguous region via virtual memory on demand. The fragmentation becomes bad enough that it can become hard to find even 1 to 2 MBs of contiguous memory. For us this would translate into an unreasonable restriction on the clients' compression process.

  • I thought for sure we'd find a solution looking at the use of zstd in the Linux kernel, but I found that zstd in that environment makes use of vmalloc so it doesn't have this issue :)

The current solution we are using is to statically allocate a sufficiently large buffer (say 4 or 8 MBs etc...) during boot when memory is not fragmented, and then using the zstd static init experimental API to provide this to the streaming decompression API.

  • This isn't ideal because we need to reserve based on a worst case scenarios (up to some limit), and more important, this memory in pinned forever, even though clients may never send us compressed data.

I recently stumbled upon the PR https://github.com/facebook/zstd/pull/2094 that introduced ZSTD_c_stableOutBuffer and from what I can tell it solves the issue of the internal history buffer by relying on a stable output.

  • Can ZSTD_c_stableOutBuffer work with an output buffer that is fragmented? The output buffers can themselves remain stable in memory, but the entirety of the decompressed output would need to be fragmented in (potentially) arbitrary ways, though we could add a stricter allocation policy if necessary.

Are there other approaches to solving the problem we are facing?

facebook/zstd

Answer questions felixhandte

@dotnwat,

Thanks for providing so much context!

I'm sorry to say that in general, the answer to your question is no. And unfortunately, it's not just an API limitation: the decoder implementation is written in such a way that it cannot decompress with a fragmented history buffer.

Fundamentally the zstd decoder is performing two operations in a loop, decoding sequences and executing them.

  1. Decoding a sequence involves recovering the LZ77 literal_length, match_length, and match_offset values from the entropy-encoded representation they're stored in (ANS Encoding via FSE).
  2. Executing a sequence is straightforward LZ77 decoding, and involves copying the next literal_length bytes from the decoded literals buffer (previously recovered from their Huffman-encoded representation) onto the tail of the output and then copying match_length bytes from match_offset bytes back in the history of the stream onto the tail of the output.

The relevant part here is the implementation of the match copy operation. Since this is part of the hot loop / core of the most performance-sensitive part of Zstd, we want the lookup of the match position from the decoded offset to be as fast as possible, basically just current_position - match_offset. The cost of this fast mapping is that it requires that the whole window is contiguous...

Except technically, we do actually implement an exception to this. The history buffer is allowed to have a single discontinuity. In order to efficiently maintain a window-sized view of an arbitrarily large stream, the internal history buffer is a circular buffer (sized to the window size), which as it wraps around will map the window into two chunks. So the decoder is implemented to handle that. That's probably not sufficient for your use case, though, even if that support were plumbed through to external history buffers.

Someone (you? us?) could potentially write a decoder implementation that supported arbitrary fragmentation at the cost of slower execution, but from an external view, the approach you're taking now is probably your most realistic option.

I hope that helps!

useful!
source:https://uonfu.com/
answerer
Felix Handte felixhandte @facebook New York. NY felixhandte.com Software Engineer on @facebook's Data Compression Team
Github User Rank List