profile
viewpoint
Björn Steinbrink dotdash Bielefeld, Germany

dotdash/actix-extras 0

A collection of additional crates supporting the actix and actix-web frameworks.

dotdash/actix-web 0

Actix web is a small, pragmatic, and extremely fast rust web framework.

dotdash/ale 0

Check syntax in Vim asynchronously and fix files, with Language Server Protocol (LSP) support

dotdash/body-parser 0

JSON body parsing for iron

dotdash/cargo 0

The Rust package manager

dotdash/clap-rs 0

A full featured, fast Command Line Argument Parser for Rust

dotdash/docopt.rs 0

Docopt for Rust (command line argument parser).

dotdash/dotvim 0

My vim configuration

dotdash/git2-rs 0

libgit2 bindings for Rust

dotdash/gltf-viewer 0

glTF 2.0 Viewer written in Rust

issue commentrust-lang/rust

Match expressions use O(n) stack space with n branches

The problem here is having live values across BB boundaries, because the register allocator in debug mode simply spills and reloads everything, even for unconditional branches.

Silly example:

define internal i8 @testcase(i8 %0) {
  br label %bb2

bb2:
  ret i8 %0
}

becomes:

testcase:                               # @testcase
  .cfi_startproc
# %bb.0:
                                          # kill: def $dil killed $dil killed $edi
  movb>-%dil, -1(%rsp)          # 1-byte Spill
  jmp>.LBB15_1
.LBB15_1:                               # %bb2
  movb>--1(%rsp), %al           # 1-byte Reload
  retq

And in this example, it's not so much the match itself, but the overflow check that causes values that are live across BB boundaries. Compiling with -Cdebug-assertions=no gives the same stack usage for small and large.

small:
stack space: 0.2kb
stack space: 0.4kb
stack space: 0.6kb
stack space: 0.8kb
stack space: 0.9kb
stack space: 1.1kb
stack space: 1.3kb
stack space: 1.5kb
stack space: 1.7kb
stack space: 1.9kb
stack space: 2.1kb
large:
stack space: 0.2kb
stack space: 0.4kb
stack space: 0.6kb
stack space: 0.8kb
stack space: 0.9kb
stack space: 1.1kb
stack space: 1.3kb
stack space: 1.5kb
stack space: 1.7kb
stack space: 1.9kb
stack space: 2.1kb

Each overflow check causes two spill/reload pairs. One for token (1 byte) and one for the result of the subtraction. Which, for alignment reasons, adds up to 8 bytes of stack usage each. I'm not sure there's much we can do there in terms of MIR construction, but I'd love to be proven wrong here :-)

Also, a good bit of the stack usage is actually used by the println! call. Without the output, small uses 136 bytes of stack with debug assertions, and large uses 440 bytes. Without debug assertions, both use 40 bytes of stack.

In the general case the difference between debug and release mode, can probably be explained by the fact that in release mode, not only do we get a better register allocator, but we also use lifetime intrinsics in LLVM, which allow stack allocated values that are used in only one arm to share space with values only used in other arms. The latter would explain why the observed stack usage in the rust analyzer example goes from Sum(arms) to Max(arms). Short of doing some form of stack coloring of our own, I don't see a way to improve this in terms of MIR generation either.

evanw

comment created time in a month

pull request commentrust-lang/rust

Implement a generic Destination Propagation optimization on MIR

That's what RVO does, assigning the return slot to a value that would otherwise use a separate stack slot, but that's just one way. Doing copy propagation works just as well as long as there is only one possible value that is being assigned.

_5 = "Hello"
_0 = _5

is transformed to the following by this pass:

_0 = "Hello"
nop

but copy propagation would lead to:

_5 = "Hello"
_0 = "Hello"

Something like dead store elimination would then have clean up the dead assignment to _5 (or have a built-in DSE).

The example from above is akin to

_5 = "Hello"
_2 = _5
call(_2)
_3 = _5
call(_3)

and gets converted to

_3 = "Hello"
_2 = _3
call(_2)
nop
call(_3)

while copy propagation would give

_5 = "Hello"
_2 = "Hello"
call(_2)
_3 = "Hello"
call(_3)

The difference being that copy propagation doesn't force the two destinations to be alive at the same time, and apparently allowing more optimizations to happen e.g. in case of aggregates.

The case where the approach taken here works better is like this:

if _10, goto bb1 else goto bb2

bb1:
  _2 = "Hello"
  goto bb3

bb2:
  _2 = "Bye"
  goto bb3

bb3:
  _5 = _2

I'm wondering whether it's more common to use a single value in multiple places, or assign to single destination from a value that has been set in multiple places.

I suppose the truth is somewhere in between... Maybe limiting this pass to cases where only a single replacement is possible is useful?

jonas-schievink

comment created time in 2 months

pull request commentrust-lang/rust

Implement a generic Destination Propagation optimization on MIR

Since I always expected this to work the other way around, replacing dest with source, not the other way around, what's the reasoning behind doing it this way? Feel free to just point me somewhere where I can read up on it myself, I just didn't find anything right away.

jonas-schievink

comment created time in 2 months

pull request commentrust-lang/rust

Implement a generic Destination Propagation optimization on MIR

This modified example from #56172 regresses with this pass:

#[inline(never)]
pub fn g(clip: Option<&bool>) {
    clip.unwrap();
    let item = SpecificDisplayItem::PopStackingContext;
    do_item(&DI {
            item,
    });
    do_item(&DI {
            item,
    });
}

pub enum SpecificDisplayItem {
    PopStackingContext,
    Other([f64; 22]),
}

struct DI {
    item: SpecificDisplayItem,
}


fn do_item(di: &DI) { unsafe { ext(di) } }
extern {
    fn ext(di: &DI);
}

Nightly optimizes this properly, but with this pass, one of the DI instances is directly initialized (good), but then this is used to initialized the other instance, which is bad because that means that the memcpy can no longer be omitted and the lifetimes of those stack allocations are now overlapping so stack coloring can no longer collapse them into a single stack slot, increasing memory usage.

jonas-schievink

comment created time in 2 months

issue commentrust-lang/rust

Unnecessary memcpy caused by ordering of unwrap

The approach from #72632 breaks if you assign the same source to multiple destinations, because there's no simple chain that can be reduced to a single destination. I think you need to do copy-propagation (replacing the destination with the source, instead of the other way around) to handle that.

The following doesn't get properly optimized by #72632, but is handled by the memcpy pass being run before the inliner. That approach of course also doesn't handle all the cases, thus the proposal for the optimization to catch copies from uninitialized memory.

#[inline(never)]
pub fn f(clip: Option<&bool>) {
    let item = SpecificDisplayItem::PopStackingContext;
    clip.unwrap();
    do_item(&DI {
            item,
    });
   do_item(&DI {
            item,
    });
}}

In fact #72632 even stops the patched (MemCpyOpt before Inliner) LLVM from optimizing this version, because SROA can no longer split the alloca and so there's no memcpy that copies only uninitialized memory. For the modified f function #72632 still produces better code than nightly, but if you apply the same change to g, then nightly produces a properly optimized version, while dest-prop causes the lifetimes of the two DI instances to overlap, forcing double stack usage and a memcpy.

jrmuizel

comment created time in 2 months

issue commentrust-lang/rust

Unnecessary memcpy caused by ordering of unwrap

It might be worth noting that the memcpy here is especially pointless because it copies uninitialized memory. When the memcpy optimizations fold a memset into a memcpy it checks whether the memcpy copies more memory than what has been memset, and if so and the remainder is uninitialized, the memcpy for the uninitialized part is dropped.

In this case here, SROA splits the alloca for the item, into the discriminant part, and 176 uninitialized bytes. But since it runs without memory dependency analysis, is has no easy way to drop the memcpy for the uninitialized part. And before the MemCpy pass can kill the memcpy, the inliner comes and breaks the code into multiple basic blocks. So for this constellation, it would help to run the MemCpy pass before the inliner, but I have no real idea what tradeoff that makes.

Another option would be to add an optimization that drops memcpys from uninitialized sources to a pass that does cross-bb memory dependence analysis anyway, but a quick look didn't reveal any obvious place to do that. Given the rather common use of enums like this, where only some variants have a payload, that might be worth a try though.

jrmuizel

comment created time in 2 months

push eventdotdash/redmine-rs

Björn Steinbrink

commit sha dc9bf0adae0398a49fbfdad6ae2fdccce9ee450e

Initial commit

view details

push time in 2 months

issue commentrust-lang/cargo

cargo-package: list of dirty files contains ignored files

Heh, so I found the special case that ignores the target directory and it's broken in that it doesn't just ignore the target directory, but anything called target anywhere in the source tree.

$ git status
On branch master
Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git restore <file>..." to discard changes in working directory)
	modified:   data/target

no changes added to commit (use "git add" and/or "git commit -a")

$ cargo package
warning: manifest has no description, license, license-file, documentation, homepage or repository.
See https://doc.rust-lang.org/cargo/reference/manifest.html#package-metadata for more info.
   Packaging bla v0.1.0 (/home/bs/src/bla)
   Verifying bla v0.1.0 (/home/bs/src/bla)
   Compiling bla v0.1.0 (/home/bs/src/bla/target/package/bla-0.1.0)
    Finished dev [unoptimized + debuginfo] target(s) in 0.10s

So from my POV I think a viable approach would be to perform the change from above and remove the special case for target. Alternatively, keep the special case, but fix it to only apply to directories called target. What do you think?

dotdash

comment created time in 2 months

issue commentrust-lang/cargo

cargo-package: list of dirty files contains ignored files

To clarify: It does the right thing because the target directory is ignored by the default rule target/. If that rule is missing, you get a listing including all the untracked files in that directory.

I suppose there's already a special case that handles not recursing into "target", which is actually not needed when that directory is already ignored via .gitignore. So this would be a breaking change for users that don't ignore the target directory in their .gitignore, which to me seems rather unlikely, but that's just from my POV.

dotdash

comment created time in 2 months

issue commentrust-lang/cargo

cargo-package: list of dirty files contains ignored files

As far as I'm concerned, I'd actually be fine with cargo just reporting the directory instead of the individual files. I do wonder whether you would really need additional filtering. I understand it that the option causes git to recurse into the directory and report the individual untracked files, but ignored files are still excluded, and there's no recursion into directories that are completely ignored anyway.

Without really knowing if that's the right place, I made the following change, and it seems to do the right thing:

diff --git src/cargo/sources/path.rs src/cargo/sources/path.rs
index cf406e8dd..eb5e26512 100644
--- src/cargo/sources/path.rs
+++ src/cargo/sources/path.rs
@@ -254,6 +254,7 @@ impl<'cfg> PathSource<'cfg> {
         });
         let mut opts = git2::StatusOptions::new();
         opts.include_untracked(true);
+        opts.recurse_untracked_dirs(true);
         if let Ok(suffix) = pkg_path.strip_prefix(root) {
             opts.pathspec(suffix);
         }
dotdash

comment created time in 2 months

issue commentrust-lang/cargo

cargo-package: list of dirty files contains ignored files

Nothing special, just:

/target
*.rs.bk
Cargo.lock
.*.swp

I originally had .*.swp in my .config/git/ignore but added it to the local .gitignore as well, just to make sure that that is not causing it. Other files like foo.rs.bk also show up, so it's not the leading dot either.

dotdash

comment created time in 2 months

issue openedrust-lang/cargo

cargo-package: list of dirty files contains ignored files

Problem When running "cargo package" it complains about some files containing uncommited change, even though they're ignored by my .gitignore rules.

This seems to only happen when the file is in a directory that contains a mix of ignored and non-ignored files, but no tracked files at all.

$ git status
On branch master
nothing to commit, working tree clean

$ mkdir new_dir
$ touch new_dir/file
$ touch new_dir/.file.swp

$ git status
On branch master
Untracked files:
  (use "git add <file>..." to include in what will be committed)

        new_dir/

nothing added to commit but untracked files present (use "git add" to track)

$ git status new_dir/
On branch master
Untracked files:
  (use "git add <file>..." to include in what will be committed)

        new_dir/file

nothing added to commit but untracked files present (use "git add" to track)

$ cargo package
error: 2 files in the working directory contain changes that were not yet committed into git:

new_dir/.file.swp
new_dir/file

I expect to only see the entry for new_dir/file there, not new_dir/.file.swp.

Notes

Output of cargo version:

$ cargo --version
cargo 1.44.1 (88ba85757 2020-06-11)

created time in 2 months

push eventdotdash/redmine-rs

Björn Steinbrink

commit sha c93c696278e72a9731f128e0324d338b10e93e2c

Initial commit

view details

push time in 2 months

push eventdotdash/redmine-rs

Björn Steinbrink

commit sha 5be564813c23b503762c2b6877c802fc296450ff

Initial commit

view details

push time in 2 months

push eventdotdash/redmine-rs

Björn Steinbrink

commit sha 9e765cb7eb378767730a4537bf8bbbbd29d271e4

Initial commit

view details

push time in 2 months

issue commentrust-lang/rust

PartialOrd derived for C-like enum isn't properly optimized for the `<=` operator

LLVM can't properly optimize the match against Some(Less | Equal), if we instead match against None | Some(Greater) and negate the result, LLVM does find a way to optimize the code, because the IR we generate is somewhat simpler having to handle only one of the Some() cases.

I didn't really look into how to maybe teach LLVM to handle this case, and didn't see anything obvious in the way we generate IR that could be improved here either.

Should I prepare a PR to adjust the default impl of le with a comment that explains why it's written that way, or should we investigate more here?

pubfnbar

comment created time in 2 months

pull request commentdense-analysis/ale

WIP: Balloonify

@ian-howell since this actually good to go, it might be a good idea to remove the WIP prefix and then ping the maintainer(s?) again

ian-howell

comment created time in 2 months

issue commentrust-lang/rust

Huge stack allocation is generated when assigning a huge piece of memory to a reference

Ok, so inlining runs first, thus the call slot optimization never triggers here.

If we mark init() as inline(never), then the call slot optimization runs but fails, because the mutable reference bar doesn't get marked as noalias (see https://github.com/rust-lang/rust/issues/54878), so the optimizer can't be sure that init() doesn't access the memory referenced by bar.h Using -Zmutable-noalias=yes fixes this and gets rid of the memcpy at the cost of not having init() inlined.

When init() gets inlined, we unfortunately get IR like this:

loop_block:
   ; yadda yadda
   br %done, label %memcpy_block, label %loop_block

memcpy_block:
    call void @llvm.memcpy(...)

Which means that we have a memcpy dependency that crosses a basic block boundary, and LLVM's memcpy optimizer doesn't handle these. There have been attempts to make it do that (for example by Patrick Walton, years ago), but AFAIK nothing has made to into LLVM (yet?).

bugadani

comment created time in 3 months

issue commentrust-lang/rust

Huge stack allocation is generated when assigning a huge piece of memory to a reference

@jonas-schievink so a slight misuse of terms that had me confused, ok :-)

@ecstatic-morse that explains it then. I think I have a well established history of getting slightly over-excited about my results and accidentally posting wrong results ;-)

There clearly is a missed optimization here, which the existing copy propagation pass could have picked up, if it wasn't limited to locals as destinations.

That said, I wonder why call slot propagation in LLVM doesn't handle it either. I'll try to look into that...

bugadani

comment created time in 3 months

issue commentrust-lang/rust

Huge stack allocation is generated when assigning a huge piece of memory to a reference

@ecstatic-morse How did you get those results, I cannot reproduce this simply merging #72205 and compiling the given code while RUSTC_STAGE=1 is set in the environment.

I also fail to see how NRVO would apply anyway. The extra copy is due to the temporary created in the caller, which uses an out-pointer and has a return type of (), so NRVO doesn't seem useful, and the callee is already writing straight to the return value anyway. Any pointers?

bugadani

comment created time in 3 months

more