profile
viewpoint

nmichaud/enable-mapping 5

Tile-based mapping using enable

nmichaud/apple-pencil-safari-api-test 0

Canvas sketch board, force touch, real-time Bezier curve.

nmichaud/arrow 0

Apache Arrow is a cross-language development platform for in-memory data. It specifies a standardized language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware. It also provides computational libraries and zero-copy streaming messaging and interprocess communication. Languages currently supported include C, C++, Java, JavaScript, Python, and Ruby.

nmichaud/atom 0

Memory efficient Python objects

nmichaud/Boost.NumPy 0

Boost.Python interface for NumPy; in preparation for eventual proposal to Boost (manual mirror of Boost Sandbox SVN)

nmichaud/c-rrb 0

RRB-tree implemented as a library in C.

nmichaud/clear-pipes 0

visualize data flow

nmichaud/codeflow 0

Python Debugger testbed

nmichaud/crosstalk 0

Smalltalk-80 bare metal implementation for the Raspberry Pi

pull request commentapache/arrow

ARROW-11321: [Rust][DataFusion] Fix DataFusion compilation error

I am really excited for the changes w.r.t. performance in the merged PRs!

This will allow us to further improve some kernels :+1:

Dandandan

comment created time in 14 minutes

pull request commentapache/arrow

ARROW-11321: [Rust][DataFusion] Fix DataFusion compilation error

No worries @jorgecarleitao -- I think this was likely to happen given the backup of PRs waiting to go on to master :)

Dandandan

comment created time in 15 minutes

pull request commentapache/arrow

ARROW-11321: [Rust][DataFusion] Fix DataFusion compilation error

Thanks a lot, @Dandandan and @alamb , and sorry for the mess 😞

Dandandan

comment created time in 21 minutes

pull request commentapache/arrow

ARROW-11022: [Rust] Upgrade to Tokio 1.0

@maxburke I predict sometime in the next 24 hours. We are working through the Rust PR backlog (though it is fairly large). I also believe this one is important as tokio flows upwards through the Rust ecosystem

Dandandan

comment created time in 21 minutes

pull request commentapache/arrow

ARROW-11108: [Rust] Fixed performance issue in mutableBuffer.

The fix is merged, so hopefully master is back 🟢 ✅

jorgecarleitao

comment created time in 23 minutes

pull request commentapache/arrow

ARROW-11321: [Rust][DataFusion] Fix DataFusion compilation error

🚀

Dandandan

comment created time in 26 minutes

push eventapache/arrow

Heres, Daniel

commit sha a4266a1d4954c83be8707bb6209a8e6552ba148a

ARROW-11321: [Rust][DataFusion] Fix DataFusion compilation error FYI @jorgecarleitao I think some change on master and maybe some parquet related changes caused a compilation error on master. This fixes the compilation error. Closes #9269 from Dandandan/fix_datafusion Authored-by: Heres, Daniel <danielheres@gmail.com> Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>

view details

push time in 26 minutes

PR closed apache/arrow

ARROW-11321: [Rust][DataFusion] Fix DataFusion compilation error datafusion lang-rust

FYI @jorgecarleitao I think some change on master and maybe some parquet related changes caused a compilation error on master. This fixes the compilation error.

+1 -1

6 comments

1 changed file

Dandandan

pr closed time in 26 minutes

pull request commentapache/arrow

ARROW-11321: [Rust][DataFusion] Fix DataFusion compilation error

For anyone else following along, the error this PR fixes looks like the following (from https://github.com/apache/arrow/runs/1729540155):

  Compiling arrow-flight v3.0.0-SNAPSHOT (/__w/arrow/arrow/rust/arrow-flight)
   Compiling tonic v0.3.1
   Compiling datafusion v3.0.0-SNAPSHOT (/__w/arrow/arrow/rust/datafusion)
error[E0061]: this function takes 2 arguments but 1 argument was supplied
   --> datafusion/src/physical_plan/parquet.rs:712:25
    |
712 |             data_buffer.resize(data_buffer.len() + data_size);
    |                         ^^^^^^ ----------------------------- supplied 1 argument
    |                         |
    |                         expected 2 arguments

error: aborting due to previous error
Dandandan

comment created time in 28 minutes

issue commentapache/arrow

Needs a handling for missing columns in parquet file

That seems like a reasonable request. Could you please report this feature request on Arrow's JIRA?.

jasonkhadka

comment created time in 31 minutes

pull request commentapache/arrow

ARROW-11321: [Rust][DataFusion] Fix DataFusion compilation error

I ran this locally and it fixes the build for me. I plan to merge this in prior to CI finishing to get master back to green

Dandandan

comment created time in 31 minutes

pull request commentapache/arrow

ARROW-11321: [Rust][DataFusion] Fix DataFusion compilation error

cc @alamb

Dandandan

comment created time in 33 minutes

pull request commentapache/arrow

ARROW-11321: [Rust][DataFusion] Fix DataFusion compilation error

https://issues.apache.org/jira/browse/ARROW-11321

Dandandan

comment created time in 36 minutes

pull request commentapache/arrow

ARROW-11156: [Rust][DataFusion] Create hashes vectorized in hash join

This one is next in line for merging @jorgecarleitao and I have our eyes on it... Once a few more tests have completed on https://github.com/apache/arrow/commits/master we'll get it in

Dandandan

comment created time in 40 minutes

pull request commentapache/arrow

ARROW-11321: [Rust][DataFusion] Fix DataFusion compilation error

Yeah those submodules seems to cause some issues lately :/. Should be fixed now @jorgecarleitao

Dandandan

comment created time in 44 minutes

pull request commentapache/arrow

ARROW-11108: [Rust] Fixed performance issue in mutableBuffer.

Well, it broke master... 🤣 @Dandandan already has a fix https://github.com/apache/arrow/pull/9269 💯

jorgecarleitao

comment created time in an hour

pull request commentapache/arrow

ARROW-11321: [Rust][DataFusion] Fix DataFusion compilation error

This has an unrelated change on a submodule?

Dandandan

comment created time in an hour

pull request commentapache/arrow

ARROW-11045: [Rust] Fix performance issues of allocator

likely. It was kind of expected, as it did some backward incompatible changes. I was trying to merge it first to avoid breaking, but I guess I was not fast for the speed on which PRs are merged into master after the green light on the mailing list :P

jorgecarleitao

comment created time in an hour

pull request commentapache/arrow

ARROW-11045: [Rust] Fix performance issues of allocator

@jorgecarleitao I think this PR broke master:

   --> datafusion/src/physical_plan/parquet.rs:712:25
    |
712 |             data_buffer.resize(data_buffer.len() + data_size);
    |                         ^^^^^^ ----------------------------- supplied 1 argument
    |                         |
    |                         expected 2 arguments
    |
note: associated function defined here
   --> rust/arrow/src/buffer.rs:833:12
    |
833 |     pub fn resize(&mut self, new_len: usize, value: u8) {
jorgecarleitao

comment created time in an hour

Pull request review commentapache/arrow

ARROW-11320: [C++] Try to strengthen temporary dir creation

 std::string MakeRandomName(int num_chars) { }  // namespace  Result<std::unique_ptr<TemporaryDir>> TemporaryDir::Make(const std::string& prefix) {-  std::string suffix = MakeRandomName(8);+  const int kNumChars = 8;+   NativePathString base_name;-  ARROW_ASSIGN_OR_RAISE(base_name, StringToNative(prefix + suffix));++  auto MakeBaseName = [&]() {+    std::string suffix = MakeRandomName(kNumChars);+    return StringToNative(prefix + suffix);+  };++  auto TryCreatingDirectory =+      [&](const NativePathString& base_dir) -> Result<std::unique_ptr<TemporaryDir>> {+    Status st;+    for (int attempt = 0; attempt < 3; ++attempt) {+      PlatformFilename fn(base_dir + kNativeSep + base_name + kNativeSep);+      auto result = CreateDir(fn);+      if (!result.ok()) {+        // Probably a permissions error or a non-existing base_dir+        return nullptr;+      }+      if (*result) {+        return std::unique_ptr<TemporaryDir>(new TemporaryDir(std::move(fn)));+      }+      // The random name already exists in base_dir, try with another name+      st = Status::IOError("Path already exists: '", fn.ToString(), "'");+      ARROW_ASSIGN_OR_RAISE(base_name, MakeBaseName());+    }+    return st;+  };++  ARROW_ASSIGN_OR_RAISE(base_name, MakeBaseName());    auto base_dirs = GetPlatformTemporaryDirs();   DCHECK_NE(base_dirs.size(), 0); -  auto st = Status::OK();-  for (const auto& p : base_dirs) {-    PlatformFilename fn(p + kNativeSep + base_name + kNativeSep);-    auto result = CreateDir(fn);-    if (!result.ok()) {-      st = result.status();-      continue;-    }-    if (!*result) {-      // XXX Should we retry with another random name?-      return Status::IOError("Path already exists: '", fn.ToString(), "'");-    } else {-      return std::unique_ptr<TemporaryDir>(new TemporaryDir(std::move(fn)));+  for (const auto& base_dir : base_dirs) {+    ARROW_ASSIGN_OR_RAISE(auto ptr, TryCreatingDirectory(base_dir));

The way it is at the moment you will try the next directory if you get a permissions error or non-existing directory but you won't try the next directory if you tried three times and failed. I think this is probably ok, just making sure this is the behavior you want.

pitrou

comment created time in an hour

Pull request review commentapache/arrow

ARROW-11320: [C++] Try to strengthen temporary dir creation

 Result<SignalHandler> SetSignalHandler(int signum, const SignalHandler& handler)  namespace { +int64_t GetPid() {+#ifdef _WIN32+  return GetCurrentProcessId();+#else+  return getpid();+#endif+}+ std::mt19937_64 GetSeedGenerator() {   // Initialize Mersenne Twister PRNG with a true random seed.+  // Make sure to mix in process id to minimize risks of clashes when parallel testing. #ifdef ARROW_VALGRIND   // Valgrind can crash, hang or enter an infinite loop on std::random_device,   // use a crude initializer instead.-  // Make sure to mix in process id to avoid clashes when parallel testing.   const uint8_t dummy = 0;   ARROW_UNUSED(dummy);   std::mt19937_64 seed_gen(reinterpret_cast<uintptr_t>(&dummy) ^-                           static_cast<uintptr_t>(getpid()));+                           static_cast<uintptr_t>(GetPid())); #else   std::random_device true_random;   std::mt19937_64 seed_gen(static_cast<uint64_t>(true_random()) ^-                           (static_cast<uint64_t>(true_random()) << 32));+                           (static_cast<uint64_t>(true_random()) << 32) ^+                           (static_cast<uint64_t>(GetPid()) << 17));

Why << 17? Won't this leave the last 17 bits as 0? It appears PID is at least 32 bits.

pitrou

comment created time in an hour

Pull request review commentapache/arrow

ARROW-11320: [C++] Try to strengthen temporary dir creation

 std::string MakeRandomName(int num_chars) { }  // namespace  Result<std::unique_ptr<TemporaryDir>> TemporaryDir::Make(const std::string& prefix) {-  std::string suffix = MakeRandomName(8);+  const int kNumChars = 8;+   NativePathString base_name;-  ARROW_ASSIGN_OR_RAISE(base_name, StringToNative(prefix + suffix));++  auto MakeBaseName = [&]() {+    std::string suffix = MakeRandomName(kNumChars);+    return StringToNative(prefix + suffix);+  };++  auto TryCreatingDirectory =

Why not simply use a static/anon-namespace functions for TryCreatingDirectory and MakeBaseName? MakeRandomName is already one.

pitrou

comment created time in 2 hours

Pull request review commentapache/arrow

ARROW-11320: [C++] Try to strengthen temporary dir creation

 std::string MakeRandomName(int num_chars) { }  // namespace  Result<std::unique_ptr<TemporaryDir>> TemporaryDir::Make(const std::string& prefix) {-  std::string suffix = MakeRandomName(8);+  const int kNumChars = 8;+   NativePathString base_name;-  ARROW_ASSIGN_OR_RAISE(base_name, StringToNative(prefix + suffix));++  auto MakeBaseName = [&]() {+    std::string suffix = MakeRandomName(kNumChars);+    return StringToNative(prefix + suffix);+  };++  auto TryCreatingDirectory =+      [&](const NativePathString& base_dir) -> Result<std::unique_ptr<TemporaryDir>> {+    Status st;+    for (int attempt = 0; attempt < 3; ++attempt) {+      PlatformFilename fn(base_dir + kNativeSep + base_name + kNativeSep);+      auto result = CreateDir(fn);+      if (!result.ok()) {+        // Probably a permissions error or a non-existing base_dir+        return nullptr;+      }+      if (*result) {+        return std::unique_ptr<TemporaryDir>(new TemporaryDir(std::move(fn)));+      }+      // The random name already exists in base_dir, try with another name+      st = Status::IOError("Path already exists: '", fn.ToString(), "'");+      ARROW_ASSIGN_OR_RAISE(base_name, MakeBaseName());+    }+    return st;+  };++  ARROW_ASSIGN_OR_RAISE(base_name, MakeBaseName());    auto base_dirs = GetPlatformTemporaryDirs();   DCHECK_NE(base_dirs.size(), 0); -  auto st = Status::OK();-  for (const auto& p : base_dirs) {-    PlatformFilename fn(p + kNativeSep + base_name + kNativeSep);-    auto result = CreateDir(fn);-    if (!result.ok()) {-      st = result.status();-      continue;-    }-    if (!*result) {-      // XXX Should we retry with another random name?-      return Status::IOError("Path already exists: '", fn.ToString(), "'");-    } else {-      return std::unique_ptr<TemporaryDir>(new TemporaryDir(std::move(fn)));+  for (const auto& base_dir : base_dirs) {+    ARROW_ASSIGN_OR_RAISE(auto ptr, TryCreatingDirectory(base_dir));+    if (ptr) {+      return std::move(ptr);

Why are you applying std::move to a return value?

pitrou

comment created time in an hour

push eventapache/arrow

Heres, Daniel

commit sha 4a6eb19ff69737572cb0e3dec45eb624e71c20d3

ARROW-11268: [Rust][DataFusion] MemTable::load output partition support I think the feature to be able to repartition an in memory table is useful, as the repartitioning only needs to be applied once, and repartition itself is cheap (at the same node). Doing this when loading data is very useful for in-memory analytics as we can benefit from mutliple cores after loading the data. The speed up from repartitioning is very big (mainly on aggregates), on my (8-core machine): ~5-7x on query 1 and 12 versus a single partition, and a smaller (~30%) difference for query 5 when using 16 partition. q1/q12 also have very high cpu utilization. @jorgecarleitao maybe this is of interest to you, as you mentioned you are looking into multi-threading. I think this would be a "high level" way to get more parallelism, also in the logical plan. I think in some optimizer rules and/or dynamically we can do repartitions, similar to what's described here https://issues.apache.org/jira/browse/ARROW-9464 Benchmarks after repartitioning (16 partitions): PR (16 partitions) ``` Query 12 iteration 0 took 33.9 ms Query 12 iteration 1 took 34.3 ms Query 12 iteration 2 took 36.9 ms Query 12 iteration 3 took 33.6 ms Query 12 iteration 4 took 35.1 ms Query 12 iteration 5 took 38.8 ms Query 12 iteration 6 took 35.8 ms Query 12 iteration 7 took 34.4 ms Query 12 iteration 8 took 34.2 ms Query 12 iteration 9 took 35.3 ms Query 12 avg time: 35.24 ms ``` Master (1 partition): ``` Query 12 iteration 0 took 245.6 ms Query 12 iteration 1 took 246.4 ms Query 12 iteration 2 took 246.1 ms Query 12 iteration 3 took 247.9 ms Query 12 iteration 4 took 246.5 ms Query 12 iteration 5 took 248.2 ms Query 12 iteration 6 took 247.8 ms Query 12 iteration 7 took 246.4 ms Query 12 iteration 8 took 246.6 ms Query 12 iteration 9 took 246.5 ms Query 12 avg time: 246.79 ms ``` PR (16 partitions): ``` Query 1 iteration 0 took 138.6 ms Query 1 iteration 1 took 142.2 ms Query 1 iteration 2 took 125.8 ms Query 1 iteration 3 took 102.4 ms Query 1 iteration 4 took 105.9 ms Query 1 iteration 5 took 107.0 ms Query 1 iteration 6 took 109.3 ms Query 1 iteration 7 took 109.9 ms Query 1 iteration 8 took 108.8 ms Query 1 iteration 9 took 112.0 ms Query 1 avg time: 116.19 ms ``` Master (1 partition): ``` Query 1 iteration 0 took 640.6 ms Query 1 iteration 1 took 640.0 ms Query 1 iteration 2 took 632.9 ms Query 1 iteration 3 took 634.6 ms Query 1 iteration 4 took 630.7 ms Query 1 iteration 5 took 630.7 ms Query 1 iteration 6 took 631.9 ms Query 1 iteration 7 took 635.5 ms Query 1 iteration 8 took 639.0 ms Query 1 iteration 9 took 638.3 ms Query 1 avg time: 635.43 ms ``` PR (16 partitions) ``` Query 5 iteration 0 took 465.8 ms Query 5 iteration 1 took 428.0 ms Query 5 iteration 2 took 435.0 ms Query 5 iteration 3 took 407.3 ms Query 5 iteration 4 took 435.7 ms Query 5 iteration 5 took 437.4 ms Query 5 iteration 6 took 411.2 ms Query 5 iteration 7 took 432.0 ms Query 5 iteration 8 took 436.8 ms Query 5 iteration 9 took 435.6 ms Query 5 avg time: 432.47 ms ``` Master (1 partition) ``` Query 5 iteration 0 took 660.6 ms Query 5 iteration 1 took 634.4 ms Query 5 iteration 2 took 626.4 ms Query 5 iteration 3 took 628.0 ms Query 5 iteration 4 took 635.3 ms Query 5 iteration 5 took 631.1 ms Query 5 iteration 6 took 631.3 ms Query 5 iteration 7 took 639.4 ms Query 5 iteration 8 took 634.3 ms Query 5 iteration 9 took 639.0 ms Query 5 avg time: 635.97 ms ``` Closes #9214 from Dandandan/mem_table_repartition Lead-authored-by: Heres, Daniel <danielheres@gmail.com> Co-authored-by: Daniël Heres <danielheres@gmail.com> Signed-off-by: Jorge C. Leitao <jorgecarleitao@gmail.com>

view details

push time in an hour

PR closed apache/arrow

ARROW-11268: [Rust][DataFusion] MemTable::load output partition support datafusion lang-rust

I think the feature to be able to repartition an in memory table is useful, as the repartitioning only needs to be applied once, and repartition itself is cheap (at the same node). Doing this when loading data is very useful for in-memory analytics as we can benefit from mutliple cores after loading the data.

The speed up from repartitioning is very big (mainly on aggregates), on my (8-core machine): ~5-7x on query 1 and 12 versus a single partition, and a smaller (~30%) difference for query 5 when using 16 partition. q1/q12 also have very high cpu utilization.

@jorgecarleitao maybe this is of interest to you, as you mentioned you are looking into multi-threading. I think this would be a "high level" way to get more parallelism, also in the logical plan. I think in some optimizer rules and/or dynamically we can do repartitions, similar to what's described here https://issues.apache.org/jira/browse/ARROW-9464

Benchmarks after repartitioning (16 partitions):

PR (16 partitions)

Query 12 iteration 0 took 33.9 ms
Query 12 iteration 1 took 34.3 ms
Query 12 iteration 2 took 36.9 ms
Query 12 iteration 3 took 33.6 ms
Query 12 iteration 4 took 35.1 ms
Query 12 iteration 5 took 38.8 ms
Query 12 iteration 6 took 35.8 ms
Query 12 iteration 7 took 34.4 ms
Query 12 iteration 8 took 34.2 ms
Query 12 iteration 9 took 35.3 ms
Query 12 avg time: 35.24 ms

Master (1 partition):

Query 12 iteration 0 took 245.6 ms
Query 12 iteration 1 took 246.4 ms
Query 12 iteration 2 took 246.1 ms
Query 12 iteration 3 took 247.9 ms
Query 12 iteration 4 took 246.5 ms
Query 12 iteration 5 took 248.2 ms
Query 12 iteration 6 took 247.8 ms
Query 12 iteration 7 took 246.4 ms
Query 12 iteration 8 took 246.6 ms
Query 12 iteration 9 took 246.5 ms
Query 12 avg time: 246.79 ms

PR (16 partitions):

Query 1 iteration 0 took 138.6 ms
Query 1 iteration 1 took 142.2 ms
Query 1 iteration 2 took 125.8 ms
Query 1 iteration 3 took 102.4 ms
Query 1 iteration 4 took 105.9 ms
Query 1 iteration 5 took 107.0 ms
Query 1 iteration 6 took 109.3 ms
Query 1 iteration 7 took 109.9 ms
Query 1 iteration 8 took 108.8 ms
Query 1 iteration 9 took 112.0 ms
Query 1 avg time: 116.19 ms

Master (1 partition):

Query 1 iteration 0 took 640.6 ms
Query 1 iteration 1 took 640.0 ms
Query 1 iteration 2 took 632.9 ms
Query 1 iteration 3 took 634.6 ms
Query 1 iteration 4 took 630.7 ms
Query 1 iteration 5 took 630.7 ms
Query 1 iteration 6 took 631.9 ms
Query 1 iteration 7 took 635.5 ms
Query 1 iteration 8 took 639.0 ms
Query 1 iteration 9 took 638.3 ms
Query 1 avg time: 635.43 ms

PR (16 partitions)

Query 5 iteration 0 took 465.8 ms
Query 5 iteration 1 took 428.0 ms
Query 5 iteration 2 took 435.0 ms
Query 5 iteration 3 took 407.3 ms
Query 5 iteration 4 took 435.7 ms
Query 5 iteration 5 took 437.4 ms
Query 5 iteration 6 took 411.2 ms
Query 5 iteration 7 took 432.0 ms
Query 5 iteration 8 took 436.8 ms
Query 5 iteration 9 took 435.6 ms
Query 5 avg time: 432.47 ms

Master (1 partition)

Query 5 iteration 0 took 660.6 ms
Query 5 iteration 1 took 634.4 ms
Query 5 iteration 2 took 626.4 ms
Query 5 iteration 3 took 628.0 ms
Query 5 iteration 4 took 635.3 ms
Query 5 iteration 5 took 631.1 ms
Query 5 iteration 6 took 631.3 ms
Query 5 iteration 7 took 639.4 ms
Query 5 iteration 8 took 634.3 ms
Query 5 iteration 9 took 639.0 ms
Query 5 avg time: 635.97 ms
+49 -5

4 comments

3 changed files

Dandandan

pr closed time in an hour

push eventapache/arrow

Andrew Lamb

commit sha b448de78cd0745b12dfb5156aaaff67f75bdee9a

ARROW-11216: [Rust] add doc example for StringDictionaryBuilder I find myself trying to remember the exact incantation to create a `StringDictionaryBuilder` so I figured I would add it as a doc example Closes #9169 from alamb/alamb/doc-example Authored-by: Andrew Lamb <andrew@nerdnetworks.org> Signed-off-by: Jorge C. Leitao <jorgecarleitao@gmail.com>

view details

push time in an hour

PR closed apache/arrow

ARROW-11216: [Rust] add doc example for StringDictionaryBuilder lang-rust

I find myself trying to remember the exact incantation to create a StringDictionaryBuilder so I figured I would add it as a doc example

+40 -2

3 comments

1 changed file

alamb

pr closed time in an hour

issue commentapache/arrow

Integrate CUDA memory capacity to plasma storage

Also if I want to integrate only plasma storage into my project, what's the suggested way to do so? c_glib or other way?

VoVAllen

comment created time in an hour

more