Tile-based mapping using enable
nmichaud/apple-pencil-safari-api-test 0
Canvas sketch board, force touch, real-time Bezier curve.
Apache Arrow is a cross-language development platform for in-memory data. It specifies a standardized language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware. It also provides computational libraries and zero-copy streaming messaging and interprocess communication. Languages currently supported include C, C++, Java, JavaScript, Python, and Ruby.
Memory efficient Python objects
Boost.Python interface for NumPy; in preparation for eventual proposal to Boost (manual mirror of Boost Sandbox SVN)
RRB-tree implemented as a library in C.
visualize data flow
Python Debugger testbed
Smalltalk-80 bare metal implementation for the Raspberry Pi
pull request commentapache/arrow
ARROW-11321: [Rust][DataFusion] Fix DataFusion compilation error
I am really excited for the changes w.r.t. performance in the merged PRs!
This will allow us to further improve some kernels :+1:
comment created time in 14 minutes
pull request commentapache/arrow
ARROW-11321: [Rust][DataFusion] Fix DataFusion compilation error
No worries @jorgecarleitao -- I think this was likely to happen given the backup of PRs waiting to go on to master :)
comment created time in 15 minutes
pull request commentapache/arrow
ARROW-11321: [Rust][DataFusion] Fix DataFusion compilation error
Thanks a lot, @Dandandan and @alamb , and sorry for the mess 😞
comment created time in 21 minutes
pull request commentapache/arrow
ARROW-11022: [Rust] Upgrade to Tokio 1.0
@maxburke I predict sometime in the next 24 hours. We are working through the Rust PR backlog (though it is fairly large). I also believe this one is important as tokio flows upwards through the Rust ecosystem
comment created time in 21 minutes
pull request commentapache/arrow
ARROW-11108: [Rust] Fixed performance issue in mutableBuffer.
The fix is merged, so hopefully master is back 🟢 ✅
comment created time in 23 minutes
pull request commentapache/arrow
ARROW-11321: [Rust][DataFusion] Fix DataFusion compilation error
🚀
comment created time in 26 minutes
push eventapache/arrow
commit sha a4266a1d4954c83be8707bb6209a8e6552ba148a
ARROW-11321: [Rust][DataFusion] Fix DataFusion compilation error FYI @jorgecarleitao I think some change on master and maybe some parquet related changes caused a compilation error on master. This fixes the compilation error. Closes #9269 from Dandandan/fix_datafusion Authored-by: Heres, Daniel <danielheres@gmail.com> Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>
push time in 26 minutes
PR closed apache/arrow
FYI @jorgecarleitao I think some change on master and maybe some parquet related changes caused a compilation error on master. This fixes the compilation error.
pr closed time in 26 minutes
pull request commentapache/arrow
ARROW-11321: [Rust][DataFusion] Fix DataFusion compilation error
For anyone else following along, the error this PR fixes looks like the following (from https://github.com/apache/arrow/runs/1729540155):
Compiling arrow-flight v3.0.0-SNAPSHOT (/__w/arrow/arrow/rust/arrow-flight)
Compiling tonic v0.3.1
Compiling datafusion v3.0.0-SNAPSHOT (/__w/arrow/arrow/rust/datafusion)
error[E0061]: this function takes 2 arguments but 1 argument was supplied
--> datafusion/src/physical_plan/parquet.rs:712:25
|
712 | data_buffer.resize(data_buffer.len() + data_size);
| ^^^^^^ ----------------------------- supplied 1 argument
| |
| expected 2 arguments
error: aborting due to previous error
comment created time in 28 minutes
issue commentapache/arrow
Needs a handling for missing columns in parquet file
That seems like a reasonable request. Could you please report this feature request on Arrow's JIRA?.
comment created time in 31 minutes
pull request commentapache/arrow
ARROW-11321: [Rust][DataFusion] Fix DataFusion compilation error
I ran this locally and it fixes the build for me. I plan to merge this in prior to CI finishing to get master back to green
comment created time in 31 minutes
pull request commentapache/arrow
ARROW-11321: [Rust][DataFusion] Fix DataFusion compilation error
cc @alamb
comment created time in 33 minutes
pull request commentapache/arrow
ARROW-11321: [Rust][DataFusion] Fix DataFusion compilation error
https://issues.apache.org/jira/browse/ARROW-11321
comment created time in 36 minutes
pull request commentapache/arrow
ARROW-11156: [Rust][DataFusion] Create hashes vectorized in hash join
This one is next in line for merging @jorgecarleitao and I have our eyes on it... Once a few more tests have completed on https://github.com/apache/arrow/commits/master we'll get it in
comment created time in 40 minutes
pull request commentapache/arrow
ARROW-11321: [Rust][DataFusion] Fix DataFusion compilation error
Yeah those submodules seems to cause some issues lately :/. Should be fixed now @jorgecarleitao
comment created time in 44 minutes
pull request commentapache/arrow
ARROW-11108: [Rust] Fixed performance issue in mutableBuffer.
Well, it broke master... 🤣 @Dandandan already has a fix https://github.com/apache/arrow/pull/9269 💯
comment created time in an hour
pull request commentapache/arrow
ARROW-11108: [Rust] Fixed performance issue in mutableBuffer.
🎉
comment created time in an hour
pull request commentapache/arrow
ARROW-11321: [Rust][DataFusion] Fix DataFusion compilation error
This has an unrelated change on a submodule?
comment created time in an hour
pull request commentapache/arrow
ARROW-11045: [Rust] Fix performance issues of allocator
likely. It was kind of expected, as it did some backward incompatible changes. I was trying to merge it first to avoid breaking, but I guess I was not fast for the speed on which PRs are merged into master after the green light on the mailing list :P
comment created time in an hour
pull request commentapache/arrow
ARROW-11045: [Rust] Fix performance issues of allocator
@jorgecarleitao I think this PR broke master:
--> datafusion/src/physical_plan/parquet.rs:712:25
|
712 | data_buffer.resize(data_buffer.len() + data_size);
| ^^^^^^ ----------------------------- supplied 1 argument
| |
| expected 2 arguments
|
note: associated function defined here
--> rust/arrow/src/buffer.rs:833:12
|
833 | pub fn resize(&mut self, new_len: usize, value: u8) {
comment created time in an hour
Pull request review commentapache/arrow
ARROW-11320: [C++] Try to strengthen temporary dir creation
std::string MakeRandomName(int num_chars) { } // namespace Result<std::unique_ptr<TemporaryDir>> TemporaryDir::Make(const std::string& prefix) {- std::string suffix = MakeRandomName(8);+ const int kNumChars = 8;+ NativePathString base_name;- ARROW_ASSIGN_OR_RAISE(base_name, StringToNative(prefix + suffix));++ auto MakeBaseName = [&]() {+ std::string suffix = MakeRandomName(kNumChars);+ return StringToNative(prefix + suffix);+ };++ auto TryCreatingDirectory =+ [&](const NativePathString& base_dir) -> Result<std::unique_ptr<TemporaryDir>> {+ Status st;+ for (int attempt = 0; attempt < 3; ++attempt) {+ PlatformFilename fn(base_dir + kNativeSep + base_name + kNativeSep);+ auto result = CreateDir(fn);+ if (!result.ok()) {+ // Probably a permissions error or a non-existing base_dir+ return nullptr;+ }+ if (*result) {+ return std::unique_ptr<TemporaryDir>(new TemporaryDir(std::move(fn)));+ }+ // The random name already exists in base_dir, try with another name+ st = Status::IOError("Path already exists: '", fn.ToString(), "'");+ ARROW_ASSIGN_OR_RAISE(base_name, MakeBaseName());+ }+ return st;+ };++ ARROW_ASSIGN_OR_RAISE(base_name, MakeBaseName()); auto base_dirs = GetPlatformTemporaryDirs(); DCHECK_NE(base_dirs.size(), 0); - auto st = Status::OK();- for (const auto& p : base_dirs) {- PlatformFilename fn(p + kNativeSep + base_name + kNativeSep);- auto result = CreateDir(fn);- if (!result.ok()) {- st = result.status();- continue;- }- if (!*result) {- // XXX Should we retry with another random name?- return Status::IOError("Path already exists: '", fn.ToString(), "'");- } else {- return std::unique_ptr<TemporaryDir>(new TemporaryDir(std::move(fn)));+ for (const auto& base_dir : base_dirs) {+ ARROW_ASSIGN_OR_RAISE(auto ptr, TryCreatingDirectory(base_dir));
The way it is at the moment you will try the next directory if you get a permissions error or non-existing directory but you won't try the next directory if you tried three times and failed. I think this is probably ok, just making sure this is the behavior you want.
comment created time in an hour
Pull request review commentapache/arrow
ARROW-11320: [C++] Try to strengthen temporary dir creation
Result<SignalHandler> SetSignalHandler(int signum, const SignalHandler& handler) namespace { +int64_t GetPid() {+#ifdef _WIN32+ return GetCurrentProcessId();+#else+ return getpid();+#endif+}+ std::mt19937_64 GetSeedGenerator() { // Initialize Mersenne Twister PRNG with a true random seed.+ // Make sure to mix in process id to minimize risks of clashes when parallel testing. #ifdef ARROW_VALGRIND // Valgrind can crash, hang or enter an infinite loop on std::random_device, // use a crude initializer instead.- // Make sure to mix in process id to avoid clashes when parallel testing. const uint8_t dummy = 0; ARROW_UNUSED(dummy); std::mt19937_64 seed_gen(reinterpret_cast<uintptr_t>(&dummy) ^- static_cast<uintptr_t>(getpid()));+ static_cast<uintptr_t>(GetPid())); #else std::random_device true_random; std::mt19937_64 seed_gen(static_cast<uint64_t>(true_random()) ^- (static_cast<uint64_t>(true_random()) << 32));+ (static_cast<uint64_t>(true_random()) << 32) ^+ (static_cast<uint64_t>(GetPid()) << 17));
Why << 17
? Won't this leave the last 17 bits as 0? It appears PID is at least 32 bits.
comment created time in an hour
Pull request review commentapache/arrow
ARROW-11320: [C++] Try to strengthen temporary dir creation
std::string MakeRandomName(int num_chars) { } // namespace Result<std::unique_ptr<TemporaryDir>> TemporaryDir::Make(const std::string& prefix) {- std::string suffix = MakeRandomName(8);+ const int kNumChars = 8;+ NativePathString base_name;- ARROW_ASSIGN_OR_RAISE(base_name, StringToNative(prefix + suffix));++ auto MakeBaseName = [&]() {+ std::string suffix = MakeRandomName(kNumChars);+ return StringToNative(prefix + suffix);+ };++ auto TryCreatingDirectory =
Why not simply use a static/anon-namespace functions for TryCreatingDirectory and MakeBaseName? MakeRandomName is already one.
comment created time in 2 hours
Pull request review commentapache/arrow
ARROW-11320: [C++] Try to strengthen temporary dir creation
std::string MakeRandomName(int num_chars) { } // namespace Result<std::unique_ptr<TemporaryDir>> TemporaryDir::Make(const std::string& prefix) {- std::string suffix = MakeRandomName(8);+ const int kNumChars = 8;+ NativePathString base_name;- ARROW_ASSIGN_OR_RAISE(base_name, StringToNative(prefix + suffix));++ auto MakeBaseName = [&]() {+ std::string suffix = MakeRandomName(kNumChars);+ return StringToNative(prefix + suffix);+ };++ auto TryCreatingDirectory =+ [&](const NativePathString& base_dir) -> Result<std::unique_ptr<TemporaryDir>> {+ Status st;+ for (int attempt = 0; attempt < 3; ++attempt) {+ PlatformFilename fn(base_dir + kNativeSep + base_name + kNativeSep);+ auto result = CreateDir(fn);+ if (!result.ok()) {+ // Probably a permissions error or a non-existing base_dir+ return nullptr;+ }+ if (*result) {+ return std::unique_ptr<TemporaryDir>(new TemporaryDir(std::move(fn)));+ }+ // The random name already exists in base_dir, try with another name+ st = Status::IOError("Path already exists: '", fn.ToString(), "'");+ ARROW_ASSIGN_OR_RAISE(base_name, MakeBaseName());+ }+ return st;+ };++ ARROW_ASSIGN_OR_RAISE(base_name, MakeBaseName()); auto base_dirs = GetPlatformTemporaryDirs(); DCHECK_NE(base_dirs.size(), 0); - auto st = Status::OK();- for (const auto& p : base_dirs) {- PlatformFilename fn(p + kNativeSep + base_name + kNativeSep);- auto result = CreateDir(fn);- if (!result.ok()) {- st = result.status();- continue;- }- if (!*result) {- // XXX Should we retry with another random name?- return Status::IOError("Path already exists: '", fn.ToString(), "'");- } else {- return std::unique_ptr<TemporaryDir>(new TemporaryDir(std::move(fn)));+ for (const auto& base_dir : base_dirs) {+ ARROW_ASSIGN_OR_RAISE(auto ptr, TryCreatingDirectory(base_dir));+ if (ptr) {+ return std::move(ptr);
Why are you applying std::move
to a return value?
comment created time in an hour
push eventapache/arrow
commit sha 4a6eb19ff69737572cb0e3dec45eb624e71c20d3
ARROW-11268: [Rust][DataFusion] MemTable::load output partition support I think the feature to be able to repartition an in memory table is useful, as the repartitioning only needs to be applied once, and repartition itself is cheap (at the same node). Doing this when loading data is very useful for in-memory analytics as we can benefit from mutliple cores after loading the data. The speed up from repartitioning is very big (mainly on aggregates), on my (8-core machine): ~5-7x on query 1 and 12 versus a single partition, and a smaller (~30%) difference for query 5 when using 16 partition. q1/q12 also have very high cpu utilization. @jorgecarleitao maybe this is of interest to you, as you mentioned you are looking into multi-threading. I think this would be a "high level" way to get more parallelism, also in the logical plan. I think in some optimizer rules and/or dynamically we can do repartitions, similar to what's described here https://issues.apache.org/jira/browse/ARROW-9464 Benchmarks after repartitioning (16 partitions): PR (16 partitions) ``` Query 12 iteration 0 took 33.9 ms Query 12 iteration 1 took 34.3 ms Query 12 iteration 2 took 36.9 ms Query 12 iteration 3 took 33.6 ms Query 12 iteration 4 took 35.1 ms Query 12 iteration 5 took 38.8 ms Query 12 iteration 6 took 35.8 ms Query 12 iteration 7 took 34.4 ms Query 12 iteration 8 took 34.2 ms Query 12 iteration 9 took 35.3 ms Query 12 avg time: 35.24 ms ``` Master (1 partition): ``` Query 12 iteration 0 took 245.6 ms Query 12 iteration 1 took 246.4 ms Query 12 iteration 2 took 246.1 ms Query 12 iteration 3 took 247.9 ms Query 12 iteration 4 took 246.5 ms Query 12 iteration 5 took 248.2 ms Query 12 iteration 6 took 247.8 ms Query 12 iteration 7 took 246.4 ms Query 12 iteration 8 took 246.6 ms Query 12 iteration 9 took 246.5 ms Query 12 avg time: 246.79 ms ``` PR (16 partitions): ``` Query 1 iteration 0 took 138.6 ms Query 1 iteration 1 took 142.2 ms Query 1 iteration 2 took 125.8 ms Query 1 iteration 3 took 102.4 ms Query 1 iteration 4 took 105.9 ms Query 1 iteration 5 took 107.0 ms Query 1 iteration 6 took 109.3 ms Query 1 iteration 7 took 109.9 ms Query 1 iteration 8 took 108.8 ms Query 1 iteration 9 took 112.0 ms Query 1 avg time: 116.19 ms ``` Master (1 partition): ``` Query 1 iteration 0 took 640.6 ms Query 1 iteration 1 took 640.0 ms Query 1 iteration 2 took 632.9 ms Query 1 iteration 3 took 634.6 ms Query 1 iteration 4 took 630.7 ms Query 1 iteration 5 took 630.7 ms Query 1 iteration 6 took 631.9 ms Query 1 iteration 7 took 635.5 ms Query 1 iteration 8 took 639.0 ms Query 1 iteration 9 took 638.3 ms Query 1 avg time: 635.43 ms ``` PR (16 partitions) ``` Query 5 iteration 0 took 465.8 ms Query 5 iteration 1 took 428.0 ms Query 5 iteration 2 took 435.0 ms Query 5 iteration 3 took 407.3 ms Query 5 iteration 4 took 435.7 ms Query 5 iteration 5 took 437.4 ms Query 5 iteration 6 took 411.2 ms Query 5 iteration 7 took 432.0 ms Query 5 iteration 8 took 436.8 ms Query 5 iteration 9 took 435.6 ms Query 5 avg time: 432.47 ms ``` Master (1 partition) ``` Query 5 iteration 0 took 660.6 ms Query 5 iteration 1 took 634.4 ms Query 5 iteration 2 took 626.4 ms Query 5 iteration 3 took 628.0 ms Query 5 iteration 4 took 635.3 ms Query 5 iteration 5 took 631.1 ms Query 5 iteration 6 took 631.3 ms Query 5 iteration 7 took 639.4 ms Query 5 iteration 8 took 634.3 ms Query 5 iteration 9 took 639.0 ms Query 5 avg time: 635.97 ms ``` Closes #9214 from Dandandan/mem_table_repartition Lead-authored-by: Heres, Daniel <danielheres@gmail.com> Co-authored-by: Daniël Heres <danielheres@gmail.com> Signed-off-by: Jorge C. Leitao <jorgecarleitao@gmail.com>
push time in an hour
PR closed apache/arrow
I think the feature to be able to repartition an in memory table is useful, as the repartitioning only needs to be applied once, and repartition itself is cheap (at the same node). Doing this when loading data is very useful for in-memory analytics as we can benefit from mutliple cores after loading the data.
The speed up from repartitioning is very big (mainly on aggregates), on my (8-core machine): ~5-7x on query 1 and 12 versus a single partition, and a smaller (~30%) difference for query 5 when using 16 partition. q1/q12 also have very high cpu utilization.
@jorgecarleitao maybe this is of interest to you, as you mentioned you are looking into multi-threading. I think this would be a "high level" way to get more parallelism, also in the logical plan. I think in some optimizer rules and/or dynamically we can do repartitions, similar to what's described here https://issues.apache.org/jira/browse/ARROW-9464
Benchmarks after repartitioning (16 partitions):
PR (16 partitions)
Query 12 iteration 0 took 33.9 ms
Query 12 iteration 1 took 34.3 ms
Query 12 iteration 2 took 36.9 ms
Query 12 iteration 3 took 33.6 ms
Query 12 iteration 4 took 35.1 ms
Query 12 iteration 5 took 38.8 ms
Query 12 iteration 6 took 35.8 ms
Query 12 iteration 7 took 34.4 ms
Query 12 iteration 8 took 34.2 ms
Query 12 iteration 9 took 35.3 ms
Query 12 avg time: 35.24 ms
Master (1 partition):
Query 12 iteration 0 took 245.6 ms
Query 12 iteration 1 took 246.4 ms
Query 12 iteration 2 took 246.1 ms
Query 12 iteration 3 took 247.9 ms
Query 12 iteration 4 took 246.5 ms
Query 12 iteration 5 took 248.2 ms
Query 12 iteration 6 took 247.8 ms
Query 12 iteration 7 took 246.4 ms
Query 12 iteration 8 took 246.6 ms
Query 12 iteration 9 took 246.5 ms
Query 12 avg time: 246.79 ms
PR (16 partitions):
Query 1 iteration 0 took 138.6 ms
Query 1 iteration 1 took 142.2 ms
Query 1 iteration 2 took 125.8 ms
Query 1 iteration 3 took 102.4 ms
Query 1 iteration 4 took 105.9 ms
Query 1 iteration 5 took 107.0 ms
Query 1 iteration 6 took 109.3 ms
Query 1 iteration 7 took 109.9 ms
Query 1 iteration 8 took 108.8 ms
Query 1 iteration 9 took 112.0 ms
Query 1 avg time: 116.19 ms
Master (1 partition):
Query 1 iteration 0 took 640.6 ms
Query 1 iteration 1 took 640.0 ms
Query 1 iteration 2 took 632.9 ms
Query 1 iteration 3 took 634.6 ms
Query 1 iteration 4 took 630.7 ms
Query 1 iteration 5 took 630.7 ms
Query 1 iteration 6 took 631.9 ms
Query 1 iteration 7 took 635.5 ms
Query 1 iteration 8 took 639.0 ms
Query 1 iteration 9 took 638.3 ms
Query 1 avg time: 635.43 ms
PR (16 partitions)
Query 5 iteration 0 took 465.8 ms
Query 5 iteration 1 took 428.0 ms
Query 5 iteration 2 took 435.0 ms
Query 5 iteration 3 took 407.3 ms
Query 5 iteration 4 took 435.7 ms
Query 5 iteration 5 took 437.4 ms
Query 5 iteration 6 took 411.2 ms
Query 5 iteration 7 took 432.0 ms
Query 5 iteration 8 took 436.8 ms
Query 5 iteration 9 took 435.6 ms
Query 5 avg time: 432.47 ms
Master (1 partition)
Query 5 iteration 0 took 660.6 ms
Query 5 iteration 1 took 634.4 ms
Query 5 iteration 2 took 626.4 ms
Query 5 iteration 3 took 628.0 ms
Query 5 iteration 4 took 635.3 ms
Query 5 iteration 5 took 631.1 ms
Query 5 iteration 6 took 631.3 ms
Query 5 iteration 7 took 639.4 ms
Query 5 iteration 8 took 634.3 ms
Query 5 iteration 9 took 639.0 ms
Query 5 avg time: 635.97 ms
pr closed time in an hour
push eventapache/arrow
commit sha b448de78cd0745b12dfb5156aaaff67f75bdee9a
ARROW-11216: [Rust] add doc example for StringDictionaryBuilder I find myself trying to remember the exact incantation to create a `StringDictionaryBuilder` so I figured I would add it as a doc example Closes #9169 from alamb/alamb/doc-example Authored-by: Andrew Lamb <andrew@nerdnetworks.org> Signed-off-by: Jorge C. Leitao <jorgecarleitao@gmail.com>
push time in an hour
PR closed apache/arrow
I find myself trying to remember the exact incantation to create a StringDictionaryBuilder
so I figured I would add it as a doc example
pr closed time in an hour
issue commentapache/arrow
Integrate CUDA memory capacity to plasma storage
Also if I want to integrate only plasma storage into my project, what's the suggested way to do so? c_glib or other way?
comment created time in an hour