profile
viewpoint
Dave Hirschfeld dhirschfeld Australia https://dhirschfeld.github.io Engineer, Physicist & Quantitative Developer working in the energy industry

dhirschfeld/cython 2

A Python to C compiler

dhirschfeld/2013_fall_ASTR599 0

Content for my Astronomy 599 Course: Intro to scientific computing in Python

dhirschfeld/apispec-feedstock 0

A conda-smithy repository for apispec.

dhirschfeld/arrow 0

Apache Arrow is a columnar in-memory analytics layer designed to accelerate big data. It houses a set of canonical in-memory representations of flat and hierarchical data along with multiple language-bindings for structure manipulation. It also provides IPC and common algorithm implementations.

dhirschfeld/arrow-cpp-feedstock 0

A conda-smithy repository for arrow-cpp.

dhirschfeld/asgiref 0

ASGI in-memory channel layer

dhirschfeld/autograd 0

Efficiently computes derivatives of numpy code.

dhirschfeld/autograd-feedstock 0

A conda-smithy repository for autograd.

dhirschfeld/azcopy-feedstock 0

A conda-smithy repository for azcopy.

push eventgoogle/jax

Matthew Johnson

commit sha 9787894d94556c4a3b5c878c00be2c178328fa88

refactor batching transform logic, fix leak checks See PR description in #5492 for details. Co-authored-by: Peter Hawkins <phawkins@google.com>

view details

push time in 8 minutes

push eventgoogle/jax

Matthew Johnson

commit sha e9c019e5ff20f8ff0c5192cfc598c1d45c784c98

refactor batching transform logic, fix leak checks See PR description in #5492 for details. Co-authored-by: Peter Hawkins <phawkins@google.com>

view details

push time in 14 minutes

issue commentdask/dask

A concurrent.futures.Executor scheduling interface

I'm wondering which approach is faster. There's a recent SoCC paper that does similar thing: https://arxiv.org/pdf/2010.07268.pdf and their code: https://github.com/mason-leap-lab/Wukong/tree/socc2020 It has more changes on the scheduler.

tomwhite

comment created time in 21 minutes

push eventgoogle/jax

Matthew Johnson

commit sha 6711bade76cd8466b500bf48e7b0a2a6e11319ea

refactor batching transform logic, fix leak checks See PR description in #5492 for details. Co-authored-by: Peter Hawkins <phawkins@google.com>

view details

push time in 22 minutes

push eventgoogle/jax

Matthew Johnson

commit sha 138afd5dda25be5c773e1da5d00c95b0b9ce4c5a

refactor batching transform logic, fix leak checks See PR description in #5492 for details. Co-authored-by: Peter Hawkins <phawkins@google.com>

view details

push time in 2 hours

issue commentgoogle/jax

loops Scope is exponentially slower in 0.2.8 than 0.1.55

Thanks for the clear explanation!

I wonder if this is related to omnistaging, which can lead to more code getting staged out to XLA, which then in turn can hit XLA codegen bugs that lead to exponential compile times.

To check if this is the issue, can you try disabling omnistaging and check if the regression is still present? It's as easy as calling jax.config.disable_omnistaging() at the top of your file, or setting the JAX_OMNISTAGING=0 shell environment variable.

zachary-jablons-okcupid

comment created time in 2 hours

push eventgoogle/jax

Peter Hawkins

commit sha dd34d48fd1b6fb0d61f81f7b626f35e9f5c33c26

Fix exception when tokens are used in AD.

view details

jax authors

commit sha bc3cd1286b506803b7d4eafd313be162565ee2d2

Merge pull request #5495 from hawkinsp:tokens PiperOrigin-RevId: 353283976

view details

Matthew Johnson

commit sha 203af4517b157cb98fd7d0a82e36e25b5b6a1bbb

revive the leak checker, as a debug mode Co-authored-by: James Bradbury <jekbradbury@google.com>

view details

Matthew Johnson

commit sha 1e729a35222681645ccef5cf73d3574c862f87eb

refactor batching transform logic, fix leak checks See PR description in #5492 for details. Co-authored-by: Peter Hawkins <phawkins@google.com>

view details

push time in 2 hours

push eventgoogle/jax

Matthew Johnson

commit sha 043567585d53598cf958d7a3f378feb0facde6fd

revive the leak checker, as a debug mode Co-authored-by: James Bradbury <jekbradbury@google.com>

view details

Matthew Johnson

commit sha ee1bdb731313cfbfcd75390f94a4c36b9906c212

refactor batching transform logic, fix leak checks See PR description in #5492 for details. Co-authored-by: Peter Hawkins <phawkins@google.com>

view details

push time in 2 hours

push eventgoogle/jax

Matthew Johnson

commit sha 7bf1592f742f882e8a217d5a810862bfb4bb6f67

refactor batching transform logic, fix leak checks See PR description in #5492 for details. Co-authored-by: Peter Hawkins <phawkins@google.com>

view details

push time in 2 hours

push eventgoogle/jax

Matthew Johnson

commit sha 5830be03e3e1bb4295824f4bc6480518eabae20e

refactor batching transform logic, fix leak checks See PR description in #5492 for details. Co-authored-by: Peter Hawkins <phawkins@google.com>

view details

push time in 2 hours

Pull request review commentapache/arrow

ARROW-11270: [Rust] Array slice accessors

 impl<T: ArrowPrimitiveType> PrimitiveArray<T> {     }      /// Returns the primitive value at index `i`.-    ///-    /// Note this doesn't do any bound checking, for performance reason.-    /// # Safety-    /// caller must ensure that the passed in offset is less than the array len()+    #[inline]     pub fn value(&self, i: usize) -> T::Native {-        let offset = i + self.offset();-        unsafe { *self.raw_values.as_ptr().add(offset) }+        self.values()[i]

#9291 is good progress towards eliminating use of the function.

And certainly, we could 'split' the macro for primitives as a quick fix to get rid of the call to the function. I've been experimenting with an alternative approach that might be a bit more flexible to multiple use cases, described at the bottom of this comment.

I am quite torn about whether I think value should or should not be in the interface.

Reasons to drop value(i) -> T::Native

I think that even if value(i) was dropped from the PrimitiveArray impl's, efficient random access to items without a bounds check can still be achieved through unsafe{*primitive_array.values().get_unchecked(i)} (the extra * because get_unchecked() returns a ref to the value).

I'm not sure I have any example code or measurements to demonstrate it on hand, but I am certain I saw the silently-unsafe implementationx.values().iter().zip(y.values().iter()) did (slightly) outperform (0..x.len()).map(|i|{x.value(i),y.value(i)}. I believe it was when I was playing with non-simd arithmetic kernels.... So that is the root of my hesitancy, is I'm worried it doesn't actually escape any overhead, and unintentionally lead people away from a more reliable/performant way. IF there is a context where unsafe{x.value(i)} beats the performance of unsafe{*x.values().get_unchecked(i)}

Reasons to keep value(i) -> T::Native

All other array implementations have value functions as far as I recall, so it is a nice 'consistency'.

In the back of my mind, the biggest argument to keep value(i) is for api consistency... so long term, a 'trait' may be the place where it might fit best? Very roughly, I'm thinking:

trait TypedArrowArray : ArrowArray {
   type RefType;
   fn is_valid(i:usize) -> bool; //bounds check
   unsafe fn is_valid(i:usize) -> bool; //no bounds check
   fn value(i:usize) -> RefType;  //bounds check
   unsafe fn value_unchecked(i:usize) -> RefType; //no bounds checked
   fn iter() -> impl Iterator<Option<RefType>>;
   fn iter_values() -> impl Iterator<RefType>;
}
impl <T: ArrowPrimitiveType> TypedArrowArray<&T::Native> for PrimitiveArray<T> { ... }
impl TypedArrowArray<ArrayRef> for GenericListArray<T> { ... }
//and similar for string/binary. ... I am not sure whether struct arrays could fit... Dictionary would not give access to 'keys', only to the values referenced by each key?  Union would require some kind of RefType that can downcast into the actual value?

Of course, I am uncertain how much overhead the 'standarization' such a trait impl implies would bring... would any kernels actually benefit from using generic implementations against such an api, or will they always go down to the concrete type to squeeze little short-cuts out that don't fit in the generic interface? I'm unsure, so very (very, very) slowly experimenting...

Summary

So in short, my thoughts are:

  • I think that leaving value(i) safety consideration out of this PR makes sense. I've rebased to drop that out - although I did leave the additional values() test code.
  • Marking it unsafe in the near future is absolutely better than being silently-unsafe. The argument that adding bounds-checks could silently impact external users is reasonable, taking unsafe has the larger 'warning' so that the change isn't missed.
  • Longer term, the options of deprecating it, or explicitly moving it into an trait impl are both contenders in my mind... but neither option is directly relevant to this PR.

Let me know if that seems reasonable.

tyrelr

comment created time in 3 hours

PR opened dask/distributed

Refactor `task_groups` & `task_prefixes`

Moves task_groups and task_prefixes to SchedulerState where they are type annotated. Then uses them through parent within Scheduler. Allows Cython to recognize these are Python dicts and optimize calls and operations on them.

+20 -8

0 comment

1 changed file

pr created time in 3 hours

pull request commentmkdocs/mkdocs

Add warning about CNAME mismatch in gh-deploy

Maybe a name like ignore-cname-check or skip-cname-check

Sounds good. I'll make it ignore-cname-check to match the existing ignore-version option.

But why would a user want to the check to be skipped except for when they don't use custom domains and want to avoid the delay caused by running the check? I'm still not convinced that we need the option.

Hmm. You might be right. I can think of a case where someone has forked a project that's using a custom domain and attempt to publish the docs for their fork, but the CNAME is mismatched. They could skip the check to make the docs publish without a CNAME or with a different CNAME specified in {docs_dir}/CNAME. But I think more broadly if there's a mismatch between the CNAME that GitHub sees and the CNAME that the user is trying to publish with it should fail until that's resolved.

And sorry for my ambiguous use of "warning". As you surmised, it actually is a fatal error.

theacodes

comment created time in 3 hours

Pull request review commentnumba/numba

Initial support for selecting the chunk size for parallel regions.

 The report is split into the following sections:     ``$const58.3 = const(int, 1)`` comes from the source ``b[j + 1]``, the     number ``1`` is clearly a constant and so can be hoisted out of the loop. +.. _numba-parallel-scheduling:++Scheduling+==========++By default, Numba divides the iterations of a parallel region into approximately equal+sized chunks and gives one such chunk to each configured thread.+(See :ref:`setting_the_number_of_threads`).+This scheduling approach is equivalent to OpenMP's static schedule with no specified+chunk size and is appropriate when the work required for each iteration is nearly constant.++Conversely, if the work required per iteration varies significantly then this static+scheduling approach can lead to load imbalances and longer execution times.  In such cases,+Numba provides a mechanism to control how many iterations of a parallel region go into+each chunk.  The number of chunks will then be approximately equal to the number of+iterations divided by the chunk size.  Numba then gives one such chunk to each configured+thread as above and when a thread finishes a chunk, Numba gives that thread the next+available chunk.  This scheduling approach is the equivalent of OpenMP's dynamic scheduling+option with the specified chunk size.  To minimize execution time, the programmer must pick+a chunk size that strikes a balance between greater load balancing with smaller chunk+sizes and less scheduling overhead with larger chunk sizes.++The number of iterations of a parallel region in a chunk is stored as a thread-local+variable and can be set using+:func:`numba.set_parallel_chunksize`.  This function takes one integer parameter+whose value must be greater than+or equal to 0.  A value of 0 is the default value and instructs Numba to use the+static scheduling approach above.  Values greater than 0 instruct Numba to use that value+as the chunk size in the dynamic scheduling approach described above.+The current value of this thread local variable is used by all subsequent parallel regions+invoked by this thread.+The current value of the parallel chunk size can be obtained from+:func:`numba.get_parallel_chunksize`.+Both of these functions can be used from standard Python and from Numba jitted functions+as shown below.  Both invocations of func1 would be executed with a chunk size of 4 whereas+func2 would use a chunk size of 8.++.. code:: python++    from numba import njit, prange, set_parallel_chunksize, get_parallel_chunksize++    @njit(parallel=True)+    def func1():+        for i in prange(n):+            ...++    @njit(parallel=True)+    def func2():+        old_chunksize = get_parallel_chunksize()+        set_parallel_chunksize(8)+        for i in prange(n):+            ...+        set_parallel_chunksize(old_chunksize)++    old_chunksize = get_parallel_chunksize()+    set_parallel_chunksize(4)+    func1()+    func2()+    func1()+    set_parallel_chunksize(old_chunksize)++Since this idiom of saving and restoring is so common, Numba provides the+:func:`parallel_chunksize` with clause to simplify the idiom.  As shown below,+this with clause can be invoked from both standard Python and within Numba+jitted functions.++.. code:: python++    from numba import njit, prange, parallel_chunksize++    @njit(parallel=True)+    def func1():+        for i in prange(n):+            ...++    @njit(parallel=True)+    def func2():+        with parallel_chunksize(8):+            for i in prange(n):+                ...++    with parallel_chunksize(4):+        func1()+        func2()+        func1()

Done but had to accept FreeVars in find_global_name in order for the with context code to recognize parallel_chunksize.

DrTodd13

comment created time in 3 hours

Pull request review commentnumba/numba

Initial support for selecting the chunk size for parallel regions.

 class RangeActual {         }         return ret;     }++    uintp total_size() const {+        std::vector<intp> per_dim = iters_per_dim();+        uintp res = 1;+        for (unsigned i = 0; i < per_dim.size(); ++i) {+            res *= per_dim[i];+        }+        return res;+    } }; +extern "C" void set_parallel_chunksize(uintp n) {+    parallel_chunksize = n;+}++extern "C" uintp get_parallel_chunksize() {+    return parallel_chunksize;+}++extern "C" uintp get_sched_size(uintp num_threads, uintp num_dim, intp *starts, intp *ends) {+    if (parallel_chunksize == 0) {+        return num_threads;+    }+    RangeActual ra(num_dim, starts, ends);+    uintp total_work_size = ra.total_size();+    uintp num_divisions = total_work_size / parallel_chunksize;+    return num_divisions < num_threads ? num_threads : num_divisions;

Done.

DrTodd13

comment created time in 3 hours

Pull request review commentnumba/numba

Initial support for selecting the chunk size for parallel regions.

 The report is split into the following sections:     ``$const58.3 = const(int, 1)`` comes from the source ``b[j + 1]``, the     number ``1`` is clearly a constant and so can be hoisted out of the loop. +.. _numba-parallel-scheduling:++Scheduling+==========++By default, Numba divides the iterations of a parallel region into approximately equal+sized chunks and gives one such chunk to each configured thread.+(See :ref:`setting_the_number_of_threads`).+This scheduling approach is equivalent to OpenMP's static schedule with no specified+chunk size and is appropriate when the work required for each iteration is nearly constant.++Conversely, if the work required per iteration varies significantly then this static+scheduling approach can lead to load imbalances and longer execution times.  In such cases,+Numba provides a mechanism to control how many iterations of a parallel region go into+each chunk.  The number of chunks will then be approximately equal to the number of+iterations divided by the chunk size.  Numba then gives one such chunk to each configured+thread as above and when a thread finishes a chunk, Numba gives that thread the next+available chunk.  This scheduling approach is the equivalent of OpenMP's dynamic scheduling+option with the specified chunk size.  To minimize execution time, the programmer must pick+a chunk size that strikes a balance between greater load balancing with smaller chunk+sizes and less scheduling overhead with larger chunk sizes.++The number of iterations of a parallel region in a chunk is stored as a thread-local+variable and can be set using+:func:`numba.set_parallel_chunksize`.  This function takes one integer parameter+whose value must be greater than+or equal to 0.  A value of 0 is the default value and instructs Numba to use the+static scheduling approach above.  Values greater than 0 instruct Numba to use that value+as the chunk size in the dynamic scheduling approach described above.+The current value of this thread local variable is used by all subsequent parallel regions+invoked by this thread.+The current value of the parallel chunk size can be obtained from+:func:`numba.get_parallel_chunksize`.+Both of these functions can be used from standard Python and from Numba jitted functions+as shown below.  Both invocations of func1 would be executed with a chunk size of 4 whereas+func2 would use a chunk size of 8.++.. code:: python++    from numba import njit, prange, set_parallel_chunksize, get_parallel_chunksize++    @njit(parallel=True)+    def func1():+        for i in prange(n):+            ...++    @njit(parallel=True)+    def func2():+        old_chunksize = get_parallel_chunksize()+        set_parallel_chunksize(8)+        for i in prange(n):+            ...+        set_parallel_chunksize(old_chunksize)++    old_chunksize = get_parallel_chunksize()+    set_parallel_chunksize(4)+    func1()+    func2()+    func1()+    set_parallel_chunksize(old_chunksize)++Since this idiom of saving and restoring is so common, Numba provides the+:func:`parallel_chunksize` with clause to simplify the idiom.  As shown below,+this with clause can be invoked from both standard Python and within Numba+jitted functions.

Done.

DrTodd13

comment created time in 3 hours

Pull request review commentnumba/numba

Initial support for selecting the chunk size for parallel regions.

 The report is split into the following sections:     ``$const58.3 = const(int, 1)`` comes from the source ``b[j + 1]``, the     number ``1`` is clearly a constant and so can be hoisted out of the loop. +.. _numba-parallel-scheduling:++Scheduling+==========++By default, Numba divides the iterations of a parallel region into approximately equal+sized chunks and gives one such chunk to each configured thread.+(See :ref:`setting_the_number_of_threads`).+This scheduling approach is equivalent to OpenMP's static schedule with no specified+chunk size and is appropriate when the work required for each iteration is nearly constant.++Conversely, if the work required per iteration varies significantly then this static+scheduling approach can lead to load imbalances and longer execution times.  In such cases,+Numba provides a mechanism to control how many iterations of a parallel region go into+each chunk.  The number of chunks will then be approximately equal to the number of+iterations divided by the chunk size.  Numba then gives one such chunk to each configured+thread as above and when a thread finishes a chunk, Numba gives that thread the next+available chunk.  This scheduling approach is the equivalent of OpenMP's dynamic scheduling+option with the specified chunk size.  To minimize execution time, the programmer must pick+a chunk size that strikes a balance between greater load balancing with smaller chunk+sizes and less scheduling overhead with larger chunk sizes.++The number of iterations of a parallel region in a chunk is stored as a thread-local+variable and can be set using+:func:`numba.set_parallel_chunksize`.  This function takes one integer parameter+whose value must be greater than+or equal to 0.  A value of 0 is the default value and instructs Numba to use the+static scheduling approach above.  Values greater than 0 instruct Numba to use that value+as the chunk size in the dynamic scheduling approach described above.+The current value of this thread local variable is used by all subsequent parallel regions+invoked by this thread.+The current value of the parallel chunk size can be obtained from+:func:`numba.get_parallel_chunksize`.+Both of these functions can be used from standard Python and from Numba jitted functions+as shown below.  Both invocations of func1 would be executed with a chunk size of 4 whereas+func2 would use a chunk size of 8.++.. code:: python++    from numba import njit, prange, set_parallel_chunksize, get_parallel_chunksize++    @njit(parallel=True)+    def func1():+        for i in prange(n):+            ...++    @njit(parallel=True)+    def func2():+        old_chunksize = get_parallel_chunksize()+        set_parallel_chunksize(8)+        for i in prange(n):+            ...+        set_parallel_chunksize(old_chunksize)++    old_chunksize = get_parallel_chunksize()+    set_parallel_chunksize(4)+    func1()+    func2()+    func1()+    set_parallel_chunksize(old_chunksize)

Done.

DrTodd13

comment created time in 3 hours

Pull request review commentnumba/numba

Initial support for selecting the chunk size for parallel regions.

     literal_unroll     get_num_threads     set_num_threads+    set_parallel_chunksize+    get_parallel_chunksize

Added to documentation.

DrTodd13

comment created time in 3 hours

issue openeddask/dask-gateway

Allow only specific JupyterHub users to create clusters

Hello,

We use dask-gateway with Jupyterhub (using Jupyterhub authentication), both using Kubernetes backends. The Jupyterhub deployment uses the Zero to JupyterHub helm chart, along with a custom generic oauth authenticator.

We want is to be able to restrict the usage of dask-gateway to a specific subset of the users. Furthermore, it would be good if we could define cluster limits depending on the user group. It seems that this feature does not exist or it is not documented currently.

Furthermore, it is unclear how user information is passed to the options handler. Is it possible to define what groups each user belongs to in the Z2JH helm chart? Or do the groups need to be passed by the authenticator?

created time in 4 hours

pull request commentdask/distributed

Use `parent._tasks` in heartbeat

Thanks James! 😄

jakirkham

comment created time in 4 hours

Pull request review commentdask/distributed

[WIP] Optimize transitions

 def transition(self, key, finish, *args, **kwargs):             if ts._state == "forgotten" and ts._group._name in self.task_groups:                 # Remove TaskGroup if all tasks are in the forgotten state                 tg: TaskGroup = ts._group-                if not any([tg._states.get(s) for s in ALL_TASK_STATES]):+                if not (tg._states.keys() & ALL_TASK_STATES):

In addition to simplifying the code a bit, this ends up being a bit more performant

In [1]: d1 = {i: i for i in range(1_000)}
   ...: s2 = {i for i in range(1_000, 2_000)}

In [2]: %timeit any([d1.get(k) for k in s2])
177 µs ± 456 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [3]: %timeit d1.keys() & s2
15.8 µs ± 157 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Plus Cython can replace & with a quick C call to PyNumber_And.

jakirkham

comment created time in 4 hours

Pull request review commentdask/distributed

[WIP] Optimize transitions

 def transition(self, key, finish, *args, **kwargs):             if ts._state == "forgotten" and ts._group._name in self.task_groups:                 # Remove TaskGroup if all tasks are in the forgotten state                 tg: TaskGroup = ts._group-                if not any([tg._states.get(s) for s in ALL_TASK_STATES]):+                if not (tg._states.keys() & ALL_TASK_STATES):

As key views are Set-like objects, things like intersections can be computed with sets. More details in this SO answer along with a link to the Python docs where they note this:

Keys views are set-like since their entries are unique and hashable. If all values are hashable, so that (key, value) pairs are unique and hashable, then the items view is also set-like. (Values views are not treated as set-like since the entries are generally not unique.) For set-like views, all of the operations defined for the abstract base class collections.abc.Set are available (for example, ==, <, or ^).

jakirkham

comment created time in 4 hours

push eventdask/dask

James Bourbeau

commit sha 0b33708f4627c3f9c7613c9554d1033db147d4b0

Add cytoolz back to CI environment (#7103)

view details

push time in 4 hours

PR merged dask/dask

Add cytoolz back to CI environment

Following up on https://github.com/dask/dask/pull/7069, now that https://github.com/conda-forge/cytoolz-feedstock/issues/36 has been resolved we can add cytoolz back to the CI environment.

  • [ ] Tests added / passed
  • [ ] Passes black dask / flake8 dask
+0 -5

0 comment

1 changed file

jrbourbeau

pr closed time in 4 hours

push eventdask/distributed

jakirkham

commit sha 777d48e977021253f071bce038a7eac40f468c82

Use `parent._tasks` in heartbeat (#4450) Make sure we grab the typed `parent._tasks` in heartbeat. This benefits from the type annotation of this attribute.

view details

push time in 4 hours

PR merged dask/distributed

Use `parent._tasks` in heartbeat

Make sure we grab the typed parent._tasks in heartbeat. This benefits from the type annotation of this attribute.

+1 -1

0 comment

1 changed file

jakirkham

pr closed time in 4 hours

issue commentdask/distributed

Handling custom serialization with MsgPack directly

Yeah mostly was curious if the extract_serialize Cythonization would be easy to test/show some notable improvement. Agree if that's not easy, we can just ignore it.

Looking at transitions atm.

@madsbk, if you have some time on Monday, maybe we can chat about this? 🙂

jakirkham

comment created time in 5 hours

issue commentdask/distributed

Handling custom serialization with MsgPack directly

@jakirkham maybe spending time on Cythonization is the wrong move if this can be done easily.

jakirkham

comment created time in 5 hours

pull request commentdask/distributed

[WIP] Build `serialize` with Cython

We could also simplify the code further such that Serialize and Serialized don't need to move (like checking typ_v.__name__ instead). Just in case that is causing issues.

jakirkham

comment created time in 5 hours

pull request commentdask/distributed

[WIP] Build `serialize` with Cython

Yeah was thinking something similar, but actually just lumping them in scheduler.py just in case the ImportError turns up (though maybe that isn't the concern).

jakirkham

comment created time in 5 hours

more