profile
viewpoint

phillc73/abettor 37

An R package for connecting to the online betting exchange Betfair, via their API-NG product, using JSON-RPC.

phillc73/duckdf 13

🦆 SQL for R dataframes, with ducks

phillc73/backblazer 7

R package for Backblaze's B2 API

phillc73/caRbine 4

A set of R scripts for automated horse racing actions

phillc73/pinhooker 2

An R Package to compile data sets of historic results from thoroughbred sales

phillc73/racing_data 1

Python horse racing class library

phillc73/resourcer 1

R Package for Team Resource Management

phillc73/awesome-public-datasets 0

An awesome list of high-quality open datasets in public domains (on-going).

phillc73/awesome-R 0

A curated list of awesome R packages, frameworks and software.

issue commentjanet-lang/janet

Janet hygiene is not hygiene

Well, I'm not sure if I know exactly what the offending language on the website is, anyway. The janet-lang.org repo has the following para:

After expansion, @codey wrongly refers to the @codex inside the macro (which is bound to 8) rather than the @codex defined to be 10. The problem is the reuse of the symbol @codex inside the macro, which overshadowed the original binding. This problem is called the @link[https://en.wikipedia.org/wiki/Hygienic_macro#The_hygiene_problem]{hygiene problem} and is well known in many programming languages. Some languages provide complicated solutions to this problem, but Janet opts for a much simpler—if not primitive—solution.

Which is to say, it links to the Hygienic Macro wikipedia page to give a gloss on what the hygiene problem is - but it doesn't claim that the macro system in Janet is hygienic. Is there elsewhere that it makes that claim?

If you link to something called #The Hygiene Problem and you claim to have a solution to it, it seems to me that you are claiming the macro system is (when properly used withgensym) hygienic. But Janet (and yes, Common Lisp as well) doesn't in fact offer a complete solution to the hygiene problem. What is more, the workarounds in Common Lisp that reduce the impact of the hygiene problem, namely being a Lisp-2 and storing symbols in packages, are not present in Janet.

I recommend that the section title be changed to "Variable Capture" and the following two sentences be removed:

This problem is called the
@link[https://en.wikipedia.org/wiki/Hygienic_macro#The_hygiene_problem]{hygiene problem}
and is well known in many programming languages.
Some languages provide complicated solutions to this problem, but Janet opts for a much
simpler—if not primitive—solution.

That is clearer, I think, and will dispose of this issue. There is also a typo in this section: change "more useful that a macro" to "more useful than a macro".

johnwcowan

comment created time in 22 minutes

push eventatt/rcloud

Gordon Woodhull

commit sha c0c4f33a346ae87dad3cb7616d19727e53d24dae

default rcloud.jupyter.python.path to avoid miniconda install fixes #2702

view details

push time in 30 minutes

issue closedatt/rcloud

reticulate uses unsolicited interactive prompt to ask for miniconda installation in `rcloud.jupyter` initialization blocking RCloud startup

Recent versions of reticulate will try to force the user to install their copy of miniconda if only system Python (even if perfectly working) is installed. This is done via an interactive prompt that cannot be disabled. That prompt will block RCloud at startup when rcloud.jupyter language is initialized.

RStudio provides no way to skip that prompt so the only options are either:

  1. make sure rcloud.jupyter.python.path configuration directive is set in rcloud.conf (which results in call to use_python() which skips the prompt)
  2. copy/paste the reticulate code that sets the miniconda flag to "no"

closed time in 30 minutes

s-u

issue commentatt/rcloud

reticulate uses unsolicited interactive prompt to ask for miniconda installation in `rcloud.jupyter` initialization blocking RCloud startup

Related: use_python needs required=TRUE, or it will pick perhaps the one with the latest numpy, a67d96501f4c

I don't want to get into which version of python is best (look for the one that has Jupyter installed? ugh), so I'm just defaulting it to /usr/bin/python3 in rcloud.conf.samp. I think this is the most common case, and recommended these days.

s-u

comment created time in 35 minutes

push eventatt/rcloud

Gordon Woodhull

commit sha a67d96501f4c45acfa8268b7074110941dd5e443

force reticulate to use the specified python version thanks @salivian! > Is this version of Python required? If TRUE then an error occurs if it's not located. > Otherwise, the version is taken as a hint only and scanning for other versions will still proceed. https://rstudio.github.io/reticulate/reference/use_python.html

view details

push time in an hour

issue commentjanet-lang/janet

Janet hygiene is not hygiene

You can manually escape functions and even other macros in the macro definition:

(defmacro my-mac
  "Emits (+ x 1)."
  [x]
  ~(,+ ,x 1))

In the above macro definition, + is properly escaped - the functional value of + is used, not the symbol '+. Nested macro "hygiene" is a bit more difficult but still doable - macro live in the function namespace, so we can manually 'apply' then (if needed). Internal macros in the core often do this.

(defmacro my-mac-uses-let
  "A macro evaluates body where `x` is defined to be 10."
  [& body]
  (apply let ['x 10] body))

Now you could of course argue that this isn't "hygiene", and I don't even disagree. However, it is true that macros can be written to be hygenic - and I would say this is true in Common Lisp as well.

Either way, this is a WNF.

johnwcowan

comment created time in an hour

issue commentrstudio/blogdown

netlify build failed to extract shortcode: template for shortcode "blogdown/postref" not found

You're welcome!

We built the check functions to help users so it is cool to know that it helps put you on the right track when an issue come up !

Definitely very helpful!

enixam

comment created time in an hour

issue commentglin/reactable

Cross-table with the functionalities of reactable and DT

Agreed. Server-side rendering is the only thing keeping me from using this instead of DT.

GitHunter0

comment created time in 2 hours

issue openedphillc73/abettor

increase getAccountStatement limit above 100 records

The maximum number of records that can be returned from an API call is 100. This means users of this function have to use the function repetitively to get their full account statement. It would be preferable if the function handled this and returned all available records as the default.

I have written some code to address this and will do a pull request in the next day or so.

created time in 2 hours

push eventeasystats/easystats

runner

commit sha d98ecf0b0fa5f1a9b48c42bc15efc6469ce076b8

Re-build README.Rmd

view details

push time in 2 hours

issue openednextcloud/deck

Github Integration

<!-- Thanks for reporting issues back!

Guidelines for submitting issues:

  • Please search the existing issues first, it's likely that your issue was already reported or even fixed.

  • SECURITY: Report any potential security bug to us via our HackerOne page (https://hackerone.com/nextcloud) following our security policy (https://nextcloud.com/security/) instead of filing an issue in our bug tracker.

  • The issues in other components should be reported in their respective repositories: You will find them in our GitHub Organization (https://github.com/nextcloud/)

  • You can also use the Issue Template app to prefill most of the required information: https://apps.nextcloud.com/apps/issuetemplate -->

<!--- Please keep this note for other contributors -->

Is there some way to integrate Deck with a GitHub repository? So that any issues pop up automatically as a card on the board.

This would be an implementation similar to Git kraken boards: https://www.gitkraken.com/boards

created time in 3 hours

pull request commentJuliaData/DataFrames.jl

implement faster innerjoin

Here are tests for two columns (smaller data, as this is more problematic).

In general it is not bad. What I do in my PR is that I allocate a vector of tuples from a tuple of vectors, and then things are fast. The creation of this vector uses memory (this is bad), but is relatively fast.

current main:

julia> using Random, DataFrames

julia> Random.seed!(1234);

julia> df1 = DataFrame(id1 = sort!(string.(1:10^6)), id2 = 1:10^6);

julia> df2 = DataFrame(id1 = sort!(string.(1:10^7)), id2 = 1:10^7);

julia> @time innerjoin(df1, df2, on=[:id1, :id2]);
  1.643699 seconds (189 allocations: 677.330 MiB, 17.62% gc time)

julia> @time innerjoin(df2, df1, on=[:id1, :id2]);
 20.949350 seconds (189 allocations: 214.007 MiB)

julia> df1 = DataFrame(id1 = shuffle!(string.(1:10^6)));

julia> df1.id2 = parse.(Int, df1.id1);

julia> df2 = DataFrame(id1 = shuffle!(string.(1:10^7)));

julia> df2.id2 = parse.(Int, df2.id1);

julia> @time innerjoin(df1, df2, on=[:id1, :id2]);
  4.515834 seconds (198 allocations: 715.477 MiB, 29.89% gc time)

julia> @time innerjoin(df2, df1, on=[:id1, :id2]);
 24.911806 seconds (198 allocations: 252.154 MiB, 4.40% gc time)

julia> df1 = DataFrame(id1 = sort!(rand(string.(1:10^6), 10^6)), id2 = 1:10^6);

julia> df2 = DataFrame(id1 = sort!(rand(string.(1:10^7), 10^7)), id2 = 1:10^7);

julia> @time innerjoin(df1, df2, on=[:id1, :id2]);
  1.730993 seconds (189 allocations: 677.330 MiB, 20.93% gc time)

julia> @time innerjoin(df2, df1, on=[:id1, :id2]);
 20.352023 seconds (189 allocations: 214.007 MiB)

julia> df1 = DataFrame(id1 = rand(string.(1:10^6), 10^6), id2 = 1:10^6);

julia> df1.id2 = parse.(Int, df1.id1);

julia> df2 = DataFrame(id1 = rand(string.(1:10^7), 10^7), id2 = 1:10^7);

julia> df2.id2 = parse.(Int, df2.id1);

julia> @time innerjoin(df1, df2, on=[:id1, :id2]);
  6.393938 seconds (200 allocations: 674.696 MiB, 36.35% gc time)

julia> @time innerjoin(df2, df1, on=[:id1, :id2]);
  4.530607 seconds (200 allocations: 261.871 MiB)

this PR

julia> using Random, DataFrames

julia> Random.seed!(1234);

julia> df1 = DataFrame(id1 = sort!(string.(1:10^6)), id2 = 1:10^6);

julia> df2 = DataFrame(id1 = sort!(string.(1:10^7)), id2 = 1:10^7);

julia> @time innerjoin(df1, df2, on=[:id1, :id2]);
  0.419206 seconds (167 allocations: 183.118 MiB)

julia> @time innerjoin(df2, df1, on=[:id1, :id2]);
  0.650134 seconds (167 allocations: 183.118 MiB, 38.94% gc time)

julia> df1 = DataFrame(id1 = shuffle!(string.(1:10^6)));

julia> df1.id2 = parse.(Int, df1.id1);

julia> df2 = DataFrame(id1 = shuffle!(string.(1:10^7)));

julia> df2.id2 = parse.(Int, df2.id1);

julia> @time innerjoin(df1, df2, on=[:id1, :id2]);
  6.374714 seconds (265 allocations: 296.956 MiB, 27.80% gc time)

julia> @time innerjoin(df2, df1, on=[:id1, :id2]);
  4.678481 seconds (265 allocations: 296.956 MiB)

julia> df1 = DataFrame(id1 = sort!(rand(string.(1:10^6), 10^6)), id2 = 1:10^6);

julia> df2 = DataFrame(id1 = sort!(rand(string.(1:10^7), 10^7)), id2 = 1:10^7);

julia> @time innerjoin(df1, df2, on=[:id1, :id2]);
  0.882602 seconds (167 allocations: 183.118 MiB, 44.24% gc time)

julia> @time innerjoin(df2, df1, on=[:id1, :id2]);
  0.437506 seconds (167 allocations: 183.118 MiB)

julia> df1 = DataFrame(id1 = rand(string.(1:10^6), 10^6), id2 = 1:10^6);

julia> df1.id2 = parse.(Int, df1.id1);

julia> df2 = DataFrame(id1 = rand(string.(1:10^7), 10^7), id2 = 1:10^7);

julia> df2.id2 = parse.(Int, df2.id1);

julia> @time innerjoin(df1, df2, on=[:id1, :id2]);
  5.791302 seconds (1.27 M allocations: 337.006 MiB)

julia> @time innerjoin(df2, df1, on=[:id1, :id2]);
  7.345807 seconds (1.27 M allocations: 337.006 MiB, 22.06% gc time)
bkamins

comment created time in 4 hours

fork jeroen/gdal

GDAL is an open source X/MIT licensed translator library for raster and vector geospatial data formats.

https://gdal.org

fork in 4 hours

pull request commentJuliaData/DataFrames.jl

implement faster innerjoin

and if so you can work directly on the refarrays. That should cover most cases

I do not think this would be a common case as most likely you are joining columns coming from different sources.

this PR is always as fast as main or faster

Not exactly - in some cases it is a bit slower as it allocates a lot more than the old one.

Something which would be worth benchmarking is joining on multiple columns.

I will run such benchmarks and report the results

bkamins

comment created time in 4 hours

issue commentqueryverse/Query.jl

error with @rename macro

Running into the same issue while working with data frames, in some cases using Symbol is mandatory as the column name includes numbers or things like % or .

@pstaabp if it might help, this is my current workaround:

I defined a function to replace characters preventing column names being interpreted as symbols:

function tidy_names(old_names)
			
  return new_names = old_names |>
  #replace spaces with underscores
  n -> replace.(n, ' ' => '_') |> 
  
  #remove parenthesis
  n -> replace.(n,'(' => "") |> 
  n -> replace.(n,')' => "") |>
  
  #remove dashes and dots
  n -> replace.(n,'-' => "") |> 
  n -> replace.(n, '.' => "") |>
  
  #all lowercase
  n -> lowercase.(n)
	
end

And then I run:

rename!(df ,names(df) .=> tidy_names(names(df)))

Before feeding the df to the pipe-magic that Query.jl allows.

pstaabp

comment created time in 5 hours

issue openedcwida/duckdb

Can you provide a row iterator interface in Python for query results?

All of the row result methods here: https://duckdb.org/docs/api/python

  • fetchdf
  • fetchall
  • fetchnumpy will put store the query results into memory. If the DuckDB database is sufficiently large, we will run out of memory.

Can you provide an iterator-based method that won't create a list like fetchall? I would recommend just having fetchall return an iterator. That way, you can still call list(fetchall) if you want a list of results.

created time in 5 hours

issue openedfonsp/Pluto.jl

Enhancement -- Sample Notebook

I've been playing with calculus, and I was thinking of writing a sample notebook, that would show how to do numeric integration and derivation (ForwardDiff, QuadGK), symbolic derivation (ModelingToolkit or JuMP), symbolic integration (SymPy or JuMP), plotting symbolic and numeric derivatives, zerofinding (find_zeros), plotting Newton's method (I might need to contribute to a plotting package for thi), and optimization, as well as plotting intercepts.

created time in 5 hours

issue openedhughjonesd/huxtable

Latex tables with borders for multicolumn cells

Problem Latex tables in academic papers often contain many columns that need to be organized. A common method is to use \cline in combination with the extracolsep @-operator. It is a very difficult topic in Latex, and a programmatic solution from huxtable would be very valuable

Example: Let's say I have this very simple data.frame called my_df turned into a huxtable my_ht:

library(huxtable)
library(tidyverse)

my_df = data.frame(A=1:4, B=5:8, alpha=10:13, beta=100:103)
my_ht = huxtable(my_df)

There are two kinds of columns. Those with Latin colnames, and those with Greek colnames. It is easy to add multicolumn elements:

my_ht %>% 
  insert_row(c('Latin','Latin', 'Greek', 'Greek')) %>%
  merge_cells(c(1,1),c(1,2)) %>% merge_cells(c(1,1),c(3,4))

Additionally, some hline and cline equivalents should be applied.

my_ht %>% 
  insert_row(c('Latin','Latin', 'Greek', 'Greek')) %>%
  merge_cells(c(1,1),c(1,2)) %>% merge_cells(c(1,1),c(3,4)) %>%
  set_bottom_border(row = c(1,1) , col = 1:2,   brdr(1, "solid", "black")) %>% 
  set_bottom_border(row = c(1,1) , col = 3:4,   brdr(1, "solid", "black")) %>%
  set_bottom_border(row = c(2,2) , col = 1:4,   brdr(1, "solid", "black")) %>%
  set_right_padding(row = 1 ,col =2, 10 ) %>%
  quick_latex()

image

That's nice, but I need a little 'gap' to help the reader. It should look like this: image

What does NOT work:

my_ht %>% 
  insert_row(c('Latin','Latin', 'Greek', 'Greek')) %>%
  merge_cells(c(1,1),c(1,2)) %>% merge_cells(c(1,1),c(3,4)) %>%
  set_bottom_border(row = c(1,1) , col = 1:2,   brdr(1, "solid", "black")) %>% 
  set_bottom_border(row = c(1,1) , col = 3:4,   brdr(1, "solid", "black")) %>%
  set_bottom_border(row = c(2,2) , col = 1:4,   brdr(1, "solid", "black")) %>%
  set_right_padding(row = 1 ,col =2, 10 ) %>%
  quick_latex()

Padding doesn't do it.

I can put a blue line over:

my_ht %>% 
  insert_row(c('Latin','Latin', 'Greek', 'Greek')) %>%
  merge_cells(c(1,1),c(1,2)) %>% merge_cells(c(1,1),c(3,4)) %>%
  set_bottom_border(row = c(1,1) , col = 1:2,   brdr(1, "solid", "black")) %>% 
  set_bottom_border(row = c(1,1) , col = 3:4,   brdr(1, "solid", "black")) %>%
  set_bottom_border(row = c(2,2) , col = 1:4,   brdr(1, "solid", "black")) %>%
  set_right_padding(row = 1 ,col =2, 10 ) %>%
  set_right_border(row = 1 ,col =2, brdr(1, "solid", "blue")) %>%
  quick_latex()

However, when I change it to white, the border is not visible -- well, it can't know which color to put up to, so that makes sense:

my_ht %>% 
  insert_row(c('Latin','Latin', 'Greek', 'Greek')) %>%
  merge_cells(c(1,1),c(1,2)) %>% merge_cells(c(1,1),c(3,4)) %>%
  set_bottom_border(row = c(1,1) , col = 1:2,   brdr(1, "solid", "black")) %>% 
  set_bottom_border(row = c(1,1) , col = 3:4,   brdr(1, "solid", "black")) %>%
  set_bottom_border(row = c(2,2) , col = 1:4,   brdr(1, "solid", "black")) %>%
  set_right_padding(row = 1 ,col =2, 10 ) %>%
  set_right_border(row = 1 ,col =2, brdr(1, "solid", "white")) %>%
  quick_latex()

Solution I am not sure how to do it. Allowing certain borders to be \cline would help. Being able to add extracolsep would help. One would also need to be able to turn other borders into top/mid/bottomrules.

created time in 5 hours

issue commentfonsp/Pluto.jl

Package Manager Not Loading

I've had the same results with my book, but this is your plots sample:

Screenshot from 2021-01-27 16-07-13

I also have this error message:

[5948:5948:0127/160618.351734:ERROR:chrome_content_client.cc(343)] Failed to locate and load the component updated flash plugin.
Opening in existing browser session.

,but I think it was there when everything was working.

BrettKnoss

comment created time in 5 hours

pull request commentJuliaData/DataFrames.jl

implement faster innerjoin

Here are benchmarks on integer columns. In this comparison the PR looks much better (so String case is a hard one):

smaller data

current main:

julia> using Random, DataFrames

julia> Random.seed!(1234);

julia> df1 = DataFrame(id = sort!((1:10^6)));

julia> df2 = DataFrame(id = sort!((1:10^7)));

julia> @time innerjoin(df1, df2, on=:id);
  1.637557 seconds (183 allocations: 707.847 MiB, 15.81% gc time)

julia> @time innerjoin(df2, df1, on=:id);
 18.728374 seconds (183 allocations: 244.524 MiB)

julia> df1 = DataFrame(id = shuffle((1:10^6)));

julia> df2 = DataFrame(id = shuffle((1:10^7)));

julia> @time innerjoin(df1, df2, on=:id);
  1.874104 seconds (183 allocations: 707.847 MiB, 5.16% gc time)

julia> @time innerjoin(df2, df1, on=:id);
 22.334269 seconds (183 allocations: 244.524 MiB, 0.14% gc time)

julia> df1 = DataFrame(id = sort!(rand((1:10^6), 10^6)));

julia> df2 = DataFrame(id = sort!(rand((1:10^7), 10^7)));

julia> @time innerjoin(df1, df2, on=:id);
  1.040792 seconds (183 allocations: 651.651 MiB, 1.98% gc time)

julia> @time innerjoin(df2, df1, on=:id);
  1.436125 seconds (183 allocations: 238.863 MiB, 7.20% gc time)

julia> df1 = DataFrame(id = rand((1:10^6), 10^6));

julia> df2 = DataFrame(id = rand((1:10^7), 10^7));

julia> @time innerjoin(df1, df2, on=:id);
  2.874287 seconds (185 allocations: 667.052 MiB, 1.03% gc time)

julia> @time innerjoin(df2, df1, on=:id);
  2.081790 seconds (185 allocations: 254.227 MiB)

this PR:

julia> using Random, DataFrames

julia> Random.seed!(1234);

julia> df1 = DataFrame(id = sort!((1:10^6)));

julia> df2 = DataFrame(id = sort!((1:10^7)));

julia> @time innerjoin(df1, df2, on=:id);
  0.028745 seconds (149 allocations: 22.900 MiB)

julia> @time innerjoin(df2, df1, on=:id);
  0.027683 seconds (149 allocations: 22.900 MiB)

julia> df1 = DataFrame(id = shuffle((1:10^6)));

julia> df2 = DataFrame(id = shuffle((1:10^7)));

julia> @time innerjoin(df1, df2, on=:id);
  0.709983 seconds (245 allocations: 90.813 MiB, 1.41% gc time)

julia> @time innerjoin(df2, df1, on=:id);
  0.701188 seconds (245 allocations: 90.813 MiB)

julia> df1 = DataFrame(id = sort!(rand((1:10^6), 10^6)));

julia> df2 = DataFrame(id = sort!(rand((1:10^7), 10^7)));

julia> @time innerjoin(df1, df2, on=:id);
  0.043311 seconds (149 allocations: 22.109 MiB)

julia> @time innerjoin(df2, df1, on=:id);
  0.042666 seconds (149 allocations: 22.109 MiB)

julia> df1 = DataFrame(id = rand((1:10^6), 10^6));

julia> df2 = DataFrame(id = rand((1:10^7), 10^7));

julia> @time innerjoin(df1, df2, on=:id);
  1.104376 seconds (1.27 M allocations: 146.839 MiB, 14.43% gc time)

julia> @time innerjoin(df2, df1, on=:id);
  1.098397 seconds (1.27 M allocations: 146.839 MiB, 12.16% gc time)

larger data

current main:

julia> using Random, DataFrames

julia> Random.seed!(1234);

julia> df1 = DataFrame(id = sort!((1:10^6)));

julia> df2 = DataFrame(id = sort!((1:10^8)));

julia> @time innerjoin(df1, df2, on=:id);
 16.249414 seconds (183 allocations: 6.260 GiB, 1.73% gc time)

julia> @time innerjoin(df2, df1, on=:id);
167.121607 seconds (183 allocations: 1.580 GiB, 0.14% gc time)

julia> df1 = DataFrame(id = shuffle((1:10^6)));

julia> df2 = DataFrame(id = shuffle((1:10^8)));

julia> @time innerjoin(df1, df2, on=:id);
 17.036872 seconds (183 allocations: 6.260 GiB, 1.85% gc time)

julia> @time innerjoin(df2, df1, on=:id);
171.352191 seconds (183 allocations: 1.580 GiB, 0.14% gc time)

julia> df1 = DataFrame(id = sort!(rand((1:10^6), 10^6)));

julia> df2 = DataFrame(id = sort!(rand((1:10^8), 10^8)));

julia> @time innerjoin(df1, df2, on=:id);
  9.675349 seconds (185 allocations: 5.727 GiB, 3.15% gc time)

julia> @time innerjoin(df2, df1, on=:id);
  9.937535 seconds (185 allocations: 1.589 GiB, 2.17% gc time)

julia> df1 = DataFrame(id = rand((1:10^6), 10^6));

julia> df2 = DataFrame(id = rand((1:10^8), 10^8));

julia> @time innerjoin(df1, df2, on=:id);
 24.858718 seconds (183 allocations: 5.712 GiB, 0.41% gc time)

julia> @time innerjoin(df2, df1, on=:id);
 14.484201 seconds (183 allocations: 1.574 GiB, 1.55% gc time)

this PR:

julia> using Random, DataFrames

julia> Random.seed!(1234);

julia> df1 = DataFrame(id = sort!((1:10^6)));

julia> df2 = DataFrame(id = sort!((1:10^8)));

julia> @time innerjoin(df1, df2, on=:id);
  0.107601 seconds (149 allocations: 22.900 MiB)

julia> @time innerjoin(df2, df1, on=:id);
  0.109973 seconds (149 allocations: 22.900 MiB)

julia> df1 = DataFrame(id = shuffle((1:10^6)));

julia> df2 = DataFrame(id = shuffle((1:10^8)));

julia> @time innerjoin(df1, df2, on=:id);
  4.833485 seconds (245 allocations: 90.813 MiB)

julia> @time innerjoin(df2, df1, on=:id);
  4.871002 seconds (245 allocations: 90.813 MiB)

julia> df1 = DataFrame(id = sort!(rand((1:10^6), 10^6)));

julia> df2 = DataFrame(id = sort!(rand((1:10^8), 10^8)));

julia> @time innerjoin(df1, df2, on=:id);
  0.118253 seconds (149 allocations: 22.132 MiB)

julia> @time innerjoin(df2, df1, on=:id);
  0.121240 seconds (149 allocations: 22.132 MiB)

julia> df1 = DataFrame(id = rand((1:10^6), 10^6));

julia> df2 = DataFrame(id = rand((1:10^8), 10^8));

julia> @time innerjoin(df1, df2, on=:id);
  4.322843 seconds (1.27 M allocations: 131.589 MiB)

julia> @time innerjoin(df2, df1, on=:id);
  4.342500 seconds (1.27 M allocations: 131.589 MiB)
bkamins

comment created time in 5 hours

pull request commentJuliaData/DataFrames.jl

implement faster innerjoin

I have added a fast join path for sorted tables. Unfortunately it cannot be used with CategoricalVector.

Yeah, the comparison between pools is still an annoying problem and I haven't tried implementing the global table to fix that. At least it shouldn't be hard to check whether refpools are equal, and if so you can work directly on the refarrays. That should cover most cases, and you'll get efficient PooledArray support too. It would probably be possible to check whether one refpool is an ordered subset of the other do to clever things, but that can be left for later.

@nalimilan - what do you think we should do?

What's your question exactly? If I understand correctly, this PR is always as fast as main or faster, so I have nothing to object. :-)

Something which would be worth benchmarking is joining on multiple columns. I think it's the case where hashing columns one by one like hashrows_cols! does make the biggest difference.

Here's another run of your benchmarks (on a Xeon 4114 at 2.20GHz):

smaller data

main:

julia> using Random, DataFrames

julia> Random.seed!(1234);

julia> df1 = DataFrame(id = sort!(string.(1:10^6)));

julia> df2 = DataFrame(id = sort!(string.(1:10^7)));

julia> @time innerjoin(df1, df2, on=:id);
  4.712898 seconds (2.43 M allocations: 843.429 MiB, 23.98% gc time, 40.77% compilation time)

julia> @time innerjoin(df2, df1, on=:id);
  1.980586 seconds (194 allocations: 252.524 MiB)

julia> df1 = DataFrame(id = shuffle!(string.(1:10^6)));

julia> df2 = DataFrame(id = shuffle!(string.(1:10^7)));

julia> @time innerjoin(df1, df2, on=:id);
  4.824913 seconds (194 allocations: 707.847 MiB, 45.99% gc time)

julia> @time innerjoin(df2, df1, on=:id);
  3.083552 seconds (194 allocations: 252.524 MiB)

julia> df1 = DataFrame(id = sort!(rand(string.(1:10^6), 10^6)));

julia> df2 = DataFrame(id = sort!(rand(string.(1:10^7), 10^7)));

julia> @time innerjoin(df1, df2, on=:id);
  1.624032 seconds (196 allocations: 667.021 MiB, 15.88% gc time)

julia> @time innerjoin(df2, df1, on=:id);
  1.563919 seconds (196 allocations: 262.201 MiB, 9.95% gc time)

julia> df1 = DataFrame(id = rand(string.(1:10^6), 10^6));

julia> df2 = DataFrame(id = rand(string.(1:10^7), 10^7));

julia> @time innerjoin(df1, df2, on=:id);
  5.085381 seconds (196 allocations: 666.978 MiB, 37.59% gc time)

julia> @time innerjoin(df2, df1, on=:id);
  2.794582 seconds (196 allocations: 262.180 MiB)

PR:

julia> using Random, DataFrames

julia> Random.seed!(1234);

julia> df1 = DataFrame(id = sort!(string.(1:10^6)));

julia> df2 = DataFrame(id = sort!(string.(1:10^7)));

julia> @time innerjoin(df1, df2, on=:id);
  1.964033 seconds (1.70 M allocations: 122.239 MiB, 19.72% gc time, 65.16% compilation time)

julia> @time innerjoin(df2, df1, on=:id);
  0.357172 seconds (154 allocations: 22.900 MiB)

julia> df1 = DataFrame(id = shuffle!(string.(1:10^6)));

julia> df2 = DataFrame(id = shuffle!(string.(1:10^7)));

julia> @time innerjoin(df1, df2, on=:id);
  3.679063 seconds (257.89 k allocations: 105.835 MiB, 5.86% compilation time)

julia> @time innerjoin(df2, df1, on=:id);
  3.884744 seconds (246 allocations: 90.813 MiB)

julia> df1 = DataFrame(id = sort!(rand(string.(1:10^6), 10^6)));

julia> df2 = DataFrame(id = sort!(rand(string.(1:10^7), 10^7)));

julia> @time innerjoin(df1, df2, on=:id);
  0.404584 seconds (154 allocations: 22.133 MiB)

julia> @time innerjoin(df2, df1, on=:id);
  0.409299 seconds (154 allocations: 22.133 MiB)

julia> df1 = DataFrame(id = rand(string.(1:10^6), 10^6));

julia> df2 = DataFrame(id = rand(string.(1:10^7), 10^7));

julia> @time innerjoin(df1, df2, on=:id);
  5.682937 seconds (1.27 M allocations: 146.871 MiB)

julia> @time innerjoin(df2, df1, on=:id);
  6.104373 seconds (1.27 M allocations: 146.871 MiB)

bigger data

main:

julia> using Random, DataFrames

julia> Random.seed!(1234);

julia> df1 = DataFrame(id = sort!(string.(1:10^6)));

julia> df2 = DataFrame(id = sort!(string.(1:10^8)));

julia> @time innerjoin(df1, df2, on=:id);
 46.291461 seconds (194 allocations: 6.260 GiB, 42.79% gc time)

julia> @time innerjoin(df2, df1, on=:id);
 28.256333 seconds (194 allocations: 1.588 GiB, 21.50% gc time)

julia> df1 = DataFrame(id = shuffle!(string.(1:10^6)));

julia> df2 = DataFrame(id = shuffle!(string.(1:10^8)));

julia> @time innerjoin(df1, df2, on=:id);
137.688439 seconds (194 allocations: 6.260 GiB, 76.09% gc time)

julia> @time innerjoin(df2, df1, on=:id);
 61.029408 seconds (194 allocations: 1.588 GiB, 35.07% gc time)

julia> df1 = DataFrame(id = sort!(rand(string.(1:10^6), 10^6)));

julia> df2 = DataFrame(id = sort!(rand(string.(1:10^8), 10^8)));

julia> @time innerjoin(df1, df2, on=:id);
 27.746145 seconds (194 allocations: 5.712 GiB, 54.92% gc time)

julia> @time innerjoin(df2, df1, on=:id);
 16.150328 seconds (194 allocations: 1.582 GiB, 20.20% gc time)

julia> df1 = DataFrame(id = rand(string.(1:10^6), 10^6));

julia> df2 = DataFrame(id = rand(string.(1:10^8), 10^8));

julia> @time innerjoin(df1, df2, on=:id);
116.331639 seconds (196 allocations: 5.727 GiB, 65.69% gc time)

julia> @time innerjoin(df2, df1, on=:id);
 52.525720 seconds (196 allocations: 1.597 GiB, 30.45% gc time)

PR:

julia> using Random, DataFrames

julia> Random.seed!(1234);

julia> df1 = DataFrame(id = sort!(string.(1:10^6)));

julia> df2 = DataFrame(id = sort!(string.(1:10^8)));

julia> @time innerjoin(df1, df2, on=:id);
  4.144959 seconds (1.70 M allocations: 122.124 MiB, 32.25% compilation time)

julia> @time innerjoin(df2, df1, on=:id);
  2.769002 seconds (154 allocations: 22.900 MiB)

julia> df1 = DataFrame(id = shuffle!(string.(1:10^6)));

julia> df2 = DataFrame(id = shuffle!(string.(1:10^8)));

julia> @time innerjoin(df1, df2, on=:id);
 41.557779 seconds (257.89 k allocations: 105.835 MiB, 0.58% compilation time)

julia> @time innerjoin(df2, df1, on=:id);
 40.998080 seconds (246 allocations: 90.813 MiB)

julia> df1 = DataFrame(id = sort!(rand(string.(1:10^6), 10^6)));

julia> df2 = DataFrame(id = sort!(rand(string.(1:10^8), 10^8)));

julia> @time innerjoin(df1, df2, on=:id);
  3.731307 seconds (154 allocations: 22.113 MiB)

julia> @time innerjoin(df2, df1, on=:id);
  3.816319 seconds (154 allocations: 22.113 MiB)

julia> df1 = DataFrame(id = rand(string.(1:10^6), 10^6));

julia> df2 = DataFrame(id = rand(string.(1:10^8), 10^8));

julia> @time innerjoin(df1, df2, on=:id);
 39.876243 seconds (1.27 M allocations: 146.741 MiB)

julia> @time innerjoin(df2, df1, on=:id);
 39.830833 seconds (1.27 M allocations: 146.741 MiB)
bkamins

comment created time in 5 hours

PR opened hng/tech-coops

Add Sange
+1 -0

0 comment

1 changed file

pr created time in 6 hours

issue openedBrewtarget/brewtarget

Creating a new recipe makes the program crash on start-up

Steps to reproduce (obviously make sure you have a backup of your database first!):

  • Start Brewtarget
  • Right-click on the Recipes tree pane and select New > Recipe
  • Type in a recipe name, eg 'Bork', and click OK
  • Close Brewtarget
  • Start Brewtarget again ... result = core dump

If you want your data back, edit the DB in another tool to remove the Recipe record you just added. Brewtarget will now start OK again.

I'm presuming the issue is that when creating a new Recipe object, one (or more) of the fields is set to a default value that is not valid. Looking at the DB, it seems that a lot of fields are set to NULL.

Hopefully the fix is something along the lines of creating some sensible default values for fields in Recipe constructor.

created time in 6 hours

PR opened hng/tech-coops

Add blinkenbox
+1 -0

0 comment

1 changed file

pr created time in 6 hours

PR opened hng/tech-coops

Add Reinblau
+1 -0

0 comment

1 changed file

pr created time in 6 hours

issue openednextcloud/deck

Bulk assign user to cards

I don't know if this has asked before as I couldn't find one. What I would like is a way to bulk assign a user to a card.

created time in 6 hours

PR opened hng/tech-coops

Add Robur
+1 -0

0 comment

1 changed file

pr created time in 6 hours

PR opened hng/tech-coops

Add ctrl.alt.coop
+1 -0

0 comment

1 changed file

pr created time in 6 hours

issue commentBrewtarget/brewtarget

[RFC] Recipe versioning

I don't know what current status of this is, but I rather agree with @pricelessbrewing about branches - ie keeping to a simple principle that if you want one recipe to evolve into two new ones, then only one of those two ones retains the history, and the other just becomes a brand new recipe. I would be inclined to present versioning to the user in somewhat the same way as versioning of wiki pages is presented.

mikfire

comment created time in 6 hours

issue commentBrewtarget/brewtarget

BrewNote should contain a copy of a the recipe on the day it was brewed.

@pricelessbrewing I think https://github.com/Brewtarget/brewtarget/issues/327 was the idea @mikfire mentions above.

janjachnik

comment created time in 7 hours

more