profile
viewpoint
If you are wondering where the data of this site comes from, please visit https://api.github.com/users/ghuls/events. GitMemory does not store any data, but only uses NGINX to cache data for a period of time. The idea behind GitMemory is simply to give users a better reading experience.
Gert Hulselmans ghuls @aertslab Bioinformatician in Laboratory of Computational Biology (part of KU Leuven Center for Human Genetics and the VIB Center for Brain and Disease Research.)

aertslab/SCopeLoomR 17

R package (compatible with SCope) to create generic .loom files and extend them with other data e.g.: SCENIC regulons, Seurat clusters and markers, ...

atarashansky/SAMap 13

SAMap: Mapping single-cell RNA sequencing datasets from evolutionarily distant organisms.

aertslab/scforest 3

scforest: a visual overview of single cell technology

ghuls/multigrep 3

Grep for multiple patterns at once in one or more files and allow grepping in certain columns only.

ghuls/bx-python-fixed-history 1

bx-python imported with hg-fast-export from https://bitbucket.org/james_taylor/bx-python/ with missing patched available in https://github.com/bxlab/bx-python added

ghuls/moreutils-parallel 1

Improve parallel of moreutils (http://joeyh.name/code/moreutils/)

aertslab/AS_variant_pipeline 0

Allele specific variant pipeline

ghuls/arrow 0

Apache Arrow is a cross-language development platform for in-memory data. It specifies a standardized language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware. It also provides computational libraries and zero-copy streaming messaging and interprocess communication. Languages currently supported include C, C++, Java, JavaScript, Python, and Ruby.

ghuls/arrow2 0

transmute-free Rust library to work with the Arrow format

ghuls/bat 0

A cat(1) clone with wings.

issue closedezrosent/frawk

frawk panics when using substr when start position is higher than length of string to subset.

frawk panics when using substr when start position is higher than length of string to subset.

$ echo 'test' | frawk '{ print substr($0, 3, 10); }'
st

$ echo 'tes' | frawk '{ print substr($0, 3, 10); }'
s

$ echo 'te' | frawk '{ print substr($0, 3, 10); }'

$ echo 't' | frawk '{ print substr($0, 3, 10); }'
thread 'main' panicked at 'internal error: invalid index len=1, from=2, to=1', src/runtime/str_impl.rs:711:13
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
fatal runtime error: failed to initiate panic, error 5
Aborted

$ echo '' | frawk '{ print substr($0, 3, 10); }'
thread 'main' panicked at 'internal error: invalid index len=0, from=2, to=0', src/runtime/str_impl.rs:711:13
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
fatal runtime error: failed to initiate panic, error 5
Aborted

closed time in 2 hours

ghuls

issue commentpola-rs/polars

Could pyarrow dependency be made optional?

It might be possible to make it optional, if no interop with pandas is required.

Greatness7

comment created time in 6 hours

issue commentpola-rs/polars

Could pyarrow dependency be made optional?

Removing pyarrow as a dependency from polars, will also remove support for converting polars dataframes to pandas dataframes as this is completely offloaded to pyarrow.

Greatness7

comment created time in 7 hours

issue commentsstadick/hck

regex parser is faster than the literal one.

Thanks. Now it is 1.5 second faster than the regex version:

$ timeit hck --no-mmap -L -d $'\t' -f 1,2 e.sorted.tsv > /dev/null

Time output:
------------

  * Command: hck --no-mmap -L -d 	 -f 1,2 e.sorted.tsv
  * Elapsed wall time: 0:07.48 = 7.48 seconds
  * Elapsed CPU time:
     - User: 5.31
     - Sys: 2.14
  * CPU usage: 99%
  * Context switching:
     - Voluntarily (e.g.: waiting for I/O operation): 7
     - Involuntarily (time slice expired): 14
  * Maximum resident set size (RSS: memory) (kiB): 1644
  * Number of times the process was swapped out of main memory: 0
  * Filesystem:
     - # of inputs: 0
     - # of outputs: 0
  * Exit status: 0

$ timeit hck --no-mmap -d $'\t' -f 1,2 e.sorted.tsv > /dev/null

Time output:
------------

  * Command: hck --no-mmap -d 	 -f 1,2 e.sorted.tsv
  * Elapsed wall time: 0:08.74 = 8.74 seconds
  * Elapsed CPU time:
     - User: 6.55
     - Sys: 2.02
  * CPU usage: 98%
  * Context switching:
     - Voluntarily (e.g.: waiting for I/O operation): 7
     - Involuntarily (time slice expired): 3407
  * Maximum resident set size (RSS: memory) (kiB): 1816
  * Number of times the process was swapped out of main memory: 0
  * Filesystem:
     - # of inputs: 0
     - # of outputs: 0
  * Exit status: 0

$ timeit hck -L -d $'\t' -f 1,2 e.sorted.tsv > /dev/null

Time output:
------------

  * Command: hck -L -d 	 -f 1,2 e.sorted.tsv
  * Elapsed wall time: 0:11.69 = 11.69 seconds
  * Elapsed CPU time:
     - User: 8.02
     - Sys: 3.03
  * CPU usage: 94%
  * Context switching:
     - Voluntarily (e.g.: waiting for I/O operation): 8
     - Involuntarily (time slice expired): 191
  * Maximum resident set size (RSS: memory) (kiB): 9563660
  * Number of times the process was swapped out of main memory: 0
  * Filesystem:
     - # of inputs: 0
     - # of outputs: 0
  * Exit status: 0

$ timeit hck -d $'\t' -f 1,2 e.sorted.tsv > /dev/null

Time output:
------------

  * Command: hck -d 	 -f 1,2 e.sorted.tsv
  * Elapsed wall time: 0:13.48 = 13.48 seconds
  * Elapsed CPU time:
     - User: 8.85
     - Sys: 3.03
  * CPU usage: 88%
  * Context switching:
     - Voluntarily (e.g.: waiting for I/O operation): 7
     - Involuntarily (time slice expired): 157
  * Maximum resident set size (RSS: memory) (kiB): 9563916
  * Number of times the process was swapped out of main memory: 0
  * Filesystem:
     - # of inputs: 0
     - # of outputs: 0
  * Exit status: 0
ghuls

comment created time in a day

create barnchghuls/polars

branch : to_polars-static_sources

created branch time in 2 days

push eventghuls/polars-book

Gert Hulselmans

commit sha 2646b8c5a08bf0424a5867db89d9853a5e4baf0f

Update https://github.com/ritchie46/polars(|-book) to https://pola-rs.github.io/polars(|-book).

view details

Gert Hulselmans

commit sha b02639e7ba4dd87eab84e33e7a76e36301858fe4

Update links to SVG files from ritchie46/static/master/polars/ to pola-rs/polars-static/master/docs/

view details

Gert Hulselmans

commit sha b575e7b9df2d214b312927a23a06713d048adab1

Update link to sponsor logo.

view details

Gert Hulselmans

commit sha e8cddab091aa7cfefcad198f3fa7d47ea8bf33f5

Replace Polars PNG logo with SVG logo and draw circle with CSS.

view details

push time in 2 days

pull request commentpola-rs/polars-book

Update https://github.com/ritchie46/polars(|-book) to https://pola-rs…

In the book there are still some references to files to your static repo:

user_guide/src/performance/strings.md
13:![](https://raw.githubusercontent.com/ritchie46/static/master/polars/arrow-string.svg)
19:![](https://raw.githubusercontent.com/ritchie46/static/master/polars/pandas-string.svg)

user_guide/src/examples/expressions/window_1.py
5:    "https://gist.githubusercontent.com/ritchie46/cac6b337ea52281aa23c049250a4ff03/raw/89a957ff3919d90e6ef2d34235e6bf22304f3366/pokemon.csv"

user_guide/src/introduction.md
1:<div style="text-align:center;margin-top:30px"><img src="https://raw.githubusercontent.com/ritchie46/static/master/polars/polars_logo_white_circle.png" width="200px" /></div>
39:![](https://raw.githubusercontent.com/ritchie46/static/master/polars/api.svg)
46:![](https://www.ritchievink.com/img/post-35-polars-0.15/db-benchmark.png)
76:[![Xomnia](https://raw.githubusercontent.com/ritchie46/static/master/polars/xomnia_logo.png)](https://www.xomnia.com)

user_guide/src/dsl/groupby.md
15:![](https://raw.githubusercontent.com/ritchie46/static/master/polars/split-apply-combine-par.svg)
20:![](https://raw.githubusercontent.com/ritchie46/static/master/polars/lock-free-hash.svg)

This logo : https://raw.githubusercontent.com/ritchie46/static/master/polars/polars_logo_white_circle.png

can be replaced with the new SVG logo (and some CSS styling to draw the circle and white background:

<div style="margin: 30px auto; background-color: white; border-radius: 50%; width: 200px; height: 200px;">
  
</div>
ghuls

comment created time in 3 days

PR opened pola-rs/polars-book

Update https://github.com/ritchie46/polars(|-book) to https://pola-rs…

….github.io/polars(|-book).

+6 -6

0 comment

4 changed files

pr created time in 3 days

create barnchghuls/polars-book

branch : update_website_urls

created branch time in 3 days

fork ghuls/polars-book

Book documentation of the Polars DataFrame library

fork in 3 days

pull request commentseq-lang/seq

Add support for reading SAM, BAM, CRAM, BCF, VCF with multiple thread…

# count_reads_in_bam.seq 
from bio import *
from time import timing

bam_filename = 'ENCFF014BMO.bam'


def count_reads_in_bam(bam_filename, nthreads):
    nbr_reads = 0

    for r in SAM(bam_filename, nthreads=nthreads):
        nbr_reads += 1
        #print(r.name, r.read, r.pos, r.mapq, r.cigar, r.reversed)

    return nbr_reads

# Warmup cache.
count_reads_in_bam(bam_filename, 4)

for nthreads in [0, 1, 2, 3, 4, 6, 8]:
    with timing(f"Read BAM file with {nthreads} threads:"):
        print(count_reads_in_bam(bam_filename, nthreads))
seqc run --release count_reads_in_bam.seq 
15707047
Read BAM file with 0 threads: took 9.35422s
15707047
Read BAM file with 1 threads: took 6.34794s
15707047
Read BAM file with 2 threads: took 3.98488s
15707047
Read BAM file with 3 threads: took 3.93837s
15707047
Read BAM file with 4 threads: took 3.96835s
15707047
Read BAM file with 6 threads: took 4.01037s
15707047
Read BAM file with 8 threads: took 3.94179s
ghuls

comment created time in 4 days

create barnchghuls/seq

branch : add_hts_set_threads

created branch time in 4 days

PR opened seq-lang/seq

Add support for reading SAM, BAM, CRAM, BCF, VCF with multiple thread…

…s.hts_set_threads

Add support for reading SAM, BAM, CRAM, BCF, VCF with multiple threads by using hts_set_threads().

+23 -13

0 comment

3 changed files

pr created time in 4 days

issue commentaertslab/popscle_helper_tools

Building vcf file

If you have bulk RNAseq for all your genotypes, you could use the 1000g common variants VCF file and call all those positions on all BAM files and then filter the results.

Lopiniatre

comment created time in 4 days

issue commentpola-rs/polars

Remove deprecated `from_*` functions

The concat test can already be fixed to use from_records:

diff --git a/py-polars/tests/test_df.py b/py-polars/tests/test_df.py
index 58ec015c..059e1f00 100644
--- a/py-polars/tests/test_df.py
+++ b/py-polars/tests/test_df.py
@@ -665,7 +665,7 @@ def test_concat():
     assert pl.concat([df, df]).shape == (6, 3)
 
     # check if a remains unchanged
-    a = pl.from_rows(((1, 2), (1, 2)))
+    a = pl.from_records(((1, 2), (1, 2)), orient='row')
     _ = pl.concat([a, a, a])
     assert a.shape == (2, 2)

test_from_rows probably should be renamed and expanded to test_from_records.

column_name_mapping dict like column mapping does not look like it is supported by pl.from_records() columns option and also does not support None for those other column names.

frame_equal does not compare the name of the series, so even assert in the original code doesn't really need to have the column names set to the same values.

def test_from_rows():
    df = pl.DataFrame.from_rows(
        [[1, 2, "foo"], [2, 3, "bar"]], column_name_mapping={1: "foo"}
    )
    assert df.frame_equal(
        pl.DataFrame({"column_0": [1, 2], "foo": [2, 3], "column_2": ["foo", "bar"]})
    )

    df = pl.DataFrame.from_rows(
        [[1, datetime.fromtimestamp(100)], [2, datetime.fromtimestamp(2398754908)]],
        column_name_mapping={1: "foo"},
    )
    assert df.dtypes == [pl.Int64, pl.Date64]


def test_from_rows_fixed():
    df = pl.from_records(
        [[1, 2, "foo"], [2, 3, "bar"]], columns=["", "foo", ""], orient='row'
    )
    assert df.frame_equal(
        pl.DataFrame({"column_0": [1, 2], "foo": [2, 3], "column_2": ["foo", "bar"]})
    )
    print(df)
        [[1, datetime.fromtimestamp(100)], [2, datetime.fromtimestamp(2398754908)]],
        columns=["", "foo"],
        orient='row'
    )
    assert df.dtypes == [pl.Int64, pl.Date64]
ghuls

comment created time in 5 days

issue openedpola-rs/polars

Remove deprecated `from_*` functions

Remove deprecated from_* functions:

pl.from_arrow_table
pl.from_rows
pl.DataFrame.from_arrow
pl.DataFrame.from_rows

created time in 5 days

PR opened pola-rs/polars

More delimiter to sep
+11 -11

0 comment

5 changed files

pr created time in 5 days

create barnchghuls/polars

branch : more_delimiter_to_sep

created branch time in 5 days

issue commentsstadick/hck

regex parser is faster than the literal one.

Gzipped test file: https://temp.aertslab.org/e.sorted.tsv.gz

I put the file in /dev/shm after running the same on a shred filesystem (which gave similar time diferences between commands) to see if the filesystem was causing the difference in timings.

With a local file (with warm cache):

$ timeit hck -d $'\t' -f 1,2 e.sorted.tsv > /dev/null

Time output:
------------

  * Command: hck -d      -f 1,2 e.sorted.tsv
  * Elapsed wall time: 0:11.50 = 11.50 seconds
  * Elapsed CPU time:
     - User: 8.88
     - Sys: 2.58
  * CPU usage: 99%
  * Context switching:
     - Voluntarily (e.g.: waiting for I/O operation): 7
     - Involuntarily (time slice expired): 5
  * Maximum resident set size (RSS: memory) (kiB): 9563892
  * Number of times the process was swapped out of main memory: 0
  * Filesystem:
     - # of inputs: 0
     - # of outputs: 0
  * Exit status: 0

$ timeit hck -d $'\t' -f 1,2 --no-mmap e.sorted.tsv > /dev/null

Time output:
------------

  * Command: hck -d      -f 1,2 --no-mmap e.sorted.tsv
  * Elapsed wall time: 0:09.34 = 9.34 seconds
  * Elapsed CPU time:
     - User: 7.53
     - Sys: 1.78
  * CPU usage: 99%
  * Context switching:
     - Voluntarily (e.g.: waiting for I/O operation): 7
     - Involuntarily (time slice expired): 4
  * Maximum resident set size (RSS: memory) (kiB): 1944
  * Number of times the process was swapped out of main memory: 0
  * Filesystem:
     - # of inputs: 0
     - # of outputs: 0
  * Exit status: 0

$ timeit hck -L -d $'\t' -f 1,2 e.sorted.tsv > /dev/null

Time output:
------------

  * Command: hck -L -d   -f 1,2 e.sorted.tsv
  * Elapsed wall time: 0:14.74 = 14.74 seconds
  * Elapsed CPU time:
     - User: 11.48
     - Sys: 3.21
  * CPU usage: 99%
  * Context switching:
     - Voluntarily (e.g.: waiting for I/O operation): 11
     - Involuntarily (time slice expired): 6
  * Maximum resident set size (RSS: memory) (kiB): 9563692
  * Number of times the process was swapped out of main memory: 0
  * Filesystem:
     - # of inputs: 0
     - # of outputs: 0
  * Exit status: 0


$ timeit hck -L -d $'\t' -f 1,2 --no-mmap e.sorted.tsv > /dev/null

Time output:
------------

  * Command: hck -L -d   -f 1,2 --no-mmap e.sorted.tsv
  * Elapsed wall time: 0:10.72 = 10.72 seconds
  * Elapsed CPU time:
     - User: 8.80
     - Sys: 1.89
  * CPU usage: 99%
  * Context switching:
     - Voluntarily (e.g.: waiting for I/O operation): 7
     - Involuntarily (time slice expired): 5
  * Maximum resident set size (RSS: memory) (kiB): 1724
  * Number of times the process was swapped out of main memory: 0
  * Filesystem:
     - # of inputs: 0
     - # of outputs: 0
  * Exit status: 0

CPU info:

processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 85
model name      : Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz
stepping        : 4
microcode       : 0x2006906
cpu MHz         : 999.932
cache size      : 25344 KB
physical id     : 0
siblings        : 18
core id         : 0
cpu cores       : 18
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 22
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb cat_l3 cdp_l3 invpcid_single intel_ppin intel_pt ssbd mba ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts hwp hwp_act_window hwp_pkg_req pku ospke md_clear spec_ctrl intel_stibp flush_l1d
bogomips        : 4600.00
clflush size    : 64
cache_alignment : 64
address sizes   : 46 bits physical, 48 bits virtual
power management:

...
ghuls

comment created time in 5 days

issue commentpola-rs/polars

use `sep` instead of `delimiter` in DataFrame.to_csv

The following functions are also using delimiter:

def concat_str(exprs: tp.List["pl.Expr"], delimiter: str = "") -> "pl.Expr":
def concat_list(exprs: tp.List["pl.Expr"], delimiter: str = "") -> "pl.Expr":
ritchie46

comment created time in 6 days

issue commentpola-rs/polars

use `sep` instead of `delimiter` in DataFrame.to_csv

https://github.com/pola-rs/polars/pull/1418

ritchie46

comment created time in 6 days

PR opened pola-rs/polars

Use "sep" instead of "delimiter" in pl.Dataframe().to_csv().

Use "sep" instead of "delimiter" in pl.Dataframe().to_csv() to be constent with pl.read_csv() and pl.scan_csv().

+6 -6

0 comment

2 changed files

pr created time in 6 days

create barnchghuls/polars

branch : use_sep_in_to_csv

created branch time in 6 days

PullRequestReviewEvent

issue commentsstadick/hck

regex parser is faster than the literal one.

# With hck compiled without pgo:
$ export RUSTFLAGS='-C target-cpu=native
$ cargo install --path . --force


$ timeit hck -d $'\t' -f 1,2 /dev/shm/e.sorted.tsv > /dev/null

Time output:
------------

  * Command: hck -d 	 -f 1,2 /dev/shm/e.sorted.tsv
  * Elapsed wall time: 0:12.01 = 12.01 seconds
  * Elapsed CPU time:
     - User: 8.88
     - Sys: 3.04
  * CPU usage: 99%
  * Context switching:
     - Voluntarily (e.g.: waiting for I/O operation): 18
     - Involuntarily (time slice expired): 5
  * Maximum resident set size (RSS: memory) (kiB): 9563896
  * Number of times the process was swapped out of main memory: 0
  * Filesystem:
     - # of inputs: 4936
     - # of outputs: 0
  * Exit status: 0

$ timeit hck -d $'\t' -f 1,2 --no-mmap /dev/shm/e.sorted.tsv > /dev/null

Time output:
------------

  * Command: hck -d 	 -f 1,2 --no-mmap /dev/shm/e.sorted.tsv
  * Elapsed wall time: 0:09.95 = 9.95 seconds
  * Elapsed CPU time:
     - User: 7.25
     - Sys: 2.67
  * CPU usage: 99%
  * Context switching:
     - Voluntarily (e.g.: waiting for I/O operation): 5
     - Involuntarily (time slice expired): 55
  * Maximum resident set size (RSS: memory) (kiB): 1948
  * Number of times the process was swapped out of main memory: 0
  * Filesystem:
     - # of inputs: 0
     - # of outputs: 0
  * Exit status: 0

$ timeit hck -L -d $'\t' -f 1,2 /dev/shm/e.sorted.tsv > /dev/null

Time output:
------------

  * Command: hck -L -d 	 -f 1,2 /dev/shm/e.sorted.tsv
  * Elapsed wall time: 0:13.05 = 13.05 seconds
  * Elapsed CPU time:
     - User: 9.88
     - Sys: 3.13
  * CPU usage: 99%
  * Context switching:
     - Voluntarily (e.g.: waiting for I/O operation): 3
     - Involuntarily (time slice expired): 5
  * Maximum resident set size (RSS: memory) (kiB): 9563712
  * Number of times the process was swapped out of main memory: 0
  * Filesystem:
     - # of inputs: 0
     - # of outputs: 0
  * Exit status: 0

$ timeit hck -L -d $'\t' -f 1,2 --no-mmap /dev/shm/e.sorted.tsv > /dev/null

Time output:
------------

  * Command: hck -L -d 	 -f 1,2 --no-mmap /dev/shm/e.sorted.tsv
  * Elapsed wall time: 0:10.79 = 10.79 seconds
  * Elapsed CPU time:
     - User: 8.20
     - Sys: 2.56
  * CPU usage: 99%
  * Context switching:
     - Voluntarily (e.g.: waiting for I/O operation): 3
     - Involuntarily (time slice expired): 9
  * Maximum resident set size (RSS: memory) (kiB): 1724
  * Number of times the process was swapped out of main memory: 0
  * Filesystem:
     - # of inputs: 0
     - # of outputs: 0
  * Exit status: 0
ghuls

comment created time in 6 days

issue commentsstadick/hck

regex parser is faster than the literal one.

I know that regex uses a lot of tricks (e.g. first trying to find rare characters) to speedup regex matches, but I am surprised it is faster for 1 byte matches with all the function call overhead.

I used:

just install-native

OS: CentOS7

With --no-mmap, all commands run much faster. As the input file is already in shared memory, I didn't expect such a big difference.

$ timeit hck -L -d 072989423308200b88dd3ce688a7dcff -f 1,2 /dev/shm/e.sorted.tsv > /dev/null

Time output:
------------

  * Command: hck -L -d 072989423308200b88dd3ce688a7dcff -f 1,2 /dev/shm/e.sorted.tsv
  * Elapsed wall time: 0:18.95 = 18.95 seconds
  * Elapsed CPU time:
     - User: 15.78
     - Sys: 3.12
  * CPU usage: 99%
  * Context switching:
     - Voluntarily (e.g.: waiting for I/O operation): 3
     - Involuntarily (time slice expired): 7
  * Maximum resident set size (RSS: memory) (kiB): 9563652
  * Number of times the process was swapped out of main memory: 0
  * Filesystem:
     - # of inputs: 0
     - # of outputs: 0
  * Exit status: 0

$ timeit hck -L -d 072989423308200b88dd3ce688a7dcff -f 1,2 --no-mmap /dev/shm/e.sorted.tsv > /dev/null

Time output:
------------

  * Command: hck -L -d 072989423308200b88dd3ce688a7dcff -f 1,2 --no-mmap /dev/shm/e.sorted.tsv
  * Elapsed wall time: 0:16.76 = 16.76 seconds
  * Elapsed CPU time:
     - User: 14.07
     - Sys: 2.65
  * CPU usage: 99%
  * Context switching:
     - Voluntarily (e.g.: waiting for I/O operation): 5
     - Involuntarily (time slice expired): 6
  * Maximum resident set size (RSS: memory) (kiB): 1656
  * Number of times the process was swapped out of main memory: 0
  * Filesystem:
     - # of inputs: 0
     - # of outputs: 0
  * Exit status: 0

$ timeit hck -L -d $'\t' -f 1,2 /dev/shm/e.sorted.tsv > /dev/null

Time output:
------------

  * Command: hck -L -d 	 -f 1,2 /dev/shm/e.sorted.tsv
  * Elapsed wall time: 0:13.13 = 13.13 seconds
  * Elapsed CPU time:
     - User: 9.93
     - Sys: 3.16
  * CPU usage: 99%
  * Context switching:
     - Voluntarily (e.g.: waiting for I/O operation): 8
     - Involuntarily (time slice expired): 5
  * Maximum resident set size (RSS: memory) (kiB): 9563704
  * Number of times the process was swapped out of main memory: 0
  * Filesystem:
     - # of inputs: 0
     - # of outputs: 0
  * Exit status: 0


$ timeit hck -L -d $'\t' -f 1,2 --no-mmap /dev/shm/e.sorted.tsv > /dev/null

Time output:
------------

  * Command: hck -L -d 	 -f 1,2 --no-mmap /dev/shm/e.sorted.tsv
  * Elapsed wall time: 0:10.61 = 10.61 seconds
  * Elapsed CPU time:
     - User: 7.91
     - Sys: 2.67
  * CPU usage: 99%
  * Context switching:
     - Voluntarily (e.g.: waiting for I/O operation): 3
     - Involuntarily (time slice expired): 6
  * Maximum resident set size (RSS: memory) (kiB): 1628
  * Number of times the process was swapped out of main memory: 0
  * Filesystem:
     - # of inputs: 0
     - # of outputs: 0
  * Exit status: 0

$ timeit hck -d $'\t' -f 1,2 /dev/shm/e.sorted.tsv > /dev/null

Time output:
------------

  * Command: hck -d 	 -f 1,2 /dev/shm/e.sorted.tsv
  * Elapsed wall time: 0:11.12 = 11.12 seconds
  * Elapsed CPU time:
     - User: 8.15
     - Sys: 2.93
  * CPU usage: 99%
  * Context switching:
     - Voluntarily (e.g.: waiting for I/O operation): 3
     - Involuntarily (time slice expired): 62
  * Maximum resident set size (RSS: memory) (kiB): 9563696
  * Number of times the process was swapped out of main memory: 0
  * Filesystem:
     - # of inputs: 0
     - # of outputs: 0
  * Exit status: 0

$ timeit hck -d $'\t' -f 1,2 --no-mmap /dev/shm/e.sorted.tsv > /dev/null

Time output:
------------

  * Command: hck -d 	 -f 1,2 --no-mmap /dev/shm/e.sorted.tsv
  * Elapsed wall time: 0:08.67 = 8.67 seconds
  * Elapsed CPU time:
     - User: 6.47
     - Sys: 2.17
  * CPU usage: 99%
  * Context switching:
     - Voluntarily (e.g.: waiting for I/O operation): 5
     - Involuntarily (time slice expired): 49
  * Maximum resident set size (RSS: memory) (kiB): 1820
  * Number of times the process was swapped out of main memory: 0
  * Filesystem:
     - # of inputs: 0
     - # of outputs: 0
  * Exit status: 0
ghuls

comment created time in 6 days

issue openedsstadick/hck

regex" parser is faster than the literal one.

Not a big issue, but for a test 10G TSV file, the "regex" parser is faster than the literal one:

# 10G TSV file.

$ wc -l e.sorted.tsv
71741554 e.sorted.tsv

$ timeit hck -d $'\t' -f 1,2 e.sorted.tsv > /dev/null

Time output:
------------

  * Command: hck -d      -f 1,2 e.sorted.tsv
  * Elapsed wall time: 0:13.44 = 13.44 seconds
  * Elapsed CPU time:
     - User: 9.20
     - Sys: 4.20
  * CPU usage: 99%
  * Context switching:
     - Voluntarily (e.g.: waiting for I/O operation): 7
     - Involuntarily (time slice expired): 24
  * Maximum resident set size (RSS: memory) (kiB): 9563740
  * Number of times the process was swapped out of main memory: 0
  * Filesystem:
     - # of inputs: 0
     - # of outputs: 0
  * Exit status: 0

$ timeit hck -d '\t' -f 1,2 e.sorted.tsv > /dev/null

Time output:
------------

  * Command: hck -d \t -f 1,2 e.sorted.tsv
  * Elapsed wall time: 0:13.45 = 13.45 seconds
  * Elapsed CPU time:
     - User: 9.22
     - Sys: 4.14
  * CPU usage: 99%
  * Context switching:
     - Voluntarily (e.g.: waiting for I/O operation): 18
     - Involuntarily (time slice expired): 31
  * Maximum resident set size (RSS: memory) (kiB): 9563968
  * Number of times the process was swapped out of main memory: 0
  * Filesystem:
     - # of inputs: 0
     - # of outputs: 0
  * Exit status: 0

$ timeit hck -L -d $'\t' -f 1,2 e.sorted.tsv > /dev/null

Time output:
------------

  * Command: hck -L -d   -f 1,2 e.sorted.tsv
  * Elapsed wall time: 0:15.79 = 15.79 seconds
  * Elapsed CPU time:
     - User: 11.20
     - Sys: 4.12
  * CPU usage: 97%
  * Context switching:
     - Voluntarily (e.g.: waiting for I/O operation): 7
     - Involuntarily (time slice expired): 73
  * Maximum resident set size (RSS: memory) (kiB): 9563784
  * Number of times the process was swapped out of main memory: 0
  * Filesystem:
     - # of inputs: 0
     - # of outputs: 0
  * Exit status: 0


$ timeit hck -d 072989423308200b88dd3ce688a7dcff -f 1,2 e.sorted.tsv > /dev/null

Time output:
------------

  * Command: hck -d 072989423308200b88dd3ce688a7dcff -f 1,2 e.sorted.tsv
  * Elapsed wall time: 0:15.20 = 15.20 seconds
  * Elapsed CPU time:
     - User: 10.96
     - Sys: 4.09
  * CPU usage: 99%
  * Context switching:
     - Voluntarily (e.g.: waiting for I/O operation): 6
     - Involuntarily (time slice expired): 42
  * Maximum resident set size (RSS: memory) (kiB): 9563884
  * Number of times the process was swapped out of main memory: 0
  * Filesystem:
     - # of inputs: 0
     - # of outputs: 0
  * Exit status: 0

$ timeit hck -L -d 072989423308200b88dd3ce688a7dcff -f 1,2 e.sorted.tsv > /dev/null

Time output:
------------

  * Command: hck -L -d 072989423308200b88dd3ce688a7dcff -f 1,2 e.sorted.tsv
  * Elapsed wall time: 0:29.58 = 29.58 seconds
  * Elapsed CPU time:
     - User: 19.55
     - Sys: 4.12
  * CPU usage: 80%
  * Context switching:
     - Voluntarily (e.g.: waiting for I/O operation): 6
     - Involuntarily (time slice expired): 1685
  * Maximum resident set size (RSS: memory) (kiB): 9563648
  * Number of times the process was swapped out of main memory: 0
  * Filesystem:
     - # of inputs: 0
     - # of outputs: 0
  * Exit status: 0

created time in 7 days

issue commentsstadick/hck

Crash when first line is empty and -L is specified with a one character separator.

Thanks for the fast fix.

ghuls

comment created time in 7 days

issue commentsstadick/hck

Crash when first line is empty and -L is specified with a one character separator.

Something like this probably will fix it.

diff --git a/src/lib/core.rs b/src/lib/core.rs
index dc45f43..476dfdc 100644
--- a/src/lib/core.rs
+++ b/src/lib/core.rs
@@ -520,7 +520,11 @@ where
                     line.push((start, index - 1));
                     start = index + 1;
                 } else if bytes[index] == newline {
-                    line.push((start, index - 1));
+                    if (index != 0) {
+                        line.push((start, index - 1));
+                    } else {
+                        line.push((0, 0));
+                    }
                     let items = self.fields.iter().flat_map(|f| {
                         let slice = line
                             .get(f.low..=min(f.high, line.len().saturating_sub(1)))

Although for performance reasons it might be better to check before the for index in iter loop if the first byte is a newline character, so this condition does not have to be checked in each iteration.

ghuls

comment created time in 7 days