profile
viewpoint
Robin Stocker robinst Sydney, Australia https://www.whatsthistimestamp.com Other projects: https://www.whatsthistimestamp.com

atlassian/commonmark-java 1440

Java library for parsing and rendering CommonMark (Markdown)

nishanths/cocoa-hugo-theme 307

Responsive Hugo blog theme

robinst/autolink-java 154

Java library to extract links (URLs, email addresses) from plain text; fast, small and smart

robinst/curlall 25

Simple curl-like CLI tool to automatically page through APIs

robinst/brainztag 17

Command line tool to tag and rename music albums using MusicBrainz data

robinst/7langs7weeks 1

Exercises from "Seven Languages in Seven Weeks"

robinst/askbot-devel 1

ASKBOT is a StackOverflow-like Q&A forum, based on CNPROG.

robinst/clojure-sudoku 1

Simple Sudoku solver in Clojure

robinst/cvs2svn 1

Migrate CVS repositories to SVN or git or hg or ... The canonical repository for this project is in Subversion at http://cvs2svn.tigris.org/svn/cvs2svn/

PR closed robinst/taglib-ruby

E:PSHRMVM
+1 -0

0 comment

1 changed file

ashok756

pr closed time in 11 hours

issue commentfancy-regex/fancy-regex

Add escape function

Yeah makes sense to have. @johnw42 recently worked on expansion APIs (a little bit related), maybe this is something they would be interested in adding.

Currently I use the fancy_regex crate for matching and the regex crate for quoting, but this is error-prone, since fancy_regex adds a number of regex-pattern syntax-extensions which are not handled by regex::escape.

Which syntax causes problems in this? I'm asking because I'd expect regex crate's quoting to take care of them already, e.g. quoting all \ and (.

florianpircher

comment created time in 15 hours

issue commentatlassian/commonmark-java

Table of Contents support

Also, it would probably need to be paired with the heading anchor extension (https://github.com/atlassian/commonmark-java#heading-anchor) to be able to link to headers.

GregJohnStewart

comment created time in 10 days

issue commentatlassian/commonmark-java

Table of Contents support

Hey! Yeah sounds like a useful extension. Looks like there's a couple of existing syntaxes, from here: https://alexharv074.github.io/2018/08/28/auto-generating-markdown-tables-of-contents.html

  • {:toc}
  • [[_TOC_]]
  • [toc]

And this one doesn't seem to need any syntax, so I assume if the extension is used it is inserted at the top of the document: https://commonmark.thephpleague.com/1.5/extensions/table-of-contents/

Any thoughts about what we should do? Some more research into prior art would be good.

GregJohnStewart

comment created time in 10 days

issue commentrobinst/linkify

"https://www.example.com**" parses as a single link?

Looking at https://tools.ietf.org/search/rfc3986#section-2.2, * is part of sub-delims, so it can be part of a URL. Having said that, this library currently doesn't include trailing , (which is also a sub-delim), so we could exclude trailing * too.

But before I do that, can I ask: Are you linkifying first, and then Markdown parsing? If yes, I would recommend doing the parsing first, and then linkifying only on the resulting text nodes.

tmladek

comment created time in 14 days

issue closedatlassian/commonmark-java

Prevent message trimming

Hello, I was wondering if it is possible to prevent trimming of first and last white space of the message. For example:
Hello **sir**! will turn into <p>Hello <strong>sir<strong>!<p>, the result I wanted would be <p> Hello <strong>sir<strong>! <p>.
My reasoning is very particular so I would understand if it's not possible, I am currently simply using the Markdown visitor to get the components since I am not rendering it into HTML, which requires me to sometimes split it into different messages, so when joining them together it looses all the spaces it had.

Ps: If can't be prevented can you please point me to the part of the code that does the trimming of the node? I would love to fork it and change for my use.
Thank you!

closed time in a month

MarkMarkine

issue commentatlassian/commonmark-java

Prevent message trimming

Hey. Hm yeah, this is intentional to follow the spec, see example 192 and 196:

Leading spaces are skipped

Final spaces are stripped before inline parsing, so a paragraph that ends with two or more spaces will not end with a hard line break

Hmm, what if you remember the leading and trailing whitespace, strip it, and then after parsing and rendering, add them on again? (Not sure what your output is.)

MarkMarkine

comment created time in a month

PullRequestReviewEvent

Pull request review commentVCCRI/atlantool

major refactor

+package org.victorchang;++import htsjdk.samtools.SAMRecord;++import java.io.DataInput;+import java.io.IOException;+import java.util.function.Consumer;++public class SAMRecordGenerator implements BamRecordHandler {

Replace SAM with Sam? To make it consistent with BamRecordHandler etc. Same with other SAM* classes.

huy

comment created time in a month

PullRequestReviewEvent

Pull request review commentVCCRI/atlantool

major refactor

+package org.victorchang;++import java.io.DataInput;+import java.io.IOException;+import java.util.function.Consumer;++public class QnameParser implements BamRecordParser<QnameRecord> {+    private static final int QNAME_SIZE = 256;++    /**+     * reuse mutable object to minimize memory allocation.+     */+    private final QnameRecord current;++    public QnameParser() {+        current = new QnameRecord(QNAME_SIZE);+    }++    @Override+    public void parse(DataInput dataInput, int recordLength, Consumer<QnameRecord> consumer) throws IOException {+        dataInput.readInt(); // reference seq id+        dataInput.readInt(); // pos+        int qnameLen = dataInput.readUnsignedByte();+        dataInput.readUnsignedByte(); // map q+        dataInput.readUnsignedShort(); // bin+        int cigarCount = dataInput.readUnsignedShort();+        dataInput.readUnsignedShort(); // flag+        int seqLen = dataInput.readInt();+        dataInput.readInt(); // next ref id+        dataInput.readInt(); // next pos+        dataInput.readInt(); // template len++        readQname(dataInput, qnameLen);+        consumer.accept(current);+    }++    private void readQname(DataInput dataInput, int qnameLen) throws IOException {+        if (qnameLen >= 256) {+            throw new IllegalStateException("qname must be less than 256 bytes");+        }+        dataInput.readFully(current.qname, 0, qnameLen);+        current.qnameLen = qnameLen - 1;

Can you add a comment here as well (why the - 1)?

huy

comment created time in a month

PullRequestReviewEvent

created tagfancy-regex/fancy-regex

tag0.4.0

Rust library for regular expressions using "fancy" features like look-around and backreferences

created time in a month

issue commenttrishume/syntect

fancy-regex: problems with patterns in some syntaxes

fancy-regex 0.4.0 now supports named groups, see changelog. \g and \G are not yet supported.

sharkdp

comment created time in a month

issue commentfancy-regex/fancy-regex

Add Regex::replace (and replacen, replaceall)

Yeah it doesn't provide replace functionality yet. You can implement it yourself by repeatedly calling captures_from_pos and adding the pieces of the string before/after matches.

himat

comment created time in a month

issue openedfancy-regex/fancy-regex

Add Regex::captures_iter

See https://docs.rs/regex/1.3.9/regex/struct.Regex.html#method.captures_iter

Currently, you will have to call captures_from_pos repeatedly to get this. Would be good to add this as API. Other things such as the replace API would also make use of it (#49).

created time in a month

issue closedfancy-regex/fancy-regex

Named capture groups

Named capture groups have been requested as a feature here: https://www.reddit.com/r/rust/comments/djv01f/z/f48r5ao

Should not be too hard to implement:

First, extend parser to parse named groups. Store it in a HashMap<String, usize>.

Then add API to look up named groups using the name. Use the HashMap to look up the group index first.

Referencing named groups in backrefs doesn't have to be implemented as part of this (but could be).

Syntax

Oniguruma uses the syntax (?<name>subexp) whereas regex crate uses (?P<name>subexp). We should probably support both.

closed time in a month

robinst

issue commentfancy-regex/fancy-regex

Named capture groups

@rxt1077 @johnw42 released in version 0.4.0 now: https://github.com/fancy-regex/fancy-regex/blob/main/CHANGELOG.md#040---2020-09-27

Thank you both! 🎉

robinst

comment created time in a month

push eventfancy-regex/fancy-regex

Robin Stocker

commit sha 9247d3dc80d4aad03120bc893793f7b32e6fa434

Prepare for 0.4.0

view details

Robin Stocker

commit sha ed774fa91f8f5fe4dad9507179b89adcf2210511

Version 0.4.0

view details

push time in a month

pull request commentfancy-regex/fancy-regex

Added: `Match:range()` method

Done in https://github.com/fancy-regex/fancy-regex/commit/ea6366527dad0415b57356c952cb3d8f44f60c2c

lebensterben

comment created time in a month

push eventfancy-regex/fancy-regex

Robin Stocker

commit sha ea6366527dad0415b57356c952cb3d8f44f60c2c

Add some tests for Match

view details

push time in a month

push eventfancy-regex/fancy-regex

Lucius Hu

commit sha ba8c61080adffc9f54cfe7c832739f5e25dd4b05

Added: `Match:range()` method - This is ported from `regex` crate.

view details

Lucius Hu

commit sha 8720b04c3a6331c5b1eeef0155ef630d84301de4

Added: Description for `Match::new()` - Ported from `regex` crate.

view details

Lucius Hu

commit sha 9f6dcae8b76e682bae652750582fe6bf53b739df

Added: `From<Match>` implemented for `&str` and `Range<usize>` - Ported from `regex` crate.

view details

Lucius Hu

commit sha 30adb5b7b16818c2852428f91f91d3af1d170c4a

Modified: `Regex::find()` - Nothing really changed. Just more ergonomic.

view details

Lucius Hu

commit sha 2887f8f1a3695bf0a9f3633e48a6292e0da4eb51

Revert "Modified: `Regex::find()`" This reverts commit 30adb5b7b16818c2852428f91f91d3af1d170c4a. `Option:transpose` is not available on Rust 1.32.0

view details

Robin Stocker

commit sha 50474de536803ba1fb3b5906f488290135f0c28a

Merge pull request #57 from lebensterben/develop Added: `Match:range()` method

view details

push time in a month

PR merged fancy-regex/fancy-regex

Added: `Match:range()` method
  • These are ported from regex crate.
  • Added Match::range()
  • Added description for Match::new()
  • Implemented From<Match> for string slice and Range<usize>.
+20 -0

1 comment

1 changed file

lebensterben

pr closed time in a month

startedVCCRI/atlantool

started time in a month

delete branch VCCRI/atlantool

delete branch : fix-long-qnames

delete time in a month

push eventVCCRI/atlantool

Robin Stocker

commit sha feee500d879fdee90a558709dc1615d0aa0464ed

Fix reading QNAMEs longer than 127 bytes Java's signed bytes strike again.

view details

Robin Stocker

commit sha 711420c3dbeb7dcfe28eafe8f39fbc8b59e721bd

Merge pull request #24 from VCCRI/fix-long-qnames Fix reading QNAMEs longer than 127 bytes

view details

push time in a month

PR merged VCCRI/atlantool

Fix reading QNAMEs longer than 127 bytes

Java's signed bytes strike again.

+26 -2

0 comment

2 changed files

robinst

pr closed time in a month

push eventVCCRI/atlantool

Robin Stocker

commit sha d316e8affc152c7cc1133dd5e2f1c4efdc5d9500

Update README.md

view details

push time in a month

push eventVCCRI/atlantool

Robin Stocker

commit sha 3777cddbf2dd9ae959e23f26d70126b02810e743

Update README.md

view details

push time in a month

delete branch VCCRI/atlantool

delete branch : metadata-align-to-beginning

delete time in a month

push eventVCCRI/atlantool

Robin Stocker

commit sha cc54e0227261628706640899c20741068bc9c3b4

Align metadata (index) pointer towards start of a block

view details

Robin Stocker

commit sha 7f4fd4fbc8b20bcc6c5dcc2065ab9a02ab795d57

Merge pull request #23 from VCCRI/metadata-align-to-beginning Align metadata (index) pointer towards start of a block

view details

push time in a month

Pull request review commentVCCRI/atlantool

Only use a single byte for length in KeyPointer

             /**              * reusable buffer to minimize memory allocation.              */-            private final byte[] inputBuff = new byte[256 + 8 + 2];+            private final byte[] inputBuff = new byte[256];             private KeyPointer current = null;              @Override             public boolean hasNext() {                 if (current == null) {                     try {-                        int entryLen;+                        int keyLen;                         try {-                            entryLen = dataInput.readShort();+                            keyLen = dataInput.readByte();

Fix: https://github.com/VCCRI/atlantool/pull/24

robinst

comment created time in a month

PullRequestReviewEvent

PR opened VCCRI/atlantool

Fix reading QNAMEs longer than 127 bytes

Java's signed bytes strike again.

+26 -2

0 comment

2 changed files

pr created time in a month

create barnchVCCRI/atlantool

branch : fix-long-qnames

created branch time in a month

Pull request review commentVCCRI/atlantool

Only use a single byte for length in KeyPointer

             /**              * reusable buffer to minimize memory allocation.              */-            private final byte[] inputBuff = new byte[256 + 8 + 2];+            private final byte[] inputBuff = new byte[256];             private KeyPointer current = null;              @Override             public boolean hasNext() {                 if (current == null) {                     try {-                        int entryLen;+                        int keyLen;                         try {-                            entryLen = dataInput.readShort();+                            keyLen = dataInput.readByte();

Good catch! I'll add a test as well.

robinst

comment created time in a month

PullRequestReviewEvent

delete branch VCCRI/atlantool

delete branch : ahead-of-time-compilation

delete time in a month

delete branch VCCRI/atlantool

delete branch : publish-jar-file

delete time in a month

delete branch VCCRI/atlantool

delete branch : update-readme-with-usage

delete time in a month

delete branch VCCRI/atlantool

delete branch : cli-options

delete time in a month

delete branch VCCRI/atlantool

delete branch : qname-file

delete time in a month

delete branch VCCRI/atlantool

delete branch : overwrite-index

delete time in a month

delete branch VCCRI/atlantool

delete branch : fix-duplicate-header-printing

delete time in a month

delete branch VCCRI/atlantool

delete branch : index-directory-ext

delete time in a month

delete branch VCCRI/atlantool

delete branch : integration-tests

delete time in a month

delete branch VCCRI/atlantool

delete branch : optimized-qnames-search

delete time in a month

delete branch VCCRI/atlantool

delete branch : index-version

delete time in a month

delete branch VCCRI/atlantool

delete branch : strip-null-termiated-from-qname

delete time in a month

delete branch VCCRI/atlantool

delete branch : version

delete time in a month

delete branch VCCRI/atlantool

delete branch : use-standard-virtual-offset-encoding

delete time in a month

delete branch VCCRI/atlantool

delete branch : compression-option

delete time in a month

PR opened VCCRI/atlantool

Align metadata (index) pointer towards start of a block

See code comment.

+17 -9

0 comment

1 changed file

pr created time in a month

issue openedVCCRI/atlantool

Smaller index size by storing block-level pointers only

Writing this up in an issue so that it's not lost: As of writing this, the index size for a 120 GB file is 11 GB. In order to reduce that more, I looked into storing only block-level pointers.

Index changes

Currently, in the .data.bgz file we store the virtual offset (location to record in BAM file) with each QNAME. That means each offset is a different 8 byte number, and hard to compress.

Instead of storing individual offsets, we can instead:

  • On the initial scan, remember the position of the first record per block. Write those positions to a .blocks file, 8 bytes per pointer, one after the other. Can be compressed or uncompressed.
  • When storing the value for QNAME in .data.bgz, instead of storing 8 bytes, just store a block index instead (block number). In the 120 GB file, there are ~9 million blocks, so that means a block index takes up 3 bytes at worst. Because we don't know how many blocks we have, using a variable encoding like Varint makes sense. So for earlier indexes, we only use 1, 2, 3 bytes, while not putting a (potentially too low) limit on the number of blocks.

Search changes

Now to look up a record in the BAM, we need to:

  • Find the QNAME in data (same as before)
  • Read its block number
  • Look up the block position in the blocks index. If it's uncompressed, all you need is to seek to byte position block_number * 8
  • Go to that position in the BAM file and do a linear scan until that QNAME is found

Results

I did the indexing changes, and the resulting file sizes were:

6.0G  qname.data.bgz
531K  qname.index.bgz
71M   qname.blocks

That's almost a 50% reduction in index size, so pretty good.

I didn't have time to check the effect on search performance yet, but don't think it would be too bad. In the average case, we'd have to scan a single block in the BAM which has a maximum size of 64 KB.

Thoughts

  • I didn't compress the .blocks file to allow for seeking, and 71 MB is small enough. But we could compress it, read it all into memory and then use that to retrieve the block positions.
  • Instead of having an additional .blocks file, could we just store the block pointer in .data.bgz and hope that compression takes care of things? Answer: I don't think it would work because the compression is done on chunks of QNAMEs, which don't necessarily contain the same block pointers.

created time in a month

create barnchVCCRI/atlantool

branch : metadata-align-to-beginning

created branch time in a month

PullRequestReviewEvent

delete branch VCCRI/atlantool

delete branch : friendlier-logging

delete time in a month

push eventVCCRI/atlantool

Robin Stocker

commit sha cd59befbfe05aa46ec50d74f681ed08f506f3123

Make logging format a bit more friendly (single line, less noise)

view details

Robin Stocker

commit sha 19304e395b7e1c00ce54b1c687c7ce77bd3c9b7e

Merge pull request #20 from VCCRI/friendlier-logging Make logging format a bit more friendly (single line, less noise)

view details

push time in a month

PR merged VCCRI/atlantool

Make logging format a bit more friendly (single line, less noise)

Before:

INFO: First 5 index blocks
Sep. 25, 2020 3:19:20 PM org.victorchang.QnameIndexer lambda$index$2
INFO: coffset=0, uoffset=62110, key=SOLEXA-1GA-1_1_FC20EMA:7:100:434:814
Sep. 25, 2020 3:19:20 PM org.victorchang.QnameIndexer lambda$index$2
INFO: coffset=13166, uoffset=58664, key=SOLEXA-1GA-1_1_FC20EMA:7:100:756:128
Sep. 25, 2020 3:19:20 PM org.victorchang.QnameIndexer lambda$index$2
INFO: coffset=26279, uoffset=55210, key=SOLEXA-1GA-1_1_FC20EMA:7:101:176:385
Sep. 25, 2020 3:19:20 PM org.victorchang.QnameIndexer lambda$index$2
INFO: coffset=39389, uoffset=51782, key=SOLEXA-1GA-1_1_FC20EMA:7:101:507:820
Sep. 25, 2020 3:19:20 PM org.victorchang.QnameIndexer lambda$index$2
INFO: coffset=52405, uoffset=48344, key=SOLEXA-1GA-1_1_FC20EMA:7:101:814:100
Sep. 25, 2020 3:19:20 PM org.victorchang.IndexCommand call
INFO: Create index completed in 10395ms

After:

[2020-09-25 15:18:38] [INFO   ] First 5 index blocks
[2020-09-25 15:18:38] [INFO   ] coffset=0, uoffset=62110, key=SOLEXA-1GA-1_1_FC20EMA:7:100:434:814
[2020-09-25 15:18:38] [INFO   ] coffset=13166, uoffset=58664, key=SOLEXA-1GA-1_1_FC20EMA:7:100:756:128
[2020-09-25 15:18:38] [INFO   ] coffset=26279, uoffset=55210, key=SOLEXA-1GA-1_1_FC20EMA:7:101:176:385
[2020-09-25 15:18:38] [INFO   ] coffset=39389, uoffset=51782, key=SOLEXA-1GA-1_1_FC20EMA:7:101:507:820
[2020-09-25 15:18:38] [INFO   ] coffset=52405, uoffset=48344, key=SOLEXA-1GA-1_1_FC20EMA:7:101:814:100
[2020-09-25 15:18:38] [INFO   ] Create index completed in 7476ms
+17 -4

0 comment

3 changed files

robinst

pr closed time in a month

delete branch VCCRI/atlantool

delete branch : length-one-byte

delete time in a month

push eventVCCRI/atlantool

Robin Stocker

commit sha 2fb42185e4cc2bc325ff880c1fe69cffa7dd9d07

Only use a single byte for length in KeyPointer Because the pointer size is fixed, we don't need to include it in the length. QNAME is limited to 254 characters, so it fits into one byte.

view details

Robin Stocker

commit sha 05505e4ee741a3ea325c582c7879064c294c7143

Merge pull request #21 from VCCRI/length-one-byte Only use a single byte for length in KeyPointer

view details

push time in a month

PR merged VCCRI/atlantool

Only use a single byte for length in KeyPointer

Because the pointer size is fixed, we don't need to include it in the length. QNAME is limited to 254 characters, so it fits into one byte:

Screen Shot 2020-09-25 at 15 34 23

+16 -13

0 comment

4 changed files

robinst

pr closed time in a month

PR opened VCCRI/atlantool

Only use a single byte for length in KeyPointer

Because the pointer size is fixed, we don't need to include it in the length. QNAME is limited to 254 characters, so it fits into one byte:

Screen Shot 2020-09-25 at 15 34 23

+16 -13

0 comment

4 changed files

pr created time in a month

create barnchVCCRI/atlantool

branch : length-one-byte

created branch time in a month

PR closed VCCRI/atlantool

Use standard encoding for virtual offset (coffset<<16|uoffset)

Not sure why we were using something different before. Also, just use the classes from the library now that we use it.

Screen Shot 2020-09-22 at 15 32 20

+21 -16

2 comments

5 changed files

robinst

pr closed time in a month

pull request commentVCCRI/atlantool

Use standard encoding for virtual offset (coffset<<16|uoffset)

Done in #17

robinst

comment created time in a month

delete branch VCCRI/atlantool

delete branch : htsjdk-streams

delete time in a month

push eventVCCRI/atlantool

Robin Stocker

commit sha 48d0e892dd732c83e98a7b3d3c671faecb0f2d32

Use BlockCompressedInputStream and BlockCompressedOutputStream from htsjdk * Less custom code * Faster: Indexing the 129 GB file took 1h instead of 1h30m * More correct: Before, the files we wrote didn't have the right BGZF headers (FLG was 0 instead of 4) and so were not readable by other tools * We can take advantage of future improvements such as https://github.com/samtools/htsjdk/pull/1249

view details

Robin Stocker

commit sha d96a2851e66a613a0458189511b369b44576d182

Address feedback

view details

Robin Stocker

commit sha 3aefa9855ee87bd33a86b1900a6639f9ac084603

Merge pull request #17 from VCCRI/htsjdk-streams Use BlockCompressedInputStream and BlockCompressedOutputStream from htsjdk

view details

push time in a month

PR merged VCCRI/atlantool

Use BlockCompressedInputStream and BlockCompressedOutputStream from htsjdk
  • Less custom code
  • Faster: Indexing the 129 GB file took 1h instead of 1h30m
  • More correct: Before, the files we wrote didn't have the right BGZF headers (FLG was 0 instead of 4) and so were not readable by other tools.
  • We can take advantage of future improvements such as https://github.com/samtools/htsjdk/pull/1249

Also, example1b had an end-of-file-marker in the middle of the file (probably because the blocks were concatenated):

Screen Shot 2020-09-25 at 10 07 00

From spec:

Screen Shot 2020-09-25 at 10 40 13

+50 -771

2 comments

20 changed files

robinst

pr closed time in a month

pull request commentVCCRI/atlantool

Use BlockCompressedInputStream and BlockCompressedOutputStream from htsjdk

Looks good. Run the exhaustive test to make sure nothing is broken, since it is ignored currently.

Done!

robinst

comment created time in a month

PR opened VCCRI/atlantool

Make logging format a bit more friendly (single line, less noise)

Before:

INFO: First 5 index blocks
Sep. 25, 2020 3:19:20 PM org.victorchang.QnameIndexer lambda$index$2
INFO: coffset=0, uoffset=62110, key=SOLEXA-1GA-1_1_FC20EMA:7:100:434:814
Sep. 25, 2020 3:19:20 PM org.victorchang.QnameIndexer lambda$index$2
INFO: coffset=13166, uoffset=58664, key=SOLEXA-1GA-1_1_FC20EMA:7:100:756:128
Sep. 25, 2020 3:19:20 PM org.victorchang.QnameIndexer lambda$index$2
INFO: coffset=26279, uoffset=55210, key=SOLEXA-1GA-1_1_FC20EMA:7:101:176:385
Sep. 25, 2020 3:19:20 PM org.victorchang.QnameIndexer lambda$index$2
INFO: coffset=39389, uoffset=51782, key=SOLEXA-1GA-1_1_FC20EMA:7:101:507:820
Sep. 25, 2020 3:19:20 PM org.victorchang.QnameIndexer lambda$index$2
INFO: coffset=52405, uoffset=48344, key=SOLEXA-1GA-1_1_FC20EMA:7:101:814:100
Sep. 25, 2020 3:19:20 PM org.victorchang.IndexCommand call
INFO: Create index completed in 10395ms

After:

[2020-09-25 15:18:38] [INFO   ] First 5 index blocks
[2020-09-25 15:18:38] [INFO   ] coffset=0, uoffset=62110, key=SOLEXA-1GA-1_1_FC20EMA:7:100:434:814
[2020-09-25 15:18:38] [INFO   ] coffset=13166, uoffset=58664, key=SOLEXA-1GA-1_1_FC20EMA:7:100:756:128
[2020-09-25 15:18:38] [INFO   ] coffset=26279, uoffset=55210, key=SOLEXA-1GA-1_1_FC20EMA:7:101:176:385
[2020-09-25 15:18:38] [INFO   ] coffset=39389, uoffset=51782, key=SOLEXA-1GA-1_1_FC20EMA:7:101:507:820
[2020-09-25 15:18:38] [INFO   ] coffset=52405, uoffset=48344, key=SOLEXA-1GA-1_1_FC20EMA:7:101:814:100
[2020-09-25 15:18:38] [INFO   ] Create index completed in 7476ms
+17 -4

0 comment

3 changed files

pr created time in a month

create barnchVCCRI/atlantool

branch : friendlier-logging

created branch time in a month

PullRequestReviewEvent

Pull request review commentVCCRI/atlantool

Use BlockCompressedInputStream and BlockCompressedOutputStream from htsjdk

 public long read(Path bamFile, BamRecordHandler handler, long bytesLimit) throws         final SAMFileHeader header = readSamHeader(bamFile);         handler.onHeader(header);         long recordCount = 0;-        try (FileChannel fileChannel = FileChannel.open(bamFile, READ)) {-            InputStream compressedStream = new BufferedInputStream(Channels.newInputStream(fileChannel), FILE_BUFF_SIZE);--            GzipEntryPositionFinder positionFinder = new GzipEntryPositionFinder();-            CountingInputStream uncompressedStream = new CountingInputStream(-                    new BufferedInputStream(-                            new GzipConcatenatedInputStream(compressedStream, positionFinder), FILE_BUFF_SIZE));+        try (BlockCompressedInputStream blockCompressedInputStream = new BlockCompressedInputStream(bamFile.toFile())) { -            LittleEndianDataInputStream dataInput = new LittleEndianDataInputStream(uncompressedStream);+            CountingInputStream countingInputStream = new CountingInputStream(blockCompressedInputStream);+            LittleEndianDataInputStream dataInput = new LittleEndianDataInputStream(countingInputStream);             assertMagic(dataInput);             skipHeaderText(dataInput);             skipReferences(dataInput);              while (true) {+                long filePointer = blockCompressedInputStream.getFilePointer();+                long coffset = BlockCompressedFilePointerUtil.getBlockAddress(filePointer);+                int uoffset = BlockCompressedFilePointerUtil.getBlockOffset(filePointer);+                 int recordLength;                 try {                     recordLength = dataInput.readInt();                 } catch (EOFException ignored) {                     break;                 } -                GzipEntryPosition position = positionFinder.find(uncompressedStream.getBytesRead() - 4);--                if (position == null) {-                    throw new IllegalStateException("Can't find start of a gzip entry");-                }--                if (position.getCompressed() >= bytesLimit) {+                if (coffset > bytesLimit) {

👍 done

robinst

comment created time in a month

PullRequestReviewEvent

push eventVCCRI/atlantool

Robin Stocker

commit sha d96a2851e66a613a0458189511b369b44576d182

Address feedback

view details

push time in a month

Pull request review commentVCCRI/atlantool

Add version information

 static Path getDefaultIndexPath(Path bamPath) {     } } -@Command(name = "index")-class IndexCommand implements Callable<Integer> {-    @Parameters(paramLabel = "bam-file", description = "Path to the BAM file")-    Path bamPath;+class CommitVersionProvider implements CommandLine.IVersionProvider { -    @Option(names = {"-i", "--index-path"}, description = "Directory to store index files. By default uses a directory name that starts with the BAM file name (so stored next to it)")-    Path indexDirectory;-    @Option(names = "--thread-count", description = "Number of threads used for sorting", defaultValue = "1")-    int threadCount;-    @Option(names = "--sort-buffer-size", description = "Maximum number of records per buffer used for sorting", defaultValue = "500000")-    int sortBufferSize;-    @Option(names = {"-l", "--limit-bytes"}, description = "Only read and index first given bytes")-    long bytesLimit;-    @Option(names = {"-t", "--temporary-path"}, description = "Directory to store temporary files for sorting. By default uses the index path")-    Path tempDirectory;-    @Option(names = {"-v", "--verbose"}, description = "Switch on verbose output", defaultValue = "false")-    boolean verbose;-    @Option(names = {"--force"}, description = "Overwrite existing index", defaultValue = "false")-    boolean force;-    @Option(names = {"--compression"}, description = "Compression level (1 to 9). 1 = faster but bigger index file size, 9 = slower but smaller index file size", defaultValue = "6")-    int compressionLevel;+    private final static String BASE_VERSION = "1.0";+    private static final Logger LOG = LoggerFactory.getLogger(QnameCommand.class);      @Override-    public Integer call() {-        java.util.logging.Logger.getLogger("")-                .setLevel(verbose ? java.util.logging.Level.ALL : java.util.logging.Level.SEVERE);--        if (!Files.isRegularFile(bamPath)) {-            System.err.println(bamPath + " not found.");-            return -1;-        }-        if (!createIndexDirectory(bamPath, force)) {-            return -1;-        }--        if (tempDirectory == null) {-            tempDirectory = indexDirectory;-        } else if (!createDirectory(tempDirectory)) {-            return -1;-        }--        bytesLimit = bytesLimit == 0 ? Long.MAX_VALUE : bytesLimit;--        BamFileReader fileReader = new DefaultBamFileReader(new EfficientBamRecordParser());-        QnameIndexer indexer = new QnameIndexer(fileReader,-                new KeyPointerWriter(compressionLevel),-                new KeyPointerReader(),-                threadCount,-                sortBufferSize);+    public String[] getVersion() {+        Properties p = new Properties();         try {-            LOG.info("Creating index {} using {} threads with sort buffer size of {} records",-                    bytesLimit == Long.MAX_VALUE ? "" : "for the first " + bytesLimit + " bytes",-                    threadCount,-                    sortBufferSize);--            long start = System.nanoTime();-            indexer.index(indexDirectory, bamPath, tempDirectory, bytesLimit);-            long finish = System.nanoTime();--            LOG.info("Create index completed in {}", (finish - start) / 1000_000 + "ms");+            final InputStream resourceAsStream = getClass().getClassLoader().getResourceAsStream("git.properties");+            p.load(resourceAsStream);+            final String commitId = p.getProperty("git.commit.id.abbrev");+            final String buildTime = p.getProperty("git.build.time");+            return new String[] { "Version: " + BASE_VERSION, "Release: " + commitId, "Release Date: " + buildTime };

How about including the index version as well?

amitdev

comment created time in a month

PullRequestReviewEvent
PullRequestReviewEvent

PR opened VCCRI/atlantool

Reviewers
Use BlockCompressedInputStream and BlockCompressedOutputStream from htsjdk
  • Less custom code
  • Faster: Indexing the 129 GB file took 1h instead of 1h30m
  • More correct: Before, the files we wrote didn't have the right BGZF headers (FLG was 0 instead of 4) and so were not readable by other tools
  • We can take advantage of future improvements such as https://github.com/samtools/htsjdk/pull/1249
+43 -725

0 comment

17 changed files

pr created time in a month

create barnchVCCRI/atlantool

branch : htsjdk-streams

created branch time in a month

PullRequestReviewEvent
PullRequestReviewEvent

Pull request review commentVCCRI/atlantool

Index version

+package org.victorchang;++public abstract class IndexVersion {+    private IndexVersion() {+    }++    public static final IndexVersion LATEST = new IndexVersion() {+        @Override+        public String fileName(String ext) {+            return "qname-version" + version() + "." + ext ;

Sounds good! That will also help my poor brain figure out which one comes first :)

huy

comment created time in a month

push eventVCCRI/atlantool

Robin Stocker

commit sha bd3dce0de9c02bdd1dda42ff962240d913cc3ad0

Set compression level back to 6, add CLI option Level 9 is about half as fast as level 6, for a gain of about 10%. Add it as an option if index size is a concern, but default should probably be 6.

view details

Robin Stocker

commit sha ea41df6f8ad3fe7a266f118e1092d74cc5b5e887

Merge pull request #15 from VCCRI/compression-option Set compression level back to 6, add CLI option

view details

push time in a month

PR merged VCCRI/atlantool

Reviewers
Set compression level back to 6, add CLI option

Level 9 is about half as fast as level 6, for a gain of about 10%. Add it as an option if index size is a concern, but default should probably be 6.

+24 -30

0 comment

7 changed files

robinst

pr closed time in a month

push eventVCCRI/atlantool

Robin Stocker

commit sha ada07ae96b6bc17731975c264dd838670c08cda6

Update .gitignore Fix `*.iml` ignore.

view details

push time in a month

PR opened VCCRI/atlantool

Reviewers
Set compression level back to 6, add CLI option

Level 9 is about half as fast as level 6, for a gain of about 10%. Add it as an option if index size is a concern, but default should probably be 6.

+24 -30

0 comment

7 changed files

pr created time in a month

create barnchVCCRI/atlantool

branch : compression-option

created branch time in a month

PullRequestReviewEvent

Pull request review commentVCCRI/atlantool

Optimized qnames search

 public QnameSearcher(KeyPointerReader keyPointerReader, BamRecordReader recordRe         this.handler = handler;     } -    public int search(Path bamFile, Path indexFolder, String qname) throws IOException {-        Path pathLevel1 = indexFolder.resolve("qname.1");-        FileChannel channelLevel1 = FileChannel.open(pathLevel1, READ);-        InputStream inputStreamLevel1 = Channels.newInputStream(channelLevel1);--        byte[] input = Ascii7Coder.INSTANCE.encode(qname);-        Iterable<KeyPointer> indexLevel1 = () -> keyPointerReader.read(inputStreamLevel1).iterator();--        KeyPointer key = new KeyPointer(0, input, input.length);-        KeyPointer start = key;-        for (KeyPointer x : indexLevel1) {-            if (x.compareTo(key) >= 0) {-                break;-            }-            start = x;+    public int search(Path bamFile, Path indexFolder, Set<String> qnames) throws IOException {+        final List<Long> pointersForQname = getPointersForQname(indexFolder, qnames);+        for (Long pointer : pointersForQname) {+            recordReader.read(bamFile, pointer, handler);         }-        inputStreamLevel1.close();--        Path pathLevel0 = indexFolder.resolve("qname.0");-        FileChannel channelLevel0 = FileChannel.open(pathLevel0, READ);-        long coffset = PointerPacker.INSTANCE.unpackCompressedOffset(start.getPointer());-        int uoffset = PointerPacker.INSTANCE.unpackUnCompressedOffset(start.getPointer());+        return pointersForQname.size();+    } -        if (coffset >= channelLevel0.size()) {-            return 0;-        }-        channelLevel0.position(coffset);-        InputStream inputStreamLevel0 = Channels.newInputStream(channelLevel0);--        int found = 0;-        Iterable<KeyPointer> indexLevel0 =  () -> keyPointerReader.read(inputStreamLevel0, uoffset).iterator();-        for (KeyPointer x : indexLevel0) {-            if (Arrays.equals(x.getKey(), input)) {-                recordReader.read(bamFile, x.getPointer(), handler);-                found++;-            }-            if (Arrays.compareUnsigned(x.getKey(), input) > 0) {-                break;-            }-        }+    List<Long> getPointersForQname(Path indexFolder, Set<String> qnames) throws IOException {+        final Map<byte[], KeyPointer> qnameToPointer = getQnameToPointerMap(indexFolder, qnames); -        inputStreamLevel0.close();+        // Map from a pointer -> set of qnames in that level+        final Map<Long, Set<byte[]>> pointerToQname = qnameToPointer.entrySet().stream()+                .collect(toMap(e -> e.getValue().getPointer(), e -> Set.of(e.getKey()), this::concatSet)); -        return found;+        return pointerToQname.entrySet().stream()+                .flatMap(e -> getPointers(e.getValue(), indexFolder, e.getKey()).stream())+                .collect(toList());     } -    public static class DebuggingHandler implements BamRecordHandler {--        @Override-        public void onHeader(SAMFileHeader header) {+    /**+     * Returns a map from the qname -> Key pointer location in the index file.+     */+    private Map<byte[], KeyPointer> getQnameToPointerMap(Path indexFolder, Set<String> qnames) throws IOException {+        Path pathLevel1 = indexFolder.resolve("qname.1");+        try (InputStream inputStreamLevel1 = Channels.newInputStream(FileChannel.open(pathLevel1, READ))) {+            final List<KeyPointer> keyPointers = qnames.stream()+                    .map(Ascii7Coder.INSTANCE::encode)+                    .map(input -> new KeyPointer(0, input, input.length))+                    .collect(toList());++            final Map<byte[], KeyPointer> keyPointerMap = keyPointers.stream()+                    .collect(toMap(KeyPointer::getKey, Function.identity()));++            keyPointerReader.read(inputStreamLevel1)+                    .forEach(indexPointer -> {+                        for (KeyPointer keyPointer : keyPointers) {+                            if (indexPointer.compareTo(keyPointer) < 0) {+                                keyPointerMap.put(keyPointer.getKey(), indexPointer);+                            }

Similar to the other search, this could stop once it has found all the pointers that it needs, no? This currently goes through the whole index every time, or am I missing something?

amitdev

comment created time in a month

PullRequestReviewEvent

Pull request review commentVCCRI/atlantool

Index version

+package org.victorchang;++public abstract class IndexVersion {+    private IndexVersion() {+    }++    public static final IndexVersion LATEST = new IndexVersion() {+        @Override+        public String fileName(String ext) {+            return "qname-version" + version() + "." + ext ;

So the file name is going to be e.g. qname-version0.0 and qname-version0.1? I'm wondering whether we want to use an actual extension, e.g. .bgz (see list here: https://github.com/samtools/htsjdk/blob/master/src/main/java/htsjdk/samtools/util/FileExtensions.java#L79)

I think it would also be nice to have a bit more descriptive names. So with those two suggestions, the new names would be:

  • qname.0 -> qname.v1.records.bgz
  • qname.1 -> qname.v1.index.bgz

What do people think?

huy

comment created time in a month

PullRequestReviewEvent

Pull request review commentVCCRI/atlantool

Optimized qnames search

 public QnameSearcher(KeyPointerReader keyPointerReader, BamRecordReader recordRe         this.handler = handler;     } -    public int search(Path bamFile, Path indexFolder, String qname) throws IOException {-        Path pathLevel1 = indexFolder.resolve("qname.1");-        FileChannel channelLevel1 = FileChannel.open(pathLevel1, READ);-        InputStream inputStreamLevel1 = Channels.newInputStream(channelLevel1);--        byte[] input = Ascii7Coder.INSTANCE.encode(qname);-        Iterable<KeyPointer> indexLevel1 = () -> keyPointerReader.read(inputStreamLevel1).iterator();--        KeyPointer key = new KeyPointer(0, input, input.length);-        KeyPointer start = key;-        for (KeyPointer x : indexLevel1) {-            if (x.compareTo(key) >= 0) {-                break;-            }-            start = x;+    public int search(Path bamFile, Path indexFolder, Set<String> qnames) throws IOException {+        final List<Long> pointersForQname = getPointersForQname(indexFolder, qnames);+        for (Long pointer : pointersForQname) {+            recordReader.read(bamFile, pointer, handler);         }-        inputStreamLevel1.close();--        Path pathLevel0 = indexFolder.resolve("qname.0");-        FileChannel channelLevel0 = FileChannel.open(pathLevel0, READ);-        long coffset = PointerPacker.INSTANCE.unpackCompressedOffset(start.getPointer());-        int uoffset = PointerPacker.INSTANCE.unpackUnCompressedOffset(start.getPointer());+        return pointersForQname.size();+    } -        if (coffset >= channelLevel0.size()) {-            return 0;-        }-        channelLevel0.position(coffset);-        InputStream inputStreamLevel0 = Channels.newInputStream(channelLevel0);--        int found = 0;-        Iterable<KeyPointer> indexLevel0 =  () -> keyPointerReader.read(inputStreamLevel0, uoffset).iterator();-        for (KeyPointer x : indexLevel0) {-            if (Arrays.equals(x.getKey(), input)) {-                recordReader.read(bamFile, x.getPointer(), handler);-                found++;-            }-            if (Arrays.compareUnsigned(x.getKey(), input) > 0) {-                break;-            }-        }+    List<Long> getPointersForQname(Path indexFolder, Set<String> qnames) throws IOException {+        final Map<byte[], KeyPointer> qnameToPointer = getQnameToPointerMap(indexFolder, qnames); -        inputStreamLevel0.close();+        // Map from a pointer -> list of qnames in that level+        final Map<Long, List<byte[]>> pointerToQname = qnameToPointer.entrySet().stream()+                .collect(toMap(e -> e.getValue().getPointer(), e -> singletonList(e.getKey()), this::concatList)); -        return found;+        return pointerToQname.entrySet().stream()+                .flatMap(e -> getPointers(e.getValue(), indexFolder, e.getKey()).stream())+                .collect(toList());     } -    public static class DebuggingHandler implements BamRecordHandler {--        @Override-        public void onHeader(SAMFileHeader header) {+    /**+     * Returns a map from the qname -> Key pointer location in the index file.+     */+    private Map<byte[], KeyPointer> getQnameToPointerMap(Path indexFolder, Set<String> qnames) throws IOException {+        Path pathLevel1 = indexFolder.resolve("qname.1");+        try (InputStream inputStreamLevel1 = Channels.newInputStream(FileChannel.open(pathLevel1, READ))) {+            final List<KeyPointer> keyPointers = qnames.stream()+                    .map(Ascii7Coder.INSTANCE::encode)+                    .map(input -> new KeyPointer(0, input, input.length))+                    .collect(toList());++            final Map<byte[], KeyPointer> keyPointerMap = keyPointers.stream()+                    .collect(toMap(KeyPointer::getKey, Function.identity()));++            keyPointerReader.read(inputStreamLevel1)+                    .forEach(inputKeyPointer -> {+                        for (KeyPointer keyPointer : keyPointers) {+                            if (inputKeyPointer.compareTo(keyPointer) < 0) {+                                keyPointerMap.put(keyPointer.getKey(), inputKeyPointer);+                            }+                        }+                    });+            return keyPointerMap;         }+    } -        @Override-        public void onAlignmentPosition(long blockPos, int offset) {-        }+    private List<Long> getPointers(List<byte[]> qnames, Path indexFolder, long keyPointer) {+        try {+            Path pathLevel0 = indexFolder.resolve("qname.0");+            FileChannel channelLevel0 = FileChannel.open(pathLevel0, READ); -        @Override-        public void onQname(byte[] qnameBuffer, int qnameLen) {-            String decoded = Ascii7Coder.INSTANCE.decode(qnameBuffer, 0, qnameLen);-            log.info("qname " + decoded);-        }+            long compressedOffset = PointerPacker.INSTANCE.unpackCompressedOffset(keyPointer);+            int unCompressedOffset = PointerPacker.INSTANCE.unpackUnCompressedOffset(keyPointer); -        @Override-        public void onSequence(byte[] seqBuffer, int seqLen) {-            String decoded = SeqDecoder.INSTANCE.decode(seqBuffer, 0, seqLen);-            log.info("sequence " + decoded);-        }+            if (compressedOffset >= channelLevel0.size()) {+                return emptyList();+            } -        @Override-        public void onAlignmentRecord(SAMRecord record) {-            final ByteArrayOutputStream inMemoryStream = new ByteArrayOutputStream();-            final SAMTextWriter writer = new SAMTextWriter(inMemoryStream);-            writer.writeAlignment(record);-            writer.finish();-            log.debug("SAM alignment record (below)");-            log.debug(new String(inMemoryStream.toByteArray()));+            channelLevel0.position(compressedOffset);+            try (InputStream inputStream = Channels.newInputStream(channelLevel0)) {+                Iterable<KeyPointer> indexLevel0 = () -> keyPointerReader.read(inputStream, unCompressedOffset).iterator();++                List<byte[]> remainingQnames = new LinkedList<>(qnames);+                List<Long> pointers = new ArrayList<>();+                for (KeyPointer x : indexLevel0) {+                    boolean keysLeft = false;+                    for (Iterator<byte[]> it = remainingQnames.iterator(); it.hasNext(); ) {

Hm we could do something more clever here I think. We can keep the remainingQnames sorted too. Then we only have to compare the smallest one. If the smallest one doesn't match, the larger ones won't match either, so no point checking them at every step.

amitdev

comment created time in a month

PullRequestReviewEvent
PullRequestReviewEvent
PullRequestReviewEvent

push eventVCCRI/atlantool

Robin Stocker

commit sha 4011c4d80d04ead43590b65b76a7e8cde0e3e60b

Create LICENSE

view details

push time in a month

more