profile
viewpoint
Osma Suominen osma @NatLibFi Helsinki, Finland

osma/annif 14

ANNotation Infrastructure using Finna: an automatic subject indexing tool using Finna as corpus

NatLibFi/Finna-client 2

Python client library for accessing Finna REST API

osma/FinnaBot 1

Twitter bot that publishes pictures from the Finna portal

osma/rdflib 1

RDFLib is a Python library for working with RDF, a simple yet powerful language for representing information.

osma/Annif-fusion 0

Experiment with different fusion methods when combining results from multiple Annif automated subject indexing algorithms

osma/cgeo 0

c:geo - The powerful Android geocaching app.

osma/debian-server-tools 0

Tools and living docs for Debian-based servers

PullRequestReviewEvent
PullRequestReviewEvent

issue openedstanfordnlp/stanza

PyTorch emits UserWarning for deprecated __floordiv__ operation

Describe the bug When performing lemmatization of certain Finnish expressions, PyTorch emits a UserWarning about the deprecated __floordiv__ operation. The lemmatization is still working. The UserWarning is only shown once per process/session.

This appears to be quite rare, only certain combinations of words will trigger this. But when processing a large file in Finnish, it will eventually be triggered. I've also done similar lemmatization for long documents in Swedish and English, but never saw this warning with those languages.

To Reproduce

This code will trigger the warning for me:

import stanza
nlp = stanza.Pipeline(lang='fi', processors='tokenize,mwt,pos,lemma')
doc = nlp("ettei se")

Output:

2022-01-14 13:39:50 INFO: Loading these models for language: fi (Finnish):
=======================
| Processor | Package |
-----------------------
| tokenize  | tdt     |
| mwt       | tdt     |
| pos       | tdt     |
| lemma     | tdt     |
=======================

2022-01-14 13:39:50 INFO: Use device: cpu
2022-01-14 13:39:50 INFO: Loading: tokenize
2022-01-14 13:39:50 INFO: Loading: mwt
2022-01-14 13:39:50 INFO: Loading: pos
2022-01-14 13:39:51 INFO: Loading: lemma
2022-01-14 13:39:51 INFO: Done loading processors!
[REDACTED]/lib/python3.8/site-packages/stanza/models/common/beam.py:86: UserWarning: __floordiv__ is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor').
  prevK = bestScoresId // numWords

Expected behavior Expected no UserWarning.

Environment (please complete the following information):

  • OS: Ubuntu 20.04
  • Python version: 3.8.10 from Ubuntu system package 3.8.10-0ubuntu1~20.04.2
  • Stanza version: 1.3.0 (installed from PyPI in a virtual environment)
  • PyTorch version: 1.10.1 (installed from PyPI in a virtual environment)

Additional context

According to the warning message, the problem seems to be this line: https://github.com/stanfordnlp/stanza/blob/e44d1c88340e33bf9813e6f5a6bd24387eefc4b2/stanza/models/common/beam.py#L86

Here is a PR fixing the same warning in another codebase: https://github.com/NVIDIA/MinkowskiEngine/pull/407

created time in 4 days

issue commentNatLibFi/Annif

LMDB can overflow

Thanks for the report. I guess this is a "640kB ought to be enough for anyone" type bug. The current, hardcoded maximum size is 1GB. I never expected that anyone would get even close to the limit; we typically train nn_ensemble models with at most tens of thousands of samples.

Make the size configurable using an environment variable.

Why not simply a parameter for the nn_ensemble backend?

As for the other options, I would slightly prefer switching to TFRecordWriter/TFRecordDataset instead of the "double the size" approach with LMDB. As I understand it this functionality is already included in TensorFlow, so it would allow dropping LMDB as a dependency, at least for now.

While we've had some thoughts about using LMDB more in the future (as in #378), this is not a goal in itself - rather, LMDB can be an elegant solution for storing large amounts of data on disk, but other options are possible too and this can be chosen on a case-by-case basis.

mo-fu

comment created time in 5 days

pull request commentNatLibFi/Annif

Remove swagger-tester dependency

Thinking aloud here: What this PR is missing is that it doesn't check that the returned data conforms to the shapes (return data types) specified in the OpenAPI spec file. I think swagger-tester did that, though I'm not 100% sure. There's a greater risk now that the app will not actually honor the contracts in the spec file.

The tests could be extended so that they perform more detailed testing of the returned data, but that still wouldn't guarantee that it follows the spec file, just that the person who wrote the tests thinks that the tests are checking for the right things. It would be better to derive the tests automatically from the spec file, not to craft them manually. But unfortunately we cannot use swagger-tester anymore for this.

juhoinkinen

comment created time in 6 days

pull request commentNatLibFi/Annif

Remove swagger-tester dependency

Great work!

These are indeed already somewhat overlapping with the tests in test_rest.py. IIRC originally the idea was to separate the test files as follows:

  • test_swagger.py does some basic exercise of the methods defined in the OpenAPI spec (as is done automatically by swagger-tester) using actual HTTP requests; as well as checking for CORS headers (which also requires HTTP)
  • test_rest.py does more detailed tests (checking that the methods work as intended, including things like access control) only on a Python code level, without actual HTTP requests.

I think it would be useful to keep this kind of separation in mind even though we abandon swagger-tester - the tests in test_swagger.py should be kept in sync with the OpenAPI spec file and operate on a HTTP level, while test_rest.py can go a bit deeper into the application logic using Python only. On a quick reading, this seems to be the case in the tests you've written in this PR.

juhoinkinen

comment created time in 6 days

push eventNatLibFi/Annif-corpora

Osma Suominen

commit sha bb9263693395b253ca3f03d18c98aef4a2a872bf

Remove broken symlinks (resulting from earlier removal of bad documents)

view details

push time in 7 days

pull request commentNatLibFi/Annif

Update dependencies v0.56

Ah, right. The dependency on connexion seems to come from swagger-tester which is a dev dependency for Annif. Its last release 0.2.12 was made in May 2018 and the project seems pretty dead. Maybe we should try to get rid of this dependency - it's used in test_swagger.py for some basic exercise of the REST API methods.

juhoinkinen

comment created time in 7 days

pull request commentNatLibFi/Annif

Update dependencies v0.56

@juhoinkinen Since you've upgraded (again) Click to 8.0.*, we need to take care of also upgrading Connexion and Flask to avoid hitting the version mismatch problem reported in #533 once more (which was fixed by downgrading Click back to 7.1.2 in PR #544).

There's a little problem with Connexion releases though - PyPI currently only has 2.9.0 which doesn't support Flask 2. Version 2.10.0 has been released on GitHub but on the PyPI side it has a different name connexion2. See here for details. So we should probably switch to connexion2 at least for now.

juhoinkinen

comment created time in 7 days

pull request commentNatLibFi/Annif

Update dependencies v0.56

[...] about one-screen length of traceback and RuntimeError: Failed to load model from data/projects/yso-bonsai-old-fi/omikuji-model This already hints something being wrong with the model, so I'm not sure if it is worth processing it within Annif...?

I think that showing tracebacks to the user is not a recipe for good UX... It would be better to show a more informative message such as "Omikuji models trained on Annif versions older than 0.56 cannot be loaded. Please retrain your project."

juhoinkinen

comment created time in 8 days

pull request commentNatLibFi/Annif

Update dependencies v0.56

Omikuji eval time has improved a lot in 0.4! The other numbers look good as well.

What happens when you try to use an old model with omikuji 0.4? Is there some kind of exception? Should we catch that and display a more user friendly error message?

juhoinkinen

comment created time in 8 days

pull request commentNatLibFi/Annif

Update dependencies v0.56

We've tried to support three consecutive versions of Python. But we already currently have support for 3.7, 3.8 and 3.9, so dropping 3.6 wouldn't violate that policy. Python 3.6 has been EOL'd and the final release 3.6.15 was released on 2021-09-04.

Python 3.6 is the default version in Ubuntu 18.04, but it's possible to install newer versions of Python from PPAs.

So I think dropping 3.6 support would be OK. Should that be done in a separate PR or is it easier to just do it in this one? (I guess that would mainly involve reorganizing the CI setup, and updating README.md)

Next steps would then be adding support for 3.10 and then we can consider when to drop 3.7.

juhoinkinen

comment created time in 8 days

PullRequestReviewEvent
PullRequestReviewEvent

Pull request review commentNatLibFi/Annif

Allow selecting installed optional dependencies in Docker build

 RUN apt-get update \ WORKDIR /Annif RUN pip install --upgrade pip --no-cache-dir -# Install all optional dependencies: COPY setup.py README.md LICENSE.txt projects.cfg.dist /Annif/-RUN pip install .[dev,voikko,pycld3,fasttext,nn,omikuji,yake] --no-cache-dir+# Install dependencies for optional features.+ARG optional_dependencies=dev,voikko,pycld3,fasttext,nn,omikuji,yake

Ah, OK. Thanks for the clarification!

juhoinkinen

comment created time in a month

PullRequestReviewEvent

Pull request review commentNatLibFi/Annif

Allow selecting installed optional dependencies in Docker build

 RUN apt-get update \ WORKDIR /Annif RUN pip install --upgrade pip --no-cache-dir -# Install all optional dependencies: COPY setup.py README.md LICENSE.txt projects.cfg.dist /Annif/-RUN pip install .[dev,voikko,pycld3,fasttext,nn,omikuji,yake] --no-cache-dir+# Install dependencies for optional features.+ARG optional_dependencies=dev,voikko,pycld3,fasttext,nn,omikuji,yake

ARG is now declared twice? Isn't once enough?

juhoinkinen

comment created time in a month

pull request commentNatLibFi/Annif

Allow selecting installed optional dependencies in Docker build

I don't think always including fasttext is a very big deal, this PR is still a big improvement.

That said, I think it could be possible to avoid installing fasttext by making the pip install command on line 9 conditional. Something like:

if [[ $optional_dependencies =~ "fasttext" ]]; then pip install --no-cache-dir fasttext==0.9.2; fi

(perhaps the apt-get install build-essential command could be moved inside the if clause as well, to avoid doing useless extra work in case fasttext isn't needed)

The COPY command would then copy an empty /usr/local/lib/python3.8 directory from the builder image.

juhoinkinen

comment created time in a month

pull request commentNatLibFi/Annif

Allow selecting installed optional dependencies in Docker build

What about fasttext? It will be installed regardless of this setting, right?

juhoinkinen

comment created time in a month

issue openedNatLibFi/Annif

Configuration file format compatible with DVC

To make it possible to use Annif productively in a DVC workflow, it would be helpful if DVC tools could read parameters directly from the Annif configuration file. DVC can currently read parameters from YAML 1.2, JSON, TOML and Python files (see documentation for dvc params).

Annif uses INI-style syntax (supported by configparser Python standard library module) in the projects.cfg configuration file. This is similar to TOML, but not identical.

I can think of at least these options for making the Annif configuration file DVC-compatible:

  1. Support YAML configuration files as an alternative to the current format.
  2. Support JSON configuration files as an alternative to the current format.
  3. Support TOML configuration files as an alternative to the current format.
  4. Adjust the current format slightly so that it becomes a valid subset of TOML.

I think we can rule out 2., because JSON is not very nice as a configuration language because of its strict syntax and lack of support for comments. If we want a new configuration format, either YAML (option 1) or TOML (option 3) is better.

For 3., AFAICT the main difference between the current syntax and TOML is that TOML requires string values to be quoted. So instead of this:

[tfidf-en]
language=en
backend=tfidf
analyzer=snowball(english)
limit=100
vocab=yso-en

the syntax must be

[tfidf-en]
language="en"
backend="tfidf"
analyzer="snowball(english)"
limit=100
vocab="yso-en"

(note that limit can be left as-is, as the value 100 is an integer, not a string)

This syntax doesn't work with ConfigParser currently, because it includes the quotes as part of the value. But it would be simple to change the optionxform so that it drops any quotes from the value. The file name projects.cfg could still be a problem for DVC, which would probably expect the extension .toml so that it recognizes which syntax to use.

created time in a month

issue openedNatLibFi/Annif

Eval metrics file output compatible with DVC

To make it possible to use Annif in DVC workflows, Annif should be able to produce a metrics file in a format that DVC understands, i.e. either JSON or YAML 1.2 (see documentation for dvc metrics).

Example command:

annif eval my-project --metrics metrics.json /path/to/my-test-corpus/

and the created metrics.json file would look something like this:

{
  "Precision (doc avg)": 0.1560,
  "Recall (doc avg)": 0.2333,
  ...
  "F1@5": 0.1789,
  ...
  "Documents evaluated": 300
}

It's possible that the selection of metrics should be reduced - the default set is quite large and this may be cumbersome to deal with in DVC. It would be helpful to implement also an option to select which metrics to calculate (see #545).

Also, DVC may not like the current naming of the metrics which includes capital letters, spaces and parentheses (e.g. F1 score (doc avg)) and perhaps it would be better to use names like f1_score_doc_avg.

created time in a month

issue openedNatLibFi/Annif

Select metrics for eval command using an option

Currently the annif eval command produces more than 20 different metrics. Some of these require non-trivial calculation and often a smaller set would be enough (e.g. F1@5 and NDCG). There should be an option for choosing the metrics to calculate. We could use the same command line option (-m, --metric) that is already used for the annif hyperopt command to select the metric to target.

Something like:

annif eval my-project -m "F1@5,NDCG" /path/to/my-corpus

As an alternative it could be possible to repeat the option as well:

annif eval my-project -m "F1@5" -m "NDCG" /path/to/my-corpus

created time in a month

push eventNatLibFi/Annif-corpora

Osma Suominen

commit sha f7c95a6f9912a284bf30018016fec61255f35dc1

Add most recent YKL dated 2021-12-01 (with ykl-skos.ttl as symlink)

view details

push time in a month

pull request commentNatLibFi/Annif

Add XTransformer backend

I'm sorry, I just caused a conflict in this PR by merging #544, which optimizes the startup time of Annif and contains a complete rewrite of annif/backend/__init__.py. You will need to adjust your code accordingly. The pattern should be quite obvious - just remember to:

  • keep alphabetical order (xtransformer should be the penultimate backend, just before yake)
  • look at the code for the other optional backends (fasttext, omikuji, nn_ensemble, yake) and use the same approach
  • add a unit test for the missing package into tests/test_backend.py
mo-fu

comment created time in a month

delete branch NatLibFi/Annif

delete branch : issue514-optimize-lazy-imports-take2

delete time in a month

push eventNatLibFi/Annif

Osma Suominen

commit sha 8e21ddda246112d9f4bc5fc2a2f62285e657871e

Lazy backend imports: only import backends when they are actually needed

view details

Osma Suominen

commit sha 3619ee145b308e462a0c9165c848f74174adcdf7

Lazy import of NLTK

view details

Osma Suominen

commit sha da6a93f74e62e4a73db7a89af666981e1681959e

Lazy import of NLTK

view details

Osma Suominen

commit sha 1cb17ce3760d9f8a119b301e96acd0a08f4a0da1

Lazy import of annif.eval (which will import sklearn)

view details

Osma Suominen

commit sha 40c1884d38bbab497d326daf479a00a756e09e13

Add tests for special cases when backends cannot be used due to missing packages

view details

Osma Suominen

commit sha 5c6af918b116e2ea488323a43afdce5620a49ced

Merge pull request #544 from NatLibFi/issue514-optimize-lazy-imports-take2 Optimize startup time using local & lazy imports (take 2)

view details

push time in a month

PR merged NatLibFi/Annif

Optimize startup time using local & lazy imports (take 2) enhancement

Simplified version of PR #543 Fixes #514

The goal of this PR is to reduce CLI startup time by avoiding useless work, especially imports that are not necessary for the requested operation.

It makes the following changes to the import statements within the Annif codebase:

  • complete rewrite of annif/backend/__init__.py; the end result is that backends (and the libraries they require, e.g. fasttext, omikuji and tensorflow) are only imported when they are actually used
  • avoid importing NLTK and sklearn unless actually required, by moving import statements inside functions and methods

I tried to craft the changes to have minimal impact on the code so I only chose to make imports local in cases where there were very few uses within the same module.

Startup time for simple commands such as annif --help and annif --version has been reduced by two thirds.

Before:

$ time annif --version
0.56.0.dev0

real	0m4,052s
user	0m4,001s
sys	0m0,568s

After:

$ time annif --version
0.56.0.dev0

real	0m1,385s
user	0m1,470s
sys	0m0,183s

As explained in #514, I also used tuna to visualize where the remaining import time is spent after this PR:

image

The main culprits are now connexion (with most of the time spent initializing openapi_spec_validator!) and flask. Those are core libraries and I don't think we can avoid importing them even for the simplest CLI commands.

TODO:

  • [x] add tests for for the ImportError/ValueError clauses in annif/backend/__init__.py
+126 -53

2 comments

5 changed files

osma

pr closed time in a month

issue closedNatLibFi/Annif

Optimize startup time with lazy imports

Annif takes several seconds to start even when it's doing nothing but printing the version number or help text:

$ time annif --version
0.54.0.dev0

real	0m4,398s
user	0m4,322s
sys	0m0,470s

I investigated this a little bit using the -X importtime feature in Python 3.7+ and the tuna tool for visualizing profiling information. It seems that the time is mostly spent importing large libraries such as tensorflow, scikit-learn, optuna, connexion and nltk:

kuva

These libraries are all unnecessary in simple operations such as annif --help and --version so it would be better to avoid importing them altogether. There are some tutorials on lazy importing (e.g. this one) and the importlib library contains (since Python 3.5) a LazyLoader utility class that could be used here.

I experimented a bit with this lazy_import function but couldn't get it to work for nltk submodules:

# Adapted from: https://stackoverflow.com/questions/42703908/
def lazy_import(fullname):
    """lazily import a module the first time it is used"""
    try:
        return sys.modules[fullname]
    except KeyError:
        spec = importlib.util.find_spec(fullname)
        module = importlib.util.module_from_spec(spec)
        loader = importlib.util.LazyLoader(spec.loader)
        # Make module with proper locking and get it inserted into sys.modules.
        loader.exec_module(module)
        return module

This needs more experimentation but for now I'm just opening the issue...

closed time in a month

osma

push eventNatLibFi/Annif

Osma Suominen

commit sha 40c1884d38bbab497d326daf479a00a756e09e13

Add tests for special cases when backends cannot be used due to missing packages

view details

push time in a month

PR opened NatLibFi/Annif

Optimize startup time using lazy imports (take 2) enhancement

Simplified version of PR #543 Fixes #514

The goal of this PR is to reduce CLI startup time by avoiding useless work, especially imports that are not necessary for the requested operation.

It makes the following changes to the import statements within the Annif codebase:

  • complete rewrite of annif/backend/__init__.py; the end result is that backends (and the libraries they require, e.g. fasttext, omikuji and tensorflow) are only imported when they are actually used
  • avoid importing NLTK and sklearn unless actually required, by moving import statements inside functions and methods

I tried to craft the changes to have minimal impact on the code so I only chose to make imports local in cases where there were very few uses within the same module.

Startup time for simple commands such as annif --help and annif --version has been reduced by more than two thirds.

Before:

$ time annif --version
0.56.0.dev0

real	0m4,052s
user	0m4,001s
sys	0m0,568s

After:

$ time annif --version
0.56.0.dev0

real	0m1,385s
user	0m1,470s
sys	0m0,183s

As explained in #514, I also used tuna to visualize where the remaining import time is spent after this PR:

image

The main culprits are now connexion (with most of the time spent initializing openapi_spec_validator!) and flask. Those are core libraries and I don't think we can avoid importing them even for the simplest CLI commands.

TODO:

  • [ ] add tests for for the ImportError/ValueError clauses in annif/backend/__init__.py
+93 -53

0 comment

4 changed files

pr created time in a month

more