Follow

Using NLTK

The Natural Language Toolkit (http://www.nltk.org) Python library comes pre-installed in the default Domino compute environment. If you are using an environment that does not have it installed, you can use pip to install nltk via the usual methods. To access the additional corpora of training data, you have a couple of options:

If you know you only need a subset, you can download each corpus one at a time via the nltk.download("<corpus>") command, e.g.

import nltk
nltk.download("brown")

You can also do this as part of a custom environment definition from bash, if for instance you want to use the corpus for multiple scripts or if you want to cache the download:

python -m nltk.downloader brown

In most cases, only a subset of the corpora are needed, and specifying only this subset as above can be the quickest way to get it into your project. However if you do need the entire set, you may find that the download takes quite some time to complete, given its size (currently ~10GB). A quicker approach is to import this project: https://app.dominodatalab.com/u/domino/nltk-data (users of our trial system: please import this project instead: https://trial.dominodatalab.com/u/domino/nltk-data). Doing so will copy all nltk training data to the machine on which your code runs (if it hasn't already been cached there), and will also import an environment variable named NLTK_DATA that tells nltk where to find these corpora. For a machine where the data is not already cached, this copy operation will be much quicker than downloading from the original nltk source.

We may update this project as the original nltk data is updated, so you may wish to fork it in order to preserve a static revision. You can always refresh your fork with updates to the original at a later date.

Was this article helpful?
0 out of 0 found this helpful

Comments