datasets-text

Datasets used in text processing.

NOTE: The MIT LICENSE does not cover the files found in the corpora directory. See each corpus for specific licensing and terms of use.

Collection method

Please see the steps.md document for more information on how each corpus was collected and processed.

Contributing

Thank you for considering contributions, however I am not taking any contributions towards adding new corpus sets. Simply because I want to keep the size of this repo down to a minimum.

Code contributions are welcome.

Top 1000 ngrams

The top 1000 letter and word ngrams can be found in the packaged/ngrams directory. See the following section on how this data was generated.

Generating ngrams

If you want to generate your own ngrams then please install ngrams from https://github.com/andrejacobs/go-analyse

Run the script ngrams.sh:

./scripts/ngrams.sh

At the time of writing the generated ngrams are about 1.98GB.

To package up the ngrams into zip files, run the script package.sh:

./scripts/package.sh

At the time of writing the packaged files is about 410MB.

Disclaimer

Minimum processing was done to the source corpora and thus the generated ngrams will reflect that. For example the top 3 words from the Gutenberg sources include things like "* * *" and "the project gutenberg".

Also for time being the ngrams program makes no attempt at stripping symbols or doing any post processing on the input sources.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

datasets-text

Collection method

Contributing

Top 1000 ngrams

Generating ngrams

Disclaimer

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
corpora		corpora
packaged/ngrams		packaged/ngrams
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
steps.md		steps.md

Folders and files

Latest commit

History

Repository files navigation

datasets-text

Collection method

Contributing

Top 1000 ngrams

Generating ngrams

Disclaimer

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages