.. | ||
dict | ||
wordlists | ||
create_wordlist_from_sentences.py | ||
README.md |
This directory contains dictionaries compiled from sentence lists to make use of next-word predictions. Currently all word dictionaries are based on word lists available at https://wortschatz.uni-leipzig.de/en/download/ under CC BY 4.0 license.
The emoji dictionary is adapted from gemoji without further processing.
Dictionaries are created using create_wordlist_from_sentences.py
for <locale>_wordlist.combined
and dicttool_aosp.jar
for creating .dict
files. See the example_()
functions in the python script for how to use it. You can simply adjust paths and add your sentence (or word) lists.
The script is still experimental, rather slow and may produce bad dictionaries in some languages. Some words seem to be wrongly added (e.g. "i" for English), and names are typically missing, though this depends on how exactly you create the Android dictionaries.
A "potentially_offensive" attribute is added for some words, which sometimes seems unnecessary. Currently this is coming from the "nosuggest" attribute of the used hunspell dictionaries, which occurs for offensive words as well as for weird / rare word forms.
Other flags are currently missing, same for shortcuts (e.g. ill -> I'll or écoeuré -> écœuré, as found in AOSP dictionaries).
wordlist.combined
file infos (mostly guessed, didn't find documentation):
- header is necessary
- format like
dictionary=main:en_us,locale=en_US,description=English (US),date=1414726260,version=54
- all of these fields are necessary, though
description
is not used - German dictionaries also have
REQUIRES_GERMAN_UMLAUT_PROCESSING=1
- format like
- each word is in a line like
word=re,f=0,flags=abbreviation,originalFreq=99,possibly_offensive=true
word
is the word (necessary)f
is frequency, from 0 to 255(?) (necessary)- higher value is more likely to get suggested / corrected
- special value
whitelist
, possibly equal to 15 f=0
will not be suggested if bad words are blocked, and will never be added to user history- possible bug: words with
possibly_offensive=true
andf=0
will be suggested when not blocking offensive words, but other words withf=0
are still not suggested
- possible bug: words with
originalFreq
: unclear, is this used?flags
:medical
,technical
,hand-added
,babytalk
,abbreviation
,offensive
,technical
,nonword
, and probably more: are they used for anything?possibly_offensive=true
stops the word from being suggested when blocking offensive wordsnot_a_word=true
will not be suggested, use together withshortcut
shortcut=<s>
(below a<word>
) will suggest<s>
when the<word>
is typed- which
f
to use? maybe only 0-14 andwhitelist
allowed - what does
f
do here?
- which
bigram=<b>
(below a<word>
) will suggest<b>
as next word before typing any letters- what does
f
do here? Looks like 1, 2, and 3 are used for the usual 3 bigram entries
- what does