Add Romanian dictionary

With Romanian you have to take special care when handling words that
contain the "sh" and "tz" character (read it as the "sh" in "shiver"
and the "zz" in "pizza"). There are two sets of characters that look
sort of the same `ş` and `ș`, `ţ` and `ț`. If you look carefully, one
has a tail connected to the body, and the other has a comma separated
from the body of the character. The correct ones are the one with the
comma separated, not the touching tail. If in doubt, switch to a
Romanian layout and type `;` and `'`, they will give you the correct
characters to use.

The HTML codes for these characters are:
- `Ș` and `ș` for `Ș` and `ș`.
- `Ț` and `ț` for `Ț` and `ț`.

Reference:
https://en.wikipedia.org/wiki/S-comma
https://en.wikipedia.org/wiki/T-comma

While similar in shape, this difference will break autoc ompletion.
I've replaced all of them with the proper one.

I've also tried creating a new dictionary but ran into issues...
The list of words was downloaded from:
https://raw.githubusercontent.com/hermitdave/FrequencyWords/master/content/2018/ro/ro_full.txt

This is not a quality source, and some cleaning up was done in order to
remove some mistakes, like words containing numbers, and the `,` and `.`
characters. Words that were separated with `--` were also removed as
there is no such notation in the language.

The tools from here were used to create the dictionary:
https://github.com/remi0s/aosp-dictionary-tools

They only take the top 150,000 words, from a total of 1,154,496
effectively skipping words with less than 2 occurrences. This is OK, I
guess... although it misses a lot of valid ones. A better data source
would help with this, but it's difficult to find such data.

I guess I can come back in the future to improve this.
This commit is contained in:
Codruț Constantin Gușoi 2020-05-30 16:38:40 +01:00
parent f935623456
commit 8132aeba7d
2 changed files with 0 additions and 0 deletions

Binary file not shown.

Binary file not shown.