utf8 collation difference between unicode and danish
Well hello there.
I’m changing the charset of a database from
utf8. I’ve allways used
utf8_danish_ci because it’s the closest to the Norwegian character style – I think.
But what about
Some time ago; It was preferred to use
_general_ci for better/faster performance, and
_unicode_ci for more accuracy because of the sorting algorithm is more complex in the latter. But since the speed/performace is no longer an issue – or not so much an issue i most cases anymore –
_unicode_ci is ok to use in most situations?
But how does
_unicode_ci differ from
Is it the last three letters
å in the nordic alphabet that is taken into account?
Most comparisson (one vs the other) I can find is only between
Anyone know of any examples for when to use
_unicode_ci or when to use
_danish_ci would be highly appreciated…
In short, if your app is multilingual and storing multiple languages within the same tables, you’re mostly screwed and should worry about sorting/collating outside of the database — utf8_general_ci is then as good as any other.
If it only supports a single language, you’ll do fine by setting the correct collation at the db level — in your case, utf8_danish_ci indeed, since it’s the same as norwegian if wikipedia is anything to go by.
In case you’d like to read more on collation, the ICU docs has colorful examples of how thorny this kind of stuff gets. Quoting extensively:
[H]ere are some of the ways languages vary in ordering strings:
The letters A-Z can be sorted in a different order than in English.
For example, in Lithuanian, “y” is sorted between “i” and “k”.
Combinations of letters can be treated as if they were one letter. For
example, in traditional Spanish “ch” is treated as a single letter,
and sorted between “c” and “d”.
Accented letters can be treated as minor variants of the unaccented
letter. For example, “é” can be treated equivalent to “e”.
Accented letters can be treated as distinct letters. For example, “Å”
in Danish is treated as a separate letter that sorts just after “Z”.
Unaccented letters that are considered distinct in one language can be
indistinct in another. For example, the letters “v” and “w” are two
different letters according to English. However, “v” and “w” are
considered variant forms of the same letter in Swedish.
A letter can be treated as if it were two letters. For example, in
traditional German “ä” is compared as if it were “ae”.
Thai requires that the order of certain letters be reversed.
French requires that letters sorted with accents at the end of the
string be sorted ahead of accents in the beginning of the string. For
example, the word “côte” sorts before “coté” because the acute accent
on the final “e” is more significant than the circumflex on the “o”.
Sometimes lowercase letters sort before uppercase letters. The reverse
is required in other situations. For example, lowercase letters are
usually sorted before uppercase letters in English. Latvian letters
are the exact opposite.
Even in the same language, different applications might require
different sorting orders. For example, in German dictionaries, “öf”
would come before “of”. In phone books the situation is the exact
Sorting orders can change over time due to government regulations or
new characters/scripts in Unicode.
Please bear in mind that Collation != Encoding.
Encoding is the mapping between integers (which all that a database can store at the end of the day) and human readable graphic representations of characters.
Collation is the ordering rule used to sort characters according to the conventional alphabetical order of a given language. Notice that this ordering does not reflect the actual order of the internal, numeric representation.
Your question reduces to the following: what kind of alphabetical order should you use in your application? This cannot be answered.