MySQL 5.4 has two Unicode character sets:
ucs2, the UCS-2 encoding of the Unicode
character set using 16 bits per character
utf8, a UTF-8 encoding of the Unicode
character set using one to three bytes per character
You can store text in about 650 languages using these character sets. This section lists the collations available for each Unicode character set. For general information about the character sets, see Section 9.1.10, “Unicode Support”.
A similar set of collations is available for each Unicode
character set. These are shown in the following list, where
xxx represents the character set
name. For example,
represents the Danish collations, the specific names of which
are xxx_danish_ciucs2_danish_ci and
utf8_danish_ci.
xxx_bin
xxx_czech_ci
xxx_danish_ci
xxx_esperanto_ci
xxx_estonian_ci
(default)
xxx_general_ci
xxx_hungarian_ci
xxx_icelandic_ci
xxx_latvian_ci
xxx_lithuanian_ci
xxx_persian_ci
xxx_polish_ci
xxx_roman_ci
xxx_romanian_ci
xxx_slovak_ci
xxx_slovenian_ci
xxx_spanish2_ci
xxx_spanish_ci
xxx_swedish_ci
xxx_turkish_ci
xxx_unicode_ci
MySQL implements the
collations according to the Unicode Collation Algorithm (UCA)
described at
http://www.unicode.org/reports/tr10/. The
collation uses the version-4.0.0 UCA weight keys:
http://www.unicode.org/Public/UCA/4.0.0/allkeys-4.0.0.txt.
Currently, the
xxx_unicode_ci
collations have only partial support for the Unicode Collation
Algorithm. Some characters are not supported yet. Also,
combining marks are not fully supported. This affects
primarily Vietnamese, Yoruba, and some smaller languages such
as Navajo. The following discussion uses
xxx_unicode_ciutf8_unicode_ci for concreteness.
For any Unicode character set, operations performed using the
_general_ci collation are faster than those
for the _unicode_ci collation. For example,
comparisons for the utf8_general_ci
collation are faster, but slightly less correct, than
comparisons for utf8_unicode_ci. The reason
for this is that utf8_unicode_ci supports
mappings such as expansions; that is, when one character
compares as equal to combinations of other characters. For
example, in German and some other languages
“ß” is equal to
“ss”.
utf8_unicode_ci also supports contractions
and ignorable characters. utf8_general_ci
is a legacy collation that does not support expansions,
contractions, or ignorable characters. It can make only
one-to-one comparisons between characters.
To further illustrate, the following equalities hold in both
utf8_general_ci and
utf8_unicode_ci (for the effect this has in
comparisons or when doing searches, see
Section 9.1.7.7, “Examples of the Effect of Collation”):
Ä = A Ö = O Ü = U
A difference between the collations is that this is true for
utf8_general_ci:
ß = s
Whereas this is true for utf8_unicode_ci:
ß = ss
MySQL implements language-specific collations for the
utf8 character set only if the ordering
with utf8_unicode_ci does not work well for
a language. For example, utf8_unicode_ci
works fine for German and French, so there is no need to
create special utf8 collations for these
two languages.
utf8_general_ci also is satisfactory for
both German and French, except that
“ß” is equal to
“s”, and not to
“ss”. If this is acceptable
for your application, then you should use
utf8_general_ci because it is faster.
Otherwise, use utf8_unicode_ci because it
is more accurate.
utf8_swedish_ci, like other
utf8 language-specific collations, is
derived from utf8_unicode_ci with
additional language rules. For example, in Swedish, the
following relationship holds, which is not something expected
by a German or French speaker:
Ü = Y < Ö
The
and
xxx_spanish_ci
collations correspond to modern Spanish and traditional
Spanish, respectively. In both collations,
“xxx_spanish2_ciñ” (n-tilde) is a separate
letter between “n” and
“o”. In addition, for
traditional Spanish, “ch” is a
separate letter between “c”
and “d”, and
“ll” is a separate letter
between “l” and
“m”
In the
collations, xxx_roman_ciI and J
compare as equal, and U and
V compare as equal.
For additional information about Unicode collations in MySQL, see Collation-Charts.Org (utf8).

User Comments
Note for Hebrew speakers: in utf8_general_ci, dots (Niqqud symbols) are treated as seperate characters, so
a. If you have dotted words in your table, they won't be ordered correctly.
b. You can have words with the same letters and different dots in a unique index column.
On the other hand, in utf8_unicode_ci, dots are igonred, so:
a. The order will be correct;
b. Words with the same letters and different dots will be regarded as equal, and you won't be able to have them in a unique index column.
I still didn't find a collation that treats both issues correctly.
Add your own comment.