(This entry is part of a series. The audience: SQL Server developers. The topic: SQLite on mobile devices.)
First, go read this blog entry by Miguel de Icaza. Right now. I'll wait.
Welcome back. Now let me apologize. I don't want to be a pedantic
jerk who quibbles about minor details. But the topic here is Unicode, so there
really is no other way.
All Unicode, all the time
The relevant difference with SQLite is easy to describe:
In the world of Windows and SQL Server, you have all kinds of possible code pages.
In SQLite, everything is Unicode.
But if you don't have much context on these issues, I haven't really told you much. Let's go further.
But let's not go too far
I don't want to rewrite articles that have already been written quite well. So you should probably also go read this blog entry by Joel Spolsky. I'll be here when you get back.
OK, now let's get started
SQL Server has two basic ways you store text:
You can use the char/varchar types, which can be used with one of
several collations, each of which implies a specific code page, which implies a
specific character encoding.
Or you can use the nchar/nvarchar types (note the extra 'n'), which
SQLite has no such distinction. All text in SQLite is Unicode.
What the heck is Unicode again?
It's a character set: a collection of characters, each with a number that can be
used to refer to it.
More specifically, it's the only character set which is [trying to be] complete.
If you choose any character set or encoding which is not Unicode, there will be
characters you cannot use.
And what's an encoding?
Saying that SQLite uses Unicode doesn't tell you how the text is actually represented.
Unicode is not an encoding. It
is more abstract than that. There are lots of different ways of representing
Unicode as bytes.
Microsoft's sad history with Unicode
In the Windows and SQL Server world, there is a long history of encoding
Unicode in 16-bit-ish ways. Originally, this was UCS-2, a simple encoding
which represents each character as a 16-bit number. But then the Unicode
consortium realized that 16 bits are not enough, so they expanded the space to 32
bits. This left Microsoft in an awkward spot. UCS-2 is a fundamentally
defective encoding of Unicode, since there are many characters in Unicode that
simply cannot be represented.
If the goal of Unicode is to be complete, it is reasonable to say that, well actually, UCS-2
is not Unicode.
The conceptual replacement for UCS-2 is to use 32 bits for every character.
This encoding is called UCS-4 or UTF-32. But now the wasted space for storing
a simple English string is getting out of hand. Switching the complete works
of Arthur Conan Doyle from ASCII (which is also an encoding) to UCS-4 would take
four times as much space.
Gradually, the world seems to be adopting UTF-8 as the most
popular Unicode encoding. This is a variable width encoding. Sometimes a
single character is represented with just one byte. Sometimes it needs more.
That's very unfortunate, but the only fixed width alternative is UCS-4,
which is also very unfortunate. Choose which problem you prefer,
but keep in mind that almost everybody has chosen to accept the problems of
But Microsoft has so much history with UCS-2 that transitioning everything
to UTF-8 would be really hard. So they have been moving from UCS-2 to UTF-16,
which is basically a variable width encoding built around a 16-bit unit instead
of an 8-bit unit. UTF-16 is approximately the worst correct way of representing Unicode, unless you
have invested billions of dollars in the fundamentally broken UCS-2, in which
case UTF-16 is a pretty awesome way out of the mess you ended up in.
Just remember that if you're going out tonight to a club for pedantic nerds and you want to impress someone,
you've got to keep the terminology straight:
Unicode is an abstraction, not an encoding, not a code page, not a
data format, and not a font.
Saying your text is Unicode says nothing about how it is represented.
It might be UTF-8. It might UTF-16. It might be code point numbers
handwritten on Post-It notes stuck on the wall. All of these are valid
representations of Unicode.
If you ever say anything to suggest that you think Unicode is
16 bits per character, you will be identified as clueless.
If you say that your text is stored in Unicode, you are not entirely incorrect, but people will wonder
about whether you really know the difference between Unicode and the
encodings of same.
SQLite always uses Unicode to represent text.
(Hopefully you are now screaming at me saying, "Yeah, but which encoding?!?")
The best answer to this question is: SQLite uses UTF-8. Forget about everything else.
A more correct answer to this question is: SQLite uses UTF-8 but also supports UTF-16.
Either way, there is no distinction between char and nchar.
There is no
way to save storage space in a column by realizing that you only use lower case english characters
so it's safe to use char instead of nchar.
There are no code pages.
is no way to move your Shift JIS data into SQLite without converting it to Unicode (or storing it as blobs, I suppose).
Microsoft has done a lot of things right, but its history with Unicode is
very unfortunate. And it's not entirely their fault. They
adopted Unicode early and it changed underneath them.
With respect to its emphasis on UTF-8, SQLite is far more typical of most
non-Microsoft software today.