| 1 |
Here is a description of how you can use STLport to read/write utf8 files. |
| 2 |
utf8 is a way of encoding wide characters. As so, management of encoding in |
| 3 |
the C++ Standard library is handle by the codecvt locale facet which is part |
| 4 |
of the ctype category. However utf8 only describe how encoding must be |
| 5 |
performed, it cannot be used to classify characters so it is not enough info |
| 6 |
to know how to generate the whole ctype category facets of a locale |
| 7 |
instance. |
| 8 |
|
| 9 |
In C++ it means that the following code will throw an exception to |
| 10 |
signal that creation failed: |
| 11 |
|
| 12 |
#include <locale> |
| 13 |
// Will throw a std::runtime_error exception. |
| 14 |
std::locale loc(".utf8"); |
| 15 |
|
| 16 |
For the same reason building a locale with the ctype facets based on |
| 17 |
UTF8 is also wrong: |
| 18 |
|
| 19 |
// Will throw a std::runtime_error exception: |
| 20 |
std::locale loc(locale::classic(), ".utf8", std::locale::ctype); |
| 21 |
|
| 22 |
The only solution to get a locale instance that will handle utf8 encoding |
| 23 |
is to specifically signal that the codecvt facet should be based on utf8 |
| 24 |
encoding: |
| 25 |
|
| 26 |
// Will succeed if there is necessary platform support. |
| 27 |
locale loc(locale::classic(), new codecvt_byname<wchar_t, char, mbstate_t>(".utf8")); |
| 28 |
|
| 29 |
Once you have obtain a locale instance you can inject it in a file stream to |
| 30 |
read/write utf8 files: |
| 31 |
|
| 32 |
std::fstream fstr("file.utf8"); |
| 33 |
fstr.imbue(loc); |
| 34 |
|
| 35 |
You can also access the facet directly to perform utf8 encoding/decoding operations: |
| 36 |
|
| 37 |
typedef std::codecvt<wchar_t, char, mbstate_t> codecvt_t; |
| 38 |
const codecvt_t& encoding = use_facet<codecvt_t>(loc); |
| 39 |
|
| 40 |
Notes: |
| 41 |
|
| 42 |
1. The dot ('.') is mandatory in front of utf8. This is a POSIX convention, locale |
| 43 |
names have the following format: |
| 44 |
language[_country[.encoding]] |
| 45 |
|
| 46 |
Ex: 'fr_FR' |
| 47 |
'french' |
| 48 |
'ru_RU.koi8r' |
| 49 |
|
| 50 |
2. utf8 encoding is only supported for the moment under Windows. The less common |
| 51 |
utf7 encoding is also supported. |