5.2 KiB
utf8proc
utf8proc is a small, clean C library that provides Unicode normalization, case-folding, and other operations for data in the UTF-8 encoding. It was initially developed by Jan Behrens and the rest of the Public Software Group, who deserve nearly all of the credit for this package. With the blessing of the Public Software Group, the Julia developers have taken over development of utf8proc, since the original developers have moved to other projects.
(utf8proc is used for basic Unicode support in the Julia language, and the Julia developers became involved because they wanted to add Unicode 7 support and other features.)
(The original utf8proc package also includes Ruby and PostgreSQL plug-ins. We removed those from utf8proc in order to focus exclusively on the C library.)
The utf8proc package is licensed under the
free/open-source MIT "expat"
license (plus certain Unicode
data governed by the similarly permissive Unicode data
license); please see
the included LICENSE.md
file for more detailed information.
Quick Start
Typical users should download a utf8proc release rather than cloning directly from github.
For compilation of the C library, run make
. You can also install the library and header file with make install
(by default into /usr/local/lib
and /usr/local/bin
, but this can be changed by make prefix=/some/dir
). make check
runs some tests, and make clean
deletes all of the generated files.
Alternatively, you can compile with cmake
, e.g. by
mkdir build
cmake -S . -B build
cmake --build build
Using other compilers
The included Makefile
supports GNU/Linux flavors and MacOS with gcc
-like compilers; Windows users will typically use cmake
.
For other Unix-like systems and other compilers, you may need to pass modified settings to make
in order to use the correct compilation flags for building shared libraries on your system.
For HP-UX with HP's aCC
compiler and GNU Make (installed as gmake
), you can compile with
gmake CC=/opt/aCC/bin/aCC CFLAGS="+O2" PICFLAG="+z" C99FLAG="-Ae" WCFLAGS="+w" LDFLAG_SHARED="-b" SOFLAG="-Wl,+h"
To run gmake install
you will need GNU coreutils for the install
command, and you may want to pass prefix=/opt libdir=/opt/lib/hpux32
or similar to change the installation location.
General Information
The C library is found in this directory after successful compilation
and is named libutf8proc.a
(for the static library) and
libutf8proc.so
(for the dynamic library).
The Unicode version supported is 15.1.0.
For Unicode normalizations, the following options are used:
- Normalization Form C:
STABLE
,COMPOSE
- Normalization Form D:
STABLE
,DECOMPOSE
- Normalization Form KC:
STABLE
,COMPOSE
,COMPAT
- Normalization Form KD:
STABLE
,DECOMPOSE
,COMPAT
C Library
The documentation for the C library is found in the utf8proc.h
header file.
utf8proc_map
is function you will most likely be using for mapping UTF-8
strings, unless you want to allocate memory yourself.
To Do
See the Github issues list.
Contact
Bug reports, feature requests, and other queries can be filed at the utf8proc issues page on Github.
See also
An independent Lua translation of this library, lua-mojibake, is also available.
Examples
Convert codepoint to string
// Convert codepoint `a` to utf8 string `str`
utf8proc_int32_t a = 223;
utf8proc_uint8_t str[16] = { 0 };
utf8proc_encode_char(a, str);
printf("%s\n", str);
// ß
Convert string to codepoint
// Convert string `str` to pointer to codepoint `a`
utf8proc_uint8_t str[] = "ß";
utf8proc_int32_t a;
utf8proc_iterate(str, -1, &a);
printf("%d\n", a);
// 223
Casefold
// Convert "ß" (U+00DF) to its casefold variant "ss"
utf8proc_uint8_t str[] = "ß";
utf8proc_uint8_t *fold_str;
utf8proc_map(str, 0, &fold_str, UTF8PROC_NULLTERM | UTF8PROC_CASEFOLD);
printf("%s\n", fold_str);
// ss
free(fold_str);
Normalization Form C/D (NFC/NFD)
// Decompose "\u00e4\u00f6\u00fc" = "äöü" into "a\u0308o\u0308u\u0308" (= "äöü" via combining char U+0308)
utf8proc_uint8_t input[] = {0xc3, 0xa4, 0xc3, 0xb6, 0xc3, 0xbc}; // "\u00e4\u00f6\u00fc" = "äöü" in UTF-8
utf8proc_uint8_t *nfd= utf8proc_NFD(input); // = {0x61, 0xcc, 0x88, 0x6f, 0xcc, 0x88, 0x75, 0xcc, 0x88}
// Compose "a\u0308o\u0308u\u0308" into "\u00e4\u00f6\u00fc" (= "äöü" via precomposed characters)
utf8proc_uint8_t *nfc= utf8proc_NFC(nfd);
free(nfd);
free(nfc);