Steven G. Johnson
bc357b276f
|
10 years ago | |
---|---|---|
bench | 10 years ago | |
.gitignore | 10 years ago | |
.travis.yml | 10 years ago | |
LICENSE.md | 10 years ago | |
Makefile | 10 years ago | |
NEWS.md | 10 years ago | |
README.md | 10 years ago | |
data_generator.rb | 10 years ago | |
lump.txt | 10 years ago | |
mojibake.h | 10 years ago | |
normtest.c | 10 years ago | |
utf8proc.c | 10 years ago | |
utf8proc_data.c | 10 years ago |
README.md
libmojibake
libmojibake is a lightly updated fork of the utf8proc library from Jan Behrens and the rest of the Public Software Group, who deserve nearly all of the credit for this package: a small, clean C library that provides Unicode normalization, case-folding, and other operations for data in the UTF-8 encoding.
The reason for this fork is that utf8proc
is used for basic Unicode
support in the Julia language and the Julia
developers wanted Unicode 7 support and other features, but the Public
Software Group is currently occupied with other projects. We hope
that our fork can be merged back into the mainline utf8proc
package
before too long.
(The original utf8proc
package also includes Ruby and PostgreSQL plug-ins.
We removed those from libmojibake
in order to focus exclusively on the C
library for the time being. We will strive to keep API changes to a minimum,
so libmojibake
should still be usable with the old plug-in code.)
Like utf8proc
, the libmojibake
package is licensed under the
free/open-source MIT "expat"
license (plus certain Unicode
data governed by the similarly permissive Unicode data
license); please see
the included LICENSE.md
file for more detailed information.
Quick Start
For compilation of the C library run make
.
General Information
The C library is found in this directory after successful compilation
and is named libmojibake.a
(for the static library) and
libmojibake.so
(for the dynamic library).
The Unicode version being supported is 5.0.0. Note: Version 4.1.0 of Unicode Standard Annex #29 was used, as version 5.0.0 had not been available at the time of implementation.
For Unicode normalizations, the following options are used:
- Normalization Form C:
STABLE
,COMPOSE
- Normalization Form D:
STABLE
,DECOMPOSE
- Normalization Form KC:
STABLE
,COMPOSE
,COMPAT
- Normalization Form KD:
STABLE
,DECOMPOSE
,COMPAT
C Library
The documentation for the C library is found in the utf8proc.h
header file.
utf8proc_map
is function you will most likely be using for mapping UTF-8
strings, unless you want to allocate memory yourself.
To Do
- detect stable code points and process segments independently in order to save memory
- do a quick check before normalizing strings to optimize speed
- support stream processing
Contact
Bug reports, feature requests, and other queries can be filed at the libmojibake page on Github.