Some characters have their width defined as "Ambiguous" in UAX#11.
These are typically rendered as single-width by modern monospace fonts,
and utf8proc correctly returns charwidth==1 for these.
However some applications might need to support older CJK fonts where
characters which where two-byte in legacy encodings were rendered as
double-width. An example of this is the 'ambiwidth' option of vim
and neovim which supports rendering in terminals using such wideness
rules.
Add an 'ambiguous_width' property to utf8proc_property_t for such characters.
* Fix two minor bugs from the Ruby code
First, `categroy` rather than `code` was used in constructing the
`control_boundary` property as related to the characters U+200C and
U+200D. This seemed incorrect and should be fixed. This could be an
observable bugfix for any C code which inspects the `control_boundary`
property.
Second, when reading composition exclusions, Ruby's String hex method
produces zero rather than nil if no number is found. For example
$ ruby -e 'puts "# blah".hex'
0
This led to the character `'\0'` being included in the `exclusions`
and `excl_versions` sets which is incorrect. However this seems
asymptomatic because `'\0'` is never part of a composition. (In terms of
the C code, the use of `comp_exclusion` is guarded by the `comb_index`
property which is `UINT16_MAX` for `'\0'`.)
* Cleanup: Remove sequence ordering hack
This hack changed the ordering of sequences encoded in the sequences
table and was added so we could easily prove equivalence to the Ruby
data generator code.
However, it's no longer needed and removing it shouldn't result in any
functional change.
* Port ruby data_generator.rb to Julia
This reduces the number of dependencies needed when regenerating the C
code. The new code also separates C code generation from unicode data
analysis somewhat more cleanly which should be better factored for
generating a Julia version of the data files in the future.
The output is identical to the original Ruby script, for now. Some bugs
which were found in the process are noted as FIXMEs in the Julia source
and can be fixed next.
* Replace some explicit loops with a utility function
* fixup! Port ruby data_generator.rb to Julia
* Update Makefile
* Update data/Makefile
* Update data/Makefile
* Update data/Makefile
* Update data/Makefile
* Update data/data_generator.jl
---------
Co-authored-by: Steven G. Johnson <stevenj@mit.edu>
This will silence the following warning:
CMake Deprecation Warning at CMakeLists.txt:1 (cmake_minimum_required):
Compatibility with CMake < 3.5 will be removed from a future version of
CMake.
Update the VERSION argument <min> value or use a ...<max> suffix to tell
CMake that the project does not need compatibility with older versions.
* JuliaStrings#169 turn on sign-conversion warnings
Signed-off-by: Mike Glorioso <mike.glorioso@gmail.com>
* JuliaStrings#169 fix sign-conversion warnings for utf8proc.c
fix sign-converstion warnings for utf8proc_iterate
uc requires at most 21 bits to identify a unicode codepoint, so there is no need for it to be unsigned
multiple locations use, modify, or store uc with a signed value
the only exception is line 137 where uc is compared with an unsigned value
fix sign-converstion warnings for utf8proc_tolower, utf8proc_toupper, utf8proc_totitle
all three methods have sign conversion warnings when calling seqindex_decode_index
seqindex_decode_index uses the passed value as an index to an array utf8proc_sequences
as utf8proc_sequences is hard-coded and smaller than 2^31 - 1 we can safely cast to unsigned
fix sign-converstion warnings for utf8proc_decompose_char
lines with this warning use the defined function utf8proc_decompose_lump
in the function, a hardcoded unsigned value (1<<12) is complemented then cast as a signed value
as the intent is to remove the 12th bit flag from options, a signed value, and explicit cast is safe
fix sign-conversion warnings for utf8proc_map_custom
result is declared as signed, but is only expected to contain values between 0 and 4
sizeof returns an unsigned value. result must be cast to unsigned
Signed-off-by: Mike Glorioso <mike.glorioso@gmail.com>
* JuliaStrings#169 fix sign-conversion warnings for test/*
fix sign-conversion warnings for test/tests.c encode
change type for d to match return value of utf8proc_encode_char
fix sign-conversion warnings for test/graphemetest.c checkline
si, i, and j are unsigned size types, utf8proc_map and utf8proc_iterate accept and return signed size types
utf8proc_map treats negative strlen values as 0. the strlen used by the test must be similarly limited
utf8proc_iterate treats negative strlen values as 4 which will be less than the unsigned size
fix unused-but-set-variable warning by checking the glen value
fix sign-conversion warnings for test/case.c main
the if block ensures that tested codepoint fits in wint_t, but needs to include u and l as well
c, u, and l can be safely cast to wint_t
fix sign-conversion warnings for test/iterate.c
all values used for len are below 8, so an explicit cast is safe
updated types for more portable test code
fix sign-conversion warnings for test/printproperty.c main
change type of c to signed to resolve all sign-converstion warnings.
replace sscanf(... &c) wiht sscanf(... &x) followed by explicit sign converstion
Signed-off-by: Mike Glorioso <mike.glorioso@gmail.com>
* ensure ruby is in UTF-8 mode
* Revert "ensure ruby is in UTF-8 mode"
This reverts commit 587b7b6b7215f91b1ae52aefc82d359f2f378a61.
* ensure Ruby reads files in UTF-8 encoding
* Fix extended emoji + zwj combo
* Patch initial repeated regional flags and extended+zwj emoj
* Merge conditions for setting breaks bt region
* updated fix
* perform tests for both utf8proc_map and manual calls to utf8proc_grapheme_break_stateful
* consolidate tests
Co-authored-by: Thomas Marks <marksta@umich.edu>
The cmake file expects the parent folder to be named "utf8proc",
otherwise the target_include_directories won't work, as it references
an unknown path.
This deviates from the install targets (both cmake and makefile) in
putting the include file into a subfolder in contrast to the top level
folder. This also prevents using the library with the recent cmake
addition of FetchContent.
This change unifies the include file handling by using the local path
for cmake as well.
This might break existing uses. As a workaround, we could add a dummy
include file in the old location (new utf8proc subfolder). I'm not sure
if that is necessary.
Co-authored-by: Stefan Floeren <stefan-floeren@users.noreply.github.com>
* Add: tests to CMakeLists.txt
* Disable compilation of charwidth, graphemetest and normtest because of missing getline
* Refactoring: UTF8PROC_ENABLE_TESTING default Off, move tests that don't compile on windows to NOT MSVC section, add testing to appveyor.yml
* Add: testing to travis
* Changed: flag to WIN32 because MinGW has the same problem as MSVC
* Commented out graphemetest and normtest because they fail.
* Re-added: graphemetest and normtest added missing data to the path of the text files.
* Fix: last commit was party wrong normtest failed.
* * Commented out graphemetest and normtest because they fail, because in CMakeLists is missing building of data.
* Add: mingw_static, mingw_shared, msvc_shared, msvc_static to ignore list
* Add: prefix utf8proc. to tests
* Fix: memory leaks in tests case.c and misc.c forgot to call free after calling utf8proc_NFKC_Casefold
Co-authored-by: Andreas-Schniertshauer <Andreas-Schniertshauer@users.noreply.github.com>