DevTools:ICU & Normalization?

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

DevTools:ICU & Normalization?

David Haslam
According to http://crosswire.org/wiki/DevTools:ICU - Sword makes use of ICU for casing (used in search), normalization, and script transliteration.

Which version of Unicode do we employ for Normalization to NFC ?

Some composite glyphs that use two combining characters in the Myanmar block are treated differently when specifying the current version of Unicode than they were for Unicode 3.2.

These are the two combining characters.  They have UNC codes U+1037 U+103A.

့ MYANMAR SIGN DOT BELOW
် MYANMAR SIGN ASAT

This pair of combining characters occurs many, many times in the BurJudson module.

Software that includes Normalization should be tested against the official Unicode Normalization Test http://www.unicode.org/Public/UNIDATA/NormalizationTest.txt (2.2MB) for that version of Unicode,

Testing the normalization of the sequence U+1000 U+103A U+1037 with the ICU Normalization Browser (which uses the "Internationalization Components for Unicode" library, which is the most widely used Unicode software library), we can verify that it does indeed normalize to U+1000 U+1037 U+103A, with reordering:

See http://bit.ly/nqYzQp.

However, if you run the same test for Unicode 3.2 (released March 2002, and so almost 10 years out of date), there is no reordering:

See http://bit.ly/orZ7df.

NB. I used the URL shortener to allow parameters to be passed to the test page more easily.

The process of converting a string to NFC or NFD requires a stage called "canonical ordering", whereby characters are reordered in ascending order according to their canonical combining class [ccc]. See http://www.unicode.org/reports/tr15/?win#Description_Norm.

U+103A MYANMAR SIGN ASAT has ccc=9, whereas U+1037 MYANMAR SIGN DOT BELOW has ccc=7; therefore U+1037 is reordered before U+103A.

The present module BurJudson has SwordVersionDate=2008-03-01.
It looks very much as if the normalization was done according to Unicode 3.2.

Context:
This question arises in the context of the possibility of creating a new module from a better source text.
If we use the latest SWORD utilities to make the new module, will it normalize correctly?

David
Reply | Threaded
Open this post in threaded view
|

Re: DevTools:ICU & Normalization?

Greg Hellings
David,

SWORD can link against many different versions of the ICU library. It
will detect the version that is installed on the system and leverage
its internal libraries. I know it supports back at least as far as ICU
4.0 which was Unicode 5.1. It also compiles against ICU 4.8 - which
supports Unicode 6.0 - as well.  Whether it supports anything before
ICU 4 I am not certain, as I have not tried with earlier versions
anytime recently.

Whatever is present on a system will be utilized. I thought
normalizing was done at data retrieval time, which would mean whatever
is present on the user's system will be used. If it's done at import
time then it will be whatever version of Unicode is on Chris Little's
system. I would imagine that it is at least later than 4.0 as that
version is dated to January 2009.

--Greg

On Wed, Oct 12, 2011 at 10:29 AM, David Haslam <[hidden email]> wrote:

> According to http://crosswire.org/wiki/DevTools:ICU - Sword makes use of ICU
> for casing (used in search), normalization, and script transliteration.
>
> *Which version of Unicode do we employ for Normalization to NFC ?*
>
> Some composite glyphs that use two combining characters in the *Myanmar*
> block are treated differently when specifying the current version of Unicode
> than they were for Unicode 3.2.
>
> These are the two combining characters.  They have UNC codes U+1037 U+103A.
>
> ့ MYANMAR SIGN DOT BELOW
> ် MYANMAR SIGN ASAT
>
> This pair of combining characters occurs many, many times in the BurJudson
> module.
>
> Software that includes Normalization should be tested against the official
> Unicode Normalization Test
> http://www.unicode.org/Public/UNIDATA/NormalizationTest.txt (2.2MB) for that
> version of Unicode,
>
> Testing the normalization of the sequence U+1000 U+103A U+1037 with the ICU
> Normalization Browser (which uses the "Internationalization Components for
> Unicode" library, which is the most widely used Unicode software library),
> we can verify that it does indeed normalize to U+1000 U+1037 U+103A, with
> reordering:
>
> See http://bit.ly/nqYzQp.
>
> However, if you run the same test for Unicode 3.2 (released March 2002, and
> so almost 10 years out of date), there is no reordering:
>
> See http://bit.ly/orZ7df.
>
> /NB. I used the URL shortener to allow parameters to be passed to the test
> page more easily/.
>
> The process of converting a string to NFC or NFD requires a stage called
> "canonical ordering", whereby characters are reordered in ascending order
> according to their canonical combining class [ccc]. See
> http://www.unicode.org/reports/tr15/?win#Description_Norm.
>
> U+103A MYANMAR SIGN ASAT has ccc=9, whereas U+1037 MYANMAR SIGN DOT BELOW
> has ccc=7; therefore U+1037 is reordered before U+103A.
>
> The present module BurJudson has SwordVersionDate=2008-03-01.
> It looks very much as if the normalization was done according to Unicode
> 3.2.
>
> Context:
> This question arises in the context of the possibility of creating a new
> module from a better source text.
> If we use the latest SWORD utilities to make the new module, will it
> normalize correctly?
>
> David
>
> --
> View this message in context: http://sword-dev.350566.n4.nabble.com/DevTools-ICU-Normalization-tp3898398p3898398.html
> Sent from the SWORD Dev mailing list archive at Nabble.com.
>
> _______________________________________________
> sword-devel mailing list: [hidden email]
> http://www.crosswire.org/mailman/listinfo/sword-devel
> Instructions to unsubscribe/change your settings at above page

_______________________________________________
sword-devel mailing list: [hidden email]
http://www.crosswire.org/mailman/listinfo/sword-devel
Instructions to unsubscribe/change your settings at above page
Reply | Threaded
Open this post in threaded view
|

Re: DevTools:ICU & Normalization?

David Haslam
Thanks Greg.

Do you happen to know which version of ICU was included when the Windows editions of the SWORD utilities were compiled?

The latest available for download are still as in sword-utilities-1.6.2.zip dated 2010-10-22.

The output starts with "You are running osis2mod: $Rev: 2562 $".

[Yeah - I know that most volunteers are Linux users, so no need to remind or cajole me].

David
Reply | Threaded
Open this post in threaded view
|

Re: DevTools:ICU & Normalization?

Greg Hellings
File listing
...
icudt42.dll
icuin42.dll
icuuc42.dll
...

Looks like it's version 4.2 of ICU which is at least Unicode 5.1.

--Greg

On Thu, Oct 13, 2011 at 11:22 AM, David Haslam <[hidden email]> wrote:

> Thanks Greg.
>
> Do you happen to know which version of ICU was included when the Windows
> editions of the SWORD utilities were compiled?
>
> The latest available for download are still as in
> http://crosswire.org/ftpmirror/pub/sword/utils/win32/sword-utilities-1.6.2.zip
> sword-utilities-1.6.2.zip  dated 2010-10-22.
>
> The output starts with "You are running osis2mod: $Rev: 2562 $".
>
> [Yeah - I know that most volunteers are Linux users, so no need to remind or
> cajole me].
>
> David
>
> --
> View this message in context: http://sword-dev.350566.n4.nabble.com/DevTools-ICU-Normalization-tp3898398p3902089.html
> Sent from the SWORD Dev mailing list archive at Nabble.com.
>
> _______________________________________________
> sword-devel mailing list: [hidden email]
> http://www.crosswire.org/mailman/listinfo/sword-devel
> Instructions to unsubscribe/change your settings at above page
>

_______________________________________________
sword-devel mailing list: [hidden email]
http://www.crosswire.org/mailman/listinfo/sword-devel
Instructions to unsubscribe/change your settings at above page
Reply | Threaded
Open this post in threaded view
|

Re: DevTools:ICU & Normalization?

David Haslam
FYI.  As a result of my posts in their forum arising from this topic, DataMystic have just released v8.9.8 of TextPipe.

The release notes include:

* Updated internal PCRE (Pattern Matching ) engine to v8.13 and support for Unicode 6.0.0.
* Updated Unicode internal libraries to support Unicode 4.1 for Normalization etc.

I have confirmed that TextPipe now Normalizes Burmese script to NFC with identical results to BabelPad.
As an avid user of TextPipe Standard edition, for me this is nice step forward.

Our BurJudson module was made with the source text normalized to an earlier version of Unicode.

Unless one specifies otherwise (by means of the -N switch), osis2mod performs normalization to NFC.

I would therefore recommend that precompiled SWORD utilities (especially those for Windows) should be built such that they adhere to the latest Unicode standard for Normalization.

Likewise, front-end developers may have something to gain by pursuing this topic further, seeing as ICU has implications during module search, in regard to normalization of a search string, such that it ought to match how the module was normalized.

David


Reply | Threaded
Open this post in threaded view
|

Re: DevTools:ICU & Normalization?

Greg Hellings
On Fri, Oct 28, 2011 at 10:28 AM, David Haslam <[hidden email]> wrote:

> FYI.  As a result of my posts in their forum arising from this topic,
> DataMystic have just released v8.9.8 of TextPipe.
>
> The release notes include:
>
> * Updated internal PCRE (Pattern Matching ) engine to v8.13 and support for
> Unicode 6.0.0.
> * Updated Unicode internal libraries to support Unicode 4.1 for
> Normalization etc.
>
> I have confirmed that TextPipe now Normalizes Burmese script to NFC with
> identical results to BabelPad.
> As an avid user of TextPipe Standard edition, for me this is nice step
> forward.
>
> Our *BurJudson* module was made with the source text normalized to an
> earlier version of Unicode.
>
> Unless one specifies otherwise (by means of the -N switch), osis2mod
> performs normalization to NFC.
>
> I would therefore recommend that precompiled SWORD utilities (especially
> those for Windows) should be built such that they adhere to the latest
> Unicode standard for Normalization.
>
> Likewise, front-end developers may have something to gain by pursuing this
> topic further, seeing as ICU has implications during module search, in
> regard to normalization of a search string, such that it ought to match how
> the module was normalized.

Front-end developers on Linux are largely limited by the distro they
reside on. SWORD supports, at the very least, ICU 4.0. Most modern
distros tend to include either ICU 4.6 or 4.8 - the latter being the
most up to date release available.

BibleTime does not use ICU at all, using Qt instead, so it is a rather
moot point with us on any operating system.  I do not know if Qt can
be compiled with an ICU backend or not, but it might be worth looking
into.

Xiphos on Windows, I believe, is distributed with ICU 4.0. This is
because that was the latest version of ICU that Matthew was able to
compile for Windows in the build environment Xiphos leverages.  I
believe Karl is also planning to use that same version. I have
successfully cross-compiled ICU 4.8 under Linux, but there are other
things hindering my ability to build Xiphos for Windows - most notably
there is very poor 64-bit Windows, cross-compile support in CLucene. I
have a patch which to fix that for CLucene 2, but I am waiting on Troy
to commit the CLucene 2 compatibility patch he and I developed for the
SWORD library, since I am unable to commit to that portion of the
SWORD repository. (Hint, hint, Troy. Waiting on you, still ;)

If he does that and I succeed at building with ICU 4.8 and CLucene 2,
then I will release a copy of the utilities with those libraries.  If
Chris updates his environment first, he will probably have more
success building natively with ICU 4.8 in VisualStudio.

--Greg

_______________________________________________
sword-devel mailing list: [hidden email]
http://www.crosswire.org/mailman/listinfo/sword-devel
Instructions to unsubscribe/change your settings at above page