usfm2osis.py

classic Classic list List threaded Threaded
35 messages Options
12
Reply | Threaded
Open this post in threaded view
|

usfm2osis.py

Chris Little-2
usfm2osis.py is posted now, at
http://www.crosswire.org/svn/sword-tools/trunk/modules/python/

It was developed on/for CPython 2.7.3, but 2.6+ should work. PyPy works
fine too, but takes more than twice as long to run. And Jython is not
supported at all.

The utility is not perfect & the code itself is a little messy at the
moment, but it's much better than its Perl equivalent when it comes to
generating valid OSIS. Every USFM tag in the 2.35 reference is processed
in some way, but processing of only a fraction of the tags has been
tested. (That's my next task.)

The command line syntax from the Perl equivalent can be used. Or use -h
for the usage statement. In general, using the '-v -r' switches will be
most common, I expect.

This utility is a bit slower than the Perl script was. Converting the
WEB from USFM to OSIS takes about 7.5s on my system with 4 vCPUs, where
the Perl script took about 4s as a single thread. But the Python version
has the benefit of generating valid markup. (The script will fork as
many processes as you have vCPUs, up to the number of books you are
converting.)

Bug reports are welcome if you try it, but this is still largely
untested stuff, so expect bugs.


The other script in the above directory can be used to identify all of
the USFM tags used in a set of files and will specify which of them are
unknown to the USFM 2.35 reference.

--Chris

_______________________________________________
sword-devel mailing list: [hidden email]
http://www.crosswire.org/mailman/listinfo/sword-devel
Instructions to unsubscribe/change your settings at above page
Reply | Threaded
Open this post in threaded view
|

Re: usfm2osis.py

David Haslam
Chris,

Do you foresee any issues if I try to run it with Python 3.2.3 x64 in Windows 7 ?

From the readme.txt file

"Python 3.x is a new version of the language, which is incompatible with the 2.x
line of releases.  The language is mostly the same, but many details, especially
how built-in objects like dictionaries and strings work, have changed
considerably, and a lot of deprecated features have finally been removed."

David
Reply | Threaded
Open this post in threaded view
|

Re: usfm2osis.py

David Haslam
I see after downloading your script that this is already answered.

# Target Python 2.7+ (but not 3)

David
Reply | Threaded
Open this post in threaded view
|

Re: usfm2osis.py

refdoc@gmx.net
In reply to this post by Chris Little-2
On 04/08/12 13:15, Chris Little wrote:
> usfm2osis.py is posted now, at

> Bug reports are welcome if you try it, but this is still largely
> untested stuff, so expect bugs.

Is it meant to be that there are some very odd characters in the file?

Peter

_______________________________________________
sword-devel mailing list: [hidden email]
http://www.crosswire.org/mailman/listinfo/sword-devel
Instructions to unsubscribe/change your settings at above page
Reply | Threaded
Open this post in threaded view
|

Re: usfm2osis.py

Greg Hellings
In reply to this post by Chris Little-2
I'm not at a place where I can check it out right now, but does it
cover the functionality that previously was required in xreffix.pl?
Since the Perl bindings seem to have gone belly-up on 64-bit machines,
it would be great if all of this could be combined in a single step
(even if it's an optional step enabled by a command-line flag or
such).

--Greg

On Sat, Aug 4, 2012 at 7:15 AM, Chris Little <[hidden email]> wrote:

> usfm2osis.py is posted now, at
> http://www.crosswire.org/svn/sword-tools/trunk/modules/python/
>
> It was developed on/for CPython 2.7.3, but 2.6+ should work. PyPy works fine
> too, but takes more than twice as long to run. And Jython is not supported
> at all.
>
> The utility is not perfect & the code itself is a little messy at the
> moment, but it's much better than its Perl equivalent when it comes to
> generating valid OSIS. Every USFM tag in the 2.35 reference is processed in
> some way, but processing of only a fraction of the tags has been tested.
> (That's my next task.)
>
> The command line syntax from the Perl equivalent can be used. Or use -h for
> the usage statement. In general, using the '-v -r' switches will be most
> common, I expect.
>
> This utility is a bit slower than the Perl script was. Converting the WEB
> from USFM to OSIS takes about 7.5s on my system with 4 vCPUs, where the Perl
> script took about 4s as a single thread. But the Python version has the
> benefit of generating valid markup. (The script will fork as many processes
> as you have vCPUs, up to the number of books you are converting.)
>
> Bug reports are welcome if you try it, but this is still largely untested
> stuff, so expect bugs.
>
>
> The other script in the above directory can be used to identify all of the
> USFM tags used in a set of files and will specify which of them are unknown
> to the USFM 2.35 reference.
>
> --Chris
>
> _______________________________________________
> sword-devel mailing list: [hidden email]
> http://www.crosswire.org/mailman/listinfo/sword-devel
> Instructions to unsubscribe/change your settings at above page

_______________________________________________
sword-devel mailing list: [hidden email]
http://www.crosswire.org/mailman/listinfo/sword-devel
Instructions to unsubscribe/change your settings at above page
Reply | Threaded
Open this post in threaded view
|

Re: usfm2osis.py

David Haslam
In reply to this post by refdoc@gmx.net
Wow!

What Peter means is that after all the ASCII stuff (up to the tilde), these are also counted:

0E0030 󠀰 14 TAG DIGIT ZERO
0E0031 󠀱 11 TAG DIGIT ONE
0E0032 󠀲 10 TAG DIGIT TWO
0E0033 󠀳 7 TAG DIGIT THREE
0E0034 󠀴 6 TAG DIGIT FOUR
0E0035 󠀵 5 TAG DIGIT FIVE
0E0042 󠁂 18 TAG LATIN CAPITAL LETTER B
0E0043 󠁃 11 TAG LATIN CAPITAL LETTER C
0E0044 󠁄 16 TAG LATIN CAPITAL LETTER D
0E0046 󠁆 28 TAG LATIN CAPITAL LETTER F
0E0056 󠁖 7 TAG LATIN CAPITAL LETTER V
0E0070 󠁰 21 TAG LATIN SMALL LETTER P


David
Reply | Threaded
Open this post in threaded view
|

Re: usfm2osis.py

Chris Little-2
In reply to this post by David Haslam
On 08/04/2012 07:04 AM, David Haslam wrote:
> I see after downloading your script that this is already answered.
>
> # Target Python 2.7+ (but not 3)
>
> David

Right. Python 3 is significantly different. I haven't bothered to learn
it and don't plan to make usfm2osis.py a Python 3 application at any
point in the near future, though there are features here and there to
make it slightly easier to transition to Python 3 *eventually*. Python 2
is still much more common and continues to be supported with new
releases (of the interpreter).

--Chris



_______________________________________________
sword-devel mailing list: [hidden email]
http://www.crosswire.org/mailman/listinfo/sword-devel
Instructions to unsubscribe/change your settings at above page
Reply | Threaded
Open this post in threaded view
|

Re: usfm2osis.py

Chris Little-2
In reply to this post by Greg Hellings
On 08/04/2012 10:19 AM, Greg Hellings wrote:
> I'm not at a place where I can check it out right now, but does it
> cover the functionality that previously was required in xreffix.pl?
> Since the Perl bindings seem to have gone belly-up on 64-bit machines,
> it would be great if all of this could be combined in a single step
> (even if it's an optional step enabled by a command-line flag or
> such).
>
> --Greg

Not yet. I want to incorporate that functionality, but I consider it a
post-1.0 feature--which is not necessarily to say that I won't get to it
in a week or so. (It's on the roadmap listed in the script.)

I want to ensure that the script works fine without Sword bindings, but
in the event you have the Python Sword bindings installed, I'll put
<reference>s through the reference parser.

--Chris


_______________________________________________
sword-devel mailing list: [hidden email]
http://www.crosswire.org/mailman/listinfo/sword-devel
Instructions to unsubscribe/change your settings at above page
Reply | Threaded
Open this post in threaded view
|

Re: usfm2osis.py

Chris Little-2
In reply to this post by David Haslam
On 08/04/2012 10:22 AM, David Haslam wrote:

> Wow!
>
> What Peter means is that after all the ASCII stuff (up to the tilde), these
> are also counted:
>
> 0E0030 󠀰 14 TAG DIGIT ZERO
> 0E0031 󠀱 11 TAG DIGIT ONE
> 0E0032 󠀲 10 TAG DIGIT TWO
> 0E0033 󠀳 7 TAG DIGIT THREE
> 0E0034 󠀴 6 TAG DIGIT FOUR
> 0E0035 󠀵 5 TAG DIGIT FIVE
> 0E0042 󠁂 18 TAG LATIN CAPITAL LETTER B
> 0E0043 󠁃 11 TAG LATIN CAPITAL LETTER C
> 0E0044 󠁄 16 TAG LATIN CAPITAL LETTER D
> 0E0046 󠁆 28 TAG LATIN CAPITAL LETTER F
> 0E0056 󠁖 7 TAG LATIN CAPITAL LETTER V
> 0E0070 󠁰 21 TAG LATIN SMALL LETTER P
>
>
> David

Yes, these are intended and fall under the following line of the guidelines:

Use & abuse Unicode tags (http://unicode.org/charts/PDF/UE0000.pdf) to
simplify Regex processing

They are inserted at various division boundaries to simplify regexes. So
the B-tag marks book boundaries. C is for chapter, D is for div, F is
for footnote, V is for verse, and p needs to be capitalized but
represents paragraphs. The digit tags represent section levels, IIRC.

Unfortunately, no one includes these in fonts, much less keyboards, so
they're a pain to work with, but they simplify regexes so drastically
that they're worth it. And I consider the probability that anyone would
use them in USFM so slim that I'm willing to risk the possibility of
false positives in my regex matching.

--Chris


_______________________________________________
sword-devel mailing list: [hidden email]
http://www.crosswire.org/mailman/listinfo/sword-devel
Instructions to unsubscribe/change your settings at above page
Reply | Threaded
Open this post in threaded view
|

Re: usfm2osis.py

Robert Hunt
In reply to this post by Chris Little-2
On 05/08/12 00:15, Chris Little wrote:
Bug reports are welcome if you try it, but this is still largely untested stuff, so expect bugs.


The other script in the above directory can be used to identify all of the USFM tags used in a set of files and will specify which of them are unknown to the USFM 2.35 reference.
I'm not sure how to submit bug reports, but in testing this on our in-progress translation I get:

From: usfmtags.py
Known USFM Tags: \b, \bk, \bk*, \c, \f, \f*, \fq, \fr, \ft, \h, \id, \ide, \io1, \io2, \ior, \ior*, \iot, \ip, \is, \it, \it*, \li, \m, \mr, \ms, \mt, \mt1, \mt2, \nb, \p, \q, \q1, \q2, \q3, \r, \s, \s2, \s3, \tc1, \tcr2, \tr, \v, \x, \x*, \xo, \xt
Unrecognized USFM Tags:
which is correct, but from usfm2osis.py I get:
Encoding unknown, processing as UTF-8.
Encoding unknown, processing as UTF-8.
Unhandled USFM tags: \n, \o1, \o2, \or, \or*, \ot, \p, \v (8 total)
Consider using the -r option for relaxed markup processing.
which are all false errors. The n is actually nb in the USFM, and the others are all from introduction tags, i.e., io1 io2, ior, etc.

Hope this helps,
Robert.



_______________________________________________
sword-devel mailing list: [hidden email]
http://www.crosswire.org/mailman/listinfo/sword-devel
Instructions to unsubscribe/change your settings at above page
Reply | Threaded
Open this post in threaded view
|

Re: usfm2osis.py

Chris Little-2
On 08/04/2012 10:10 PM, Robert Hunt wrote:

> On 05/08/12 00:15, Chris Little wrote:
>> Bug reports are welcome if you try it, but this is still largely
>> untested stuff, so expect bugs.
>>
>>
>> The other script in the above directory can be used to identify all of
>> the USFM tags used in a set of files and will specify which of them
>> are unknown to the USFM 2.35 reference.
> I'm not sure how to submit bug reports, but in testing this on our
> in-progress translation I get:
>
> From: usfmtags.py
>
>     Known USFM Tags: \b, \bk, \bk*, \c, \f, \f*, \fq, \fr, \ft, \h, \id,
>     \ide, \io1, \io2, \ior, \ior*, \iot, \ip, \is, \it, \it*, \li, \m,
>     \mr, \ms, \mt, \mt1, \mt2, \nb, \p, \q, \q1, \q2, \q3, \r, \s, \s2,
>     \s3, \tc1, \tcr2, \tr, \v, \x, \x*, \xo, \xt
>     Unrecognized USFM Tags:
>
> which is correct, but from usfm2osis.py I get:
>
>     Encoding unknown, processing as UTF-8.
>     Encoding unknown, processing as UTF-8.
>     Unhandled USFM tags: \n, \o1, \o2, \or, \or*, \ot, \p, \v (8 total)
>     Consider using the -r option for relaxed markup processing.
>
> which are all false errors. The n is actually nb in the USFM, and the
> others are all from introduction tags, i.e., io1 io2, ior, etc.

Thanks Robert, it does help tremendously. You're welcome to file reports
in the MODTOOLS project of our bug tracker, as well. (Here's the report
for this bug: http://www.crosswire.org/bugs/browse/MODTOOLS-32)

I realized earlier today that I've badly bungled handling of all the \i-
introduction elements, so that bit needs to be redone completely.

I couldn't guess why it's missing \nb, since that one is treated like
every other paragraph type and I have actually test that one. Maybe the
problem will become apparent when I finish the test suite.

--Chris



_______________________________________________
sword-devel mailing list: [hidden email]
http://www.crosswire.org/mailman/listinfo/sword-devel
Instructions to unsubscribe/change your settings at above page
Reply | Threaded
Open this post in threaded view
|

Re: usfm2osis.py

David Haslam
In reply to this post by Chris Little-2
Chris,

Thanks for the explanation. Nice to "learn something new each day."
It was new to me, and probably also for Peter.

However, such tag characters have become deprecated in Unicode 5.1 (2008).

See http://en.wikipedia.org/wiki/Unicode_control_characters#Language_tags

David
Reply | Threaded
Open this post in threaded view
|

Re: usfm2osis.py

David Haslam
In reply to this post by Chris Little-2
Although I haven't done this yet, I understand that it's feasible to install more than one version of Python in the same computer.

So (assuming I do get that far), I should be able to install Python 2.7.

Hmmm!  The Software History page in the help for Python 3.2.x jumps straight from 2.6.4 to 3.0 on the next row of the table.

Nonetheless, there is an earlier page entitled, What’s New in Python 2.7 (April 11, 2012),
so I guess the other omission was just an oversight by the documentation authors.
Who looks at history pages, anyway?

David
Reply | Threaded
Open this post in threaded view
|

Re: usfm2osis.py

Chris Little-2
In reply to this post by David Haslam
On 8/5/2012 12:29 AM, David Haslam wrote:

> Chris,
>
> Thanks for the explanation. Nice to "learn something new each day."
> It was new to me, and probably also for Peter.
>
> However, such tag characters have become deprecated in Unicode 5.1 (2008).
>
> See  http://en.wikipedia.org/wiki/Unicode_control_characters#Language_tags
> http://en.wikipedia.org/wiki/Unicode_control_characters#Language_tags
>
> David

Yes, absolutely they're deprecated. They're also intended for language
tagging specifically, which is completely different from my use.

The fact that they're deprecated (and were always, frankly, an obscure
corner of Unicode) makes it even more unlikely that we'll somehow
receive data that uses these characters. I would consider it less likely
that we'll see language tags than any given PUA character, and as long
as we don't include the tags in the output, we're in the clear about the
deprecation.

--Chris



_______________________________________________
sword-devel mailing list: [hidden email]
http://www.crosswire.org/mailman/listinfo/sword-devel
Instructions to unsubscribe/change your settings at above page
Reply | Threaded
Open this post in threaded view
|

Re: usfm2osis.py

Chris Little-2
In reply to this post by David Haslam
On 8/5/2012 12:38 AM, David Haslam wrote:
> Although I haven't done this yet, I understand that it's feasible to install
> more than one version of Python in the same computer.
>
> So (assuming I do get that far), I should be able to install Python 2.7.

I'm not certain of this, especially on Windows, but I believe 'python'
usually refers to Python 2 and the Python 3 interpreter is 'python3' on
Linux. I'm pretty sure it's easy, if not trivial, to install them on the
same machine.

> Hmmm!  The *Software History* page in the help for Python 3.2.x jumps
> straight from 2.6.4 to 3.0 on the next row of the table.
>
> Nonetheless, there is an earlier page entitled, *What’s New in Python 2.7*
> (April 11, 2012),
> so I guess the other omission was just an oversight by the documentation
> authors.
> /Who looks at history pages, anyway?/

Python 2.7 is much more recent than 3.0, but I'm sure the jump from
2.6.4 to 3.0 was accurate at the time of 3.0's release.

--Chris


_______________________________________________
sword-devel mailing list: [hidden email]
http://www.crosswire.org/mailman/listinfo/sword-devel
Instructions to unsubscribe/change your settings at above page
Reply | Threaded
Open this post in threaded view
|

Re: usfm2osis.py

David Haslam
In reply to this post by Chris Little-2
FWIW, I just came across this Python Regular Expression Testing Tool

Does Python support the full 21-bit Unicode range?

cf. Many other regular expression engines only support the Basic Multilingual Plane.

David
Reply | Threaded
Open this post in threaded view
|

Re: usfm2osis.py

David Haslam
In reply to this post by Chris Little-2
See also the Comparison of regular expression engines on Wikipedia.

If the table is not out of date, it would appear that Perl can do some regexp things that Python can't.

e.g. Recursion, etc.

David
Reply | Threaded
Open this post in threaded view
|

Re: usfm2osis.py

Chris Little-2
In reply to this post by David Haslam


On Aug 5, 2012, at 11:37 AM, David Haslam <[hidden email]> wrote:

> FWIW, I just came across this  http://www.pythonregex.com/ Python Regular
> Expression Testing Tool
>
> Does Python support the full 21-bit Unicode range?
>
> cf. Many other regular expression engines only support the Basic
> Multilingual Plane.
>

Yes, Python regex supports non-BMP characters. The language tags are Plane 14, I believe. An engine that supports only the BMP can't be said to support Unicode and is probably just processing bytes.

--Chris


_______________________________________________
sword-devel mailing list: [hidden email]
http://www.crosswire.org/mailman/listinfo/sword-devel
Instructions to unsubscribe/change your settings at above page
Reply | Threaded
Open this post in threaded view
|

Re: usfm2osis.py

Greg Hellings
On Sun, Aug 5, 2012 at 7:19 PM, Chris Little <[hidden email]> wrote:

>
>
> On Aug 5, 2012, at 11:37 AM, David Haslam <[hidden email]> wrote:
>
>> FWIW, I just came across this  http://www.pythonregex.com/ Python Regular
>> Expression Testing Tool
>>
>> Does Python support the full 21-bit Unicode range?
>>
>> cf. Many other regular expression engines only support the Basic
>> Multilingual Plane.
>>
>
> Yes, Python regex supports non-BMP characters. The language tags are Plane 14, I believe. An engine that supports only the BMP can't be said to support Unicode and is probably just processing bytes.
>

As further explanation, Python differentiates between the "string"
object, which is 8-bit encoding representation of objects in any
selected encoding and "unicode" objects which are strings of Unicode
characters. The exact internal representation probably differs between
CPython and Jython. CPython used to use UCS-2 but now can use either
UCS-2 or UCS-4 since the extension of the BMP.

To read more details see
http://www.cmlenz.net/archives/2008/07/the-truth-about-unicode-in-python
under the heading "Internal Representation".

--Greg

> --Chris
>
>
> _______________________________________________
> sword-devel mailing list: [hidden email]
> http://www.crosswire.org/mailman/listinfo/sword-devel
> Instructions to unsubscribe/change your settings at above page

_______________________________________________
sword-devel mailing list: [hidden email]
http://www.crosswire.org/mailman/listinfo/sword-devel
Instructions to unsubscribe/change your settings at above page
Reply | Threaded
Open this post in threaded view
|

Re: usfm2osis.py

Chris Little-2
On 8/5/2012 5:28 PM, Greg Hellings wrote:

> On Sun, Aug 5, 2012 at 7:19 PM, Chris Little <[hidden email]> wrote:
>>
>>
>> On Aug 5, 2012, at 11:37 AM, David Haslam <[hidden email]> wrote:
>>
>>> FWIW, I just came across this  http://www.pythonregex.com/ Python Regular
>>> Expression Testing Tool
>>>
>>> Does Python support the full 21-bit Unicode range?
>>>
>>> cf. Many other regular expression engines only support the Basic
>>> Multilingual Plane.
>>>
>>
>> Yes, Python regex supports non-BMP characters. The language tags are Plane 14, I believe. An engine that supports only the BMP can't be said to support Unicode and is probably just processing bytes.
>>
>
> As further explanation, Python differentiates between the "string"
> object, which is 8-bit encoding representation of objects in any
> selected encoding and "unicode" objects which are strings of Unicode
> characters. The exact internal representation probably differs between
> CPython and Jython. CPython used to use UCS-2 but now can use either
> UCS-2 or UCS-4 since the extension of the BMP.
>
> To read more details see
> http://www.cmlenz.net/archives/2008/07/the-truth-about-unicode-in-python
> under the heading "Internal Representation".

Oh. Well, that's annoying.

To see whether your Python interpreter is compiled with UCS-2 or UCS-4,
you can run this from the interpreter:

import sys
sys.maxunicode

If it returns 65535, it's using UCS-2. If 1114111, then UCS-4.

Linux packagers apparently go the UCS-4 route, so I didn't notice any
issue with using the Language Tags. But trying the above on Windows
shows that the cygwin build and the builds from python.org (2.7 & 3.2)
all use UCS-2. So my script won't work correctly on Windows.

Not to worry, though. I'll just replace the Language Tags with
Noncharacters in the range u+FDD0-u+FDEF. They're UCS-2-safe since
they're BMP codepoints and they're specifically designated as "intended
for process-internal uses, but are not permitted for interchange." So in
the unlikely event that they appear in input, it's the fault of the
USFM-encoder if anything goes awry.

We'll have to watch for input outside of the BMP on UCS-2 Python,
though, as that could cause problems.

--Chris


_______________________________________________
sword-devel mailing list: [hidden email]
http://www.crosswire.org/mailman/listinfo/sword-devel
Instructions to unsubscribe/change your settings at above page
12