Detecting Problem Characters

classic Classic list List threaded Threaded
23 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Detecting Problem Characters

Mike Hart
I've got a couple modules-in-making both of which I'm working on quote
marks that aren't displaying at all or are displaying block "mystery"
characters.  I'm spending time trying to separate apostrophes from
single quotes on both modules with the hope I can preserve or achieve
the ability to use OSIS <Q> tags....

HOWEVER

In both modules, at some point I've lost control of a few characters and
now ms excel or openoffice calc, or jEdit now can't see all the end of
line characters. That is, when I try to open the file VPL, it almost but
not quite works.  Some verses are grouped together in either spreadsheet
while jedit sees them as properly separated.

Recently or not so recently I saw a comment in some post describing a
way  or a program with summarizes all 'non-ascii' or 'out of this
encoding' characters that appear in a file.  I've spent time searching
for this post but cannot locate it or any information about this step on
the module creation wiki.

Can someone enlighten me (again) as to the best method to find offending
characters and deal with them?

Thanks in advance,

Mike
___________________________________________________________________

PS.  Modules in progress are based on these documents:

1. Holy New Covenant (public domain on publication in 2004.)
http://www.thomhackett.com/the-holy-new-covenant.htm

The "palm doc" file actually opens as a ms word 97 or 2003 file.)  It is
my intention to get this into sword to evaluate it as to it's
readability and usability.  From my cursory review is is a fairly
faithful treatment of scripture. Galilee Translation Team mentioned
appears to be affiliated with The Church of Christ in some way.

2. The Riverside New Testament (published 1923 and copyright renewed
(1948?) according to Google, but even if still copyrighted should be
distributable within the next decade... If I have my facts straight).

http://sourceforge.net/projects/zefania-sharp/files/Zefania%20XML%20Modules%20%28old%29/Bibles%20ENG/The%20Riverside%20New%20Testament%20%281923%29/sf_Riverside_NT2.zip/download

Came to me as a 'zefania' xml file.  Note that this file is now (after I
started working on this last year) already available in OSIS format at:

http://sourceforge.net/projects/zefania-sharp/files/Osis%20XML%20Modules%20%28raw%29/

so this is really more of an exercise in 'what am I doing wrong' for me.



_______________________________________________
sword-devel mailing list: [hidden email]
http://www.crosswire.org/mailman/listinfo/sword-devel
Instructions to unsubscribe/change your settings at above page
Reply | Threaded
Open this post in threaded view
|

Re: Detecting Problem Characters

Greg Hellings
Michael,

On Fri, Sep 23, 2011 at 12:20 PM, Michael Hart <[hidden email]> wrote:

> I've got a couple modules-in-making both of which I'm working on quote marks
> that aren't displaying at all or are displaying block "mystery" characters.
>  I'm spending time trying to separate apostrophes from single quotes on both
> modules with the hope I can preserve or achieve the ability to use OSIS <Q>
> tags....
>
> HOWEVER
>
> In both modules, at some point I've lost control of a few characters and now
> ms excel or openoffice calc, or jEdit now can't see all the end of line
> characters. That is, when I try to open the file VPL, it almost but not
> quite works.  Some verses are grouped together in either spreadsheet while
> jedit sees them as properly separated.
>
> Recently or not so recently I saw a comment in some post describing a way
>  or a program with summarizes all 'non-ascii' or 'out of this encoding'
> characters that appear in a file.  I've spent time searching for this post
> but cannot locate it or any information about this step on the module
> creation wiki.
>
> Can someone enlighten me (again) as to the best method to find offending
> characters and deal with them?

I wrote the following script which will work great if your text is in
plain text format. Its output will be skewed if you are in something
like OSIS or imp format, but it will still run.
http://dl.thehellings.com/count.py
It makes the further assumption that you are encoded in UTF-8 format.
You can change that readily enough. The program will terminate
incorrectly if there are non-UTF8 characters in the input file,
otherwise it will print out a list of all the characters it
encountered, their frequency, and their Unicode name.

>
> Thanks in advance,
>
> Mike
> ___________________________________________________________________
>
> PS.  Modules in progress are based on these documents:
>
> 1. Holy New Covenant (public domain on publication in 2004.)
> http://www.thomhackett.com/the-holy-new-covenant.htm
>
> The "palm doc" file actually opens as a ms word 97 or 2003 file.)  It is my
> intention to get this into sword to evaluate it as to it's readability and
> usability.  From my cursory review is is a fairly faithful treatment of
> scripture. Galilee Translation Team mentioned appears to be affiliated with
> The Church of Christ in some way.
>
> 2. The Riverside New Testament (published 1923 and copyright renewed (1948?)
> according to Google, but even if still copyrighted should be distributable
> within the next decade... If I have my facts straight).
>
> http://sourceforge.net/projects/zefania-sharp/files/Zefania%20XML%20Modules%20%28old%29/Bibles%20ENG/The%20Riverside%20New%20Testament%20%281923%29/sf_Riverside_NT2.zip/download
>
> Came to me as a 'zefania' xml file.  Note that this file is now (after I
> started working on this last year) already available in OSIS format at:
>
> http://sourceforge.net/projects/zefania-sharp/files/Osis%20XML%20Modules%20%28raw%29/
>
> so this is really more of an exercise in 'what am I doing wrong' for me.

For reasons not entirely mine to go into, nor germane to your
questions, CrossWire policy is generally to ignore zefania files.
Among such, as you point out, is that many of their files have been
found to violate copyright laws.

--Greg

_______________________________________________
sword-devel mailing list: [hidden email]
http://www.crosswire.org/mailman/listinfo/sword-devel
Instructions to unsubscribe/change your settings at above page
Reply | Threaded
Open this post in threaded view
|

Re: Detecting Problem Characters

Troy A. Griffitts
In reply to this post by Mike Hart
Michael,

It sounds like you have eol types intermixed within your file.  This
script removes all \r's from your file to normalize linefeeds to
newlines.  It might help. Make a backup of your file first! :)

http://crosswire.org/svn/community/trunk/utils/osx/te2bb.app/Contents/Resources/te2bb.sh

On 09/23/2011 07:20 PM, Michael Hart wrote:

> I've got a couple modules-in-making both of which I'm working on quote
> marks that aren't displaying at all or are displaying block "mystery"
> characters.  I'm spending time trying to separate apostrophes from
> single quotes on both modules with the hope I can preserve or achieve
> the ability to use OSIS <Q> tags....
>
> HOWEVER
>
> In both modules, at some point I've lost control of a few characters
> and now ms excel or openoffice calc, or jEdit now can't see all the
> end of line characters. That is, when I try to open the file VPL, it
> almost but not quite works.  Some verses are grouped together in
> either spreadsheet while jedit sees them as properly separated.
>
> Recently or not so recently I saw a comment in some post describing a
> way  or a program with summarizes all 'non-ascii' or 'out of this
> encoding' characters that appear in a file.  I've spent time searching
> for this post but cannot locate it or any information about this step
> on the module creation wiki.
>
> Can someone enlighten me (again) as to the best method to find
> offending characters and deal with them?
>
> Thanks in advance,
>
> Mike
> ___________________________________________________________________
>
> PS.  Modules in progress are based on these documents:
>
> 1. Holy New Covenant (public domain on publication in 2004.)
> http://www.thomhackett.com/the-holy-new-covenant.htm
>
> The "palm doc" file actually opens as a ms word 97 or 2003 file.)  It
> is my intention to get this into sword to evaluate it as to it's
> readability and usability.  From my cursory review is is a fairly
> faithful treatment of scripture. Galilee Translation Team mentioned
> appears to be affiliated with The Church of Christ in some way.
>
> 2. The Riverside New Testament (published 1923 and copyright renewed
> (1948?) according to Google, but even if still copyrighted should be
> distributable within the next decade... If I have my facts straight).
>
> http://sourceforge.net/projects/zefania-sharp/files/Zefania%20XML%20Modules%20%28old%29/Bibles%20ENG/The%20Riverside%20New%20Testament%20%281923%29/sf_Riverside_NT2.zip/download 
>
>
> Came to me as a 'zefania' xml file.  Note that this file is now (after
> I started working on this last year) already available in OSIS format at:
>
> http://sourceforge.net/projects/zefania-sharp/files/Osis%20XML%20Modules%20%28raw%29/ 
>
>
> so this is really more of an exercise in 'what am I doing wrong' for me.
>
>
>
> _______________________________________________
> sword-devel mailing list: [hidden email]
> http://www.crosswire.org/mailman/listinfo/sword-devel
> Instructions to unsubscribe/change your settings at above page


_______________________________________________
sword-devel mailing list: [hidden email]
http://www.crosswire.org/mailman/listinfo/sword-devel
Instructions to unsubscribe/change your settings at above page
Reply | Threaded
Open this post in threaded view
|

Re: Detecting Problem Characters

Greg Hellings
On Fri, Sep 23, 2011 at 12:50 PM, Troy A. Griffitts
<[hidden email]> wrote:
> Michael,
>
> It sounds like you have eol types intermixed within your file.  This script
> removes all \r's from your file to normalize linefeeds to newlines.  It
> might help. Make a backup of your file first! :)
>
> http://crosswire.org/svn/community/trunk/utils/osx/te2bb.app/Contents/Resources/te2bb.sh

Another option is the unix2dos and dos2unix commands as well, if you
want to have bi-directionality between Linux and Windows hosts.


--Greg

_______________________________________________
sword-devel mailing list: [hidden email]
http://www.crosswire.org/mailman/listinfo/sword-devel
Instructions to unsubscribe/change your settings at above page
Reply | Threaded
Open this post in threaded view
|

Re: Detecting Problem Characters

David Haslam
In reply to this post by Mike Hart
Hi Mike,

Sorry I was busy earlier when you Skyped me.

Useful Unicode text editors besides jEdit include Notepad++ and BabelPad.
See http://crosswire.org/wiki/DevTools:Text_Editors

You may be recalling my mention of BabelPad, which has a character frequency tool.
Greg's script does something very similar.

Notepad++ has a powerful search feature that includes a Count button, which I've found so useful, I've lost count of the number of times I use it.

David
Reply | Threaded
Open this post in threaded view
|

Re: Detecting Problem Characters

Karl Kleinpaste-2
In reply to this post by Troy A. Griffitts
"Troy A. Griffitts" <[hidden email]> writes:
>  This script removes all \r's from your file

You sure do like to work awfully hard.

sed -i -e 's/\r$//' foo.txt

_______________________________________________
sword-devel mailing list: [hidden email]
http://www.crosswire.org/mailman/listinfo/sword-devel
Instructions to unsubscribe/change your settings at above page
Reply | Threaded
Open this post in threaded view
|

Re: Detecting Problem Characters

David Haslam
In reply to this post by Greg Hellings
And Notepad++ can also convert EOLs from one kind to another. (DOS, Linux, Mac).
It can also toggle whether such codes are displayed or not.

So there are lots of tools available.

Enjoy!
Reply | Threaded
Open this post in threaded view
|

Re: Detecting Problem Characters

Troy A. Griffitts
In reply to this post by Karl Kleinpaste-2
Yeah, I tried sed first, but this was originally for the mac and I
couldn't get it to work with new lines on the mac.

On 09/23/2011 07:59 PM, Karl Kleinpaste wrote:

> "Troy A. Griffitts"<[hidden email]>  writes:
>>   This script removes all \r's from your file
> You sure do like to work awfully hard.
>
> sed -i -e 's/\r$//' foo.txt
>
> _______________________________________________
> sword-devel mailing list: [hidden email]
> http://www.crosswire.org/mailman/listinfo/sword-devel
> Instructions to unsubscribe/change your settings at above page


_______________________________________________
sword-devel mailing list: [hidden email]
http://www.crosswire.org/mailman/listinfo/sword-devel
Instructions to unsubscribe/change your settings at above page
Reply | Threaded
Open this post in threaded view
|

Re: Detecting Problem Characters

Greg Hellings
Mac newlines are \r, so that is probably what mixed you up.

--Greg

On Fri, Sep 23, 2011 at 1:25 PM, Troy A. Griffitts <[hidden email]> wrote:

> Yeah, I tried sed first, but this was originally for the mac and I couldn't
> get it to work with new lines on the mac.
>
> On 09/23/2011 07:59 PM, Karl Kleinpaste wrote:
>>
>> "Troy A. Griffitts"<[hidden email]>  writes:
>>>
>>>  This script removes all \r's from your file
>>
>> You sure do like to work awfully hard.
>>
>> sed -i -e 's/\r$//' foo.txt
>>
>> _______________________________________________
>> sword-devel mailing list: [hidden email]
>> http://www.crosswire.org/mailman/listinfo/sword-devel
>> Instructions to unsubscribe/change your settings at above page
>
>
> _______________________________________________
> sword-devel mailing list: [hidden email]
> http://www.crosswire.org/mailman/listinfo/sword-devel
> Instructions to unsubscribe/change your settings at above page
>

_______________________________________________
sword-devel mailing list: [hidden email]
http://www.crosswire.org/mailman/listinfo/sword-devel
Instructions to unsubscribe/change your settings at above page
Reply | Threaded
Open this post in threaded view
|

Re: Detecting Problem Characters

Mike Hart
In reply to this post by Greg Hellings
On 9/23/2011 1:30 PM, Greg Hellings wrote:
>
> For reasons not entirely mine to go into, nor germane to your
> questions, CrossWire policy is generally to ignore zefania files.
> Among such, as you point out, is that many of their files have been
> found to violate copyright laws.
>
> --Greg
Hopefully, I did not state ANY file at the Sourceforge site I linked to
IS copyrighted, only that the one I'm working on is possibly still under
copyright in the USA, but probably isn't (see my TMI explanation
below.)  This specific document is right on the line at every step of
the way, but I believe it fell through the gap in 1998 and did become
public domain, but I don't have a way to prove that at this time. As far
as I know noone has claimed the Riverside NT is still copyrighted, but
my method for works 1923-1964 was to positively answer beyond any doubt
the status of the work before publishing them.  The Riverside NT is
'almost, but not quite' ready.

I do want to know if someone thinks or knows there are clear violations
at the Sourceforge site I linked to.

http://sourceforge.net/projects/zefania-sharp/files/

______________________

As to the Sourceforge Zefania site itself:

The Sourceforge project I linked to which produces zefania works appears
to deal only with public domain works (the Riverside NT aside, and I've
seen other lists that name Riverside NT 'public domain', so I suspect
there is evidence I don't have yet.) IF any site that has inadvertently
produced illegal copies of copyrighted works were banned from
consideration, there wouldn't be anything left.  Amazon is guilty,
Internet Archive is guilty, CCEL is guilty, Google is guilty, etc.  (I'm
not sure about Gutenberg, but I wouldn't be surprised if even they have
gone afoul of copyright law and had to retract works.)   I prefer to
evaluate each work on its own, and whatever pipeline it comes by is a
publishing concern.

The group working on the Sourceforge Zefania texts is not the group that
produced Opensong, which did or does have clear ties to ignoring the
law.  It would be the same as saying Crosswire and the Sword
Repositories should now be shunned because Lyricue uses Crosswire texts,
but that project also provides clearly copyrighted texts from some other
source.  The Zefania project started as a Linux Bible reader for Sharp
Zaurus, clearly intending to work in a similar way to Crosswire on only
public domain texts.  Similar to the palm bible apps from the 90s,  the
Zaurus was resource and memory limited, so they had to create their own
scheme to make the Bible fit.    Whatever other quasi-legal projects
that piggyback onto it, don't blame the original.   I see nothing wrong
with sourcing from the Zefania site itself.
______________

For what it's worth (about the Riverside NT possibly being copyright):

The Riverside new testament is also available on the Internet Archive at

http://www.archive.org/details/riversidenewtest027415mbp

However, the text output from this digitization is in much much poorer
condition than that at the Zefania site.

It is my belief that this work is already public domain, but there is a
small window of time that I haven't excluded.

The copyright law changed in 1963 to make copyrights extend 75 years
from publication or 50 years after the life of the author, which
remained in effect until September 1998 when congress extended to Life
+70 years of the author.  This means There is a very likely chance that
the Riverside NT is already public domain if 1) It was published before
September 1923, and 2) William G Ballentine died before September 1948.  
I believe it is customary (even in 1923) to start printing the next year
somewhere around September.   That is you can get books published in
2012 right now.  If Ballentine died after September 1948,  then his
death determines the PD date, but in all likelihood it will be within
the next decade.  Add to this that the Copyright renewal in (1948?) was
recorded to the estate of Ballentine by his wife. The question mark is
to remind me of the fact that it was recorded posthumously after 1948,
but there is or was some law that allowed this (widow's law).  This
implies the copyright was continuous, so even tho recorded late, the
renewal was effective 1948, but again I'm not a lawyer on this point.

So the latest this work MIGHT be under copyright is around 2018.

Also consider that without evidence to the contrary, Publications are
assumed Jan 1 of the year they are published (earliest possible date).  
Therefore is is reasonable to assume this work is already in copyright
and legally can be distributed with the caveat that as soon as someone
produces evidence of a fall publication OR a late death certificate, a
cessation would be necessary.

Also, note that I am stripping any copyrightable condition that may
exist on the Zefania work to achieve the original work, then building it
up (specifically I am not working from the OSIS document available there
to create a module on this work)  That is, I'm going back to bare text
(even removing all the return characters and white space (spaces greater
than one, tab characters, etc.) and rebuilding the structure based on
verse numbers.)  because of this, I highly doubt that there could be any
legal action from the sourcing of this document being contaminated, but
in this day of lawyers in the US, I'm fairly confident that my process
would withstand a scrutiny even if there were legal action.  I think
I've covered all my bases well enough.  Even today, additions to a
public domain work are copyrightable, but the original work remains in
the public domain.

Also note that under 1923 copyright law, fair-use rights in general
remained with the citizen, not the holder. The original 1923 copyright
viewable at the Internet Archive link above has no restrictions listed,
so the only restrictions are those listed under the constitution and
1909 law itself (sale of the work for profit.)  Storing, modifying, and
(arguably, but not my intent until it is clearly legal to do so)
distribution for non-profit reasons are not restricted.  Therefore I see
nothing unethical, illegal, or immoral with my current work in modifying
the document for my own use, and preparations for a release when the
work IS public domain without a doubt.


_______________________________________________
sword-devel mailing list: [hidden email]
http://www.crosswire.org/mailman/listinfo/sword-devel
Instructions to unsubscribe/change your settings at above page
Reply | Threaded
Open this post in threaded view
|

Re: Detecting Problem Characters

Mike Hart
In reply to this post by Greg Hellings
Here's a link to the file prior to import in OO calc 3.3 which produces
the problem

http://www.archive.org/details/HolyNewCovenant

It's under the 'all files http' link ending in .csv  It takes too much
time to remove the server name. in the final link.

(*)I import this file delimited on TABS only with no 'text' delimiter
character as an UTF-8 file, which is what I loaded and saved as in
jedit. (In later OO calc versions, to get no text delimiter, you have to
delete and click on some other field.)

The trouble for me starts on on row 61.

An earlier version of this file (overwritten now) loaded properly into
OO calc, based on flags that calc raised, I manually updated a mirror
Jedit file for all the false verses ( now appearing as _##_ in their
proper verses instead of out of sequence verses on their own rows.) the
import into OOO was not saved. I then further edited in Jedit for some
or all of the quoting issues (block characters appearing in jedit where
quotes should be.)  (fixing the false versification, and dealing with
quotification) in jedit were saved.

I'll try to run some of the character summary scripts when I get to a
working linux box.  might be a day or two. (python isn't on my current
desktop, nor is Bash, and I didn't see any other methods.)

I suspect this has something to do with the spreadsheet import module
interpreting some character as an 'absolute text quote' and assuming
many lines are one because of It.  BUT  I can't see any logic to what's  
happening. The way I'm importing, no character should be doing this and
the EOL should be respected. As far as I can tell, it's not happening on
any character I can see in jedit, but it is happening on some of the
verses i've searched/replaced with jedit, which is suggesting jedit is
hiding something from view on replace, that OOO is seeing.

________________________________________________

Re: EOL's as the source of the problem.

Since all EOL's are coming from JEDIT, I can assume they're all the same
structure? (Whatever jre 6 rev 26 under windows  produces?)

One of my steps in conversion is to remove ALL  EOL characters from the
file and then Insert EOL's with jedit prior to any tab character (placed
by jedit on chapter and booknames) or any exactly 2 digit number with
spaces preceding and following it Or in the case of the HNC the space
following has already been further modified to a tab for nice import to
spreadsheet.  For full bible this leaves a few verses in Psalms and
Isaiah that I have to deal with individually (unless the bible properly
spells out any numbers appearing in the text, where it becomes easier to
also insert EOL's before 3 character numbers also.

After removing all return characters, my document is one row long,
somewhere around 5 megabytes wide (or 1.2 megabytes for NT only.)


The replace structure is like this
1. remove newlines
Search: \n
Replace:

2. add VPL new lines
Search: ( [0-9][0-9] )
Replace: \n$1

Search: \t
Replace: \n\t

With regexp enabled. In windows (both vista and XP, but 80% XP because
the Sword windows utilities won't run on Vista.)

With the Holy New Covenant work, I replaced the original EOL's with the
text "<>", in order to preserve paragraphing.  The document still has a
bunch of diamonds in it waiting resolution at some point. But the EOL's
are still inserted by me 100% with jedit search/replace.


__________________
(*) - This file started as the 'palmdoc' word document at the
thomhackett.com site I referred to earlier. I've textified it and VPL'd
it (note that the text has paraphased, grouped verses in it, so I will
later need to IMPortify it or OSISify it.)  The proper coding for me
starts with getting the text for each 'verse' into a single row and
building the verse declarations in a spreadsheet.


Other notes: In addition, the text file that came out of the word save
as UTF-8 had what appeared to be bulleted text on 4-5 verses, which i
reverted to straight text, no bullets, no return characters.  In all
except one verse this appeared to be completely bogus, but I haven't
followed up with the original document to see if bullets were there or
not. they didn't convert properly even if they were present originally
(kept on going well after any bulleted list would have stopped.

_______________________________________________
sword-devel mailing list: [hidden email]
http://www.crosswire.org/mailman/listinfo/sword-devel
Instructions to unsubscribe/change your settings at above page
Reply | Threaded
Open this post in threaded view
|

Re: Detecting Problem Characters

refdoc@gmx.net
In reply to this post by Mike Hart
Dear Michael,

We do not copy out of other Bible programmes, but create modules from
source. So, where ever Zefania picked up their text - that is where we
should go too.

Wrt the specific text you describe - if there is no absolute clarity re
its PD status, the usual way we would deal with it is contacting the
last copyright owner and ask them for permission. If this permission is
superfluous, then so be it, it creates at the very least a good
relationship and likely it leads to long term collaboration. While this
might take longer and is more difficult than simply to assume PD status
(and wait until we are challenged), time and again we have profited from
this - by getting better quality texts, by gaining reputation among
genuine copyright owners etc.

Peter


On 23/09/11 21:01, Michael Hart wrote:

> On 9/23/2011 1:30 PM, Greg Hellings wrote:
>>
>> For reasons not entirely mine to go into, nor germane to your
>> questions, CrossWire policy is generally to ignore zefania files.
>> Among such, as you point out, is that many of their files have been
>> found to violate copyright laws.
>>
>> --Greg
> Hopefully, I did not state ANY file at the Sourceforge site I linked
> to IS copyrighted, only that the one I'm working on is possibly still
> under copyright in the USA, but probably isn't (see my TMI explanation
> below.)  This specific document is right on the line at every step of
> the way, but I believe it fell through the gap in 1998 and did become
> public domain, but I don't have a way to prove that at this time. As
> far as I know noone has claimed the Riverside NT is still copyrighted,
> but my method for works 1923-1964 was to positively answer beyond any
> doubt the status of the work before publishing them.  The Riverside NT
> is 'almost, but not quite' ready.
>
> I do want to know if someone thinks or knows there are clear
> violations at the Sourceforge site I linked to.
>
> http://sourceforge.net/projects/zefania-sharp/files/
>
> ______________________
>
> As to the Sourceforge Zefania site itself:
>
> The Sourceforge project I linked to which produces zefania works
> appears to deal only with public domain works (the Riverside NT aside,
> and I've seen other lists that name Riverside NT 'public domain', so I
> suspect there is evidence I don't have yet.) IF any site that has
> inadvertently produced illegal copies of copyrighted works were banned
> from consideration, there wouldn't be anything left.  Amazon is
> guilty, Internet Archive is guilty, CCEL is guilty, Google is guilty,
> etc.  (I'm not sure about Gutenberg, but I wouldn't be surprised if
> even they have gone afoul of copyright law and had to retract
> works.)   I prefer to evaluate each work on its own, and whatever
> pipeline it comes by is a publishing concern.
>
> The group working on the Sourceforge Zefania texts is not the group
> that produced Opensong, which did or does have clear ties to ignoring
> the law.  It would be the same as saying Crosswire and the Sword
> Repositories should now be shunned because Lyricue uses Crosswire
> texts, but that project also provides clearly copyrighted texts from
> some other source.  The Zefania project started as a Linux Bible
> reader for Sharp Zaurus, clearly intending to work in a similar way to
> Crosswire on only public domain texts.  Similar to the palm bible apps
> from the 90s,  the Zaurus was resource and memory limited, so they had
> to create their own scheme to make the Bible fit.    Whatever other
> quasi-legal projects that piggyback onto it, don't blame the
> original.   I see nothing wrong with sourcing from the Zefania site
> itself.
> ______________
>
> For what it's worth (about the Riverside NT possibly being copyright):
>
> The Riverside new testament is also available on the Internet Archive at
>
> http://www.archive.org/details/riversidenewtest027415mbp
>
> However, the text output from this digitization is in much much poorer
> condition than that at the Zefania site.
>
> It is my belief that this work is already public domain, but there is
> a small window of time that I haven't excluded.
>
> The copyright law changed in 1963 to make copyrights extend 75 years
> from publication or 50 years after the life of the author, which
> remained in effect until September 1998 when congress extended to Life
> +70 years of the author.  This means There is a very likely chance
> that the Riverside NT is already public domain if 1) It was published
> before September 1923, and 2) William G Ballentine died before
> September 1948.  I believe it is customary (even in 1923) to start
> printing the next year somewhere around September.   That is you can
> get books published in 2012 right now.  If Ballentine died after
> September 1948,  then his death determines the PD date, but in all
> likelihood it will be within the next decade.  Add to this that the
> Copyright renewal in (1948?) was recorded to the estate of Ballentine
> by his wife. The question mark is to remind me of the fact that it was
> recorded posthumously after 1948, but there is or was some law that
> allowed this (widow's law).  This implies the copyright was
> continuous, so even tho recorded late, the renewal was effective 1948,
> but again I'm not a lawyer on this point.
>
> So the latest this work MIGHT be under copyright is around 2018.
>
> Also consider that without evidence to the contrary, Publications are
> assumed Jan 1 of the year they are published (earliest possible
> date).  Therefore is is reasonable to assume this work is already in
> copyright and legally can be distributed with the caveat that as soon
> as someone produces evidence of a fall publication OR a late death
> certificate, a cessation would be necessary.
>
> Also, note that I am stripping any copyrightable condition that may
> exist on the Zefania work to achieve the original work, then building
> it up (specifically I am not working from the OSIS document available
> there to create a module on this work)  That is, I'm going back to
> bare text (even removing all the return characters and white space
> (spaces greater than one, tab characters, etc.) and rebuilding the
> structure based on verse numbers.)  because of this, I highly doubt
> that there could be any legal action from the sourcing of this
> document being contaminated, but in this day of lawyers in the US, I'm
> fairly confident that my process would withstand a scrutiny even if
> there were legal action.  I think I've covered all my bases well
> enough.  Even today, additions to a public domain work are
> copyrightable, but the original work remains in the public domain.
>
> Also note that under 1923 copyright law, fair-use rights in general
> remained with the citizen, not the holder. The original 1923 copyright
> viewable at the Internet Archive link above has no restrictions
> listed, so the only restrictions are those listed under the
> constitution and 1909 law itself (sale of the work for profit.)
> Storing, modifying, and (arguably, but not my intent until it is
> clearly legal to do so) distribution for non-profit reasons are not
> restricted.  Therefore I see nothing unethical, illegal, or immoral
> with my current work in modifying the document for my own use, and
> preparations for a release when the work IS public domain without a
> doubt.
>
>
> _______________________________________________
> sword-devel mailing list: [hidden email]
> http://www.crosswire.org/mailman/listinfo/sword-devel
> Instructions to unsubscribe/change your settings at above page



_______________________________________________
sword-devel mailing list: [hidden email]
http://www.crosswire.org/mailman/listinfo/sword-devel
Instructions to unsubscribe/change your settings at above page
Reply | Threaded
Open this post in threaded view
|

Re: Detecting Problem Characters

David Haslam
In reply to this post by Mike Hart
Mike,

I just downloaded the csv file, and had no trouble opening it with Notepad++.

Just because a file has .csv as the file extension doesn't mean that one has to use OO or Excel to edit it.
If I were to apply any format conversions to this, I'd probably make use of a bespoke TextPipe filter.
Others would use grep or some other favorite tool for streamed edits.

The file has 8183 lines, and all the EOLs are Windows/DOS (i.e. CRLF).

I can't see anything amiss around line 61, which reads:
09	"Don't think this to yourselves: 'Abraham is our father!' I tell you that God could make children for Abraham from these rocks here.

NB. Using OO CALC might stumble at characters it sees as cell delimiters. Likewise for Excel.

btw. The file does contain 2460 occurrences of the string "<>", which might cause some issues.

The file also contains one private use area character in line 1191, just after the word Boanerges.

David
Reply | Threaded
Open this post in threaded view
|

Re: Detecting Problem Characters

Mike Hart
In reply to this post by refdoc@gmx.net
On 9/24/2011 7:59 AM, Peter von Kaehne wrote:
> Dear Michael,
>
> We do not copy out of other Bible programmes, but create modules from
> source. So, where ever Zefania picked up their text - that is where we
> should go too.
The Zefania xml project STARTED as a means to get the bible onto the
Zaurus with as little overhead as possible (less compression, fewer
extra characters.)  There are still Bible programs using it, but the
Zefania XML sourceforge site isn't related to any specific program.

The sourceforge site I referenced is a Bible Archive and not a
repository for any specific program.  If you start investigating, you'll
find what acts as a news/home page for the Zefania Archive at

http://zefania.blogspot.com/

Which declares itself a Bible Archive and not  a 'program database'
site.  They encourage the use of their sources for other programs.  They
are actively seeking permissions and working with publishers.  It seems
like a good fit to me, and I don't see why Crosswire and Zefania haven't
conjoined like go-bible did.  They probably do need guidance with
respect to some of there documents.

For documents that are recent and electronic in original form (such as
the Holy New Covenant), I do completely agree that starting with the
original source is the best method.

For documents prepared prior to ~1960?, there was never an 'original'
that existed in electronic form, but instead came from someone's
typewriter which was then edited, then re-edited, then manually marked
up by a typesetter on a press. Starting from the source in these cases
is always going to involve a scanner, and SHOULD involve a scanner with
a typewritten manuscript (which I'm not aware has ever happened) because
that was the author's original intent.  In a perfect world that is what
I do.  (One of my current, dormant projects is 'The Word Made Fresh'
which is starting from a scanned book.  I have scanned it and am not
happy with the results.  I'm waiting on better technology to retry.  
This work is an 1980's work but the grand-daughter I have communicated
with stated all that remain are the printed books.  Whatever electronic
versions that existed are lost to time.)

Since I'm dealing with a pre-digital age document, I don't see that
sourcing from an Archive is wrong.  In this case, the laws in Germany
are probably different than the US,  so I'm taking extra care to remove
any and all encoding that may have been previously copyrighted prior to
being placed in the archive, and being extra careful with releasing the
document because it falls in the perilous realm of 1923-1964 publication
date.
> Wrt the specific text you describe - if there is no absolute clarity re
> its PD status, the usual way we would deal with it is contacting the
> last copyright owner and ask them for permission. If this permission is
> superfluous, then so be it, it creates at the very least a good
> relationship and likely it leads to long term collaboration. While this
> might take longer and is more difficult than simply to assume PD status
> (and wait until we are challenged), time and again we have profited from
> this - by getting better quality texts, by gaining reputation among
> genuine copyright owners etc.
This book was first published in 1923 by Macmillan, but the copyright is
to the author and not the publisher. the renewal was to the estate of
the author and not a publisher.  Seeking a rights statement would be
from the rights holder.  Recently published works of scripture are more
frequently assigned an owner like a bible society, and have a contact
name.  Tracking down the owner of the rights when the last information
was 1948 is very difficult.  for 1923-1964 works with personal owners, I
have tried many times contacting the most recent publisher, and
generally if i get any response at all, it is that they don't own the
rights. For this work, I haven't because I don't see a publisher that
would likely still exist in the same form.


_______________________________________________
sword-devel mailing list: [hidden email]
http://www.crosswire.org/mailman/listinfo/sword-devel
Instructions to unsubscribe/change your settings at above page
Reply | Threaded
Open this post in threaded view
|

Re: Detecting Problem Characters

David Haslam
William Gay Ballantine died Jan­u­a­ry 10, 1937, Spring­field, Mass­a­chu­setts.

See http://www.hymntime.com/tch/bio/b/a/l/ballantine_wg.htm

He was was the fourth pre­si­dent of Ober­lin Coll­ege (1891-6).

See http://www.oberlin.edu/archive/holdings/finding/RG2/SG4/biography.html

The Riverside NT was first Published: Boston, Mass. : Houghton Mifflin Co., 1923.

NB. A revised edition was published in 1934. Information from WorldCat.

It's therefore important to know whether the electronic text you unearthed was for the first edition or the revised edition.

His second son, Henry Winthrop Ballantine studied Law and lived until 1951.

His third son, Edward Ballantine was a composer and lived until 1971.
See http://en.wikipedia.org/wiki/Edward_Ballantine

David
Reply | Threaded
Open this post in threaded view
|

Re: Detecting Problem Characters

David Haslam
In reply to this post by Mike Hart
PS. There are images of both editions of The Riverside NT in http://bibles.wikidot.com/ballantine

David
Reply | Threaded
Open this post in threaded view
|

Re: Detecting Problem Characters

David Haslam
In reply to this post by Mike Hart
PS2. A modern reprint of the Riverside New Testament was published by Greenbie Press in November 2008. See

http://berkelouw.com.au/catalogue/books/9781443727280/the-riverside-new-testament-a-translation-from-the-original-greek-into-the-english-of-today

An facsimile reprint also appeared in India in Jan 2010.  See http://www.flipkart.com/books/1153140101

I even found another reprint here
http://www.infibeam.com/Books/info/william-g-ballantine/riverside-new-testament-translation-original-greek-into-english/9781149532164.html?utm_term=William+Ballantine_1_10
with the ostentatious claim, "This is an EXACT reproduction of a book published before 1923. This IS NOT an OCR'd book with strange characters, introduced typographical errors, and jumbled words. ...."
It would seem that they were less diligent in their research than they should be!

David
Reply | Threaded
Open this post in threaded view
|

Re: Detecting Problem Characters

refdoc@gmx.net
In reply to this post by Mike Hart
Dear Michael,

The Zefania project is far too well known to us and we will not touch
their texts with a barge pole. Not even a very long one. Not even one
held by someone else.

Please read the mailing list archives prior to asking/discussing further.

Peter

_______________________________________________
sword-devel mailing list: [hidden email]
http://www.crosswire.org/mailman/listinfo/sword-devel
Instructions to unsubscribe/change your settings at above page
Reply | Threaded
Open this post in threaded view
|

Re: Detecting Problem Characters

refdoc@gmx.net
In reply to this post by Mike Hart
On Sun, 2011-09-25 at 11:51 -0400, Michael Hart wrote:
> Since I'm dealing with a pre-digital age document, I don't see that
> sourcing from an Archive is wrong.

No, it is not, you are right there - to a degree. But even in the world
of electronic re-publishing of old documents there is "original" and
"derived". I am not aware of anyone who set out to create a Zefania text
as the first and best representation of an existing book text. If you
look hard enough/ask enough there is usually somewhere someone who
published the text as plain text with paragraphing, as HTML, as MS word
document, as whatever. Zefania publishes derived material. This is not
bad as such - we do the same. But to copy from one derivation to another
results in chinese whisper like mistakes during bad transformation and
nearly always results in loss of information.  For that reason we
routinely tell people not to copy out of our archives for their own
purposes, but to create their own material from the sources we used.

So, even before we started discussing the reputation of Zefania as a
reliable source and its founder as a person worthy of trust (for which I
recommend reading of the archive), the answer would already be - do not
use their texts, but find whoever digitised their text and use what they
give you.

> For this work, I haven't because I don't see a publisher that
> would likely still exist in the same form.

So sad that is, if you can not ascertain who owns the copyright and if
the copyright is likely still existent (or at least not reliably
expired) then you can not re-publish the text. We certainly - and that
is unrelated to the Zefania discussion - would not publish a text where
we do not know who owns it.

You are free to do what you like - including setting up your own
repository, but we will not chose to publish a text with a dodgy legal
status. An honest mistake is one thing, to risk our reputation
deliberately for a text of YASEBT status ("yet another spurious English
Bible translation") this makes no sense to me. Sorry.

Yours

Peter


_______________________________________________
sword-devel mailing list: [hidden email]
http://www.crosswire.org/mailman/listinfo/sword-devel
Instructions to unsubscribe/change your settings at above page
Reply | Threaded
Open this post in threaded view
|

Re: Detecting Problem Characters

David Haslam
Hi Peter,

Though I might be well inclined to categorize the "Holy New Covenant" as a YASEBT, historically the word "spurious" is not a good epithet to apply to R G Ballantine's Riverside NT.

Ballantine was a highly regarded evangelical NT scholar, and the Riverside NT did break new ground in the history of Bible translations. It was possibly the earliest translation to use paragraphing, and to do away with verse numbers. It set a profile that was followed by several more well-known English translations of the NT that still have some visibility even now.  It's translation source was Nestle's Greek.

In its day, The Riverside translation was well-received in North America. It's only fallen out of use because of the plethora of more recent translations of the whole Bible.

PS>  Taliaferro does not group the Riverside among translations with an agenda or translation bias.
cf. Bible Version Encyclopedia.

David
12