Lucene search index and Coptic ?

classic Classic list List threaded Threaded
21 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Lucene search index and Coptic ?

David Haslam
If you search the module SahidicBible using either PocketSword or using Xiphos with the Lucene method selected, the results list is enormous and erroneous.

Example: Search for the word "ⲉⲩϩⲩⲡⲟⲙⲟⲛⲏ"

This actually occurs on only two verses: Romans 5:3 and James 1:3

The Lucene method lists 12460 results in Xiphos, and slightly fewer with PocketSword.

What's the explanation?

If our Greek and Hebrew modules gave such strange results, we'd be getting loads of feedback.
Coptic Bible students may be less inclined to notice or report such weirdness.

Best regards,

David

PS. I just updated the .conf file and submitted it to the Modules Team.



Reply | Threaded
Open this post in threaded view
|

Re: Lucene search index and Coptic ?

David Haslam
Comparing the results total 12460 to the number of module verses that contain any text (14212), a search that finds the 10 letter search key in 87.67% of the total is clearly a serious matter, one so egregious that it almost defies a rational explanation.

Here's a possible clue.

Taking the unique letters from the example search word, and inserting a space between each, we get this:

ϩ ⲉ ⲏ ⲙ ⲛ ⲟ ⲡ ⲩ

Using this as the search key, and selecting multi-word search type in Xiphos, I got 9049 results using the Advanced Search dialog.

Now although that's only 72.6% of the original number of results, or 63.67% of the non-empty verses.

One further observation is that the results verse list starts in almost the same way as before.

Genesis 3:10,11,14,15,16,19,20,21,...

However, with such high proportions of the non-empty verse count, this is not so surprising.

This comparison suggests the following plausible explanation for the weird result with Lucene.

Is the software used by the Lucene search treating each Coptic Letter as a Word ?
i.e. Just as it should if each Unicode Symbol was an Egyptian Hieroglyph or a Han/Hangul Ideograph.

Maybe this conjecture needs teasing out in further detail, if perhaps only some of the Coptic Letters are misclassified.
After all, the Coptic letters in the module are from two separate Unicode blocks.

But if this is really the root cause, then it's clearly a critical bug in the Lucene software.

Can anyone think of a better explanation?

Best regards,

David
Reply | Threaded
Open this post in threaded view
|

Re: Lucene search index and Coptic ?

David Haslam
If you examine the result preview pane in the Xiphos Advanced Search dialog, the problem becomes apparent.

Most Coptic Unicode characters are not displayed correctly.



The remainder seem to have been converted to U+FFFD REPLACEMENT CHARACTER.

i.e. All these Coptic letters are basically not handled aright by this part of the software:

U+2C81 ⲁ COPTIC SMALL LETTER ALFA
U+2C83 ⲃ COPTIC SMALL LETTER VIDA
U+2C85 ⲅ COPTIC SMALL LETTER GAMMA
U+2C87 ⲇ COPTIC SMALL LETTER DALDA
U+2C89 ⲉ COPTIC SMALL LETTER EIE
U+2C8B ⲋ COPTIC SMALL LETTER SOU
U+2C8D ⲍ COPTIC SMALL LETTER ZATA
U+2C8F ⲏ COPTIC SMALL LETTER HATE
U+2C91 ⲑ COPTIC SMALL LETTER THETHE
U+2C93 ⲓ COPTIC SMALL LETTER IAUDA
U+2C95 ⲕ COPTIC SMALL LETTER KAPA
U+2C97 ⲗ COPTIC SMALL LETTER LAULA
U+2C99 ⲙ COPTIC SMALL LETTER MI
U+2C9B ⲛ COPTIC SMALL LETTER NI
U+2C9D ⲝ COPTIC SMALL LETTER KSI
U+2C9F ⲟ COPTIC SMALL LETTER O
U+2CA1 ⲡ COPTIC SMALL LETTER PI
U+2CA3 ⲣ COPTIC SMALL LETTER RO
U+2CA5 ⲥ COPTIC SMALL LETTER SIMA
U+2CA7 ⲧ COPTIC SMALL LETTER TAU
U+2CA9 ⲩ COPTIC SMALL LETTER UA
U+2CAB ⲫ COPTIC SMALL LETTER FI
U+2CAD ⲭ COPTIC SMALL LETTER KHI
U+2CAF ⲯ COPTIC SMALL LETTER PSI
U+2CB1 ⲱ COPTIC SMALL LETTER OOU
U+2CC1 ⳁ COPTIC SMALL LETTER SAMPI
U+2CE8 ⳨ COPTIC SYMBOL TAU RO

Only the few Coptic letters in the block U+03E2 to U+03EF are displayed aright.

It's no wonder that a search has so many spurious results if most of the search space has been squashed into Unicode replacement characters.

I'm a Windows user, as most of you know already.
Does the same thing happen in Xiphos under Linux?

Is this an issue common to all SWORD based front-ends?
The fact that we see similar results in PocketSword strongly suggests it is.

Best regards,

David
Reply | Threaded
Open this post in threaded view
|

Re: Lucene search index and Coptic ?

Greg Hellings
Unicode replacement characters typically indicate a font issue, and would not normally be represented as such within the internals of a program. Have you tried using one of the command line utilities or examples directly?

--Greg

On Wed, Apr 26, 2017 at 2:48 PM, David Haslam <[hidden email]> wrote:
If you examine the result preview pane in the Xiphos Advanced Search dialog,
the problem becomes apparent.

Most Coptic Unicode characters are not displayed correctly.



The remainder seem to have been converted to U+FFFD REPLACEMENT CHARACTER.

i.e. All these Coptic letters are basically not handled aright by this part
of the software:

U+2C81  ⲁ       COPTIC SMALL LETTER ALFA
U+2C83  ⲃ       COPTIC SMALL LETTER VIDA
U+2C85  ⲅ       COPTIC SMALL LETTER GAMMA
U+2C87  ⲇ       COPTIC SMALL LETTER DALDA
U+2C89  ⲉ       COPTIC SMALL LETTER EIE
U+2C8B  ⲋ       COPTIC SMALL LETTER SOU
U+2C8D  ⲍ       COPTIC SMALL LETTER ZATA
U+2C8F  ⲏ       COPTIC SMALL LETTER HATE
U+2C91  ⲑ       COPTIC SMALL LETTER THETHE
U+2C93  ⲓ       COPTIC SMALL LETTER IAUDA
U+2C95  ⲕ       COPTIC SMALL LETTER KAPA
U+2C97  ⲗ       COPTIC SMALL LETTER LAULA
U+2C99  ⲙ       COPTIC SMALL LETTER MI
U+2C9B  ⲛ       COPTIC SMALL LETTER NI
U+2C9D  ⲝ       COPTIC SMALL LETTER KSI
U+2C9F  ⲟ       COPTIC SMALL LETTER O
U+2CA1  ⲡ       COPTIC SMALL LETTER PI
U+2CA3  ⲣ       COPTIC SMALL LETTER RO
U+2CA5  ⲥ       COPTIC SMALL LETTER SIMA
U+2CA7  ⲧ       COPTIC SMALL LETTER TAU
U+2CA9  ⲩ       COPTIC SMALL LETTER UA
U+2CAB  ⲫ       COPTIC SMALL LETTER FI
U+2CAD  ⲭ       COPTIC SMALL LETTER KHI
U+2CAF  ⲯ       COPTIC SMALL LETTER PSI
U+2CB1  ⲱ       COPTIC SMALL LETTER OOU
U+2CC1  ⳁ       COPTIC SMALL LETTER SAMPI
U+2CE8  ⳨       COPTIC SYMBOL TAU RO

Only the few Coptic letters in the block U+03E2 to U+03EF are displayed
aright.

It's no wonder that a search has so many spurious results if most of the
search space has been squashed into Unicode replacement characters.

I'm a Windows user, as most of you know already.
Does the same thing happen in Xiphos under Linux?

Is this an issue common to all SWORD based front-ends?
The fact that we see similar results in PocketSword strongly suggests it is.

Best regards,

David



--
View this message in context: http://sword-dev.350566.n4.nabble.com/Lucene-search-index-and-Coptic-tp4657103p4657106.html
Sent from the SWORD Dev mailing list archive at Nabble.com.

_______________________________________________
sword-devel mailing list: [hidden email]
http://www.crosswire.org/mailman/listinfo/sword-devel
Instructions to unsubscribe/change your settings at above page


_______________________________________________
sword-devel mailing list: [hidden email]
http://www.crosswire.org/mailman/listinfo/sword-devel
Instructions to unsubscribe/change your settings at above page
Reply | Threaded
Open this post in threaded view
|

Re: Lucene search index and Coptic ?

DM Smith-5
In reply to this post by David Haslam
Consider using Luke to analyze the constructed Lucene index. See: https://code.google.com/archive/p/luke/
I think you’ll need one that matches Lucene 1.9.1. Maybe 1.4.x.

DM


On Apr 26, 2017, at 3:48 PM, David Haslam <[hidden email]> wrote:

If you examine the result preview pane in the Xiphos Advanced Search dialog,
the problem becomes apparent.

Most Coptic Unicode characters are not displayed correctly.



The remainder seem to have been converted to U+FFFD REPLACEMENT CHARACTER.

i.e. All these Coptic letters are basically not handled aright by this part
of the software:

U+2C81 COPTIC SMALL LETTER ALFA
U+2C83 COPTIC SMALL LETTER VIDA
U+2C85 COPTIC SMALL LETTER GAMMA
U+2C87 COPTIC SMALL LETTER DALDA
U+2C89 COPTIC SMALL LETTER EIE
U+2C8B COPTIC SMALL LETTER SOU
U+2C8D COPTIC SMALL LETTER ZATA
U+2C8F COPTIC SMALL LETTER HATE
U+2C91 COPTIC SMALL LETTER THETHE
U+2C93 COPTIC SMALL LETTER IAUDA
U+2C95 COPTIC SMALL LETTER KAPA
U+2C97 COPTIC SMALL LETTER LAULA
U+2C99 COPTIC SMALL LETTER MI
U+2C9B COPTIC SMALL LETTER NI
U+2C9D COPTIC SMALL LETTER KSI
U+2C9F COPTIC SMALL LETTER O
U+2CA1 COPTIC SMALL LETTER PI
U+2CA3 COPTIC SMALL LETTER RO
U+2CA5 COPTIC SMALL LETTER SIMA
U+2CA7 COPTIC SMALL LETTER TAU
U+2CA9 COPTIC SMALL LETTER UA
U+2CAB COPTIC SMALL LETTER FI
U+2CAD COPTIC SMALL LETTER KHI
U+2CAF COPTIC SMALL LETTER PSI
U+2CB1 COPTIC SMALL LETTER OOU
U+2CC1 COPTIC SMALL LETTER SAMPI
U+2CE8 COPTIC SYMBOL TAU RO

Only the few Coptic letters in the block U+03E2 to U+03EF are displayed
aright.

It's no wonder that a search has so many spurious results if most of the
search space has been squashed into Unicode replacement characters.

I'm a Windows user, as most of you know already.
Does the same thing happen in Xiphos under Linux?

Is this an issue common to all SWORD based front-ends?
The fact that we see similar results in PocketSword strongly suggests it is.

Best regards,

David



--
View this message in context: http://sword-dev.350566.n4.nabble.com/Lucene-search-index-and-Coptic-tp4657103p4657106.html
Sent from the SWORD Dev mailing list archive at Nabble.com.

_______________________________________________
sword-devel mailing list: [hidden email]
http://www.crosswire.org/mailman/listinfo/sword-devel
Instructions to unsubscribe/change your settings at above page


_______________________________________________
sword-devel mailing list: [hidden email]
http://www.crosswire.org/mailman/listinfo/sword-devel
Instructions to unsubscribe/change your settings at above page
Reply | Threaded
Open this post in threaded view
|

Re: Lucene search index and Coptic ?

Troy A. Griffitts

So, as a side note to this thread,

The Sahidic Bible is maintained at coptot.manuscriptroom.com:

http://coptot.manuscriptroom.com/transcribing?docID=1620025&userName=PUBLISHED

and we regularly export from there and import into swordweb, which is used for their browser plugin (first link on Christian Askeland's wonder resource list for Coptic):

https://sites.google.com/site/askelandchristian/copticlinks

We don't index the text.  They typically search with regex (and yes, they know about the {byte_count} anomaly with our regex search).

-Troy



On 04/26/2017 03:21 PM, DM Smith wrote:
Consider using Luke to analyze the constructed Lucene index. See: https://code.google.com/archive/p/luke/
I think you’ll need one that matches Lucene 1.9.1. Maybe 1.4.x.

DM


On Apr 26, 2017, at 3:48 PM, David Haslam <[hidden email]> wrote:

If you examine the result preview pane in the Xiphos Advanced Search dialog,
the problem becomes apparent.

Most Coptic Unicode characters are not displayed correctly.



The remainder seem to have been converted to U+FFFD REPLACEMENT CHARACTER.

i.e. All these Coptic letters are basically not handled aright by this part
of the software:

U+2C81 COPTIC SMALL LETTER ALFA
U+2C83 COPTIC SMALL LETTER VIDA
U+2C85 COPTIC SMALL LETTER GAMMA
U+2C87 COPTIC SMALL LETTER DALDA
U+2C89 COPTIC SMALL LETTER EIE
U+2C8B COPTIC SMALL LETTER SOU
U+2C8D COPTIC SMALL LETTER ZATA
U+2C8F COPTIC SMALL LETTER HATE
U+2C91 COPTIC SMALL LETTER THETHE
U+2C93 COPTIC SMALL LETTER IAUDA
U+2C95 COPTIC SMALL LETTER KAPA
U+2C97 COPTIC SMALL LETTER LAULA
U+2C99 COPTIC SMALL LETTER MI
U+2C9B COPTIC SMALL LETTER NI
U+2C9D COPTIC SMALL LETTER KSI
U+2C9F COPTIC SMALL LETTER O
U+2CA1 COPTIC SMALL LETTER PI
U+2CA3 COPTIC SMALL LETTER RO
U+2CA5 COPTIC SMALL LETTER SIMA
U+2CA7 COPTIC SMALL LETTER TAU
U+2CA9 COPTIC SMALL LETTER UA
U+2CAB COPTIC SMALL LETTER FI
U+2CAD COPTIC SMALL LETTER KHI
U+2CAF COPTIC SMALL LETTER PSI
U+2CB1 COPTIC SMALL LETTER OOU
U+2CC1 COPTIC SMALL LETTER SAMPI
U+2CE8 COPTIC SYMBOL TAU RO

Only the few Coptic letters in the block U+03E2 to U+03EF are displayed
aright.

It's no wonder that a search has so many spurious results if most of the
search space has been squashed into Unicode replacement characters.

I'm a Windows user, as most of you know already.
Does the same thing happen in Xiphos under Linux?

Is this an issue common to all SWORD based front-ends?
The fact that we see similar results in PocketSword strongly suggests it is.

Best regards,

David



--
View this message in context: http://sword-dev.350566.n4.nabble.com/Lucene-search-index-and-Coptic-tp4657103p4657106.html
Sent from the SWORD Dev mailing list archive at Nabble.com.

_______________________________________________
sword-devel mailing list: [hidden email]
http://www.crosswire.org/mailman/listinfo/sword-devel
Instructions to unsubscribe/change your settings at above page



_______________________________________________
sword-devel mailing list: [hidden email]
http://www.crosswire.org/mailman/listinfo/sword-devel
Instructions to unsubscribe/change your settings at above page


_______________________________________________
sword-devel mailing list: [hidden email]
http://www.crosswire.org/mailman/listinfo/sword-devel
Instructions to unsubscribe/change your settings at above page
Reply | Threaded
Open this post in threaded view
|

Re: Lucene search index and Coptic ?

David Haslam
Even if Troy's good friends don't use the Lucene index for their work on Coptic manuscripts, that's no reason not to pursue this issue in more detail.

The Coptic block 2C80..2CFF was added with Unicode 4.1 which was released in March 2005.

Are we concluding that the Lucene indexing software built into SWORD is so old that it doesn't support Unicode 4.1 ?

That's what it's beginning to look like to me.

Does it even support the Tagalog block 1700..170C, 170E..1714 (added with Unicode 3.2 in March 2002) ?

Best regards,

David
Reply | Threaded
Open this post in threaded view
|

Re: Lucene search index and Coptic ?

David Haslam
In reply to this post by DM Smith-5
Hi DM,

I wouldn't know where to begin with Luke; there's no documentation for it in the code archive you cited.

Is it even something a Windows user like me could do anything with?

Best regards,

David
Reply | Threaded
Open this post in threaded view
|

Re: Lucene search index and Coptic ?

David Haslam
In reply to this post by Greg Hellings
No - I've not tried any command line utilities related to Lucene search.

Were you thinking of diatheke?

If it is a font issue, it's not as if I hadn't already installed the recommended Antinoou font for the SahidicBible module and selected this font in Xiphos.

http://www.evertype.com/fonts/coptic/

On the other hand, the font used in Xiphos for general UI dialogs doesn't support the Coptic block 2C80..2CFF.

And I haven't found any user accessible means to change that in Xiphos to one with wider coverage, such as Code2000 or Unifont.


NB. My activity in this context were prompted by

(1) the recent beta version 1.4.8 (19) of PocketSword becoming available to test
(2) the old post by Troy about wanting use PocketSword to view the SahidicBible module.

The latter was posted not long after PocketSword first hit the problems triggered by iOS updates that made it impossible to install the search index. Troy asked about search.

Now that PocketSword is "on the mend" thanks to Nic and his helpers, it seemed a sensible thing to look at old posts to sword-devel relating to this PocketSword issue.

NB. PocketSword uses Lucene search for every search.
There are no user options to use a different search method, unlike in other front-ends such as Xiphos.

The switch to compare with Xiphos was because the search results in PocketSword for this module were so unexpected.

Best regards,

David



Reply | Threaded
Open this post in threaded view
|

Re: Lucene search index and Coptic ?

Greg Hellings
In reply to this post by David Haslam


On Thu, Apr 27, 2017 at 3:59 AM, David Haslam <[hidden email]> wrote:
Even if Troy's good friends don't use the Lucene index for their work on
Coptic manuscripts, that's no reason not to pursue this issue in more
detail.

The *Coptic* block *2C80..2CFF* was added with *Unicode 4.1* which was
released in March 2005.

The Lucene compatibility we aim for, according to DM earlier in this thread, is Lucene 1.9.1. That version was released in March of 2006, so it's definitely feasible that it might not support Unicode 4.1. That would depend on what Lucene's policy is for updating existing release branches with new Unicode versions and support.

CLucene is notoriously far behind even what Lucene offers (CLucene is, essentially, abandonware at this point) supporting only up to something like Lucene 2.3 at the latest. But probably lagging behind in things like individual language and script support.

For an additional data point you might try using BibleDesktop's Lucene support. That, at least, uses the upstream Lucene instead of CLucene and stands a chance of having a newer set of script support.

--Greg

Are we concluding that the Lucene indexing software built into SWORD is so
old that it doesn't support Unicode 4.1 ?

That's what it's beginning to look like to me.

Does it even support the *Tagalog* block *1700..170C, 170E..1714* (added
with Unicode 3.2 in March 2002) ?

Best regards,

David



--
View this message in context: http://sword-dev.350566.n4.nabble.com/Lucene-search-index-and-Coptic-tp4657103p4657110.html
Sent from the SWORD Dev mailing list archive at Nabble.com.

_______________________________________________
sword-devel mailing list: [hidden email]
http://www.crosswire.org/mailman/listinfo/sword-devel
Instructions to unsubscribe/change your settings at above page


_______________________________________________
sword-devel mailing list: [hidden email]
http://www.crosswire.org/mailman/listinfo/sword-devel
Instructions to unsubscribe/change your settings at above page
Reply | Threaded
Open this post in threaded view
|

Re: Lucene search index and Coptic ?

Troy A. Griffitts
Has anyone checked the encoding entry in the conf file?

On April 27, 2017 6:57:48 AM MST, Greg Hellings <[hidden email]> wrote:


On Thu, Apr 27, 2017 at 3:59 AM, David Haslam <[hidden email]> wrote:
Even if Troy's good friends don't use the Lucene index for their work on
Coptic manuscripts, that's no reason not to pursue this issue in more
detail.

The *Coptic* block *2C80..2CFF* was added with *Unicode 4.1* which was
released in March 2005.

The Lucene compatibility we aim for, according to DM earlier in this thread, is Lucene 1.9.1. That version was released in March of 2006, so it's definitely feasible that it might not support Unicode 4.1. That would depend on what Lucene's policy is for updating existing release branches with new Unicode versions and support.

CLucene is notoriously far behind even what Lucene offers (CLucene is, essentially, abandonware at this point) supporting only up to something like Lucene 2.3 at the latest. But probably lagging behind in things like individual language and script support.

For an additional data point you might try using BibleDesktop's Lucene support. That, at least, uses the upstream Lucene instead of CLucene and stands a chance of having a newer set of script support.

--Greg

Are we concluding that the Lucene indexing software built into SWORD is so
old that it doesn't support Unicode 4.1 ?

That's what it's beginning to look like to me.

Does it even support the *Tagalog* block *1700..170C, 170E..1714* (added
with Unicode 3.2 in March 2002) ?

Best regards,

David



--
View this message in context: http://sword-dev.350566.n4.nabble.com/Lucene-search-index-and-Coptic-tp4657103p4657110.html
Sent from the SWORD Dev mailing list archive at Nabble.com.

_______________________________________________
sword-devel mailing list: [hidden email]
http://www.crosswire.org/mailman/listinfo/sword-devel
Instructions to unsubscribe/change your settings at above page


--
Sent from my Android device with K-9 Mail. Please excuse my brevity.
_______________________________________________
sword-devel mailing list: [hidden email]
http://www.crosswire.org/mailman/listinfo/sword-devel
Instructions to unsubscribe/change your settings at above page
Reply | Threaded
Open this post in threaded view
|

Re: Lucene search index and Coptic ?

David Haslam
Yes - of course!

Encoding=UTF-8

It's not primarily a font issue, even though that might be a further annoyance in Xiphos.

Having a font without coverage for the Coptic block in question cannot by any stretch of logic account for a search that finds 622,900% of the only 2 true matches.

A font can only operate when the results are displayed, not during the search "under the hood".

As regards entering the search term in Xiphos, the font in the GUI clearly doesn't support the Coptic block.
If however, you copy the word back from the search key field and paste it into BabelPad, it's quite certain that the word was unchanged.

The Coptic Bible text displays fine by Xiphos and PocketSword when viewed normally.
This thread only relates to the search feature of these front-ends
(and any others based on SWORD that use the same Lucene).

btw. Would it be true to assume that ticking "fuzzy search" in PocketSword in effect says "goodbye" to Lucene?

Best regards,

David

PS. My updated .conf file is in the attached Zip (if this works herewith).
Already submitted to the modules team with a request to update the module to version 1.1.1
Included here FIO. It makes no difference to the matter being pursued.
Mostly correcting elementary mistakes that were there from the first release.
Reply | Threaded
Open this post in threaded view
|

Re: Lucene search index and Coptic ?

David Haslam
FIO.  Zip contains the .conf file that I updated yesterday.

sahidicbible.zip

btw.  I just missed out the "upload file" step before.

David
Reply | Threaded
Open this post in threaded view
|

Re: Lucene search index and Coptic ?

David Haslam
In reply to this post by Greg Hellings
Greg wrote, "Have you tried using one of the command line utilities or examples directly?"

Well, yes, but now I have hit a brick wall.

Assuming that mkfastmod.exe exactly mimics Xiphos in how it constructs the Lucene index, that's not the problem.

The problem is that in Windows, how do you get the non-ANSI search key into diatheke?

Well one might think that this is simply a matter of creating a suitable CMD file containing the following line:

xiphos\diatheke -b SahidicBible -s lucene -k ⲉⲩϩⲩⲡⲟⲙⲟⲛⲏ >test.log

Where xiphos was set up long ago on my PC as symbolic link to the xiphos directory - using the mklink command.

This is what running such a CMD file gave:

Verses containing "ндвнд«ц«нд«ндьндЃндондЃндЭнд┼"-- none (SahidicBible)

The quoted search key is 29 characters long of obscure text.
That's 3 characters for every higher block Coptic letter
and 2 characters for the third letter ϩ which is in the lower block.

Looks like the fact that (as you know) Windows handles everything as UTF-16 LE,
inevitably causes diatheke to convert the search key into something unrecognisable!

The same thing happens without the "-s lucene"!

And it makes no difference whether the CMD file's text is UTF-8 or UTF-16 encoded.

Nice try, Greg. But it's not added much to the identification of the root cause.

Can something like this be tried on a Linux machine for comparison?

Best regards,

David


Reply | Threaded
Open this post in threaded view
|

Re: Lucene search index and Coptic ?

David Haslam
In reply to this post by Troy A. Griffitts
Thanks Troy,

When you wrote, "They typically search with regex".

Please can you explain exactly how I could do this in a Windows CMD file (or command line) in order to find (e.g.) the two verses containing the word ⲉⲩϩⲩⲡⲟⲙⲟⲛⲏ

What exactly does diatheke -s regex expect for Unicode character codes?

cf.  Using PCRE escape codes doesn't work!

\x{2C89}\x{2CA9}\x{03E9}\x{2CA9}\x{2CA1}\x{2C9F}\x{2C99}\x{2C9F}\x{2C9B}\x{2C8F}

Best regards,

David
Reply | Threaded
Open this post in threaded view
|

Re: Lucene search index and Coptic ?

Greg Hellings
$ diatheke -b SahidicBible -s lucene -k ⲉⲩϩⲩⲡⲟⲙⲟⲛⲏ
<snip>
12460 matches total (SahidicBible)
$ diatheke -b SahidicBible -s regex -k ⲉⲩϩⲩⲡⲟⲙⲟⲛⲏ
Verses containing "ⲉⲩϩⲩⲡⲟⲙⲟⲛⲏ"-- Romans 5:3 ; James 1:3 -- 2 matches total (SahidicBible)

--Greg

On Fri, Apr 28, 2017 at 9:55 AM, David Haslam <[hidden email]> wrote:
Thanks Troy,

When you wrote, "They typically search with regex".

Please can you explain exactly how I could do this in a Windows CMD file (or
command line) in order to find (e.g.) the two verses containing the word
ⲉⲩϩⲩⲡⲟⲙⲟⲛⲏ

What exactly does *diatheke -s regex* expect for Unicode character codes?

cf.  Using PCRE escape codes doesn't work!

\x{2C89}\x{2CA9}\x{03E9}\x{2CA9}\x{2CA1}\x{2C9F}\x{2C99}\x{2C9F}\x{2C9B}\x{2C8F}

Best regards,

David



--
View this message in context: http://sword-dev.350566.n4.nabble.com/Lucene-search-index-and-Coptic-tp4657103p4657121.html
Sent from the SWORD Dev mailing list archive at Nabble.com.

_______________________________________________
sword-devel mailing list: [hidden email]
http://www.crosswire.org/mailman/listinfo/sword-devel
Instructions to unsubscribe/change your settings at above page


_______________________________________________
sword-devel mailing list: [hidden email]
http://www.crosswire.org/mailman/listinfo/sword-devel
Instructions to unsubscribe/change your settings at above page
Reply | Threaded
Open this post in threaded view
|

Re: Lucene search index and Coptic ?

David Haslam
Thanks, Greg.

I guess this shows the limitations of using diatheke in Windows when you have a non-ANSI module.

"What's impossible with Windows is possible in Linux."

The test also confirms that there's a serious issue with Lucene for Coptic texts.
The number of matches (12460 ) is the same as I observed with Xiphos for this module.

btw. Did you examine the "<snip>" ?

David
Reply | Threaded
Open this post in threaded view
|

Re: Lucene search index and Coptic ?

Greg Hellings
I did not analyze the <snip>. It was multiple screens of text.

Have you tried this in BD? BD uses Lucene directly instead of CLucene. That might have better support for Coptic.

--Greg

On Fri, Apr 28, 2017 at 1:19 PM, David Haslam <[hidden email]> wrote:
Thanks, Greg.

I guess this shows the limitations of using diatheke in Windows when you
have a non-ANSI module.

"What's impossible with Windows is possible in Linux."

The test also confirms that there's a serious issue with Lucene for Coptic
texts.
The number of matches (12460 ) is the same as I observed with Xiphos for
this module.

btw. Did you examine the "<snip>" ?

David



--
View this message in context: http://sword-dev.350566.n4.nabble.com/Lucene-search-index-and-Coptic-tp4657103p4657125.html
Sent from the SWORD Dev mailing list archive at Nabble.com.

_______________________________________________
sword-devel mailing list: [hidden email]
http://www.crosswire.org/mailman/listinfo/sword-devel
Instructions to unsubscribe/change your settings at above page


_______________________________________________
sword-devel mailing list: [hidden email]
http://www.crosswire.org/mailman/listinfo/sword-devel
Instructions to unsubscribe/change your settings at above page
Reply | Threaded
Open this post in threaded view
|

Re: Lucene search index and Coptic ?

David Haslam
As it happens, I've rarely used BD on my Win7 x64 PC, largely because there's no quick fix for tweaking the way I launch BD whenever there's been an update to Oracle Java. I'm still waiting for DM to fix the way BD installs for 64-bit hardware.

Best regards,

David
Reply | Threaded
Open this post in threaded view
|

Re: Lucene search index and Coptic ?

David Haslam
In reply to this post by David Haslam
Just to confirm.

The crazy results from Lucene search type affects ALL five Coptic language Bible modules.

[CopNT] - The Coptic New Testament
[CopSahHorner] - Sahidic Coptic New Testament, ed. by G. W. Horner
[CopSahidicMSS] - The Sahidica Manuscripts
[CopSahidica] - Sahidica - A New Edition of the New Testament in Sahidic Coptic
[SahidicBible] - Sahidic Bible - Askeland / Schulz

Aside: Our naming convention seems to have been ignored for the Askeland / Schulz module.

Best regards,

David
12