Apache solr search in hebrew (and probably arabic) documents in Drupal - PDF problem & solution

You are here

Contact

Israel Office

+972-52-838-7222

+972-52-430-5252

Europe office

+33-695-805-004

23.03.2011
Apache solr search in hebrew (and probably arabic) documents in Drupal - PDF problem & solution
submitted by: Shai Weinstein

We recently had a client request to search inside user's uploaded Documents for some online tenders.
Dupal's apachesolr and apachesolr_attachments modules with Apache's solr do the work but we have an exotic language and.. exotic challenges...
When extracting text from the uploaded PDFs - the uploaded Hebrew PDF indexes the words backwards (not being aware to Right To Left text)...

The default behavior of apachesolr_attachments is to use Tika (through solr or application) to extract text from uploaded documents.
the text extraction was correct for Hebrew .doc and .docx files but not for PDFs.

a quick fix for that was to use PDFBox application CLI to exctract text from uploaded PDFs.
this corrected the Right to left awareness issue.

apachesolr_attachments module - http://drupal.org/node/840056
patch to use pdfbox cli: apachesolr_attachments_pdfbox.patch

AttachmentSize
File apachesolr_attachments_pdfbox.patch3.61 KB

Comments

I knew that there is this functionality in drupal. But surely this post is a little help.
Surely will try this too.

Did you find a way to configure apatch Solr to use reasonable Hebrew stemmer. I.e understand that a search for "drupal" need to return "bedrupal" as result too?

"bedrupal" is not really a variation of "drupal", but if you mean that "בדרופל" should return "דרופל" as well, then yes - this works (however - I'm not sure "Drupal" is a known word to the engine, and so it is not sure any variation on the name would work)

Add new comment

blogs