We recently had a client request to search inside user's uploaded Documents for some online tenders.
Dupal's apachesolr and apachesolr_attachments modules with Apache's solr do the work but we have an exotic language and.. exotic challenges...
When extracting text from the uploaded PDFs - the uploaded Hebrew PDF indexes the words backwards (not being aware to Right To Left text)...
The default behavior of apachesolr_attachments is to use Tika (through solr or application) to extract text from uploaded documents.
the text extraction was correct for Hebrew .doc and .docx files but not for PDFs.
a quick fix for that was to use PDFBox application CLI to exctract text from uploaded PDFs.
this corrected the Right to left awareness issue.