{"id":184,"date":"2010-01-07T21:52:41","date_gmt":"2010-01-07T21:52:41","guid":{"rendered":"http:\/\/www.muratyaman.co.uk\/wp\/?p=184"},"modified":"2020-04-01T12:37:46","modified_gmt":"2020-04-01T11:37:46","slug":"creating-a-cv-indexing-facility-using-oracle-text","status":"publish","type":"post","link":"https:\/\/www.muratyaman.co.uk\/blog\/index.php\/2010\/01\/creating-a-cv-indexing-facility-using-oracle-text\/","title":{"rendered":"Creating a CV indexing facility using Oracle Text"},"content":{"rendered":"<p>Recently, I have struggled to get Oracle Text working on Linux. It was supposed to be a simple addon based on a very simple idea:<\/p>\n<p>+ already installed Apache 2.2, PHP 5.3, Firebird 2.0; which are working fine<\/p>\n<p>+ install Oracle 10g Express Edition<br \/>\n+ create a simple table to store candidate ID and document itself<br \/>\n   &#8211; binary contents of Word document (.doc) or PDF document (.pdf)<\/p>\n<p>I had to download the RPM file for Oracle 10g XE (Western Europe charset) and install it using yum. Then, run its configuration command (\/etc\/init.d\/oracle-xe configure) to finish the setup.<\/p>\n<p>After connecting with sqlplus, I created the table:<\/p>\n<pre lang=\"sql\">\r\nCREATE TABLE mytable (\r\n  id NUMBER PRIMARY KEY,\r\n  cvfile BLOB DEFAULT EMPTY_BLOB()\r\n);\r\n\r\nCREATE INDEX mytable_cvfile_idx ON mytable (cv_file)\r\n  INDEXTYPE IS CTXSYS.CONTEXT\r\n  PARAMETERS ('SYNC (ON COMMIT)')\r\n);\r\n<\/pre>\n<p><strong>&#8220;How can I populate BLOB column of the table?&#8221;<\/strong><\/p>\n<p>1. You can save files in the table as usual using <a href=\"http:\/\/www.oracle.com\/technology\/pub\/articles\/oracle_php_cookbook\/fuecks_lobs.html\">PHP OCI functions<\/a> (<a href=\"http:\/\/www.php.net\/manual\/en\/ref.oci8.php\">PHP documentation<\/a>).<\/p>\n<p>2. Or you can create\/execute a custom procedure that refers to your files inside a specific directory and let it do the rest using <a href=\"http:\/\/download.oracle.com\/docs\/cd\/B19306_01\/appdev.102\/b14258\/d_lob.htm\">DBMS_LOB<\/a> package.<\/p>\n<p>The issue I experienced was that indexing failed with errors. When I looked at view (select * from CTX_USER_INDEX_ERRORS) I saw errors such as:<\/p>\n<p>&#8220;DRG-11207: user filter command exited with status 1&#8221;<br \/>\n&#8220;DRG-11222: Third-party filter does not support this known document format&#8221;<\/p>\n<p>However, the same structure worked perfectly on a Windows server (with <a href=\"http:\/\/ifilter.org\/\">ifilters<\/a> installed).<\/p>\n<p>Then, I tried another method. First, I added a CLOB column into my table, and created similar index on the new column:<\/p>\n<pre lang=\"sql\">\r\nALTER TABLE mytable\r\nADD cvtxt CLOB DEFAULT EMPTY_CLOB();\r\n\r\nUPDATE mytable SET cvtxt = EMPTY_CLOB();-- if there are records\r\n\r\nCREATE INDEX mytable_cvtxt_idx ON mytable (cv_file)\r\n  INDEXTYPE IS CTXSYS.CONTEXT\r\n  PARAMETERS ('SYNC (ON COMMIT)')\r\n);\r\n<\/pre>\n<p>I had to populate the new column with text. I thought while I was saving the files in the database, I might as well extract text from Word documents and PDF documents and save the text in the CLOB column.<\/p>\n<p>Here comes the new challenge!<\/p>\n<p>Thanks to the open source world, we can find free solutions:<\/p>\n<p>+ <a href=\"http:\/\/www.winfield.demon.nl\/\">Antiword<\/a> MS-Word reader (&#8220;.. converts the documents from Word 2, 6, 7, 97, 2000, 2002, and 2003 to text, Postscript, and XML\/DocBook&#8221;)<\/p>\n<p>Installation is easy:<\/p>\n<pre lang=\"bash\">\r\nyum install antiword\r\n<\/pre>\n<p>+ <a href=\"http:\/\/www.foolabs.com\/xpdf\/\">XPDF<\/a> PDF viewer (&#8220;.. includes a PDF text extractor <strong>called pdftotext<\/strong>, PDF-to-PostScript converter, and various other utilities&#8221;)<\/p>\n<p>Download the version for your system (mine was Extra Packages for Linux 5 aka <a href=\"http:\/\/www.rpmfind.net\/linux\/rpm2html\/search.php?query=xpdf&#038;submit=Search+...&#038;system=&#038;arch=x86_64\">EPEL5 x86_64<\/a>)<\/p>\n<pre lang=\"bash\">\r\ncd \/usr\/local\/src\/\r\nwget ftp:\/\/195.220.108.108\/linux\/epel\/5\/x86_64\/xpdf-3.02-15.el5.x86_64.rpm\r\nyum install .\/xpdf-3.02-15.el5.x86_64.rpm\r\n<\/pre>\n<p>In PHP, you can exec the following commands and extract text from your Word and PDF files:<\/p>\n<pre lang=\"php\">\r\n<?php\r\n\/\/extract text from Word document\r\n$myfile_path_and_name = \"\/tmp\/abc.doc\";\r\n$cmd = \"antiword $myfile_path_and_name\";\r\n$output_lines = array(); $return_value = 0;\r\n$last_line = exec($cmd, $output_lines, $return_value);\r\n$doc_text = implode(\"\\n\", $output_lines);\r\n\r\n\/\/extract text from PDF file\r\n$myfile_path_and_name = \"\/tmp\/def.pdf\";\r\n$cmd = \"pdftotext -q $myfile_path_and_name -\"; \/\/ \"-q\" no messages, \"-\" no output file but stdout\r\n$output_lines = array(); $return_value = 0;\r\n$last_line = exec($cmd, $output_lines, $return_value);\r\n$pdf_text = implode(\"\\n\", $output_lines);\r\n?>\r\n<\/pre>\n<p>I could use text taken from these utilities in my PHP\/OCI function like saving BLOBs, but this time it is a CLOB column.<\/p>\n<p>That&#8217;s all! Happy searching CVs!<\/p>\n<pre lang=\"sql\">\r\nSELECT id, SCORE(1) as myrank\r\nFROM mytable\r\nWHERE CONTAINS (cvtxt, 'php sql', 1) > 0;\r\n<\/pre>\n","protected":false},"excerpt":{"rendered":"<p>Recently, I have struggled to get Oracle Text working on Linux.<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[11],"tags":[126,38,26,170,33],"class_list":["post-184","post","type-post","status-publish","format-standard","hentry","category-technology","tag-linux","tag-oracle","tag-php","tag-search","tag-sql"],"_links":{"self":[{"href":"https:\/\/www.muratyaman.co.uk\/blog\/index.php\/wp-json\/wp\/v2\/posts\/184","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.muratyaman.co.uk\/blog\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.muratyaman.co.uk\/blog\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.muratyaman.co.uk\/blog\/index.php\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.muratyaman.co.uk\/blog\/index.php\/wp-json\/wp\/v2\/comments?post=184"}],"version-history":[{"count":12,"href":"https:\/\/www.muratyaman.co.uk\/blog\/index.php\/wp-json\/wp\/v2\/posts\/184\/revisions"}],"predecessor-version":[{"id":950,"href":"https:\/\/www.muratyaman.co.uk\/blog\/index.php\/wp-json\/wp\/v2\/posts\/184\/revisions\/950"}],"wp:attachment":[{"href":"https:\/\/www.muratyaman.co.uk\/blog\/index.php\/wp-json\/wp\/v2\/media?parent=184"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.muratyaman.co.uk\/blog\/index.php\/wp-json\/wp\/v2\/categories?post=184"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.muratyaman.co.uk\/blog\/index.php\/wp-json\/wp\/v2\/tags?post=184"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}