Creating a CV indexing facility using Oracle Text

Recently, I have struggled to get Oracle Text working on Linux. It was supposed to be a simple addon based on a very simple idea:

+ already installed Apache 2.2, PHP 5.3, Firebird 2.0; which are working fine

+ install Oracle 10g Express Edition
+ create a simple table to store candidate ID and document itself
– binary contents of Word document (.doc) or PDF document (.pdf)

I had to download the RPM file for Oracle 10g XE (Western Europe charset) and install it using yum. Then, run its configuration command (/etc/init.d/oracle-xe configure) to finish the setup.

After connecting with sqlplus, I created the table:

CREATE TABLE mytable (
  id NUMBER PRIMARY KEY,
  cvfile BLOB DEFAULT EMPTY_BLOB()
);

CREATE INDEX mytable_cvfile_idx ON mytable (cv_file)
  INDEXTYPE IS CTXSYS.CONTEXT
  PARAMETERS ('SYNC (ON COMMIT)')
);

“How can I populate BLOB column of the table?”

1. You can save files in the table as usual using PHP OCI functions (PHP documentation).

2. Or you can create/execute a custom procedure that refers to your files inside a specific directory and let it do the rest using DBMS_LOB package.

The issue I experienced was that indexing failed with errors. When I looked at view (select * from CTX_USER_INDEX_ERRORS) I saw errors such as:

“DRG-11207: user filter command exited with status 1”
“DRG-11222: Third-party filter does not support this known document format”

However, the same structure worked perfectly on a Windows server (with ifilters installed).

Then, I tried another method. First, I added a CLOB column into my table, and created similar index on the new column:

ALTER TABLE mytable
ADD cvtxt CLOB DEFAULT EMPTY_CLOB();

UPDATE mytable SET cvtxt = EMPTY_CLOB();-- if there are records

CREATE INDEX mytable_cvtxt_idx ON mytable (cv_file)
  INDEXTYPE IS CTXSYS.CONTEXT
  PARAMETERS ('SYNC (ON COMMIT)')
);

I had to populate the new column with text. I thought while I was saving the files in the database, I might as well extract text from Word documents and PDF documents and save the text in the CLOB column.

Here comes the new challenge!

Thanks to the open source world, we can find free solutions:

+ Antiword MS-Word reader (“.. converts the documents from Word 2, 6, 7, 97, 2000, 2002, and 2003 to text, Postscript, and XML/DocBook”)

Installation is easy:

yum install antiword

+ XPDF PDF viewer (“.. includes a PDF text extractor called pdftotext, PDF-to-PostScript converter, and various other utilities”)

Download the version for your system (mine was Extra Packages for Linux 5 aka EPEL5 x86_64)

cd /usr/local/src/
wget ftp://195.220.108.108/linux/epel/5/x86_64/xpdf-3.02-15.el5.x86_64.rpm
yum install ./xpdf-3.02-15.el5.x86_64.rpm

In PHP, you can exec the following commands and extract text from your Word and PDF files:

I could use text taken from these utilities in my PHP/OCI function like saving BLOBs, but this time it is a CLOB column.

That’s all! Happy searching CVs!

SELECT id, SCORE(1) as myrank
FROM mytable
WHERE CONTAINS (cvtxt, 'php sql', 1) > 0;

Tags: Linux, Oracle, PHP, Search, SQL

Categories:

Technology

M	T	W	T	F	S	S
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31

Creating a CV indexing facility using Oracle Text

Leave a Reply

Categories

Archives