Recently, I have struggled to get Oracle Text working on Linux. It was supposed to be a simple addon based on a very simple idea:
+ already installed Apache 2.2, PHP 5.3, Firebird 2.0; which are working fine
+ install Oracle 10g Express Edition
+ create a simple table to store candidate ID and document itself
– binary contents of Word document (.doc) or PDF document (.pdf)
I had to download the RPM file for Oracle 10g XE (Western Europe charset) and install it using yum. Then, run its configuration command (/etc/init.d/oracle-xe configure) to finish the setup.
After connecting with sqlplus, I created the table:
CREATE TABLE mytable ( id NUMBER PRIMARY KEY, cvfile BLOB DEFAULT EMPTY_BLOB() ); CREATE INDEX mytable_cvfile_idx ON mytable (cv_file) INDEXTYPE IS CTXSYS.CONTEXT PARAMETERS ('SYNC (ON COMMIT)') );
“How can I populate BLOB column of the table?”
1. You can save files in the table as usual using PHP OCI functions (PHP documentation).
2. Or you can create/execute a custom procedure that refers to your files inside a specific directory and let it do the rest using DBMS_LOB package.
The issue I experienced was that indexing failed with errors. When I looked at view (select * from CTX_USER_INDEX_ERRORS) I saw errors such as:
“DRG-11207: user filter command exited with status 1″
“DRG-11222: Third-party filter does not support this known document format”
However, the same structure worked perfectly on a Windows server (with ifilters installed).
Then, I tried another method. First, I added a CLOB column into my table, and created similar index on the new column:
ALTER TABLE mytable ADD cvtxt CLOB DEFAULT EMPTY_CLOB(); UPDATE mytable SET cvtxt = EMPTY_CLOB();-- if there are records CREATE INDEX mytable_cvtxt_idx ON mytable (cv_file) INDEXTYPE IS CTXSYS.CONTEXT PARAMETERS ('SYNC (ON COMMIT)') );
I had to populate the new column with text. I thought while I was saving the files in the database, I might as well extract text from Word documents and PDF documents and save the text in the CLOB column.
Here comes the new challenge!
Thanks to the open source world, we can find free solutions:
+ Antiword MS-Word reader (“.. converts the documents from Word 2, 6, 7, 97, 2000, 2002, and 2003 to text, Postscript, and XML/DocBook”)
Installation is easy:
yum install antiword+ XPDF PDF viewer (“.. includes a PDF text extractor called pdftotext, PDF-to-PostScript converter, and various other utilities”)
Download the version for your system (mine was Extra Packages for Linux 5 aka EPEL5 x86_64)
cd /usr/local/src/ wget ftp://195.220.108.108/linux/epel/5/x86_64/xpdf-3.02-15.el5.x86_64.rpm yum install ./xpdf-3.02-15.el5.x86_64.rpm
In PHP, you can exec the following commands and extract text from your Word and PDF files:
<?php //extract text from Word document $myfile_path_and_name = "/tmp/abc.doc"; $cmd = "antiword $myfile_path_and_name"; $output_lines = array(); $return_value = 0; $last_line = exec($cmd, $output_lines, $return_value); $doc_text = implode("\n", $output_lines); //extract text from PDF file $myfile_path_and_name = "/tmp/def.pdf"; $cmd = "pdftotext -q $myfile_path_and_name -"; // "-q" no messages, "-" no output file but stdout $output_lines = array(); $return_value = 0; $last_line = exec($cmd, $output_lines, $return_value); $pdf_text = implode("\n", $output_lines); ?>
I could use text taken from these utilities in my PHP/OCI function like saving BLOBs, but this time it is a CLOB column.
That’s all! Happy searching CVs!
SELECT id, SCORE(1) AS myrank FROM mytable WHERE CONTAINS (cvtxt, 'php sql', 1) > 0;
Go to your Oracle binaries folder and export environment variables, if they are not set already. Then, run SQL*Plus command line utility with option /nolog .
[root@localhost /]# cd /usr/lib/oracle/xe/app/oracle/product/10.2.0/server/bin/ [root@localhost bin]#. ./oracle_env.sh [root@localhost bin]# sqlplus /nolog SQL> connect myuser/mypassword@localhost/xe Connected. SQL> select count(*) as c from mytablespace.mytable 2 / C ---------- 123
I hope this helps some beginners.
Before continuing the rest of the installation process we started (/wp/index.php/2009/07/setting-up-a-web-application-on-centos-5-3/), I must mention my appreciation for the “yummy” tool Linux developers provided: yum (.. an automatic updater and package installer/remover for rpm systems). I love using webmin (.. a web-based interface for system administration for Unix) as well. The other day I only clicked on a button to upgrade almost all the packages in our CentOS box; this facility is using yum in the background. It took almost 2 hours: just be patient
The main problem I have experienced was the create the link between PHP and Oracle 10g Express Database server. Thanks to Google I was able to find people on the net that had the similar issues.
I downloaded 2 (free) packages from Oracle’s website (into folder /usr/src/):
oracle-instantclient-basic-10.2.0.3-1.x86_64.rpm
oracle-instantclient-devel-10.2.0.3-1.x86_64.rpm
Then, I ran:
cd /usr/src/ rpm -ivh oracle-instantclient-basic-10.2.0.3-1.x86_64.rpm rpm -ivh oracle-instantclient-devel-10.2.0.3-1.x86_64.rpm
Development packages and a compiler were needed, as far as I understood from the error messages.
yum install gcc yum install php-devel yum install php-pecl-* --skip-broken yum install php-oci8 pecl install oci8
When it asks for the ORACLE_HOME or Oracle Instant Client, type:
instantclient,/usr/lib/oracle/10.2.0.3/client64/lib
[root@localhost src]# pecl install oci8 downloading oci8-1.3.5.tgz ... Starting to download oci8-1.3.5.tgz (137,987 bytes) ............................done: 137,987 bytes 10 source files, building running: phpize Configuring for: PHP Api Version: 20090626 Zend Module Api No: 20090626 Zend Extension Api No: 220090626 Please provide the path to the ORACLE_HOME directory. Use 'instantclient,/path/to/instant/client/lib' if you're compiling with Oracle Instant Client [autodetect] : instantclient,/usr/lib/oracle/10.2.0.3/client64/lib
By the way, I did not need all the pecl libraries installed. I simply went into /etc/php.d/ and commented out the lines for extensions (to disable them):
[root@localhost php.d]# ls apc.ini gmagick.ini lzf.ini oci8.ini pdo_sqlite.ini sphinx.ini z-mailparse.ini curl.ini imap.ini mbstring.ini pdf.ini phar.ini sqlite.ini fileinfo.ini interbase.ini memcached.ini pdo_firebird.ini php-pecl-imagick.ini ssh2.ini gd.ini json.ini memcache.ini pdo.ini pspell.ini xdebug.ini geoip.ini ldap.ini ncurses.ini pdo_oci.ini radius.ini zip.ini [root@localhost php.d]# nano xdebug.ini ; Enable xdebug extension module ; zend_extension=/usr/lib64/php/modules/xdebug.so
So, far the packages installed help us run our bespoke web application on Apache (httpd 2.2.3), PHP 5.3.1, Firebird 2.1.2, Oracle 10g XE.
Check the following useful extras:
yum install aspell yum install php-pspell yum install php-imap
Reading emails from an IMAP server, parsing structured content, preparing emails using by TinyMCE with spellchecker and sending email using PHPMailer class are all fun!
Happy coding!
After years of torturing their SQL Server users, Microsoft provided a “better” way of paging records in version 2005, and it is a weird one! The following example is from their website, an article about ROW_NUMBER (Transact-SQL).
USE AdventureWorks; GO WITH OrderedOrders AS ( SELECT SalesOrderID, OrderDate, ROW_NUMBER() OVER (ORDER BY OrderDate) AS 'RowNumber' FROM Sales.SalesOrderHeader ) SELECT * FROM OrderedOrders WHERE RowNumber BETWEEN 50 AND 60;
It is simply unacceptable, non-standard, impractical! By the way, you can not use column numbers (e.g. ORDER BY 1, 2 DESC); you must specify column names.
As usual, Microsoft programmers do not “think” and do not expect to work on millions of records, they simply say: “Buy more RAM, faster CPU, a better machine!” etc. Upgrade, upgrade, upgrade! They still do not understand that they will not be able to replace international standards organisation or something like that.
Unfortunately, we can not force our users not to ask for all the records; if they want to see them, we need to provide the facility, and show the records page by page.
Possibly, MySQL developers thought about the speed of the database engine while they were developing it, and LIMIT clause was a part of it since middle ages!
SELECT SalesOrderID, OrderDate FROM SalesOrderHeader ORDER BY OrderDate LIMIT 10 OFFSET 49
Similarly, Firebird SQL has a very simple clause to get a page of records like this, using FIRST m SKIP n syntax:
SELECT FIRST 10 SKIP 49 SalesOrderID, OrderDate FROM SalesOrderHeader ORDER BY OrderDate
PostgreSQL works same as MySQL:
SELECT SalesOrderID, OrderDate FROM SalesOrderHeader ORDER BY OrderDate LIMIT 10 OFFSET 49
Although Oracle SQL has a function (similar to MS SQL Server 2005) called ROW_NUMBER(), they also offer a pseudocolumn called ROWNUM to achieve the same:
SELECT SalesOrderID, OrderDate FROM SalesOrderHeader WHERE ROWNUM BETWEEN 50 AND 60 ORDER BY OrderDate
Note that this method requires modification of the criteria (WHERE clause) of the original query. I do not want to comment on who is mimicing who, but it is known that Microsoft do it more often, or simply buy the rival company owning a popular software application.
You can easily get lost inside the terrible documentation of IBM for DB2 database: it supports the approach using ROW_NUMBER() function.
My favourite database management systems are: Firebird, MySQL and Oracle. Sometimes you are not given any options and have to use what is available on the application server, like MS SQL Server 2000/2005 on a Windows server, MySQL 5 usually on a Linux machine.