Recently, I have struggled to get Oracle Text working on Linux. It was supposed to be a simple addon based on a very simple idea:

+ already installed Apache 2.2, PHP 5.3, Firebird 2.0; which are working fine

+ install Oracle 10g Express Edition
+ create a simple table to store candidate ID and document itself
– binary contents of Word document (.doc) or PDF document (.pdf)

I had to download the RPM file for Oracle 10g XE (Western Europe charset) and install it using yum. Then, run its configuration command (/etc/init.d/oracle-xe configure) to finish the setup.

After connecting with sqlplus, I created the table:

CREATE TABLE mytable (
  id NUMBER PRIMARY KEY,
  cvfile BLOB DEFAULT EMPTY_BLOB()
);
 
CREATE INDEX mytable_cvfile_idx ON mytable (cv_file)
  INDEXTYPE IS CTXSYS.CONTEXT
  PARAMETERS ('SYNC (ON COMMIT)')
);

“How can I populate BLOB column of the table?”

1. You can save files in the table as usual using PHP OCI functions (PHP documentation).

2. Or you can create/execute a custom procedure that refers to your files inside a specific directory and let it do the rest using DBMS_LOB package.

The issue I experienced was that indexing failed with errors. When I looked at view (select * from CTX_USER_INDEX_ERRORS) I saw errors such as:

“DRG-11207: user filter command exited with status 1″
“DRG-11222: Third-party filter does not support this known document format”

However, the same structure worked perfectly on a Windows server (with ifilters installed).

Then, I tried another method. First, I added a CLOB column into my table, and created similar index on the new column:

ALTER TABLE mytable
ADD cvtxt CLOB DEFAULT EMPTY_CLOB();
 
UPDATE mytable SET cvtxt = EMPTY_CLOB();-- if there are records
 
CREATE INDEX mytable_cvtxt_idx ON mytable (cv_file)
  INDEXTYPE IS CTXSYS.CONTEXT
  PARAMETERS ('SYNC (ON COMMIT)')
);

I had to populate the new column with text. I thought while I was saving the files in the database, I might as well extract text from Word documents and PDF documents and save the text in the CLOB column.

Here comes the new challenge!

Thanks to the open source world, we can find free solutions:

+ Antiword MS-Word reader (“.. converts the documents from Word 2, 6, 7, 97, 2000, 2002, and 2003 to text, Postscript, and XML/DocBook”)

Installation is easy:

yum install antiword

+ XPDF PDF viewer (“.. includes a PDF text extractor called pdftotext, PDF-to-PostScript converter, and various other utilities”)

Download the version for your system (mine was Extra Packages for Linux 5 aka EPEL5 x86_64)

cd /usr/local/src/
wget ftp://195.220.108.108/linux/epel/5/x86_64/xpdf-3.02-15.el5.x86_64.rpm
yum install ./xpdf-3.02-15.el5.x86_64.rpm

In PHP, you can exec the following commands and extract text from your Word and PDF files:

<?php
//extract text from Word document
$myfile_path_and_name = "/tmp/abc.doc";
$cmd = "antiword $myfile_path_and_name";
$output_lines = array(); $return_value = 0;
$last_line = exec($cmd, $output_lines, $return_value);
$doc_text = implode("\n", $output_lines);
 
//extract text from PDF file
$myfile_path_and_name = "/tmp/def.pdf";
$cmd = "pdftotext -q $myfile_path_and_name -"; // "-q" no messages, "-" no output file but stdout
$output_lines = array(); $return_value = 0;
$last_line = exec($cmd, $output_lines, $return_value);
$pdf_text = implode("\n", $output_lines);
?>

I could use text taken from these utilities in my PHP/OCI function like saving BLOBs, but this time it is a CLOB column.

That’s all! Happy searching CVs!

SELECT id, SCORE(1) AS myrank
FROM mytable
WHERE CONTAINS (cvtxt, 'php sql', 1) > 0;
Tagged with:
 

Before continuing the rest of the installation process we started (/wp/index.php/2009/07/setting-up-a-web-application-on-centos-5-3/), I must mention my appreciation for the “yummy” tool Linux developers provided: yum (.. an automatic updater and package installer/remover for rpm systems). I love using webmin (.. a web-based interface for system administration for Unix) as well. The other day I only clicked on a button to upgrade almost all the packages in our CentOS box; this facility is using yum in the background. It took almost 2 hours: just be patient :)

The main problem I have experienced was the create the link between PHP and Oracle 10g Express Database server. Thanks to Google I was able to find people on the net that had the similar issues.

I downloaded 2 (free) packages from Oracle’s website (into folder /usr/src/):

oracle-instantclient-basic-10.2.0.3-1.x86_64.rpm
oracle-instantclient-devel-10.2.0.3-1.x86_64.rpm

Then, I ran:

cd /usr/src/
rpm -ivh oracle-instantclient-basic-10.2.0.3-1.x86_64.rpm
rpm -ivh oracle-instantclient-devel-10.2.0.3-1.x86_64.rpm

Development packages and a compiler were needed, as far as I understood from the error messages.

yum install gcc
yum install php-devel
yum install php-pecl-* --skip-broken
yum install php-oci8
pecl install oci8

When it asks for the ORACLE_HOME or Oracle Instant Client, type:

instantclient,/usr/lib/oracle/10.2.0.3/client64/lib

[root@localhost src]# pecl install oci8
downloading oci8-1.3.5.tgz ...
Starting to download oci8-1.3.5.tgz (137,987 bytes)
............................done: 137,987 bytes
10 source files, building
running: phpize
Configuring for:
PHP Api Version:         20090626
Zend Module Api No:      20090626
Zend Extension Api No:   220090626
Please provide the path to the ORACLE_HOME directory. Use 'instantclient,/path/to/instant/client/lib' if you're compiling with Oracle Instant Client [autodetect] : instantclient,/usr/lib/oracle/10.2.0.3/client64/lib

By the way, I did not need all the pecl libraries installed. I simply went into /etc/php.d/ and commented out the lines for extensions (to disable them):

[root@localhost php.d]# ls
apc.ini       gmagick.ini    lzf.ini        oci8.ini          pdo_sqlite.ini        sphinx.ini  z-mailparse.ini
curl.ini      imap.ini       mbstring.ini   pdf.ini           phar.ini              sqlite.ini
fileinfo.ini  interbase.ini  memcached.ini  pdo_firebird.ini  php-pecl-imagick.ini  ssh2.ini
gd.ini        json.ini       memcache.ini   pdo.ini           pspell.ini            xdebug.ini
geoip.ini     ldap.ini       ncurses.ini    pdo_oci.ini       radius.ini            zip.ini
 
[root@localhost php.d]# nano xdebug.ini
; Enable xdebug extension module
; zend_extension=/usr/lib64/php/modules/xdebug.so

So, far the packages installed help us run our bespoke web application on Apache (httpd 2.2.3), PHP 5.3.1, Firebird 2.1.2, Oracle 10g XE.

Check the following useful extras:

yum install aspell
yum install php-pspell
yum install php-imap

Reading emails from an IMAP server, parsing structured content, preparing emails using by TinyMCE with spellchecker and sending email using PHPMailer class are all fun!

Happy coding!

Tagged with:
 

Tape backup on CentOS

On October 8, 2009, in Databases, Linux, by murat

In this occasion, we have a PHP/Firebird application. We had to backup the database and the code (which includes PHP, HTML, CSS, Javascript files as well as user documents uploaded into a folder).

We have created a folder /mybackup/ and created a bash script backup.sh to do the following, using nano command line editor. We will always have 2 compressed gz files, remove the old ones. Backup the database, compress it. Backup the code folder, compress it. Rewind the tape. Then, copy into the tape. We are expecting the client to eject the tape and insert another one everyday, Monday to Friday.

#!/bin/bash
rm -f *.gz
/opt/firebird/bin/gbak -t -user sysdba -password "mypassword" localhost:mydb /mybackup/mydb.fbk
gzip mydb.fbk
tar -cpf mycode.tar /var/www/html
gzip mycode.tar
mt -f /dev/st0 rewind
tar -cpf /dev/st0 *.gz

(mydb is an alias pointing to a FDB database file defined in Firebird aliases.conf)

Using Webmin, my favourite online system administration tool for Linux, I’ve created a cron job to run /mybackup/backup.sh at required intervals.

Make sure that the user account that runs the commands has the execute and write permissions where needed.

Install mt, if your system has not got it already.

CentOS package manager

CentOS package manager

Tagged with:
 

Ingredients for our web application are Apache HTTP Server 2.2, PHP 5.2, Firebird 2.1and Oracle 10g Express Edition.

Like in any IT project, not everything goes according to plan and there are so many pitfalls; some of which are:

  • Repositories
  • Linux Firewall
  • SE Linux Permissions

Using the package manager (Yum) to add, remove and upgrade applications is quite straight forward. You can install and uninstall applications fairly easily by just ticking and unticking a few boxes and APPLYYY!. Yum will take care of the applications and libraries that installed application requires (dependency tree).

Installing CentOS 5.3 with standard web server tools should be all right.

Although the default packages allows you to install earlier versions of Apache HTTPd 2.2.3 and PHP 5.1.6, when the latest are Apache HTTPd 2.2.12 and PHP 5.3, you think it should be all right, but no, unfortunately!

First, we need to install Firebird: luckily you can download the RPM file and double click on it!

On my system, it is installed on /opt/firebird/ by default. So, run gsec as root to manage user accounts as usual:

/opt/firebird/bin/gsec

Unfortunately, there are not many online tools to administer Firebird database; have a look at ibWebAdmin.

And for Oracle, just download the RPM file, and intallation is easy. This one is the 10g R2 Express Edition (Universal), free to use like MS SQL Server Express.

Run the following command to configure and set a password:

/etc/init.d/oracle-xe configure

Then, you can manage the Oracle system within the browser:

http://localhost:8080/apex

First problem is the firewall: if you want to manage Oracle from another PC on your network, allow connections on the port 8080. If you need to access Firebird only locally, you do not need to open port 3050.

The real nitty gritty stuff come when you need particular libraries, modules and functions in PHP: to connect to Oracle and Firebird, for instance.

Normally, you would run Yum command to add these functionalities:

yum install php-firebird

But, it will not find them in the default repositories. This did the trick for me:

yum install php-interbase --enablerepo epel-testing

To be continued..

(as promised /wp/index.php/2009/12/setting-up-a-web-application-on-centos-continued/)

Tagged with: