• Requirements
  • Stage 1: Installing the Linux Operating System
  • Stage 2: Installing MySQL
  • Stage 3: Installing Apache Web Server
  • Stage 4: Installing the ASPSeek Search Engine
  • Stage 5: Testing and Configurating the ASPSeek Search Engine
  • Stage 6: Indexing with the ASPSeek Search Engine
  • Stage 7: Going Further


  • Sana01.01.1970
    Hajmi
    #10

    P-16 Portal Project Documentation [Draft]

    How to install and operate the ASPSeek Search Engine


    These instructions are intended to be a simple step-by-step installation and operation guide for the ASPSeek Search Engine. The ASPSeek Search Engine is a collection of programs that will allow a server administrator to index the content of specific web servers and allow browsers to search through the indexed content via a CGI application. The diagram below summarizes how it works:

    In a nutshell, the ASPSeek Search Engine consists of 4 main components:



    1. The Indexer program (index), which is a web-spidering robot application. It gets its instructions from the aspseek.conf and db.conf files. The aspseek.conf file can be altered to get its server list from sites.conf

    2. The MySQL database, which interacts with the Indexer and the Search Daemon. The information storage system uses a combination of SQL server tables and binary files.

    3. The Search Daemon (searchd), which is used to search the data storage system when a request is received from a web user.

    4. The s.cgi program, which runs as a CGI application on the Apache web server. Any conventional web-based form can call this program, such as the form that might be included in the search.html document.

    The remainder of this document will detail the installation of ASPSeek.


    Requirements:


    1. Hardware:

      1. Low End – the ASPSeek packages have been installed successfully on a discarded Intel-based workstation: Pentium-S 133mHz with 64MB of RAM and a 1.5 GB IDE HD. This size of machine would probably suffice if you want to index a small number of servers.

      2. High End – the ASPSeek packages have been installed successfully on a Dell PowerEdge 1500SC with twin 1.4 GB P-4 processors, dual 18 GB SCSI HDs in a Raid 5 configuration.




    1. Operating System:

      1. ASP-Linux – the ASP-Linux operating system is a variant of RedHat Linux. This operating system features a very efficient installer and it is the OS that ASPSeek was originally designed on.

      2. RedHat Linux 7.2 – Since ASP-Linux is based on RedHat Linux, this OS is compatible with the ASPSeek Search Engine.




    1. Source code:

      1. The latest version of the ASPSeek Search Engine is aspseek-1.2.10.tar.gz

      2. The source code is available at http://www.aspseek.org/pkg/src/1.2.10/aspseek-1.2.10.tar.gz

      3. Installation from source requires compilers and compiler libraries




    1. Packages (RPM):

      1. On RedHat Linux and compatibles, installations can be done using the RedHat Package Manager system (RPM).

      2. The RPMs are available at http://www.aspseek.org/pkg/rpm/

      3. Current package builds are based on aspseek-1.2.9-1 rather than aspseek-1.2.10.

      4. Depending on the system that you are using, you may need to find and install the Apache web server and the MySQL server packages. Both packages will probably be available on the Linux install CD. You should also consider installing PHP with MySQL support.




    1. Networking:

      1. Linux recognizes most modern Network Interface Cards (NIC). It is advisable to check the hardware compatibility lists for your Linux distribution prior to installing the OS. Most Linux flavors recognize 3Com, Intel, and other brand-name cards.

      2. When the Indexer is running, it can consume quite a bit of your network bandwidth. Depending on the number of servers that are indexed and the depth to which a search is working, a single indexing run can take hours. A T-1 (1.54 mbps) connection should provide sufficient bandwidth for indexing as well as end-user search connections.


    Stage 1: Installing the Linux Operating System

    If you have never installed Linux, don’t panic. Most of the latest distributions include very efficient installer programs. On most modern, high-end machines, the installer program will work as effectively as Windows® or Macintosh® installers. Since the GNU Public License protects the Linux OS, it is free to download and burn to CD. These brief instructions are for the ASP-Linux distribution.


    Getting Ready:

    You will need a high-speed Internet connection and a workstation or laptop that has a CD-RW drive. Most computers with CD-RW drives have CD-burning software installed.


    Step 1:

    Download the ISO image from http://www.asp-linux.com. The most current version is asplin-rc3d1.iso. It weighs in at approximately 680 MB, so make sure you have plenty of disk space and lots of time for the download. Save the image to an easily accessible location on your hard drive.


    Step 2:

    Create the CD using DirectCD, EasyCD Creator, Nero, or your favorite CD-creation program on your machine.


    Step3:

    Once the CD is created, you can use it to install ASP-Linux on the machine that is destined to become the search engine server. Most of the newer computers will boot from the CD-ROM. If yours does not, you may be able to modify the BIOS settings so that it will. If the “server-to-be” will not boot from CD-ROM, then you can create a boot installer diskette that will let you install the OS.


    Creating a Boot Installer Diskette
    Getting Ready:

    This presumes that you have successfully created the ASP-Linux Installation CD.


    Step 1:

    While you’re still using Windows®, insert the ASP-Linux Installation CD in to the CD-ROM drive. Place a BLANK diskette into the floppy diskette drive.


    Step 2:

    Go to Start  Run and type in “command” (Win98) or “cmd” (Win2K/XP). Once the command window appears, type “d:\dosutils\rawrite.exe” (without quotes). This will launch a disk imaging utility that will prompt you for “Source:” and “Destination:”



    Source: d:\floppy\boot.img [Enter]

    Destination: a: [Enter]
    Step 3:

    Once the Boot Installer diskette is created, place the CD in the CD-ROM tray and place the diskette in the diskette drive of the “server-to-be”. Then boot or reboot the machine. If your machine will boot from the CD-ROM drive, you do not need to use the Boot Installer diskette.


    WARNING: INSTALLING LINUX ON A COMPUTER WILL ERASE THE HARD DRIVE(S) AND COMPLETELY REMOVE EXISTING OPERATING SYSTEMS, PROGRAMS AND DATA.
    Step 4:

    Follow the on-screen dialogue to install ASP-Linux on your system. Ethernet connectivity will allow the installer to proceed more rapidly as the installer probes the network for information. Installation can take several minutes.


    For more information on installing ASP-Linux, see:
    http://www.asp-linux.com/en/docs/install/
    For more information on using ASP-Linux, see:
    http://www.asp-linux.com/en/docs/guide/

    Stage 2: Installing MySQL

    The ASPSeek Search Engine relies on a relational database management system (RDBMS), such as MySQL. MySQL is a small, compact database server ideal for small go medium applications. ASPSeek requires MySQL version 3.23.xx or later. To test for the presence of MySQL, issue the following command from the root command line prompt:


    # whereis mysql
    If it is already installed, the system should return:
    mysql: /usr/bin/mysql /usr/lib/mysql /usr/include/mysql /usr/share/mysql /usr/man/man1/mysql.1.gz
    If the system returns the following:
    mysql:
    then MySQL is not installed on the server and you will have to install it. Installation of MySQL from RPM is the preferred method on RedHat or ASP-Linux. The following packages should be installed prior to installing ASPSeek:
    MySQL-3.23.51-1.i386.rpm

    MySQL-client-3.23.51-1.i386.rpm

    MySQL-devel-3.23.51-1.i386.rpm

    MySQL-shared-3.23.51-1.i386.rpm

    The packages can be found on the ASP-Linux CD, the RedHat Linux CD (Disk 1), or from the http://www.mysql.com/ web site. For more information on copying files from the CD-ROM to the server filesystem, consult the User Guide.


    You must be root to install packages on your Linux server. Once you have acquired the MySQL packages on your server, issuing the following commands will install them:
    rpm -ivh MySQL-3.23.51-1.i386.rpm

    rpm -ivh MySQL-client-3.23.51-1.i386.rpm

    rpm -ivh MySQL-devel-3.23.51-1.i386.rpm

    rpm -ivh MySQL-shared-3.23.51-1.i386.rpm

    Stage 3: Installing Apache Web Server

    Chances are that the Apache Web Server is already installed on your server. If it is not, you will need to grab the RPM from the Linux CD or from the Apache web site. To test for the presence of the Apache Web Server, issue the following command from the command prompt:


    # whereis httpd
    If Apache is already installed, the command should return something similar to the following:
    httpd: /usr/sbin/httpd /etc/httpd /usr/share/man/man8/httpd.8.gz
    If the system cannot find the Apache Web Server, you will need to install it. As with all package installations, you must be root in order for RPMs to install properly.
    rpm -ivh apache-1.3.20-16.i386.rpm
    You should also check to see that zlib is installed on the server.
    # whereis zlib

    zlib: /usr/include/zlib.h /usr/share/man/man3/zlib.3.gz

    Stage 4: Installing the ASPSeek Search Engine

    Once again, installation from RPM is the simplest approach to installing ASPSeek. There are 6 packages that are required.



    aspseek-1.2.9-1.asp.i386.rpm

    aspseek-client-1.2.9-1.asp.i386.rpm

    aspseek-mysql-lib-1.2.9-1.asp.i386.rpm

    aspseek-cgi-1.2.9-1.asp.i386.rpm

    mod_aspseek-1.2.9-1.asp.i386.rpm

    aspseek-1.2.9-1.asp.src.rpm
    Install each package using the rpm –ivh command. The contents of these packages are listed in the Appendix of this document.
    After installing all the packages, your machine will benefit from a reboot. This can be accomplished from the command line with either of the following commands:
    # shutdown –r now

    # reboot
    Once the server reboots, you can start testing out the system.

    Stage 5: Testing and Configurating the ASPSeek Search Engine

    All of the components of the search engine should now be operational. There are a few tests that need to be performed to ensure that everything installed correctly.


    The first thing to check is to see that the s.cgi binary found its way into the correct directory. On RedHat and ASP-Linux, it should have been installed in /var/www/cgi-bin. You can check this by issuing the following from the command line:
    locate s.cgi
    Your system should return the following:
    /usr/share/man/man1/s.cgi.1.gz

    /var/www/cgi-bin/s.cgi
    To see if the Apace Web Server is able to serve this CGI application, open a browser on a workstation and type in the following address:
    http://{yourservername-or-IP}/cgi-bin/s.cgi
    If successful, you will see the generic ASPSeek Search page. You can give it a try, but prior to indexing some web sites, the database will be empty.
    Next you will want to test the Indexer.
    The index application is a component of ASPSeek that performs Web-crawling, document downloading, parsing and storing. It can also be used to manipulate the ASPSeek database. During the indexing process, index walks across the sites and stores the pages it finds in special data structures called delta files, and in a MySQL database.
    In order to run index, you must become an unprivileged user. The ASPSeek installation process created a user called “aspseek”. You must become this user to perform certain tasks, such as performing an index run. To become the “aspseek” user, issue the following from the command prompt:
    # su aspseek
    Once you become the aspseek user, you will have a different shell. From this shell prompt, issue the following command:
    bash-2.05$ index –S
    This instructs the index to read its configuration files and to show the database statistics (the sample below is from a database that has already indexed sites):
    Loading configuration from /etc/aspseek/db.conf

    Loading configuration from /etc/aspseek/ucharset.conf

    Loading configuration from /etc/aspseek/stopwords.conf

    Loading configuration from /etc/aspseek/sites.conf

    Loading configuration from /etc/aspseek/aspseek.conf
    ASPseek database statistics
    Status Expired Total

    -----------------------------

    0 22675 22675 Not indexed yet

    200 85 130351 OK

    300 0 193 Multiple Choices

    301 0 1414 Moved Permanently

    302 0 3392 Moved Temporarily

    401 0 50 Unauthorized

    403 0 565 Forbidden

    404 9 14913 Not found

    500 0 8 Internal Server Error

    -----------------------------

    Total 22769 173561
    If your system returns a screen similar to the one above, then the index program is working properly. You will have to return to the root prompt by issuing the “exit” command to quit the aspseek user shell.
    bash-2.05$ exit
    The next step is to add the sites that you wish to index.
    Prior to making the first indexing run, you will need to add web server URLs to the aspseek.conf file. All of the configuration files are located in /etc/aspseek/. To edit this file (and other files), change directories by issuing the following command:
    # cd /etc/aspseek
    List the contents of the directory by issuing the “ls -l” command:
    -rw-r--r-- 1 root root 22651 Jul 3 14:26 aspseek.conf

    -rw-r--r-- 1 root root 59 Feb 18 16:42 db.conf

    drwxr-xr-x 2 root root 4096 Feb 18 16:42 langmap

    -rw-r--r-- 1 root root 15254 Jul 15 11:51 s.htm

    -rw-r--r-- 1 root root 6351 Sep 24 2001 searchd.conf

    -rw-r--r-- 1 root root 566 Jul 2 14:33 sites.conf

    drwxr-xr-x 3 root root 4096 Feb 18 16:41 sql

    drwxr-xr-x 2 root root 4096 Feb 18 16:42 stopwords

    -rw-r--r-- 1 root root 1105 Sep 24 2001 stopwords.conf

    drwxr-xr-x 2 root root 4096 Feb 18 16:42 tables

    -rw-r--r-- 1 root root 2274 Sep 24 2001 ucharset.conf
    Edit the aspseek.conf file using the text editor, pico

    # pico –w aspseek.conf
    Use the down-arrow key (or the ctrl-v key combination) to find the following section:
    #######################################################################

    #Server

    # It is the main command of aspseek.conf file

    # It's used to add start URL of server

    # You may use "Server" command as many times as number of different

    # servers required to be indexed.

    # You should specify FQDN in URL, so put http://myserver.mydomain.com/

    # instead of just http://myserver/

    # You can index ftp servers when using proxy:

    #Server ftp://localhost/

    #Server http://localhost/
    Add the following lines immediately below this section, taking care not to use the comment symbol (#) at the beginning of the line:
    Server http://www.server-to-index/

    Server http://www.anotherserver-to-index/


    You can add as many servers as you wish to index, one per line. When you are finished adding servers in this fashion, close the pico editor by using the ctrl-x key combination. It will prompt you to “Save modified buffer…” to which you will type in a “y”; then it will ask you if you’d like to save the file under the name “aspseek.conf”, to which you will agree by pressing the [Enter] key.
    If you have just a few sites to index, then inserting the “Server http://” directives into the aspseek.conf file will work just fine. If you will continue to add servers over time, then it may be better to create an external file that contains only the server directives and call that file from the aspseek.conf file with an “include” statement. Here’s how that works:
    Edit the aspseek.conf file with the pico editor, and locate the #Server <URL> section. Instead of adding line after line of servers to index, simply add an include statement as shown in the example below:
    #Server ftp://localhost/

    #Server http://localhost/


    Include sites.conf
    Save the file and exit from pico. Next, create and edit a sites.conf file:
    # pico –w sites.conf
    Add the Server directives one per line as in the following example:
    Server http://www.server-to-index/

    Server http://www.anotherserver-to-index/

    Server http://www.yetanotherserver.com/
    Save the file and exit from pico. When the Indexer program reads in its configuration information, the aspseek.conf file will pull in the values in the sites.conf file as if they were part of the aspseek.conf file.
    The web pages that are returned by the Apache web server are created from a template that resides in the /etc/aspseek directory. By editing the “s.htm” file, you can change graphics, fonts, arrangements, and other cosmetic features that are rendered in a browser. Study this file carefully prior to making changes. It may be advantageous to make a backup copy of it before you start hacking on it:
    # cp s.htm s.htm.bkup
    If something goes awry, you can always recover the original file by switching the filenames in the command above.

    Stage 6: Indexing with the ASPSeek Search Engine

    Finally, the ASPSeek Search Engine is ready to do an indexing run. Once again, you will need to take on the identity of the aspseek user.


    # su aspseek
    After you become the aspseek user, you can run the index program:
    bash-2.05$ index
    The screen should show that the configuration files are loading and then show that the URLs are being indexed. When the indexing run is completed, you will be returned to the command prompt. You should exit the aspseek user shell and return to being the root user. At the end of its run, index will create a set of Delta files and organize the database so that it is easily searchable by the Search Daemon (searchd).
    There are dozens of command line switches that can be used with the index command. For example, if you have 20 servers that are being indexed, you may want to start more than one threaded process to do so. Issuing the following command will create 10 parallel processes that will spider and index the 20 servers:
    bash-2.05$ index –N 10
    You may find that automating this command relieves you of the chore of logging into the server, becoming the aspseek user, and remembering which command to use. You could create a cron job to have the indexer run on a repeating, regular basis. Perhaps you would like to re-index the servers at 2:00 a.m. every Sunday morning.
    To make this happen, you can create a shell script called “index-run.sh” and then use the crontab utility to add this to the list of cron jobs executed by the system.
    To set this up, begin by creating the index-run.sh script. Issue the following command:
    # pico –w index-run.sh
    Type in the following:
    #!/bin/sh
    /usr/bin/index –N 5
    Save the file and exit pico. Next, change the ownership and permissions on this file so that the aspseek user can execute it:

    # chown aspseek.aspseek index-run.sh

    # chmod 700 index-run.sh
    Next, create an aspseek.cron file that will be used to schedule the indexing process. From the command line, type:
    # pico –w aspseek.cron
    Type in the following information; pay close attention to spacing. This will run the index-run.sh script every Sunday morning at 2:00 a.m. (server time).

    * 2 * * 0 /etc/aspseek/index-run.sh

    Save the file and exit. Finally, set up a cron job for the aspseek user that executes the shell script that you created:


    # crontab –u aspseek /etc/aspseek/aspseek.cron

    Stage 7: Going Further

    There are literally dozens of different mechanisms that can be switched on and off in this search engine. Your best bet is to read the documentation thoroughly, with heavy emphasis on how to modify the configuration files. Be aware that indexing numerous web sites can fill a small disk drive quickly. As a system administrator, you will have to monitor the disk and might occasionally have to prune the database.


    To monitor the used and available disk space:
    # df -h
    To flush the database prior to starting anew:
    # su – aspseek

    bash-2.04$ index –C –w
    Simple web forms can be produced that point to the ASPSeek CGI program. A simple example is shown below:


    Simple Search with ASPSeek



    Search for:









    Download