Tuesday, January 2, 2007

Crawling

The Nutch website already has a tutorial on crawling available here so I will just add some screenshots and further details if needed to make it easier.

This tutorial will only be focusing on the intranet searching part for simplicity.


Intranet Configuration
(Source: Nutch Website)

To configure things for intranet crawling you must:

1) Create a directory with a flat file of root urls. For example, to crawl the nutch site you might start with a file named urls/nutch containing the url of just the Nutch home page. All other Nutch pages should be reachable from this page. The urls/nutch file would thus contain:

http://lucene.apache.org/nutch/


Note: To start crawling from more than one url, you can add in more files containing the urls to be crawled.



For example, in the urls folder, I have three flat files containing the url of SOC's homepage, NUS's homepage and Nutch's homepage respectively.

2) Edit the file conf/crawl-urlfilter.txt and replace MY.DOMAIN.NAME with the name of the domain you wish to crawl. For example, if you wished to limit the crawl to the apache.org domain, the line should read:
+^http://([a-z0-9]*\.)*apache.org/
This will include any url in the domain apache.org.

For example, I have limit the crawl to these two domains as above.

Edit the file conf/nutch-site.xml, insert at minimum following properties into it and edit in proper values for the properties:

For example, the nutch-site.xml file will include something like that below. You will need to input some information between the value and /value part.


The template for the xml file is located in the Nutch Tutorial.

Sidenote: I am using an image because I can't seem to get blogspot to
publish the text without removing the tags. Anyone knows how?


Intranet Crawling

(Source: Nutch Website)

Once things are configured, running the crawl is easy. Just use the crawl command. Its options include:

  • -dir dir names the directory to put the crawl in.
  • -threads threads determines the number of threads that will fetch in parallel.
  • -depth depth indicates the link depth from the root page that should be crawled.
  • -topN N determines the maximum number of pages that will be retrieved at each level up to the depth.

For example, a typical call might be:

bin/nutch crawl urls -dir crawl -depth 3 -topN 50

Typically one starts testing one's configuration by crawling at shallow depths, sharply limiting the number of pages fetched at each level (-topN), and watching the output to check that desired pages are fetched and undesirable pages are not. Once one is confident of the configuration, then an appropriate depth for a full crawl is around 10. The number of pages per level (-topN) for a full crawl can be from tens of thousands to millions, depending on your resources.

Note: During the searching later, the webapp will search the for the folder crawl relative to where Tomcat is started, unless the searcher directory is set (will be covered later). So for simplicity, you can just place the output of the crawl to crawl.



Monday, January 1, 2007

Searching

Searching is also pretty straightforward (on hindsight after spending a few days trying to get it to work), once you realize a few things.


Searching
(from Nutch Website)

To search you need to put the nutch war file into your servlet container. (If instead of downloading a Nutch release you checked the sources out of SVN, then you'll first need to build the war file, with the command ant war.)

Assuming you've unpacked Tomcat as ~/local/tomcat, then the Nutch war file may be installed with the commands:

rm -rf ~/local/tomcat/webapps/ROOT*
cp nutch*.war ~/local/tomcat/webapps/ROOT.war

Ok, this is the first time that I have used Tomcat or any similar tool which explains the fumbling. Anyway, I finally realized that the above 2 lines are only necessary if you want to start the Nutch search as the default application. So, the only thing you need to do is to copy the .war file to the Tomcat's webapps folder.


The above simply copies the .war file to the webapps folder without renaming.


The webapp finds its indexes in ./crawl, relative to where you start Tomcat, so use a command like:

~/local/tomcat/bin/catalina.sh start


If you did not place your crawl contents in crawl folder,
you will need to define the search directory.

1) First just start the Tomcat.



The .war file that you just copied to the
/tomcat_dir/webapps will be automatically
expanded, as evident from the image below.



If you have named your .war file to
say, abc.war, then it will expand
to a abc folder.

2) Amend the nutch-site.xml file in your
tomcat_dir/webapps/exanded_dir/WEB-INF/classes
folder. So, for my case, I will locate
the file here:




This is the content of my nutch-site.xml file.
(note: this nutch-site.xml file is the one
located in the tomcat_dir and not the
one in the nutch_dir folder.)



So you just need to put in the path of your
crawl directory where the indexes and segments
are placed after the crawl.
In this case, my folder is called crawl.test3.


Then visit http://localhost:8080/
and have fun!

Note: If you are using other ports for Tomcat, please use
the corresponding port number.

For example, if port 8888 is used, the address
will be http://localhost:8888.
This will bring you to the ROOT
application. If you have not changed
anything in the original root folder, then
you will be at the Tomcat start page which looks
like this:


However, if you have changed the root folder
like this:

rm -rf ~/local/tomcat/webapps/ROOT*
cp nutch*.war ~/local/tomcat/webapps/ROOT.war

Then, the nutch search page will be the
root application.

For my example, my nutch search page
is at http://localhost:8080/nutch-0.9/
as shown in the below image.




Now, you can verify that your search works
by inputing the search queries. If there
are no hits when there should be, maybe
the search directory is not set correctly.
Or the problem may lie with the crawling part.

Have fun!