Installing nutch 1.0 on OSX

Today I started to work on a little project that required a crawler, and Nutch seemed to do most of what I needed. The nutch team conveniently released Nutch 1.0 late in March 2009, so I had a brand new release to test out. Installing nutch 1.0 on a mac is not as straight forward as I thought, I ran into a lot of unexpected issues and here is my cook book description of how to successfully install nutch 1.0 on your mac.

  1. Download the latest source code from the Apache SVN repository I tried running it from the tarball without success, I also tried to compile the source from the tarball, but a post on the nutch forum clearly states that this will not work.
  2. Set your JAVA_HOME and NUTCH_JAVA_HOME variables, again this is not straight forward, they both need to point to your real installation of Java 1.6 (earlier versions of Java will fail). I sat these variables to: /System/Library/Frameworks/JavaVM.framework/Versions/1.6/Home, I could not get the /Library/Java/Home symbolic link to work properly.
  3. Compile the source code using Ant (I built it in Eclipse).
  4. Setup your nutch configuration, by following the tutorial by Peter P. Wang
  5. Run your first crawl with: ./bin/nutch crawl urls -dir crawl -depth 3 -topN 50

Most of the issues I encountered was related to the Java version and the fact that using /Application/Utilities/Java/Java preferences application do not really change the JAVA_HOME directory /Library/Java/Home properly. So make sure you have set both JAVA_HOME and NUTCH_JAVA_HOME, and that your OSX does not fool you when it pretend to be symbolically linking to the 1.6 installation.

Good luck.

April 7th, 2009

