Tech. memo: NutchとSolr

Nutchというクローラを動かしてみる。

さくらVPS
CentOS 6.2
OpenJDK 1.6.0_24
Apache Nutch 1.6
Apache Solr 4.0

Nutchのバイナリパッケージ取得

apache-nutch-1.6-bin.tar.gz

展開(ここでは/opt)

# cd /opt
# ls
apache-nutch-1.6-bin.tar.gz
# tar xvf apache-nutch-1.6-bin.tar.gz

nutchコマンド実行

以下を実行する。必要であればbashrcなどにも書いておく。

export PATH=$PATH:/opt/apache-nutch-1.6/bin

で、コマンドを実行してみる

# cd apache-nutch-1.6
# nutch
Usage: nutch COMMAND
where COMMAND is one of:
  crawl             one-step crawler for intranets (DEPRECATED - USE CRAWL SCRIPT INSTEAD)
  readdb            read / dump crawl db
  mergedb           merge crawldb-s, with optional filtering
  readlinkdb        read / dump link db
  inject            inject new urls into the database
  generate          generate new segments to fetch from crawl db
  freegen           generate new segments to fetch from text files
  fetch             fetch a segment's pages
  parse             parse a segment's pages
  readseg           read / dump segment data
  mergesegs         merge several segments, with optional filtering and slicing
  updatedb          update crawl db from segments after fetching
  invertlinks       create a linkdb from parsed segments
  mergelinkdb       merge linkdb-s, with optional filtering
  solrindex         run the solr indexer on parsed segments and linkdb
  solrdedup         remove duplicates from solr
  solrclean         remove HTTP 301 and 404 documents from solr
  parsechecker      check the parser for a given url
  indexchecker      check the indexing filters for a given url
  domainstats       calculate domain statistics from crawldb
  webgraph          generate a web graph from existing segments
  linkrank          run a link analysis program on the generated web graph
  scoreupdater      updates the crawldb with linkrank scores
  nodedumper        dumps the web graph's node scores
  plugin            load a plugin and run one of its classes main()
  junit             runs the given JUnit test
 or
  CLASSNAME         run the class named CLASSNAME
Most commands print help when invoked w/o parameters.

クローリングの準備

confディレクトリにあるnutch-default.xmlの「http.agent.name」プロパティの値に適当な文字列を記述する。

# vi conf/nutch-default.xml

nutch-default.xml

<property>
  <name>http.agent.name</name>
  <value>My Nutch Spider</value>
</property>

クローリングの始点となるサイトを記述したテキストファイルを任意のディレクトリに作成。
以下の例では、インストールディレクトリ配下に作成した「urls」ディレクトリに作成。
例）

# mkdir urls
# vi urls/seed.txt

seed.txt

http://nutch.apache.org/

クローリング

以下のコマンドで実行。

# nutch crawl urls -dir crawl -depth 1 -topN 2

Solrパッケージ取得

apache-solr-4.0.0.tgz

展開(ここでは/opt)

# cd /opt
# ls
apache-solr-4.0.0.tgz
# tar xvf apache-solr-4.0.0.tgz

Solr設定＆起動

NutchにはSolr用の設定ファイルが用意されているので、これをコピーしておく。

# cp /opt/apache-nutch-1.6/conf/schema.xml /opt/apache-solr-4.0.0/example/solr/collection1/conf/schema.xml

が、以下のコマンドでSolrを起動してみるといろいろとエラーが出て起動しない・・・

# cd cd /opt/apache-solr-4.0.0/example/
# java -jar start.jar

設定ファイル「schema.xml」を少し書き換える。

# vi solr/collection1/conf/schema.xml

追記

<field name="text" type="text" stored="false" indexed="true"/>

<field name="_version_" type="long" indexed="true" stored="true" />

コメントアウト

<!-- <filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt"/> -->

すると起動するはず。

再度、クローリング＆インデクシング

# nutch crawl urls/ -solr http://localhost:8983/solr/ -depth 2 -topN 50

SolrのWeb画面で検索

http://localhost:8983/solr/にブラウザで接続。左の「collection1」の「Query」を選択し、「Execute Query」をクリック！
あれ・・・XMLできれいに出力されない。

ブラウザの問題のようです。Chromeの場合、アドオン入れたら解決。

Tech. memo

2013年1月8日火曜日

NutchとSolr

おわり

0 件のコメント:

コメントを投稿