Tech. memo

AWS上でPiwikでウェブ解析

2014-08-03T13:12:00.001+09:00

仕事でPiwikというWebアクセスログ解析ツールを使用する事になったので、AWSで構築してみます。

AWS EC2(Red Hat Enterprise Linux 6.5)

AWSは、契約済とします。

Piwik 2.4.1
MacBook Air 10.9.4 (クライアント端末)

AWS EC2でのマシン作成手順。（既にある場合は読み飛ばしてください。）

AWS マネジメントコンソールに接続する
https://console.aws.amazon.com/console/home
「EC2」を選択する。
「Lunch Instance」を選択する。
「Red Hat Enterprise Linux 6.5(PV)」の「64-bit」を選択して「Select」を選択する。
とりあえず、Type「t1.micro」を選択して「Next: Configure Instance Details」を選択する。
Step 3: Configure Instance Details画面で「Enable termination protection」の「Protect against accidental termination」のチェックボックスをOnにして、「Next: Add Strage」を選択する。
Sizeを「30」、Volume Typeを「General Purpose (SSD)」として、「Next: Tag Instance」を選択する。
Tagは適当に、Name=「Name」、Value=「piwik_test」とかしておく。「Next: Configure Security Group」を選択する。
Step 3: Configure Security Group画面でSSH通信の設定のSourceに「My IP」を選択し、IPアドレスを確認して、「Review and Launch」を選択する。
「Lunch」を選択する。「Create a new key pair」を選択し、Key pair nameに「piwik_test」とでも入力し、「Download key pair」をクリックした後に、「Launch Instance」を選択する。
マシンへsshで接続する。rootのパスワードを設定しておく。
```
$ sudo su -
# passwd
```

piwikをインストールする。(主にrootで実施)

公式サイトからPiwikをダウンロードしておく。

# wget http://builds.piwik.org/piwik.zip
--2014-08-02 08:27:36--  http://builds.piwik.org/piwik.zip
Resolving builds.piwik.org... 176.31.58.94
Connecting to builds.piwik.org|176.31.58.94|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 10143342 (9.7M) [application/zip]
Saving to: “piwik.zip”

100%[======================================>] 10,143,342  2.67M/s   in 4.6s

2014-08-02 08:27:42 (2.10 MB/s) - “piwik.zip” saved [10143342/10143342]

Apache(Webサーバ)のインストール
```
# yum -y install httpd
```

PHP関連パッケージインストール

# yum -y install php php-mysql php-pdo php-mbstring php-xml php-gd

MySQLインストール
# yum -y install mysql mysql-server mysql-libs

MySQL初期設定と起動

# mysql_secure_installation
NOTE: RUNNING ALL PARTS OF THIS SCRIPT IS RECOMMENDED FOR ALL MySQL
      SERVERS IN PRODUCTION USE!  PLEASE READ EACH STEP CAREFULLY!


In order to log into MySQL to secure it, we'll need the current
password for the root user.  If you've just installed MySQL, and
you haven't set the root password yet, the password will be blank,
so you should just press enter here.

Enter current password for root (enter for none):
OK, successfully used password, moving on...

Setting the root password ensures that nobody can log into the MySQL
root user without the proper authorisation.

Set root password? [Y/n] Y
New password:
Re-enter new password:
Password updated successfully!
Reloading privilege tables..
 ... Success!


By default, a MySQL installation has an anonymous user, allowing anyone
to log into MySQL without having to have a user account created for
them.  This is intended only for testing, and to make the installation
go a bit smoother.  You should remove them before moving into a
production environment.

Remove anonymous users? [Y/n] Y
 ... Success!

Normally, root should only be allowed to connect from 'localhost'.  This
ensures that someone cannot guess at the root password from the network.

Disallow root login remotely? [Y/n] Y
 ... Success!

By default, MySQL comes with a database named 'test' that anyone can
access.  This is also intended only for testing, and should be removed
before moving into a production environment.

Remove test database and access to it? [Y/n] Y
 - Dropping test database...
 ... Success!
 - Removing privileges on test database...
 ... Success!

Reloading the privilege tables will ensure that all changes made so far
will take effect immediately.

Reload privilege tables now? [Y/n] Y
 ... Success!

Cleaning up...



All done!  If you've completed all of the above steps, your MySQL
installation should now be secure.

Thanks for using MySQL!
# mysql -uroot -p
<省略>
mysql> create user piwik identified by '****';    パスワード「****」は、任意の文字列を入れる。
Query OK, 0 rows affected (0.00 sec)

mysql> create database piwik;
Query OK, 1 row affected (0.00 sec)

mysql> show databases;
+--------------------+
| Database           |
+--------------------+
| information_schema |
| mysql              |
| piwik              |
+--------------------+
3 rows in set (0.00 sec)

mysql> grant all on piwik.* to piwik;
Query OK, 0 rows affected (0.00 sec)

# mysql piwik -upiwik -p
Enter password:

Piwikインストール

# mkdir /var/www/html/piwik
# unzip piwik.zip
# cp -rp piwik/* /var/www/html/piwik
# rm -f How\ to\ install\ Piwik.html
# chown -R apache:apache /var/www/html/piwik

Apache起動
```
# service httpd start
```
ブラウザで「http://<Public DNS or IP adress of the instance>/piwik/」に接続する。と、以下のような画面が出る。「次へ」を選択する。
システムの確認画面で、すべてチェックが入っていることを確認し、「次へ」を選択する。
データベースのセットアップを以下の用に入力する。パスワードは、MySQKでpiwikユーザを作成した際に設定したものをしていする。
「データベースサーバ」が、デフォルトで「127.0.0.1」になっているが、なぜか接続が失敗するので、「localhost」に変更するとうまくいった。が、原因は不明。。。
「次へ」を選択すると、テーブル作成が正常にできたメッセージが表示される。さらに、「次へ」を選択する。
スーパーユーザ画面で、任意の名前とパスワードのユーザを作成する。「次へ」を選択すると進むウェブサイトのセットアップ画面でこのブログを登録してみる。
「次へ」を選択すると、解析対象のWebサイトに埋め込むJavaScriptのタグが表示されるので、それをコピーして対象Webサイトの「</body>」の前にペーストする。「次へ」を選択して、最後に「Piwikを続ける」を選択する。
ブラウザで「http://<Public DNS or IP adress of the instance>/piwik/」に接続する。先ほど、作成したスーパユーザのアカウント情報を使ってログインする。
以下のようなダッシュボード画面がみれる。

なんかいろいろみれそうです！まさに、Google Analiticsに似ている。Piwikおもしろそうです。

また、AWSもすぐに仮想マシンが作成できて便利！

おわり

Rでトピック分析（LDA:Latent Dirichlet Allocatoion）

2013-03-13T01:59:00.001+09:00

テキストマイニング手法のひとつであるトピック分析をR言語を使ってやってみる。

トピック分析とは、簡単に言うと、単語の出現確率の組み合わせで表現されたトピックにより与えられた文書のテーマを推定する手法である。(と思っています)
これをLDA:Latent Dirichlet Allocationというトピックモデルを用いて行う。CRANにRのLDAパッケージが存在する。

Mac Book Air
Mac OS 10.8.2
R 2.15.1

インストール

Rはインストールされているものとする。

> install.packages("lda")
 --- このセッションで使うために、CRANのミラーサイトを選んでください --- 
        
Tcl/Tkインターフェースのロード中   終了済 
CRAN mirror 

 1: 0-Cloud                       2: Argentina (La Plata)       
 3: Argentina (Mendoza)           4: Australia (Canberra)       
 5: Australia (Melbourne)         6: Austria                    
 7: Belgium                       8: Brazil (PR)                
 9: Brazil (RJ)                  10: Brazil (SP 1)              
11: Brazil (SP 2)                12: Canada (BC)                
13: Canada (NS)                  14: Canada (ON)                
15: Canada (QC 1)                16: Canada (QC 2)              
17: Chile                        18: China (Beijing 1)          
19: China (Beijing 2)            20: China (Guangzhou)          
21: China (Hefei)                22: China (Xiamen)             
23: Colombia (Bogota)            24: Colombia (Cali)            
25: Denmark                      26: Ecuador                    
27: France (Lyon 1)              28: France (Lyon 2)            
29: France (Montpellier)         30: France (Paris 1)           
31: France (Paris 2)             32: Germany (Berlin)           
33: Germany (Bonn)               34: Germany (Falkenstein)      
35: Germany (Goettingen)         36: Greece                     
37: Hungary                      38: India                      
39: Indonesia                    40: Iran                       
41: Ireland                      42: Italy (Milano)             
43: Italy (Padua)                44: Italy (Palermo)            
45: Japan (Hyogo)                46: Japan (Tsukuba)            
47: Japan (Tokyo)                48: Korea (Seoul 1)            
49: Korea (Seoul 2)              50: Latvia                     
51: Mexico (Mexico City)         52: Mexico (Texcoco)           
53: Netherlands (Amsterdam)      54: Netherlands (Utrecht)      
55: New Zealand                  56: Norway                     
57: Philippines                  58: Poland                     
59: Russia                       60: Singapore                  
61: Slovakia                     62: South Africa (Cape Town)   
63: South Africa (Johannesburg)  64: Spain (Madrid)             
65: Sweden                       66: Switzerland                
67: Taiwan (Taichung)            68: Taiwan (Taipei)            
69: Thailand                     70: Turkey                     
71: UK (Bristol)                 72: UK (London)                
73: UK (St Andrews)              74: USA (CA 1)                 
75: USA (CA 2)                   76: USA (IA)                   
77: USA (IN)                     78: USA (KS)                   
79: USA (MD)                     80: USA (MI)                   
81: USA (MO)                     82: USA (OH)                   
83: USA (OR)                     84: USA (PA 1)                 
85: USA (PA 2)                   86: USA (TN)                   
87: USA (TX 1)                   88: USA (WA 1)                 
89: USA (WA 2)                   90: Venezuela                  
91: Vietnam                      

 選択： 
 メニューから項目を入力するか、0 を入力して終了して下さい 
 選択： 46
 URL 'http://cran.md.tsukuba.ac.jp/bin/macosx/leopard/contrib/2.15/lda_1.3.2.tgz' を試しています 
Content type 'application/x-gzip' length 460292 bytes (449 Kb)
 開かれた URL 
==================================================
downloaded 449 Kb


 ダウンロードされたパッケージは、以下にあります 
  /var/folders/1c/8xf1yn7x5zl37mfvqjrl31ph0000gp/T//RtmpIxno8E/downloaded_packages

さらに関連パッケージをインストールしておきます。

> install.packages("reshape2")
also installing the dependencies ‘plyr’, ‘stringr’

 URL 'http://cran.md.tsukuba.ac.jp/bin/macosx/leopard/contrib/2.15/plyr_1.8.tgz' を試しています 
Content type 'application/x-gzip' length 726655 bytes (709 Kb)
 開かれた URL 
==================================================
downloaded 709 Kb

 URL 'http://cran.md.tsukuba.ac.jp/bin/macosx/leopard/contrib/2.15/stringr_0.6.2.tgz' を試しています 
Content type 'application/x-gzip' length 70411 bytes (68 Kb)
 開かれた URL 
==================================================
downloaded 68 Kb

 URL 'http://cran.md.tsukuba.ac.jp/bin/macosx/leopard/contrib/2.15/reshape2_1.2.2.tgz' を試しています 
Content type 'application/x-gzip' length 56717 bytes (55 Kb)
 開かれた URL 
==================================================
downloaded 55 Kb


 ダウンロードされたパッケージは、以下にあります 
  /var/folders/1c/8xf1yn7x5zl37mfvqjrl31ph0000gp/T//RtmpIxno8E/downloaded_packages 
> install.packages("Matrix")
 URL 'http://cran.md.tsukuba.ac.jp/bin/macosx/leopard/contrib/2.15/Matrix_1.0-11.tgz' を試しています 
Content type 'application/x-gzip' length 4066929 bytes (3.9 Mb)
 開かれた URL 
==================================================
downloaded 3.9 Mb


 ダウンロードされたパッケージは、以下にあります 
  /var/folders/1c/8xf1yn7x5zl37mfvqjrl31ph0000gp/T//RtmpIxno8E/downloaded_packages 
> install.packages("ggplot2")
also installing the dependencies ‘colorspace’, ‘RColorBrewer’, ‘dichromat’, ‘munsell’, ‘labeling’, ‘digest’, ‘gtable’, ‘scales’, ‘proto’

 URL 'http://cran.md.tsukuba.ac.jp/bin/macosx/leopard/contrib/2.15/colorspace_1.2-1.tgz' を試しています 
Content type 'application/x-gzip' length 404391 bytes (394 Kb)
 開かれた URL 
==================================================
downloaded 394 Kb

 URL 'http://cran.md.tsukuba.ac.jp/bin/macosx/leopard/contrib/2.15/RColorBrewer_1.0-5.tgz' を試しています 
Content type 'application/x-gzip' length 22390 bytes (21 Kb)
 開かれた URL 
==================================================
downloaded 21 Kb

 URL 'http://cran.md.tsukuba.ac.jp/bin/macosx/leopard/contrib/2.15/dichromat_2.0-0.tgz' を試しています 
Content type 'application/x-gzip' length 144378 bytes (140 Kb)
 開かれた URL 
==================================================
downloaded 140 Kb

 URL 'http://cran.md.tsukuba.ac.jp/bin/macosx/leopard/contrib/2.15/munsell_0.4.tgz' を試しています 
Content type 'application/x-gzip' length 125546 bytes (122 Kb)
 開かれた URL 
==================================================
downloaded 122 Kb

 URL 'http://cran.md.tsukuba.ac.jp/bin/macosx/leopard/contrib/2.15/labeling_0.1.tgz' を試しています 
Content type 'application/x-gzip' length 35236 bytes (34 Kb)
 開かれた URL 
==================================================
downloaded 34 Kb

 URL 'http://cran.md.tsukuba.ac.jp/bin/macosx/leopard/contrib/2.15/digest_0.6.3.tgz' を試しています 
Content type 'application/x-gzip' length 161113 bytes (157 Kb)
 開かれた URL 
==================================================
downloaded 157 Kb

 URL 'http://cran.md.tsukuba.ac.jp/bin/macosx/leopard/contrib/2.15/gtable_0.1.2.tgz' を試しています 
Content type 'application/x-gzip' length 60865 bytes (59 Kb)
 開かれた URL 
==================================================
downloaded 59 Kb

 URL 'http://cran.md.tsukuba.ac.jp/bin/macosx/leopard/contrib/2.15/scales_0.2.3.tgz' を試しています 
Content type 'application/x-gzip' length 169299 bytes (165 Kb)
 開かれた URL 
==================================================
downloaded 165 Kb

 URL 'http://cran.md.tsukuba.ac.jp/bin/macosx/leopard/contrib/2.15/proto_0.3-10.tgz' を試しています 
Content type 'application/x-gzip' length 454829 bytes (444 Kb)
 開かれた URL 
==================================================
downloaded 444 Kb

 URL 'http://cran.md.tsukuba.ac.jp/bin/macosx/leopard/contrib/2.15/ggplot2_0.9.3.1.tgz' を試しています 
Content type 'application/x-gzip' length 2659920 bytes (2.5 Mb)
 開かれた URL 
==================================================
downloaded 2.5 Mb


 ダウンロードされたパッケージは、以下にあります 
  /var/folders/1c/8xf1yn7x5zl37mfvqjrl31ph0000gp/T//RtmpIxno8E/downloaded_packages 
> install.packages("penalized")
 URL 'http://cran.md.tsukuba.ac.jp/bin/macosx/leopard/contrib/2.15/penalized_0.9-42.tgz' を試しています 
Content type 'application/x-gzip' length 522276 bytes (510 Kb)
 開かれた URL 
==================================================
downloaded 510 Kb


 ダウンロードされたパッケージは、以下にあります 
  /var/folders/1c/8xf1yn7x5zl37mfvqjrl31ph0000gp/T//RtmpIxno8E/downloaded_packages

デモ実行

> library("lda")
> demo(lda)


 demo(lda)
 ---- ~~~

Type    to start : 

> require("ggplot2")
Loading required package: ggplot2

> require("reshape2")
Loading required package: reshape2

> data(cora.documents)

> data(cora.vocab)

> theme_set(theme_bw())  

> set.seed(8675309)

> K <- data-blogger-escaped-10="" data-blogger-escaped-clusters="" data-blogger-escaped-num=""> result <- data-blogger-escaped-0.1="" data-blogger-escaped-25="" data-blogger-escaped-clusters="" data-blogger-escaped-compute.log.likelihood="TRUE)" data-blogger-escaped-cora.documents="" data-blogger-escaped-cora.vocab="" data-blogger-escaped-iterations="" data-blogger-escaped-k="" data-blogger-escaped-lda.collapsed.gibbs.sampler="" data-blogger-escaped-num=""> ## Get the top words in the cluster
> top.words <- data-blogger-escaped-5="" data-blogger-escaped-by.score="TRUE)" data-blogger-escaped-result="" data-blogger-escaped-top.topic.words="" data-blogger-escaped-topics=""> ## Number of documents to display
> N <- data-blogger-escaped-10=""> topic.proportions <- data-blogger-escaped-colsums="" data-blogger-escaped-document_sums="" data-blogger-escaped-result="" data-blogger-escaped-t=""> topic.proportions <- data-blogger-escaped-dim="" data-blogger-escaped-n="" data-blogger-escaped-sample="" data-blogger-escaped-topic.proportions=""> topic.proportions[is.na(topic.proportions)] <- data-blogger-escaped-1="" data-blogger-escaped-k=""> colnames(topic.proportions) <- data-blogger-escaped-2="" data-blogger-escaped-apply="" data-blogger-escaped-collapse=" " data-blogger-escaped-paste="" data-blogger-escaped-top.words=""> topic.proportions.df <- data-blogger-escaped-cbind="" data-blogger-escaped-data.frame="" data-blogger-escaped-document="factor(1:N))," data-blogger-escaped-id.vars="document" data-blogger-escaped-melt="" data-blogger-escaped-topic.proportions="" data-blogger-escaped-variable.name="topic"> qplot(topic, value, fill=document, ylab="proportion",
+       data=topic.proportions.df, geom="bar") +
+   opts(axis.text.x = theme_text(angle=90, hjust=1)) +  
+   coord_flip() +
+   facet_wrap(~ document, ncol=5)
'opts' is deprecated. Use 'theme' instead. (Deprecated; last used in version 0.9.1)
theme_text is deprecated. Use 'element_text' instead. (Deprecated; last used in version 0.9.1)
 次の図を見るためにはキーを押して下さい:

でキーを叩くと、下のような画面が出現する。

四角の枠は、１０個のドキュメントの分析結果それぞれを表している。
縦軸の各項目がトピックでその内の単語はそのトピックで使われる代表的な単語である。この単語群からトピックの内容を把握する。
横軸は、各トピックの各ドキュメントにおけるトピックの割合(最大１)である。LDAでは、ドキュメントのテーマは複数のトピックの混合で表現される。

外部データ読み込み

NLTK

小文字に変換
ASCII文字コードの「SP(¥x20)」から「~(¥x7e)」以外を除去
文末のコンマを除去
-ing, -s, -edなどを除去（nltkのLancasterStemmer()を利用）
意味のないトークンを除去（アルファベット以外のみで構成された単語）
ストップワードを除去（and, a, the, など）

> doclines <- data-blogger-escaped-6="" data-blogger-escaped-docr.txt="" data-blogger-escaped-earn_r="" data-blogger-escaped-items="" data-blogger-escaped-lda="" data-blogger-escaped-read="" data-blogger-escaped-scan="" data-blogger-escaped-sep="\n" data-blogger-escaped-sers="" data-blogger-escaped-shu222="" data-blogger-escaped-what="character"> vocablist <- data-blogger-escaped-11826="" data-blogger-escaped-earn_r="" data-blogger-escaped-items="" data-blogger-escaped-lda="" data-blogger-escaped-read="" data-blogger-escaped-scan="" data-blogger-escaped-sep="\n" data-blogger-escaped-sers="" data-blogger-escaped-shu222="" data-blogger-escaped-vocab.txt="" data-blogger-escaped-what="character"> library(lda)
>  corpus <- data-blogger-escaped-doclines="" data-blogger-escaped-lexicalize="" data-blogger-escaped-lower="TRUE," data-blogger-escaped-pre="" data-blogger-escaped-sep=" " data-blogger-escaped-vocab="vocablist)">


崩壊型ギブスサンプリングのLDAを実行する。


トピック数：10
サンプリング回数：25
ハイパーパラメータα：0.1
ハイパーパラメータβ：0.1

> result <- data-blogger-escaped-0.1="" data-blogger-escaped-10="" data-blogger-escaped-25="" data-blogger-escaped-compute.log.likelihood="TRUE)" data-blogger-escaped-corpus="" data-blogger-escaped-lda.collapsed.gibbs.sampler="" data-blogger-escaped-pre="" data-blogger-escaped-vocablist="">


可視化。

> require("ggplot2")
 要求されたパッケージ ggplot2 をロード中です 
 警告メッセージ： 
 パッケージ '‘ggplot2’' はバージョン 2.15.2 の R の下で造られました  
> require("reshape2")
 要求されたパッケージ reshape2 をロード中です 
> theme_set(theme_bw())
> top.words <- data-blogger-escaped-5="" data-blogger-escaped-by.score="TRUE)" data-blogger-escaped-result="" data-blogger-escaped-top.topic.words="" data-blogger-escaped-topics=""> N <- data-blogger-escaped-6=""> topic.proportions <- data-blogger-escaped-colsums="" data-blogger-escaped-document_sums="" data-blogger-escaped-result="" data-blogger-escaped-t=""> topic.proportions <- data-blogger-escaped-dim="" data-blogger-escaped-n="" data-blogger-escaped-sample="" data-blogger-escaped-topic.proportions=""> topic.proportions[is.na(topic.proportions)] <- data-blogger-escaped-10="" data-blogger-escaped-1=""> colnames(topic.proportions) <- data-blogger-escaped-2="" data-blogger-escaped-apply="" data-blogger-escaped-collapse=" " data-blogger-escaped-paste="" data-blogger-escaped-top.words=""> topic.proportions.df <- data-blogger-escaped-cbind="" data-blogger-escaped-data.frame="" data-blogger-escaped-document="factor(1:N))," data-blogger-escaped-id.vars="document" data-blogger-escaped-melt="" data-blogger-escaped-topic.proportions="" data-blogger-escaped-variable.name="topic"> qplot(topic, value, fill=document, ylab="proportion", data=topic.proportions.df, geom="bar") +
+ opts(axis.text.x = theme_text(angle=90, hjust=1)) +
+ coord_flip() + facet_wrap(~ document, ncol=3)

するとこんなグラフが。



ちなみに、イテレーションを500した場合のグラフ。



ドキュメント３ドキュメント１は、「パイレーツ・オブ・カリビアン」について書かれていそうだ。
データファイル作成プログラム
参考として、今回使用したPythonコード。コメント皆無で恐縮です。

#!/usr/bin/env python
# -*- encoding: utf_8 -*-

import re, random
import nltk

class MyDocs:
  def __init__(self, rawDocs, stopwordFilePath, stemmer):
    self.docs = []
    for rawDoc in rawDocs:
      tokens = nltk.word_tokenize(rawDoc)

      tokens = self.tolowerTokens(tokens)
      tokens = self.removeNotASCII(tokens)
      tokens = self.removeEndComma(tokens)
      tokens = self.stemmingTokens(tokens, stemmer)
      tokens = self.removeMeaninglessTokens(tokens)
      tokens = self.removeStopword(tokens, stopwordFilePath)

      self.docs.append(tokens)

  def tolowerTokens(self, tokens):
    newTokens = [token.lower() for token in tokens]
    return newTokens

  def removeNotASCII(self, tokens):
    newTokens = []
    for token in tokens:
      rep = re.sub(r'[^\x20-\x7e]', '', token)
      newTokens.append(rep)
    return newTokens

  def stemmingTokens(self, tokens, stemmer):
    if stemmer is None:
      stemmer = nltk.LancasterStemmer()
    newTokens = [stemmer.stem(token) for token in tokens]
    return newTokens

  def removeMeaninglessTokens(self, tokens):
    newTokens = []
    for token in tokens:
      if not re.match(r'^[^a-z]+.*$', token):
        newTokens.append(token)
    return newTokens

  def removeEndComma(self, tokens):
    newTokens = []
    for token in tokens:
      rep = re.sub(r'\.$', '', token)
      newTokens.append(rep)
    return newTokens

  def removeStopword(self, tokens, stopwordFilePath):
    f = open(stopwordFilePath)
    stopwords = f.read().split()
    f.close()
    newTokens = []
    for token in tokens:
      if not token in stopwords:
        newTokens.append(token)
    return newTokens

  def getVocablary(self):
    allTokens = []
    for doc in self.docs:
      allTokens += doc
    return sorted(set(allTokens))

  def getDocForRLDA(self):
    docForR = []
    for doc in self.docs:
      docStr = ' '.join(doc)
      docStr += '\n'
      docForR.append(docStr)
    return docForR


def main():
  docs = []
  for fileid in nltk.corpus.webtext.fileids():
    docs.append(nltk.corpus.webtext.raw(fileid))

  stopwordFilePath = './stopword.list'
  stemmer = nltk.LancasterStemmer()

  mydocs = MyDocs(docs, stopwordFilePath, stemmer)

  vocabs = mydocs.getVocablary()
  f = open('./vocab.txt', 'w')
  for vocab in vocabs:
    f.write(vocab + '\n')
  f.close()

  docForR = mydocs.getDocForRLDA()
  f = open('./docR.txt', 'w')
  for doc in docForR:
    f.write(doc)
  f.close()

if __name__ == '__main__':
  main()

おわり

[SSLの勉強２]：クライアント証明書と認証

2013-02-24T18:37:00.000+09:00

本などで勉強しても全く頭に入らないので、手を動かして理解する。
今回は、クライアントの認証を導入してみる。前回の設定等が済まされた状態を前提とする。

Mac Book Air
Mac OS X 10.8.2
OpenSSL 0.9.8
Apache 2.2.22
Google Chrome 24.0.1312

クライントの秘密鍵・証明書作成

前回と同じように~/pkiで作業を行う。

$ cd ~/pki
$ mkdir ~/pki/client

まずは、クライアントの秘密鍵の作成。

$ openssl genrsa -des3 -out ~/pki/client/client.key 1024
Generating RSA private key, 1024 bit long modulus
..++++++
.....++++++
e is 65537 (0x10001)
Enter pass phrase for /Users/shu222/pki/client/client.key:
Verifying - Enter pass phrase for /Users/shu222/pki/client/client.key:

次に、クライアントの認証局への署名要求書の作成。

$ openssl req -new -days 365 -key ~/pki/client/client.key -out ~/pki/client/csr.pem
Enter pass phrase for /Users/shu222/pki/client/client.key:
You are about to be asked to enter information that will be incorporated
into your certificate request.
What you are about to enter is what is called a Distinguished Name or a DN.
There are quite a few fields but you can leave some blank
For some fields there will be a default value,
If you enter '.', the field will be left blank.
-----
Country Name (2 letter code) [JP]:
State or Province Name (full name) [Kobe]:
Locality Name (eg, city) [Chuo-ku]:
Organization Name (eg, company) [Internet Widgits Pty Ltd]:
Organizational Unit Name (eg, section) []:
Common Name (eg, YOUR name) []:noah
Email Address []:

Please enter the following 'extra' attributes
to be sent with your certificate request
A challenge password []:
An optional company name []:

CAで署名する。クライアント証明書の出来上がり。

$ openssl ca -in ~/pki/client/csr.pem -keyfile ~/pki/demoCA/private/cakey.pem -cert ~/pki/demoCA/cacert.pem -out ~/pki/client/cert.pem
Using configuration from /System/Library/OpenSSL/openssl.cnf
Enter pass phrase for /Users/testuser/pki/demoCA/private/cakey.pem:
Check that the request matches the signature
Signature ok
Certificate Details:
        Serial Number:
            bd:48:b8:5b:4c:1c:57:12
        Validity
            Not Before: Feb 23 13:55:05 2013 GMT
            Not After : Feb 21 13:55:05 2023 GMT
        Subject:
            countryName               = JP
            stateOrProvinceName       = Kobe
            organizationName          = Internet Widgits Pty Ltd
            commonName                = noah
        X509v3 extensions:
            X509v3 Basic Constraints: 
                CA:FALSE
            Netscape Cert Type: 
                SSL Server
            Netscape Comment: 
                OpenSSL Generated Certificate
            X509v3 Subject Key Identifier: 
                0A:F6:CB:01:4E:FD:0C:4B:DB:3E:E1:6B:76:9A:C6:63:B8:49:64:57
            X509v3 Authority Key Identifier: 
                keyid:09:18:4F:0E:9D:62:76:25:9D:1D:7F:34:9E:CC:5F:47:C0:DA:41:6B

Certificate is to be certified until Feb 21 13:55:05 2023 GMT (3650 days)
Sign the certificate? [y/n]:y


1 out of 1 certificate requests certified, commit? [y/n]y
Write out database with 1 new entries
Data Base Updated

ApacheのSSL設定

Apacheを停止しておく。

$ sudo apachectl stop

httpd-ssl.confに以下の設定を記述する。

認証局の証明書があるディレクトリのパス

SSLCACertificatePath "/Users/testuser/pki/demoCA"

認証局の証明書ファイルのパス

SSLCACertificateFile "/Users/testuser/pki/demoCA/cacert.pem"

クライントの認証方式。{none | optional | require | optional_no_ca}が選択できるよう。

SSLVerifyClient require

証明書チェインにおいて認証局を辿る深さ。今回はオレオレなので１。

SSLVerifyDepth  1

ブラウザにクライアント証明書を設定

まず、クライアント証明書をpkcs12形式に変換する。

$ openssl pkcs12 -export -in ~/pki/client/cert.pem -inkey ~/pki/client/client.key -certfile ~/pki/demoCA/cacert.pem -out ~/pki/client/cert.p12

Chromeの場合、メニューバーの「Chrome」→「環境設定...」→一番下にスクロールして「詳細設定を表示...」
→HTTP/SSLのエリアの「証明書の管理」→キーチェーンアクセスが起動→メニューバーの「ファイル」→「読み込む...」
→クライアント証明書のファイル(上の場合だと、cert.p12)を選択して「開く」
これでブラウザはクライアント証明書をサーバに送れるようになる。
ちなみに設定前は、以下のような画面で警告される。

おわり

[SSLの勉強１]：CA構築とサーバ証明書

2013-02-23T17:43:00.001+09:00

本などで勉強しても全く頭に入らないので、手を動かして理解する。
いわゆる自己認証・自己証明書作成をやってみる。

Mac Book Air
Mac OS X 10.8.2
OpenSSL 0.9.8
Apache 2.2.22

openssl.cfgの作成

ユーザホームディレクトリにpkiというディレクトリを作成し、そこで作業することとする。

$ mkdir ~/pki
$ cd ~/pki
$ sudo cp /System/Library/OpenSSL/openssl.cnf /System/Library/OpenSSL/openssl.cnf.org
$ vi /System/Library/OpenSSL/openssl.cnf

今回使用したファイルは以下のとおり。

openssl.cfg

#
# OpenSSL example configuration file.
# This is mostly being used for generation of certificate requests.
#

HOME   = .
RANDFILE  = $ENV::HOME/.rnd

oid_section  = new_oids

[ new_oids ]

####################################################################
[ ca ]
default_ca = CA_default  # The default ca section

####################################################################
[ CA_default ]

dir  = ./demoCA
certs  = $dir/certs
crl_dir  = $dir/crl
database = $dir/index.txt
new_certs_dir = $dir/newcerts

certificate = $dir/cacert.pem 
serial  = $dir/serial 
crlnumber = $dir/crlnumber
crl  = $dir/crl.pem
private_key = $dir/private/cakey.pem
RANDFILE = $dir/private/.rand

x509_extensions = usr_cert

name_opt  = ca_default
cert_opt  = ca_default

default_days = 3650
default_crl_days= 30
default_md = sha1
preserve = no

policy  = policy_match

[ policy_match ]
countryName  = match
stateOrProvinceName = match
organizationName = match
organizationalUnitName = optional
commonName  = supplied
emailAddress  = optional

[ policy_anything ]
countryName  = optional
stateOrProvinceName = optional
localityName  = optional
organizationName = optional
organizationalUnitName = optional
commonName  = supplied
emailAddress  = optional

####################################################################
[ req ]
default_bits  = 2048
default_keyfile  = privkey.pem
distinguished_name = req_distinguished_name
attributes  = req_attributes
x509_extensions  = v3_ca 

string_mask = nombstr

[ req_distinguished_name ]
countryName   = Country Name (2 letter code)
countryName_default  = JP
countryName_min   = 2
countryName_max   = 2

stateOrProvinceName  = State or Province Name (full name)
stateOrProvinceName_default = Kobe

localityName   = Locality Name (eg, city)
localityName_default  = Chuo-ku

0.organizationName  = Organization Name (eg, company)
0.organizationName_default = Internet Widgits Pty Ltd

organizationalUnitName  = Organizational Unit Name (eg, section)

commonName   = Common Name (eg, YOUR name)
commonName_max   = 64

emailAddress   = Email Address
emailAddress_max  = 64

[ req_attributes ]
challengePassword  = A challenge password
challengePassword_min  = 4
challengePassword_max  = 20

unstructuredName  = An optional company name

[ usr_cert ]
basicConstraints=CA:FALSE

nsComment   = "OpenSSL Generated Certificate"

subjectKeyIdentifier=hash
authorityKeyIdentifier=keyid,issuer

[ v3_req ]
basicConstraints = CA:FALSE
keyUsage = nonRepudiation, digitalSignature, keyEncipherment

[ v3_ca ]
subjectKeyIdentifier=hash

authorityKeyIdentifier=keyid:always,issuer:always

basicConstraints = CA:true

[ crl_ext ]
authorityKeyIdentifier=keyid:always,issuer:always

[ proxy_cert_ext ]
basicConstraints=CA:FALSE

nsComment   = "OpenSSL Generated Certificate"

subjectKeyIdentifier=hash
authorityKeyIdentifier=keyid,issuer:always

proxyCertInfo=critical,language:id-ppl-anyLanguage,pathlen:3,policy:foo

CA構築

$ /System/Library/OpenSSL/misc/CA.sh -newca
CA certificate filename (or enter to create)

Making CA certificate ...
Generating a 2048 bit RSA private key
...................................................................+++
....................................................................................................................................+++
writing new private key to './demoCA/private/./cakey.pem'
Enter PEM pass phrase: <== 任意のパスワードを入力
Verifying - Enter PEM pass phrase: <== もう一度入力
-----
You are about to be asked to enter information that will be incorporated
into your certificate request.
What you are about to enter is what is called a Distinguished Name or a DN.
There are quite a few fields but you can leave some blank
For some fields there will be a default value,
If you enter '.', the field will be left blank.
-----
Country Name (2 letter code) [JP]:
State or Province Name (full name) [Kobe]:
Locality Name (eg, city) [Chuo-ku]:
Organization Name (eg, company) [Internet Widgits Pty Ltd]:
Organizational Unit Name (eg, section) []:
Common Name (eg, YOUR name) []:noah <== ホスト名(FQDN)入力
Email Address []:

Please enter the following 'extra' attributes
to be sent with your certificate request
A challenge password []:
An optional company name []:
Using configuration from /System/Library/OpenSSL/openssl.cnf
Enter pass phrase for ./demoCA/private/./cakey.pem:  <== 任意のパスワードを入力
Check that the request matches the signature
Signature ok
Certificate Details:
        Serial Number:
            a5:f1:29:54:f4:12:9e:c4
        Validity
            Not Before: Feb 20 16:02:51 2013 GMT
            Not After : Feb 20 16:02:51 2016 GMT
        Subject:
            countryName               = JP
            stateOrProvinceName       = Kobe
            organizationName          = Internet Widgits Pty Ltd
            commonName                = noah
        X509v3 extensions:
            X509v3 Subject Key Identifier: 
                40:8C:0E:CF:F6:70:8F:09:83:E0:4A:19:31:63:3A:66:3D:9C:23:0E
            X509v3 Authority Key Identifier: 
                keyid:40:8C:0E:CF:F6:70:8F:09:83:E0:4A:19:31:63:3A:66:3D:9C:23:0E
                DirName:/C=JP/ST=Kobe/O=Internet Widgits Pty Ltd/CN=noah
                serial:A5:F1:29:54:F4:12:9E:C4

            X509v3 Basic Constraints: 
                CA:TRUE
Certificate is to be certified until Feb 20 16:02:51 2016 GMT (1095 days)

Write out database with 1 new entries
Data Base Updated

作業ディレクトリのdemoCAに認証局証明書(cacert.pem)などができている。

$ ls ~/pki/demoCA/
cacert.pem certs  index.txt index.txt.old private
careq.pem crl  index.txt.attr newcerts serial

CAの証明書の中身は以下で確認できる。

$ openssl x509 -in ~/pki/demoCA/cacert.pem -text

サーバの秘密鍵作成

サーバ用のディレクトリを作成して作業。パスワード設定のため入力を求められます。

$ mkdir ~/pki/server
$ openssl genrsa -des3 -out ~/pki/server/server.key 1024
$ ls ~/pki/server
server.key

これがサーバの秘密鍵になります。

サーバ証明書作成

サーバの証明書に信頼性を持たせるため認証局に署名をしてもらう必要があります。
これによって、そのサーバが本物であることをクライアントに示すことができます。
ここでは、認証局も自前なので一般的になんの信頼性もないサーバ証明書になります。（いわゆるオレオレ）

まずは、署名要求書の作成。CA構築時と全く同じ回答をします。

$ openssl req -new -days 365 -key ~/pki/server/server.key -out ~/pki/server/csr.pem
Enter pass phrase for /Users/testuser/pki/server/server.key:
You are about to be asked to enter information that will be incorporated
into your certificate request.
What you are about to enter is what is called a Distinguished Name or a DN.
There are quite a few fields but you can leave some blank
For some fields there will be a default value,
If you enter '.', the field will be left blank.
-----
Country Name (2 letter code) [JP]:
State or Province Name (full name) [Kobe]:
Locality Name (eg, city) [Chuo-ku]:
Organization Name (eg, company) [Internet Widgits Pty Ltd]:
Organizational Unit Name (eg, section) []:
Common Name (eg, YOUR name) []:noah
Email Address []:

Please enter the following 'extra' attributes
to be sent with your certificate request
A challenge password []:
An optional company name []:

$ ls ~/pki/server/
csr.pem  server.key

このファイルを使って認証局に署名してもらいます。

$ openssl ca -in ~/pki/server/csr.pem -keyfile ~/pki/demoCA/private/cakey.pem -cert ~/pki/demoCA/cacert.pem -out ~/pki/server/cert.pem
Using configuration from /System/Library/OpenSSL/openssl.cnf
Enter pass phrase for /Users/testuser/pki/demoCA/private/cakey.pem:
Check that the request matches the signature
Signature ok
Certificate Details:
        Serial Number:
            bd:48:b8:5b:4c:1c:57:11
        Validity
            Not Before: Feb 23 02:48:56 2013 GMT
            Not After : Feb 21 02:48:56 2023 GMT
        Subject:
            countryName               = JP
            stateOrProvinceName       = Kobe
            organizationName          = Internet Widgits Pty Ltd
            commonName                = noah
        X509v3 extensions:
            X509v3 Basic Constraints: 
                CA:FALSE
            Netscape Cert Type: 
                SSL Server
            Netscape Comment: 
                OpenSSL Generated Certificate
            X509v3 Subject Key Identifier: 
                0B:BB:C9:A4:6F:8D:93:B8:D3:E1:EA:62:C2:30:FD:46:6B:6A:1F:15
            X509v3 Authority Key Identifier: 
                keyid:09:18:4F:0E:9D:62:76:25:9D:1D:7F:34:9E:CC:5F:47:C0:DA:41:6B

Certificate is to be certified until Feb 21 02:48:56 2023 GMT (3650 days)
Sign the certificate? [y/n]:y


1 out of 1 certificate requests certified, commit? [y/n]y
Write out database with 1 new entries
Data Base Updated

一度エラーがでたので、以下の対処を行った。

$ openssl ca -in ~/pki/server/csr.pem -keyfile ~/pki/demoCA/private/cakey.pem -cert ~/pki/demoCA/cacert.pem -out ~/pki/server/cert.pem
<省略>
Certificate is to be certified until Feb 21 02:46:04 2023 GMT (3650 days)
Sign the certificate? [y/n]:y
failed to update database
TXT_DB error number 2
$ mv ~/pki/demoCA/index.txt ~/pki/demoCA/index.txt.old
$ touch ~/pki/demoCA/index.txt

なぜこのエラーがでるのかが、不明。

ApacheのSSL設定

最低限の設定しかここではしないため、適宜環境・要件に合わせて読み替えてください。
起動している場合は、Apacheを停止しておく。

$ sudo apachectl stop

Apacheに設定する秘密鍵として、パスワード無しの秘密鍵を現在の秘密鍵から作成しておく。

openssl rsa -in ~/pki/server/server.key -out ~/pki/server/servernopw.key 
Enter pass phrase for /Users/testuser/pki/server/server.key:
writing RSA key

Apacheの設定ファイルを２つ（httpd.conf, httpd-ssl.conf） /etc/apache2/httpd.conf：この行のコメントアウトを外す

Include /private/etc/apache2/extra/httpd-ssl.conf

/etc/apache2/extra/httpd-ssl.conf：以下２行でサーバ証明書と秘密鍵を指定する。

SSLCertificateFile "/Users/testuser/pki/server/cert.pem"
SSLCertificateKeyFile "/Users/testuser/pki/server/servernopw.key"

以下にhttpd-ssl.confを載せておく。

Listen 443

AddType application/x-x509-ca-cert .crt
AddType application/x-pkcs7-crl    .crl

SSLPassPhraseDialog  builtin
SSLSessionCache        "shmcb:/private/var/run/ssl_scache(512000)"
SSLSessionCacheTimeout  300
SSLMutex  "file:/private/var/run/ssl_mutex"

<VirtualHost _default_:443>
DocumentRoot "/Library/WebServer/Documents"
ServerName www.example.com:443
ServerAdmin you@example.com
ErrorLog "/private/var/log/apache2/error_log"
TransferLog "/private/var/log/apache2/access_log"

SSLEngine on
SSLProtocol all -SSLv2
SSLCipherSuite HIGH:MEDIUM:!aNULL:!MD5
SSLCertificateFile "/Users/testuser/pki/server/cert.pem"
SSLCertificateKeyFile "/Users/testuser/pki/server/servernopw.key"

<FilesMatch "\.(cgi|shtml|phtml|php)$">
    SSLOptions +StdEnvVars
</FilesMatch>
<Directory "/Library/WebServer/CGI-Executables">
    SSLOptions +StdEnvVars
</Directory>
BrowserMatch "MSIE [2-5]" \
         nokeepalive ssl-unclean-shutdown \
         downgrade-1.0 force-response-1.0
CustomLog "/private/var/log/apache2/ssl_request_log" \
          "%t %h %{SSL_PROTOCOL}x %{SSL_CIPHER}x \"%r\" %b"
</VirtualHost>

接続確認

ブラウザで接続してみる。
Chromeではこんな感じで警告が出る。「このまま続行」をクリックすると

サイトが表示される。

おわり

備忘録：Macで特定ウィンドウのスクリーンショットを撮るショートカット

2013-02-13T01:24:00.002+09:00

忘れてしまう。。。

「コマンド」　+　「Shift」　+　「4」　+　「Space」

これでマウスカーソルがカメラになるので、キャプチャしたいウィンドウを選択する。

おわり

PythonでHTMLパース：リンクとアンカーテキスト抽出

2013-02-07T01:10:00.002+09:00

HTMLParserを利用してHTMLのタグ解析を行う。
特定のサイトにあるAタグを抽出して、リンクURLとアンカーテキストの組を作る。

さくらVPS
CentOS 6.2
Python 2.6.6

ソースコード

#!/usr/bin/env python
# -*- encoding: utf-8 -*-

import re

from urllib import urlopen
from HTMLParser import HTMLParser

class out_link_parser(HTMLParser):
  def __init__(self):
    HTMLParser.__init__(self)
    self.links = {}
    self.linkurl = ''

  # aタグのみ処理を行い、href属性の内容をlinkurlに格納
  def handle_starttag(self, tag, attrs):
    if tag == 'a':
      attrs = dict(attrs)
      if 'href' in attrs:
        self.linkurl = attrs['href']

  # これは書かなくてもよい
  def handle_endtag(self, tag):
    pass

  # linkurlに値が入っている場合のみ、（つまりAタグの場合）
  # urlをキー：アンカーテキストをバリューとしてディクショナリに追加
  def handle_data(self, data):
    if self.linkurl:
      self.links[self.linkurl] = data
      self.linkurl = ''


def main():
  target = 'http://www.python.jp/'
  url = urlopen(target)
  html = url.read()
  parser = out_link_parser()

  # 日本語があるのでUnicodeに変換
  parser.feed(html.decode('utf-8'))
  parser.close()

  for k, v in parser.links.items():
    k_str = k.encode('utf-8')
    # アンカーテキストの先頭／末尾のスペースや改行などを除去
    v_str = re.sub('^[ \n\r\t]+|[ \n\r\t]+$', '', v).encode('utf_8')
 
    # 相対パスやアンカーの場合、ルートのURLを先頭に付与
    if re.match('^/', k_str):
      print "%s: %s" % (re.sub('^/', target, k_str), v_str)
    elif re.match('^#', k_str):
      print "%s: %s" % (re.sub('^#', target + '#', k_str), v_str)
    else:
      print "%s: %s" % (k_str, v_str)
      

if __name__ == '__main__':
  main()

実行結果

http://www.python.jp/pyjug/: PyJUGについて
http://www.python.org/ftp/python/2.7.3/Python-2.7.3.tar.bz2: ソースコード
http://confoo.ca/: CanFoo
http://www.python.jp/psf/donations/: 
http://www.python.org/download/releases/3.3.0/: Python 3.3.0
http://www.python.jp/#content-body: 
http://www.python.org: 公式ウェブサイト
http://www.python.org/psf/donations/: Pythonに募金を！
http://www.python.jp/Zope/: 以前のwww.python.jpサイト
http://pypi.python.org/pypi: Pythonパッケージインデックス
http://www.python.jp/about/: Pythonとは
http://www.python.org/download/releases/2.7.3/: Python 2.7.3
http://code.google.com/p/python-doc-ja/: ドキュメント翻訳プロジェクト
http://www.python.jp/: 
http://docs.python.jp/3.3/: ドキュメント (nightly)
http://python.org/download/releases/3.3.0/: リリース
http://www.python.jp/news/: ニュース
http://www.python.org/ftp/python/3.3.0/Python-3.3.0.tar.bz2: ソースコード
http://www.python.jp/pyjug: Legal Statements
http://docs.python.jp/2.7/: 日本語ドキュメント
http://www.python.jp/download/: ダウンロード
http://www.google.com/calendar/ical/kj670le78ju5alcbt1khect5ks%40group.calendar.google.com/public/basic.ics: Python Calendar Japan
http://www.python.jp/doc/: ドキュメント
http://www.timparkin.co.uk/: design by Tim Parkin
http://www.python.jp/mailman/listinfo/python-ml-jp: Python日本語メーリングリスト
http://www.python.jp/channews.rdf: RSS
https://github.com/tokyo-scipy/archive/tree/master/005: Tokyo.SciPy #5
http://python.org/community/awards/psf-distinguished-awards/: 特別功労賞
http://www.numfocus.org/johnhunter: John Hunter
http://www.python.org/psf/license/: オープンソースライセンス
http://wiki.python.org/moin/Languages: 各国語のPython情報
http://www.python.org/ftp/python/3.3.0/python-3.3.0.msi: Windows インストーラ
http://www.python.jp/#left-hand-navigation: 
http://www.python.org/ftp/python/2.7.3/python-2.7.3.msi: Windows インストーラ
http://www.python.jp/about: Pythonについてもっと詳しく
https://github.com/tokyo-scipy/archive: Tokyo.SciPy

そこそこまともにとれました。

おわり

備忘録：grepで前後複数行も表示

2013-01-31T23:54:00.000+09:00

ログを解析する際に検索対象のイベントを判別する文字列とタイムスタンプが連続した別々の行に出力されている場合など

前の行も表示

# grep -B 1 "hogehoge" ./higehige.txt

後の行も表示

# grep -A 1  "hogehoge" ./higehige.txt

おわり

PythonでMySQL

2013-01-29T23:17:00.001+09:00

MySQL

さくらVPS
CentOS 6.2
Python 2.6.6
MySQL 5.1.61
MySQL for Python 1.2.4

MySQL for Pythonのインストール

# wget http://sourceforge.net/projects/mysql-python/files/latest/download
# tar xvf MySQL-python-1.2.4b4.tar.gz
# python setup.py build
# python setup.py install

サンプルプログラム。

myMySQLforPython.py

!/usr/bin/env python
# coding: utf-8

import MySQLdb

con = MySQLdb.connect(db='nutch', host='localhost', user='nutch', passwd='password')
cur = con.cursor()
q = 'SELECT id, title FROM webpage LIMIT 10'
cur.execute(q)
rows = cur.fetchall()
for row in rows:
  print "%s ( %s )" % (row[1], row[0])
cur.close()
con.close()

実行結果。
テスト用テーブルとしてこちらで作成したクロールデータデータベースのものを利用します。

# python myMySQLforPython.py 
Welcome to Apache Nutch ( org.apache.nutch:http/ )
About Apache Nutch ( org.apache.nutch:http/about.html )
None ( org.apache.nutch:http/about.pdf )
All Classes (apache-nutch 1.6 API) ( org.apache.nutch:http/apidocs-1.6/allclasses-frame.html )
apache-nutch 1.6 API ( org.apache.nutch:http/apidocs-1.6/index.html )
None ( org.apache.nutch:http/apidocs-1.6/org/apache/nutch/analysis/lang/HTMLLanguageParser.html )
None ( org.apache.nutch:http/apidocs-1.6/org/apache/nutch/analysis/lang/LanguageIndexingFilter.html )
None ( org.apache.nutch:http/apidocs-1.6/org/apache/nutch/analysis/lang/package-frame.html )
None ( org.apache.nutch:http/apidocs-1.6/org/apache/nutch/collection/CollectionManager.html )
None ( org.apache.nutch:http/apidocs-1.6/org/apache/nutch/collection/package-frame.html )

おわり

NutchとMySQL

2013-01-27T13:56:00.001+09:00

HBase, Cassandraと連携させたので、MySQLにも。

さくらVPS
CentOS 6.2
OpenJDK 1.6.0_24
Apache Nutch 2.1
MySQL 5.1.61

MySQLのインストール

ここは他サイトを参照してください。インストールされているものとして続けます。

MySQLユーザ作成

ここではユーザ「nutch」で作成。GRANT文のパスワード「****」は任意のものを

# mysql -u root -p
Enter password: 
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 138
Server version: 5.1.61 Source distribution

Copyright (c) 2000, 2011, Oracle and/or its affiliates. All rights reserved.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql> 
mysql> use mysql;
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Database changed
mysql> 
mysql> GRANT ALL PRIVILEGES ON nutch.* TO nutch@localhost IDENTIFIED BY '****';
Query OK, 0 rows affected (0.00 sec)

mysql> 
mysql> quit
Bye

Nutchのソースパッケージ取得

※現時点で公式リンクにはソースのみ。

apache-nutch-2.1-src.tar.gz

展開(同様に/opt)

# cd /opt
# ls
apache-nutch-2.1-src.tar.gz
# tar xvf apache-nutch-2.1-src.tar.gz
# cd apache-nutch-2.1

設定(Gora関連)

以下それぞれ追記。

conf/nutch-site.xml(<configuration>タグ内)

  <property>
    <name>storage.data.store.class</name>
    <value>org.apache.gora.sql.store.SqlStore</value>
  </property>

conf/gora.properties

gora.datastore.default=org.apache.gora.sql.store.SqlStore
gora.sqlstore.jdbc.driver=com.mysql.jdbc.Driver
gora.sqlstore.jdbc.url=jdbc:mysql://localhost:3306/nutch?createDatabaseIfNotExist=true
gora.sqlstore.jdbc.user=<作成したユーザ名(ここではnutch)>
gora.sqlstore.jdbc.password=<設定したパスワード>

ivy/ivy.xml

<dependency org="mysql" name="mysql-connector-java" rev="5.1.18" conf="*->default"/>

コンパイル

antでコンパイル。

# ant
<省略>

設定(Nutch関連)

confディレクトリにあるnutch-default.xmlの「http.agent.name」プロパティの値に適当な文字列を記述する。
conf/nutch-default.xml

<property>
  <name>http.agent.name</name>
  <value>My Nutch Spider</value>
</property>

クローリングの始点となるサイトを記述したテキストファイルを任意のディレクトリに作成。
以下の例では、インストールディレクトリ配下に作成した「urls」ディレクトリに作成。
例）

# mkdir urls

urls/seed.txt

http://nutch.apache.org/

正規表現でクローリング対象サイトを指定する。
ファイル「conf/regex-urlfilter.txt」を以下の通り、最下行コメントアウト＋追記する。
conf/regex-urlfilter.txt

#+.
+^http://([a-z0-9]*\.)*nutch.apache.org

クローリングのためのInject

# ./runtime/local/bin/nutch inject urls/
InjectorJob: starting
InjectorJob: urlDir: urls
2013/01/27 12:56:49 org.apache.gora.sql.store.SqlStore createSchema
情報: creating schema: webpage
InjectorJob: org.apache.gora.util.GoraException: java.io.IOException: com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: Column length too big for column 'text' (max = 21845); use BLOB or TEXT instead
 at org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:167)
 at org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:135)
 at org.apache.nutch.storage.StorageUtils.createWebStore(StorageUtils.java:75)
 at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:214)
 at org.apache.nutch.crawl.InjectorJob.inject(InjectorJob.java:228)
 at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:248)
 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
 at org.apache.nutch.crawl.InjectorJob.main(InjectorJob.java:258)
Caused by: java.io.IOException: com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: Column length too big for column 'text' (max = 21845); use BLOB or TEXT instead
 at org.apache.gora.sql.store.SqlStore.createSchema(SqlStore.java:226)
 at org.apache.gora.sql.store.SqlStore.initialize(SqlStore.java:172)
 at org.apache.gora.store.DataStoreFactory.initializeDataStore(DataStoreFactory.java:102)
 at org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:161)
 ... 7 more
Caused by: com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: Column length too big for column 'text' (max = 21845); use BLOB or TEXT instead
 at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
 at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
 at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
 at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
 at com.mysql.jdbc.Util.handleNewInstance(Util.java:411)
 at com.mysql.jdbc.Util.getInstance(Util.java:386)
 at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:1052)
 at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3609)
 at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3541)
 at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:2002)
 at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2163)
 at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2624)
 at com.mysql.jdbc.PreparedStatement.executeInternal(PreparedStatement.java:2127)
 at com.mysql.jdbc.PreparedStatement.executeUpdate(PreparedStatement.java:2427)
 at com.mysql.jdbc.PreparedStatement.executeUpdate(PreparedStatement.java:2345)
 at com.mysql.jdbc.PreparedStatement.executeUpdate(PreparedStatement.java:2330)
 at org.apache.gora.sql.store.SqlStore.createSchema(SqlStore.java:224)
 ... 10 more

エラーがでました。textカラム（Webの本文が格納される）のサイズが少ないと、型をBLOBかTEXTにせよと。
試行錯誤しましたが、やり方がわからずコミュニティのチケットなどを調べていると、
https://issues.apache.org/jira/browse/NUTCH-1497
https://issues.apache.org/jira/browse/NUTCH-1473
ここにたどり着きました。
http://nlp.solutions.asia/?p=180
ちなみに、2.2で修正される予定のようですね。

再設定

データベース/スキーマ作成

mysql> drop database nutch;
Query OK, 0 rows affected (0.00 sec)

mysql> CREATE DATABASE nutch DEFAULT CHARACTER SET utf8 DEFAULT COLLATE utf8_unicode_ci;
Query OK, 1 row affected (0.00 sec)

mysql> use nutch;
Database changed

mysql> CREATE TABLE `webpage` (
    -> `id` varchar(255) NOT NULL,
    -> `headers` blob,
    -> `text` mediumtext DEFAULT NULL,
    -> `status` int(11) DEFAULT NULL,
    -> `markers` blob,
    -> `parseStatus` blob,
    -> `modifiedTime` bigint(20) DEFAULT NULL,
    -> `score` float DEFAULT NULL,
    -> `typ` varchar(32) CHARACTER SET latin1 DEFAULT NULL,
    -> `baseUrl` varchar(767) DEFAULT NULL,
    -> `content` longblob,
    -> `title` varchar(2048) DEFAULT NULL,
    -> `reprUrl` varchar(767) DEFAULT NULL,
    -> `fetchInterval` int(11) DEFAULT NULL,
    -> `prevFetchTime` bigint(20) DEFAULT NULL,
    -> `inlinks` mediumblob,
    -> `prevSignature` blob,
    -> `outlinks` mediumblob,
    -> `fetchTime` bigint(20) DEFAULT NULL,
    -> `retriesSinceFetch` int(11) DEFAULT NULL,
    -> `protocolStatus` blob,
    -> `signature` blob,
    -> `metadata` blob,
    -> PRIMARY KEY (`id`)
    -> ) ENGINE=InnoDB
    -> ROW_FORMAT=COMPRESSED
    -> DEFAULT CHARSET=utf8;
Query OK, 0 rows affected (0.00 sec)

mysql> show tables;
+-----------------+
| Tables_in_nutch |
+-----------------+
| webpage         |
+-----------------+
1 row in set (0.00 sec)

gora-sql-mapping.xmlのidカラムの長さを512から767に変更する。
gora-sql-mapping.xml

クローリング

とりあえず１巡だけやってみます。

./runtime/local/bin/nutch crawl urls/ -depth 1 -topN 1

MySQLで確認。(ずれていてみにくいですが。)

mysql> select id, score, fetchTime  from webpage;
+----------------------------------------------+-----------+---------------+
| id                                           | score     | fetchTime     |
+----------------------------------------------+-----------+---------------+
| org.apache.nutch:http/                       |   1.05882 | 1361854115396 |
| org.apache.nutch:http/about.html             | 0.0588235 | 1359262135084 |
| org.apache.nutch:http/apidocs-1.6/index.html | 0.0588235 | 1359262135086 |
| org.apache.nutch:http/apidocs-2.1/index.html | 0.0588235 | 1359262135086 |
| org.apache.nutch:http/bot.html               | 0.0588235 | 1359262135087 |
| org.apache.nutch:http/credits.html           | 0.0588235 | 1359262135088 |
| org.apache.nutch:http/faq.html               | 0.0588235 | 1359262135089 |
| org.apache.nutch:http/index.html             | 0.0588235 | 1359262135090 |
| org.apache.nutch:http/index.pdf              | 0.0588235 | 1359262135092 |
| org.apache.nutch:http/issue_tracking.html    | 0.0588235 | 1359262135093 |
| org.apache.nutch:http/mailing_lists.html     | 0.0588235 | 1359262135094 |
| org.apache.nutch:http/nightly.html           | 0.0588235 | 1359262135095 |
| org.apache.nutch:http/old_downloads.html     | 0.0588235 | 1359262135095 |
| org.apache.nutch:http/sonar.html             | 0.0588235 | 1359262135096 |
| org.apache.nutch:http/tutorial.html          | 0.0588235 | 1359262135097 |
| org.apache.nutch:http/version_control.html   | 0.0588235 | 1359262135099 |
| org.apache.nutch:http/wiki.html              | 0.0588235 | 1359262135100 |
+----------------------------------------------+-----------+---------------+
17 rows in set (0.00 sec)

おわり

Nutchでクローリング(StepByStep)

2013-01-20T00:13:00.000+09:00

以前Nutchでクローリングした際は、crawlでクロールのプロセスを自動的に順次実行させた。
今回は、プロセスを手動で順次実行してみる。
ちょっと特殊なのはCassandraにストアしていること。

さくらVPS
CentOS 6.2
OpenJDK 1.6.0_24
Apache Nutch 2.1
Apache Cassandra 1.2

Inject

クローリングの始点をデータベース、つまり今回はCassandraに、登録する。
Apacheのコミュニティサイトを３つをクロール始点としてファイルに登録。

# pwd
/opt/apache-nutch-2.1

# cat urls/seed.txt
http://nutch.apache.org/
http://lucene.apache.org/
http://cassandra.apache.org/

# nutch inject urls/
InjectorJob: starting
InjectorJob: urlDir: urls
<省略>
InjectorJob: finished

確かめる。

# nutch readdb -dump ./out_dir

# cat out_dir/part-r-00000
http://cassandra.apache.org/ key: org.apache.cassandra:http/
baseUrl: null
status: 0 (null)
fetchInterval: 2592000
fetchTime: 1358584769397
prevFetchTime: 0
retries: 0
modifiedTime: 0
protocolStatus: (null)
parseStatus: (null)
title: null
score: 1.0
markers: {dist=0, _injmrk_=y}
metadata _csh_ :  ?�

http://lucene.apache.org/ key: org.apache.lucene:http/
baseUrl: null
status: 0 (null)
fetchInterval: 2592000
fetchTime: 1358584769397
prevFetchTime: 0
retries: 0
modifiedTime: 0
protocolStatus: (null)
parseStatus: (null)
title: null
score: 1.0
markers: {dist=0, _injmrk_=y}
metadata _csh_ :  ?�

http://nutch.apache.org/ key: org.apache.nutch:http/
baseUrl: null
status: 0 (null)
fetchInterval: 2592000
fetchTime: 1358584769397
prevFetchTime: 0
retries: 0
modifiedTime: 0
protocolStatus: (null)
parseStatus: (null)
title: null
score: 1.0
markers: {dist=0, _injmrk_=y}
metadata _csh_ :  ?�

たしかに３つのサイトが登録されている。

Generate

クロール対象の選択。

# nutch generate
<省略>

そして確認。readdb -dumpの出力先ディレクトリは、存在するとエラーになる。ここでは同じディレクトリにするので最初に消しておく。

# rm -rf out_dir

# nutch readdb -dump ./out_dir
<省略>

# cat out_dir/part-r-00000
http://cassandra.apache.org/ key: org.apache.cassandra:http/
baseUrl: null
status: 0 (null)
fetchInterval: 2592000
fetchTime: 1358584769397
prevFetchTime: 0
retries: 0
modifiedTime: 0
protocolStatus: (null)
parseStatus: (null)
title: null
score: 1.0
markers: {dist=0, _injmrk_=y, _gnmrk_=1358585856-208596666}
metadata _csh_ :  ?�

http://lucene.apache.org/ key: org.apache.lucene:http/
baseUrl: null
status: 0 (null)
fetchInterval: 2592000
fetchTime: 1358584769397
prevFetchTime: 0
retries: 0
modifiedTime: 0
protocolStatus: (null)
parseStatus: (null)
title: null
score: 1.0
markers: {dist=0, _injmrk_=y, _gnmrk_=1358585856-208596666}
metadata _csh_ :  ?�

http://nutch.apache.org/ key: org.apache.nutch:http/
baseUrl: null
status: 0 (null)
fetchInterval: 2592000
fetchTime: 1358584769397
prevFetchTime: 0
retries: 0
modifiedTime: 0
protocolStatus: (null)
parseStatus: (null)
title: null
score: 1.0
markers: {dist=0, _injmrk_=y, _gnmrk_=1358585856-208596666}
metadata _csh_ :  ?�

markersのところが追記されているが、大きな変化はなし。
Generate実行時に-topNオプションを付けると対象サイトをしぼれる。

# nutch generate -topN 1
<省略>

# nutch readdb -dump out_dir

# cat out_dir/part-r-00000
http://cassandra.apache.org/ key: org.apache.cassandra:http/
<省略>
markers: {dist=0, _injmrk_=y, _gnmrk_=1358586890-1155183335}
metadata _csh_ :  ?�

http://lucene.apache.org/ key: org.apache.lucene:http/
<省略>
markers: {dist=0, _injmrk_=y}
metadata _csh_ :  ?�

http://nutch.apache.org/ key: org.apache.nutch:http/
<省略>
markers: {dist=0, _injmrk_=y}
metadata _csh_ :  ?�

最初のサイトに対してのみ、markersに追記がなされている。

Fetch

対象サイトのデータを取得する。
fetchの引数には、generate実行時の最後に出力されるBatch Idを指定する。

# nutch fetch 1358585856-208596666
<省略>

Cassandraをのぞくと、確かにコンテンツが格納されている模様。
ColumnFamily「f」に入っている。

# cassandra-cli
Connected to: "Cassandra Cluster" on 127.0.0.1/9160
Welcome to Cassandra CLI version 1.2.0

Type 'help;' or '?' for help.
Type 'quit;' or 'exit;' to quit.

[default@unknown] use webpage;

[default@webpage] list f;
Using default limit of 100
Using default column limit of 100
-------------------
RowKey: 6f72672e6170616368652e6e757463683a687474702f
=> (column=bas, value=http://nutch.apache.org/, timestamp=1358588578114000)
=> (column=cnt, value=






<省略>



, timestamp=1358588578191000)
=> (column=fi, value=2592000, timestamp=1358588024031000)
=> (column=pts, value=1358588021620, timestamp=1358588578188001)
=> (column=s, value=1.0, timestamp=1358588024032000)
=> (column=st, value=2, timestamp=1358588578187000)
=> (column=ts, value=1358588574261, timestamp=1358588578188000)
=> (column=typ, value=application/xhtml+xml, timestamp=1358588578192000)

3 Rows Returned.
Elapsed time: 27 msec(s).
[default@webpage]

Parse

# nutch parse 1358588033-175013002
<省略>

CassandraのColumnFamily「p」に解析されたデータが格納される。

[default@webpage] list sc;
Using default limit of 100
Using default column limit of 100
-------------------
RowKey: 6f72672e6170616368652e6e757463683a687474702f
=> (super_column=h,
     (column=Accept-Ranges, value=bytes, timestamp=1358588578175000)
     (column=Connection, value=close, timestamp=1358588578168000)
     (column=Content-Encoding, value=gzip, timestamp=1358588578162000)
     (column=Content-Length, value=8631, timestamp=1358588578171000)
     (column=Content-Type, value=text/html; charset=utf-8, timestamp=1358588578169000)
     (column=Date, value=Sat, 19 Jan 2013 09:42:53 GMT, timestamp=1358588578173000)
     (column=ETag, value="84c4-4d0994769b476-gzip", timestamp=1358588578166000)
     (column=Last-Modified, value=Tue, 11 Dec 2012 20:10:53 GMT, timestamp=1358588578172000)
     (column=Server, value=Apache/2.4.3 (Unix) OpenSSL/1.0.0g, timestamp=1358588578176000)
     (column=Vary, value=Accept-Encoding, timestamp=1358588578178000))
=> (super_column=mk,
     (column=__prsmrk__, value=1358588033-175013002, timestamp=1358590210616000)
     (column=_ftcmrk_, value=1358588033-175013002, timestamp=1358590210613000)
     (column=_gnmrk_, value=1358588033-175013002, timestamp=1358590210614000)
     (column=_injmrk_, value=y, timestamp=1358590210611000)
     (column=dist, value=0, timestamp=1358590210610000))
=> (super_column=mtdt,
     (column=_csh_, value=, timestamp=1358590006461000))
=> (super_column=ol,
     (column=http://lucene.apache.org/java/, value=Lucene, timestamp=1358590210557000)
     (column=http://lucene.apache.org/solr/, value=Solr, timestamp=1358590210581000)
     (column=http://nutch.apache.org/, value=Nutch, timestamp=1358590210606000)
<省略>
developers and community members hang out in the #cassandra channel on irc.freenode.net . If you are new to IRC, you can use a web-based client . Dead Trees Cassandra High Performance Cookbook , by Ed Capriolo. Covers Cassandra 0.8. Also on Amazon . Copyright © 2009 The Apache Software Foundation . Licensed under the Apache License, Version 2.0. Apache and the Apache feather logo are trademarks of The Apache Software Foundation. Privacy Policy ., timestamp=1358590210680000)
=> (column=sig, value=�#)�2s��D��j�, timestamp=1358590210678000)
=> (column=t, value=The Apache Cassandra Project, timestamp=1358590210679000)

3 Rows Returned.
Elapsed time: 13 msec(s).

UpdateDB

Parseでリンクも抽出しており、これを次のクロールの候補としてデータベースに反映させる。

# nutch updatedb

Cassandraをのぞくとデータ件数が45に増えている。

[default@webpage] list f;
<省略&pt;
=> (column=ts, value=1358590521209, timestamp=1358590521282000)

45 Rows Returned.
Elapsed time: 80 msec(s).

readdbでも確認してみる。

# cat ./out_dir/part-r-00000 
http://cassandra.apache.org/ key: org.apache.cassandra:http/
baseUrl: http://cassandra.apache.org/
status: 2 (status_fetched)
fetchInterval: 2592000
fetchTime: 1363772574261
prevFetchTime: 1358588021620
retries: 0
modifiedTime: 0
protocolStatus: SUCCESS, args=[]
parseStatus: success/ok (1/0), args=[]
title: The Apache Cassandra Project
score: 1.0
signature: 14efbfbd2329efbfbd3273efbfbdefbfbd44efbfbdefbfbd6aefbfbd000000
markers: {dist=0, _injmrk_=y, _updmrk_=1358588033-175013002, _gnmrk_=1358588033-175013002, _ftcmrk_=1358588033-175013002, __prsmrk__=1358588033-175013002}
metadata _csh_ :  

http://cassandra.apache.org/download/ key: org.apache.cassandra:http/download/
baseUrl: null
status: 1 (status_unfetched)
fetchInterval: 2592000
fetchTime: 1358590521209
prevFetchTime: 0
retries: 0
modifiedTime: 0
protocolStatus: UNKNOWN_CODE_0, args=[]
parseStatus: notparsed/ok (0/0), args=[]
title: null
score: 0.0
markers: {dist=1}
metadata _csh_ :  

http://cassandra.apache.org/privacy.html key: org.apache.cassandra:http/privacy.html
baseUrl: null
status: 1 (status_unfetched)
fetchInterval: 2592000
fetchTime: 1358590521210
prevFetchTime: 0
retries: 0
modifiedTime: 0
protocolStatus: UNKNOWN_CODE_0, args=[]
parseStatus: notparsed/ok (0/0), args=[]
title: null
score: 0.0
markers: {dist=1}
metadata _csh_ :  

http://lucene.apache.org/ key: org.apache.lucene:http/
baseUrl: http://lucene.apache.org/
status: 2 (status_fetched)
fetchInterval: 2592000
fetchTime: 1363772574561
prevFetchTime: 1358588021620
retries: 0
modifiedTime: 0
protocolStatus: SUCCESS, args=[]
parseStatus: success/ok (1/0), args=[]
title: Apache Lucene - Welcome to Apache Lucene
score: 1.0
signature: efbfbdefbfbdefbfbdefbfbd67efbfbd74efbfbd5cefbfbdefbfbd21413befbfbdefbfbd0000000000000000000000000000000000000000000000000000000000000000000000
markers: {dist=0, _injmrk_=y, _updmrk_=1358588033-175013002, _gnmrk_=1358588033-175013002, _ftcmrk_=1358588033-175013002, __prsmrk__=1358588033-175013002}
metadata _csh_ :

おさらい

再度以下を１巡行うと、

# nutch generate -topN 10
<省略>
GeneratorJob: generated batch id: 1358607691-1967365662
# nutch fetch 1358607691-1967365662
# nutch parse 1358607691-1967365662
# nutch updatedb

ColumnFamily「p」が13件に、ColumnFamily「f」が125件となった。

おわり

備忘録：Cassandraクライアント実行時エラー(SLF4J: Class path contains multiple SLF4J bindings.)

2013-01-19T11:53:00.000+09:00

CassandraのクライアントをThriftを利用して実装していると、実行時にこんなエラーが出た。

SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/lib/hbase/lib/slf4j-log4j12-1.5.8.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/lib/hadoop-0.20/lib/slf4j-log4j12-1.4.3.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/apache-nutch-1.6/lib/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/apache-cassandra-1.2.0/lib/slf4j-log4j12-1.7.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
Exception in thread "main" java.lang.NoSuchMethodError: org.apache.thrift.meta_data.FieldValueMetaData.(BZ)V
 at org.apache.cassandra.thrift.ColumnParent.(ColumnParent.java:128)
 at NutchCassandraData.(NutchCassandraData.java:66)
 at NutchCassandraData.main(NutchCassandraData.java:108)

原因は、エラーログの通り、複数のself4j-log4jがクラスパスに存在するため。
今回は、HBaseとHadoopのlibをクラスパスに通していた。これを外すことで解決。
Javaは勉強不足。コンパイル時になんとかなりそうな問題。

おわり

NuthとCassandraからSolrでインデクシング

2013-01-19T11:01:00.000+09:00

ここで行ったNutch＋CassandraでストアしたクロールデータをSolrでインデクシング・検索できるようにしてみる。

さくらVPS
CentOS 6.2
OpenJDK 1.6.0_24
Apache Nutch 2.1
Apache Cassandra 1.2
Apache Solr 4.0

Solrインストール

以下のパッケージを公式サイトのダウンロードサイトから用意し、ここでは/opt/にインストールする。

apache-solr-4.0.0.tgz

アーカイブを展開。

# cd /opt
# ls
apache-solr-4.0.0.tgz
# tar xvf apache-solr-4.0.0.tgz

Nutchのインストールディレクトリ配下confにSolr4用の設定ファイルが用意されているので、これをSolrのインストールディレクトリの適切な場所にコピー。

# cp /opt/apache-nutch-2.1/conf/schema-solr4.xml /opt/apache-solr-4.0.0/example/solr/collection1/conf/schema.xml

が、1.6の時と同様にエラーが出て起動しない・・・。
「/opt/apache-solr-4.0.0/example/solr/collection1/conf/schema.xml」に以下の行を追記する。
(すでに書かれている<field>と同じ場所に。)

<field name="_version_" type="long" indexed="true" stored="true" />

これで起動するはず。以下で起動して、ブラウザで「http://<ホスト名>:8983/solr/」を参照して起動を確認する。

# cd /opt/apache-solr-4.0.0/example/
# java -jar start.jar

データをSolrへ送る

# /opt/apache-nutch-2.1/runtime/local/bin/nutch solrindex http://localhost:8983/solr -all
<省略>
2013/01/19 10:54:31 org.apache.hadoop.mapred.Counters log
情報:     SPLIT_RAW_BYTES=1096
2013/01/19 10:54:31 org.apache.hadoop.mapred.Counters log
情報:     Map output records=33
SolrIndexerJob: done.

Solrで検索

ブラウザでSolrのWeb画面に接続し、左の「collection1」を選択し、「Execute Query」をクリック。

ちゃーんとみれました。

おわり

NutchとCassandra

2013-01-19T09:23:00.001+09:00

NutchのデータをCassandraにストアしてみる。
これまでHBaseにストアしていたが、HBaseは構成要素(プロセス)が多くてちょっと苦手なのでCassandraにしてみようと思い立つ。

さくらVPS
CentOS 6.2
OpenJDK 1.6.0_24
Ant 1.7.1
Apache Nutch 2.1
Apache Cassandra 1.2

Cassandraのバイナリパッケージ取得

apache-cassandra-1.2.0-bin.tar.gz

展開(ここでは/opt)

# cd /opt
# tar xvf apache-cassandra-1.2.0-bin.tar.gz
# cd apache-cassandra-1.2.0

Cassandra実行

実はノードが１台なので、設定変更はクラスタ名の変更のみ。
(複数台の場合は、他サイト参考)

cassandra.yaml

cluster_name: 'Cassandra Cluster'

実行する。

# ./bin/cassandara

クライアントで接続確認。

# ./bin/cassandra-cli
Connected to: "Cassandra Cluster" on 127.0.0.1/9160
Welcome to Cassandra CLI version 1.2.0

Type 'help;' or '?' for help.
Type 'quit;' or 'exit;' to quit.

[default@unknown] 
[default@unknown] show cluster name;
Cassandra Cluster

とりあえず接続できているようなので、正常に起動している模様。

Nutchのソースパッケージ取得

※現時点で公式リンクにはソースのみ。

apache-nutch-2.1-src.tar.gz

展開(同様に/opt)

# cd /opt
# ls
apache-nutch-2.1-src.tar.gz
# tar xvf apache-nutch-2.1-src.tar.gz
# cd apache-nutch-2.1

設定(Gora関連)

以下それぞれ追記。

conf/nutch-site.xml(<configuration>タグ内)

  <property>
    <name>storage.data.store.class</name>
    <value>org.apache.gora.cassandra.store.CassandraStore</value>
  </property>

conf/gora.properties

gora.datastore.default=org.apache.gora.cassandra.store.CassandraStore
gora.cassandrastore.servers=localhost:9160

ivy/ivy.xml

<dependency org="org.apache.gora" name="gora-cassandra" rev="0.2" conf="*->default" />

コンパイル

antでコンパイル。

# ant
<省略>

設定(Nutch関連)

confディレクトリにあるnutch-default.xmlの「http.agent.name」プロパティの値に適当な文字列を記述する。
conf/nutch-default.xml

<property>
  <name>http.agent.name</name>
  <value>My Nutch Spider</value>
</property>

# mkdir urls

urls/seed.txt

http://nutch.apache.org/

#+.
+^http://([a-z0-9]*\.)*nutch.apache.org

クローリング

以下のコマンドで実行。

# nutch crawl urls -dir crawl -depth 3 -topN 20

結果参照

cassandra-cliで確認してみる。

# cassandra-cli
Connected to: "Cassandra Cluster" on 127.0.0.1/9160
Welcome to Cassandra CLI version 1.2.0

Type 'help;' or '?' for help.
Type 'quit;' or 'exit;' to quit.

[default@unknown] use webpage;
Authenticated to keyspace: webpage

カラムファミリ「f」を全件参照。

[default@webpage] list f limit 1000;
<省略>
=> (column=ts, value=1358554167986, timestamp=1358554170426001)
-------------------
RowKey: 6f72672e6170616368652e6e757463683a687474702f617069646f63732d322e312f6f72672f6170616368652f6e757463682f70726f746f636f6c2f687474702f6170692f426c6f636b6564457863657074696f6e2e68746d6c
=> (column=fi, value=2592000, timestamp=1358554170550001)
=> (column=s, value=1.3605226E-4, timestamp=1358554170551000)
=> (column=st, value=1, timestamp=1358554170549002)
=> (column=ts, value=1358554167993, timestamp=1358554170550000)

739 Rows Returned.
Elapsed time: 1349 msec(s).

カラムファミリ「p」を全件参照。

[default@webpage] list p limit 1000;
<省略>
   Parse Plugins org.apache.nutch.parse.headings     Indexing Filter Plugins org.apache.nutch.indexer.anchor An indexing plugin for inbound anchor text. org.apache.nutch.indexer.basic A basic indexing plugin. org.apache.nutch.indexer.feed   org.apache.nutch.indexer.metadata   org.apache.nutch.indexer.staticfield A simple plugin called at indexing that adds fields with static data. org.apache.nutch.indexer.subcollection   org.apache.nutch.indexer.tld Top Level Domain Indexing plugin. org.apache.nutch.indexer.urlmeta URL Meta Tag Indexing Plugin   Misc. Plugins org.apache.nutch.analysis.lang Text document language identifier. org.apache.nutch.collection Subcollection is a subset of an index. org.creativecommons.nutch Sample plugins that parse and index Creative Commons medadata.   Apache Nutch is an open source web-search software project. Nutch is a project of the Apache Software Foundation and is part of the larger Apache community of developers and users.   Overview   Package   Class   Use   Tree   Deprecated   Index   Help    PREV   NEXT FRAMES     NO FRAMES     All Classes Copyright © 2012 The Apache Software Foundation, timestamp=1358554162452000)
=> (column=sig, value=o�>SA-����=�Hɏ, timestamp=1358554162451000)
=> (column=t, value=Overview (apache-nutch 1.6 API), timestamp=1358554162451001)

21 Rows Returned.
Elapsed time: 18 msec(s).

ちゃんとデータはストアされている模様。

おわり

NutchとSolr

2013-01-08T00:59:00.000+09:00

Nutchというクローラを動かしてみる。

さくらVPS
CentOS 6.2
OpenJDK 1.6.0_24
Apache Nutch 1.6
Apache Solr 4.0

Nutchのバイナリパッケージ取得

apache-nutch-1.6-bin.tar.gz

展開(ここでは/opt)

# cd /opt
# ls
apache-nutch-1.6-bin.tar.gz
# tar xvf apache-nutch-1.6-bin.tar.gz

nutchコマンド実行

以下を実行する。必要であればbashrcなどにも書いておく。

export PATH=$PATH:/opt/apache-nutch-1.6/bin

で、コマンドを実行してみる

# cd apache-nutch-1.6
# nutch
Usage: nutch COMMAND
where COMMAND is one of:
  crawl             one-step crawler for intranets (DEPRECATED - USE CRAWL SCRIPT INSTEAD)
  readdb            read / dump crawl db
  mergedb           merge crawldb-s, with optional filtering
  readlinkdb        read / dump link db
  inject            inject new urls into the database
  generate          generate new segments to fetch from crawl db
  freegen           generate new segments to fetch from text files
  fetch             fetch a segment's pages
  parse             parse a segment's pages
  readseg           read / dump segment data
  mergesegs         merge several segments, with optional filtering and slicing
  updatedb          update crawl db from segments after fetching
  invertlinks       create a linkdb from parsed segments
  mergelinkdb       merge linkdb-s, with optional filtering
  solrindex         run the solr indexer on parsed segments and linkdb
  solrdedup         remove duplicates from solr
  solrclean         remove HTTP 301 and 404 documents from solr
  parsechecker      check the parser for a given url
  indexchecker      check the indexing filters for a given url
  domainstats       calculate domain statistics from crawldb
  webgraph          generate a web graph from existing segments
  linkrank          run a link analysis program on the generated web graph
  scoreupdater      updates the crawldb with linkrank scores
  nodedumper        dumps the web graph's node scores
  plugin            load a plugin and run one of its classes main()
  junit             runs the given JUnit test
 or
  CLASSNAME         run the class named CLASSNAME
Most commands print help when invoked w/o parameters.

クローリングの準備

confディレクトリにあるnutch-default.xmlの「http.agent.name」プロパティの値に適当な文字列を記述する。

# vi conf/nutch-default.xml

nutch-default.xml

<property>
  <name>http.agent.name</name>
  <value>My Nutch Spider</value>
</property>

# mkdir urls
# vi urls/seed.txt

seed.txt

http://nutch.apache.org/

クローリング

以下のコマンドで実行。

# nutch crawl urls -dir crawl -depth 1 -topN 2

Solrパッケージ取得

apache-solr-4.0.0.tgz

展開(ここでは/opt)

# cd /opt
# ls
apache-solr-4.0.0.tgz
# tar xvf apache-solr-4.0.0.tgz

Solr設定＆起動

NutchにはSolr用の設定ファイルが用意されているので、これをコピーしておく。

# cp /opt/apache-nutch-1.6/conf/schema.xml /opt/apache-solr-4.0.0/example/solr/collection1/conf/schema.xml

が、以下のコマンドでSolrを起動してみるといろいろとエラーが出て起動しない・・・

# cd cd /opt/apache-solr-4.0.0/example/
# java -jar start.jar

設定ファイル「schema.xml」を少し書き換える。

# vi solr/collection1/conf/schema.xml

追記

<field name="text" type="text" stored="false" indexed="true"/>

<field name="_version_" type="long" indexed="true" stored="true" />

コメントアウト

<!-- <filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt"/> -->

すると起動するはず。

再度、クローリング＆インデクシング

# nutch crawl urls/ -solr http://localhost:8983/solr/ -depth 2 -topN 50

SolrのWeb画面で検索

http://localhost:8983/solr/にブラウザで接続。左の「collection1」の「Query」を選択し、「Execute Query」をクリック！
あれ・・・XMLできれいに出力されない。

ブラウザの問題のようです。Chromeの場合、アドオン入れたら解決。

おわり

備忘録：MacのターミナルでJavaが文字化けする場合

2012-12-19T18:11:00.003+09:00

MacのターミナルでJavaのコンパイル(# javac)を実行すると日本語の表示が文字化けを起こす。
Shift-JISだかららしい。

Mac OS X 10.8.2

これを実行するか、しかるべきスクリプトに書く。

export JAVA_TOOL_OPTIONS=-Dfile.encoding=UTF-8

おわり

HBaseでリージョンを分割してTRUNCATE

2012-09-03T20:23:00.000+09:00

HBaseのテーブル操作について。リージョンの分割を指定して、テーブルをTRUNCATEしたい」
分割キーを指定した場合は、その通りに。しなかった場合は、消す前のテーブルのリージョン（枠）を再現する。

ソースコード

# java truncateWithRegions
Usage : java truncateWithRegions  [<filename>]

truncateWithRegions.java

import java.io.*;
import java.util.*;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.HTableDescriptor;
import org.apache.hadoop.hbase.client.HBaseAdmin;
import org.apache.hadoop.hbase.client.HTable;
import org.apache.hadoop.hbase.util.Bytes;

public class truncateWithRegions {

  public static void main(String[] args) throws Exception {

    if (args.length < 1 || args.length > 2) {
      printUsage();
      System.exit(1);
    }

    Configuration conf = HBaseConfiguration.create();
    HBaseAdmin hAdmin = new HBaseAdmin(conf);

    HTableDescriptor desc;
    byte[][] splits = null;
    if (args.length == 2) splits = readSplitKeys(args[1]);
    if (hAdmin.tableExists(args[0])) {
      desc = hAdmin.getTableDescriptor(Bytes.toBytes(args[0]));
      if (args.length == 1) {
        splits = getSplitKeys(args[0]);
      }
      hAdmin.disableTable(args[0]);
      hAdmin.deleteTable(args[0]);
    } else {
      desc = new HTableDescriptor(args[0]);
    }
    createTable(hAdmin, desc, splits);
  }

  private static byte[][] getSplitKeys(String tableName) throws Exception {
    HTable hTable = new HTable(tableName);
    byte[][] splits = new byte[hTable.getEndKeys().length - 1][];
    for ( int i = 0; i < hTable.getEndKeys().length - 1; i++) {
      splits[i] = hTable.getEndKeys()[i];
    }
    return splits;
  }

  private static byte[][] readSplitKeys(String filename) throws Exception {

    BufferedReader br = new BufferedReader(new FileReader(filename));
    List<byte[]> splitList = new ArrayList<byte[]>();
    String line = "";
    while ((line = br.readLine()) != null) {
      if (line.length() != 0) {
        byte[] split = Bytes.toBytes(line);
        splitList.add(split);
      }
    }
    br.close();
    return splitList.toArray(new byte[0][]);
  }

  private static void createTable(HBaseAdmin hAdmin, HTableDescriptor desc, byte[][] splits) throws Exception {
    if (splits == null) {
      hAdmin.createTable(desc);
    } else {
      hAdmin.createTable(desc, splits);
    }
  }

  private static void printUsage() {
    System.err.println("Usage : java truncateWithRegions <tablename> [<filename>]");
  }

おわり

備忘録：CentOSを日本語環境へ

2012-06-16T13:29:00.002+09:00

さくらVPS
CentOS 6.2

日本語環境用のパッケージ追加

# yum -y groupinstall "Japanese Support"

既に入っていた。

設定ファイル編集

# vi /etc/sysconfig/i18n

<変更前>

LANG="C"

＜変更後＞

 LANG="ja_JP.UTF-8"

設定反映

# source /etc/sysconfig/i18n
# echo $LANG
ja_JP.UTF-8

おわり

HBaseとThriftとPython

2012-06-08T00:02:00.002+09:00

CentOS上でThrift使ってPythonでHBaseアクセス。

HadoopディストリビューションはCDH3u3

thriftインストール

http://thrift.apache.org/docs/install/centos/

hbase-thriftインストール

yum使います。

# yum install hadoop-hbase-thrift

CDHのパッケージにはThriftの定義体がないので以下からソースダウンロード

http://www.apache.org/dyn/closer.cgi/hbase/

Hbase.thrift(Thriftの定義体)を作業ディレクトリにコピー

# cp {hadoopのソースコード展開ディレクトリ}/src/main/resources/org/apache/hadoop/hbase/thrift/Hbase.thrift {作業ディレクトリ}

ついでに作業ディレクトリに移動。

# cd {作業ディレクトリ}

thriftによるpythonモジュールの生成

# thrift --gen py Hbase.thrift

クライアントのコードを書く

# cd gen-py
# vi hbaseClient.py

hbaseClient.py

import sys
sys.path.append('./gen-py')

from thrift.transport.TSocket import TSocket
from thrift.transport.TTransport import TBufferedTransport
from thrift.protocol import TBinaryProtocol
from hbase import Hbase

transport=TBufferedTransport(TSocket('localhost', 9090))
transport.open()
protocol=TBinaryProtocol.TBinaryProtocol(transport)
client=Hbase.Client(protocol)

print(client.getTableNames())


実行

#service hadoop-hbase-thrift start
# python hbaseClient.py
['usertable']