2014年7月2日 星期三

nutch 1.8 + solr 4.9.0 探討系列一 : 基礎安裝篇

前一陣子規劃要學習 nutch, solr, lucene 已經很久, 現在終於有時間來學習. 本來是要安裝 nutch 2.2.x 版的, 但由於以下兩篇文章讓我改變了主意.

整體來說就是 nutch 2.2.x 在將資料存到資料庫方面還不穩定, 且它整體的效能還差很多. 對我來說, 因為我主要要學的是 solr, lucene, 所以選擇 nutch 1.x 版.

在安裝 nutch 之前, 讀者可參考 Nutch 等 Web Crawler 的著作權問題 這一篇文章, 如果違法是要負民事及刑事責任的, 不得不當心.

前一陣子練習 Hadoop YARN + Spark + Shark, 所以安裝了 hadoop, 共 master, slave1, slave2, slave3, 4 個 centos 6.3 VM. 不過此篇文章並未使用到 nutch 與 Hadoop 的結合, 待筆者有空時, 再補充這方面的測試.

先從 master 開始, 在那之前請先安裝 jdk, 我安裝的是 jdk1.7.0_55 .
1. Ant 1.9.5 安裝
在 root HOME 下 首先安裝 Ant 1.9.4, 因為 centos 6.3 yum 提供的 Ant 版本太舊, solr 無法接受. 但安裝 Ant 1.9.4 之後, build 時會出現以下訊息, 所以根據 https://www.mail-archive.com/blfs-book@lists.linuxfromscratch.org/msg00345.html 發現應該是有修補了. 所以就改到 svn 去抓取最新版本.
/sources/apache-ant/apache-
 
ant-1.9.4/src/tests/junit/org/apache/tools/ant/taskdefs/ExecuteWatchdogTest.java:143:
 error: cannot access Matcher
                         throw new AssumptionViolatedException("process
 interrupted in thread", e);

yum install svn
svn co http://svn.apache.org/repos/asf/ant/core/trunk/ ant-core
cd ant-core
sh build.sh -Ddist.dir=./install dist

將以下內容 append 到 /etc/profile, 以下使用的是 bash .

vi /etc/profile

append :

export ANT_HOME=/root/ant-core/install
export JAVA_HOME=/usr/java/jdk1.7.0_55
export PATH=${PATH}:${ANT_HOME}/bin

使 /etc/profile 生效

source /etc/profile
2. nutch 1.8 安裝
cd ~
wget http://ftp.twaren.net/Unix/Web/apache/nutch/1.8/apache-nutch-1.8-src.tar.gz
tar zxvf apache-nutch-1.8-src.tar.gz
cd apache-nutch-1.8
ant
3. nutch config
設定 nutch agent name, 及取消抓取檔案大小的上限(也可以不取消, 保持 nutch-default.xml 的限制)
cd runtime/local/conf
vi nutch-site.xml

在<configuration></configuration>之間加入 :

<property>
  <name>http.agent.name</name>
  <value>My Nutch Spider</value>
</property>
<property>
  <name>file.content.limit</name>
  <value>-1</value>
  <description>The length limit for downloaded content using the file://
  protocol, in bytes. If this value is nonnegative (>=0), content longer
  than it will be truncated; otherwise, no truncation at all. Do not
  confuse this setting with the http.content.limit setting.
  </description>
</property>
<property>
  <name>http.content.limit</name>
  <value>65536</value>
  <description>The length limit for downloaded content using the http://
  protocol, in bytes. If this value is nonnegative (>=0), content longer
  than it will be truncated; otherwise, no truncation at all. Do not
  confuse this setting with the file.content.limit setting.
  </description>
</property>
4. solr 4.9.0 安裝
cd ~
wget http://ftp.twaren.net/Unix/Web/apache/lucene/solr/4.9.0/solr-4.9.0-src.tgz
tar zxvf solr-4.9.0-src.tgz
cd solr-4.9.0
ant ivy-bootstrap
ant compile
cd solr
ant example
5. 整合 nutch 和 solr
cd example/solr
mv collection1/conf/schema.xml collection1/conf/schema.xml.bak
cp /root/apache-nutch-1.8/runtime/local/conf/schema-solr4.xml collection1/conf/schema.xml

將以下內容放到 collection1/conf/schema.xml

vi collection1/conf/schema.xml

在<fields>之後加入 :

<field name="_version_" type="long" stored="true" indexed="true" multiValued="false"/>
6. Tomcat 8.0.9 安裝
cd ~
wget http://ftp.mirror.tw/pub/apache/tomcat/tomcat-8/v8.0.9/src/apache-tomcat-8.0.9-src.tar.gz
tar zxvf apache-tomcat-8.0.9-src.tar.gz
cd apache-tomcat-8.0.9-src
ant -buildfile ./build.xml
cp -r output/build /usr/local/apache-tomcat-8.0.9
7. 佈署 solr.war 到 Tomcat
yum install unzip
unzip /root/solr-4.9.0/solr/example/webapps/solr.war -d /usr/local/apache-tomcat-8.0.9/webapps/solr
vi /usr/local/apache-tomcat-8.0.9/webapps/solr/WEB-INF/web.xml

修改以下內容 :

  <!--
    <env-entry>
       <env-entry-name>solr/home</env-entry-name>
       <env-entry-value>/put/your/solr/home/here</env-entry-value>
       <env-entry-type>java.lang.String</env-entry-type>
    </env-entry>
   -->

    <env-entry> 
       <env-entry-name>solr/home</env-entry-name>
       <env-entry-value>/root/solr-4.9.0/solr/example/solr</env-entry-value>
       <env-entry-type>java.lang.String</env-entry-type>
    </env-entry>
8. 啟動 Tomcat 及後續處理
cd /usr/local/apache-tomcat-8.0.9
bin/startup.sh
vi logs/catalina.out


26-Jun-2014 19:31:15.405 INFO [main] org.apache.catalina.core.AprLifecycleListener.init The APR based Apache Tomcat Native library which allows optimal performance in production environments was not found on the java.library.path: /usr/java/packages/lib/amd64:/usr/lib64:/lib64:/lib:/usr/lib

訊息提示找不到 APR based Apache Tomcat Native library, 所以進行安裝.

cd /usr/local/apache-tomcat-8.0.9/bin
tar zxvf tomcat-native.tar.gz
cd tomcat-native-1.1.30-src/jni/native
yum install apr-devel openssl-devel

./configure --with-apr=/usr/bin/apr-1-config \
            --with-java-home=/usr/java/jdk1.7.0_55/ \
            --with-ssl=yes
make && make install

刪除解開, 再也用不著的目錄.

rm -rf /usr/local/apache-tomcat-8.0.9/bin/tomcat-native-1.1.30-src

加入 LD_LIBRARY_PATH

vi /etc/profile

append :

LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/apr/lib
export LD_LIBRARY_PATH

source /etc/profile

重啟 Tomcat, 檢視 log 是否安裝成功.

cd /usr/local/apache-tomcat-8.0.9
bin/shutdown.sh
bin/startup.sh
vi logs/catalina.out

26-Jun-2014 21:34:36.695 INFO [main] org.apache.catalina.core.AprLifecycleListener.init Loaded APR based Apache Tomcat Native library 1.1.30 using APR version 1.3.9.

從訊息看來, APR based Apache Tomcat Native library 已安裝成功.

vi logs/localhost.2014-06-26.log


26-Jun-2014 19:31:18.276 SEVERE [localhost-startStop-1] org.apache.catalina.core.StandardContext.filterStart Exception starting filter SolrRequestFilter
 java.lang.NoClassDefFoundError: Failed to initialize Apache Solr: Could not find necessary SLF4j logging jars. If using Jetty, the SLF4j logging jars need to go in the jetty lib/ext directory. For other containers, the corresponding directory should be used. For more information, see: http://wiki.apache.org/solr/SolrLogging
        at org.apache.solr.servlet.CheckLoggingConfiguration.check(CheckLoggingConfiguration.java:28)

從訊息看來, solr logging 未安裝好.

cp /root/solr-4.9.0/solr/example/lib/ext/* /usr/local/apache-tomcat-8.0.9/lib
cp /root/solr-4.9.0/solr/example/resources/log4j.properties /usr/local/apache-tomcat-8.0.9/lib

bin/shutdown.sh
bin/startup.sh
vi logs/catalina.out

0    [localhost-startStop-1] INFO  org.apache.solr.servlet.SolrDispatchFilter  – SolrDispatchFilter.init()
10   [localhost-startStop-1] INFO  org.apache.solr.core.SolrResourceLoader  – Using JNDI solr.home: /root/solr-4.8.1/solr/example/solr
11   [localhost-startStop-1] INFO  org.apache.solr.core.SolrResourceLoader  – new SolrResourceLoader for directory: '/root/solr-4.8.1/solr/example/solr/'
180  [localhost-startStop-1] INFO  org.apache.solr.core.ConfigSolr  – Loading container configuration from /root/solr-4.8.1/solr/example/solr/solr.xml
359  [localhost-startStop-1] INFO  org.apache.solr.core.CoresLocator  – Config-defined core root directory: /root/solr-4.8.1/solr/example/solr

看到 solr logging 已經正常. 另外也可查到多出了 logs/solr.log 檔.

9. 測試 solr 是否安裝成功

用瀏覽器連到以下網址, 看 solr admin 介面是否顯示並正常運作.

http://localhost:8080/solr/

可以透過 admin 管理介面修改 core name, 或 add core.

10. 進行 nutch 測試, 先不使用 hadoop
先限制每一 round 抓取的 urls 數, 否則硬碟放不下
cd /root/apache-nutch-1.8/runtime/local
vi bin/crawl
# number of urls to fetch in one iteration
# 250K per task?
sizeFetchlist=`expr $numSlaves \* 50000`

改成

# number of urls to fetch in one iteration
# 250K per task?
sizeFetchlist=`expr $numSlaves \* 50`

開始抓取, 可透過 /root/apache-nutch-1.8/runtime/local/logs/hadoop.log 及 執行指令時的 standard output, 還有 /usr/local/apache-tomcat-8.0.9/logs/solr.log 來 trace

mkdir urls
vi seed.txt

加入一行 :

http://wiki.apache.org/nutch/

執行 command, 開始抓取資料並建 solr index.

#crawl <seedDir> <crawlDir> <solrURL> <numberOfRounds>

bin/crawl urls crawl http://localhost:8080/solr/ 2

可輸入以下網址來查詢結果, 直接按下下方的 Execute Query Button 就可查詢到全部抓取到的資料.

http://localhost:8080/solr/#/collection1/query

若想有更好的, 一般網頁格式的使用者介面, 可試試以下 2 個網址.

http://localhost:8080/solr/select/?q=apache&wt=xslt&tr=example.xsl
http://localhost:8080/solr/collection1/browse

網路上其實有許多 solr 的加強介面, 不過目前重點先在於 nutch 和 slor 內涵的了解, 介面等過一陣子筆者會再去試用. 可先參考以下網址的介紹.

http://searchhub.org/2010/01/14/solr-search-user-interface-examples/
11. 參考文章

沒有留言:

張貼留言