MacBookAir で Hadoop擬似分散環境構築

はじめに

Apache Mahoutで遊んでみたいので、MacBookAirにHadoopの環境を構築することにしました。
情報がたくさんありそうで、バージョンや環境の問題でまとまってなかったのでメモを残します。

参考

以下のページが参考になりました。
http://lizan.asia/blog/2012/11/13/mountain-lion-setup-hadoop/
http://shayanmasood.com/blog/how-to-setup-hadoop-on-mac-os-x-10-9-mavericks/
http://www.ayutaya.com/ops/os-x/hadoop-pdist
http://metasearch.sourceforge.jp/wiki/index.php?Hadoop%A5%BB%A5%C3%A5%C8%A5%A2%A5%C3%A5%D7

環境

手順

hadoopのインストール

brew install hadoop

hadoop設定

設定ファイルは /usr/local/Cellar/hadoop/1.2.1/libexec/conf 以下に全てあるらしい
設定内容は参考ページを真似して。

  • hadoop-env.sh
    • ここでJAVA 6を使うようにしています。
export HADOOP_OPTS="-Djava.security.krb5.realm=OX.AC.UK -Djava.security.krb5.kdc=kdc0.ox.ac.uk:kdc1.ox.ac.uk"
export JAVA_HOME=`/usr/libexec/java_home -v 1.6`
  • core-site.xml
<configuration>
    <property>
        <name>fs.default.name</name>
        <value>hdfs://localhost:9000</value>
    </property>
</configuration>
<configuration>
    <property>
        <name>dfs.name.dir</name>
        <value>/Users/${user.name}/hdfs/name-node</value>
    </property>
        <name>dfs.data.dir</name>
        <value>/Users/${user.name}/hdfs/data-node</value>
    </property>
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>
</configuration>
  • mapred-site.xml
<configuration>
    <property>
        <name>mapred.job.tracker</name>
        <value>localhost:9001</value>
    </property>
    <property>
        <name>mapred.tasktracker.map.tasks.maximum</name>
        <value>2</value>
    </property>
</configuration>

ssh の設定

擬似分散環境では localhostsshするので設定する
「システム環境設定」→「共有」→「リモートログイン」にチェックを入れてから

ssh-keygen -t rsa -P ""
cat id_rsa >> authorized_keys
ssh localhost

でログインできればOK

hostname

sudo hostname localhost

で作業する必要がありました。。
このへんは後でちゃんと設定したいなと思います。

hadoop起動

いよいよhadoopを起動していきます

初期化

hadoop namenode -format
14/01/19 20:25:51 INFO namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG:   host = localhost/127.0.0.1
STARTUP_MSG:   args = [-format]
STARTUP_MSG:   version = 1.2.1
STARTUP_MSG:   build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.2 -r 1503152; compiled by 'mattf' on Mon Jul 22 15:23:09 PDT 2013
STARTUP_MSG:   java = 1.6.0_65
************************************************************/
Re-format filesystem in /Users/junji/hdfs ? (Y or N) Y
14/01/19 20:25:54 INFO util.GSet: Computing capacity for map BlocksMap
14/01/19 20:25:54 INFO util.GSet: VM type       = 64-bit
14/01/19 20:25:54 INFO util.GSet: 2.0% max memory = 1039859712
14/01/19 20:25:54 INFO util.GSet: capacity      = 2^21 = 2097152 entries
14/01/19 20:25:54 INFO util.GSet: recommended=2097152, actual=2097152
14/01/19 20:25:54 INFO namenode.FSNamesystem: fsOwner=junji
14/01/19 20:25:55 INFO namenode.FSNamesystem: supergroup=supergroup
14/01/19 20:25:55 INFO namenode.FSNamesystem: isPermissionEnabled=true
14/01/19 20:25:55 INFO namenode.FSNamesystem: dfs.block.invalidate.limit=100
14/01/19 20:25:55 INFO namenode.FSNamesystem: isAccessTokenEnabled=false accessKeyUpdateInterval=0 min(s), accessTokenLifetime=0 min(s)
14/01/19 20:25:55 INFO namenode.FSEditLog: dfs.namenode.edits.toleration.length = 0
14/01/19 20:25:55 INFO namenode.NameNode: Caching file names occuring more than 10 times
14/01/19 20:25:55 INFO common.Storage: Image file /Users/junji/hdfs/current/fsimage of size 111 bytes saved in 0 seconds.
14/01/19 20:25:55 INFO namenode.FSEditLog: closing edit log: position=4, editlog=/Users/junji/hdfs/current/edits
14/01/19 20:25:55 INFO namenode.FSEditLog: close success: truncate to 4, editlog=/Users/junji/hdfs/current/edits
14/01/19 20:25:55 INFO common.Storage: Storage directory /Users/junji/hdfs has been successfully formatted.
14/01/19 20:25:55 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at localhost/127.0.0.1
************************************************************/

hadoop-env.sh での設定がちゃんとできていないと、Unable to load realm info from SCDynamicStore のエラーが出ます

起動

start-all.sh
starting namenode, logging to /usr/local/Cellar/hadoop/1.2.1/libexec/bin/../logs/hadoop-junji-namenode-localhost.out
localhost: starting datanode, logging to /usr/local/Cellar/hadoop/1.2.1/libexec/bin/../logs/hadoop-junji-datanode-localhost.out
localhost: starting secondarynamenode, logging to /usr/local/Cellar/hadoop/1.2.1/libexec/bin/../logs/hadoop-junji-secondarynamenode-localhost.out
starting jobtracker, logging to /usr/local/Cellar/hadoop/1.2.1/libexec/bin/../logs/hadoop-junji-jobtracker-localhost.out
localhost: starting tasktracker, logging to /usr/local/Cellar/hadoop/1.2.1/libexec/bin/../logs/hadoop-junji-tasktracker-localhost.out

確認

実行サンプル

hadoop jar /usr/local/Cellar/hadoop/1.2.1/libexec/hadoop-examples-1.2.1.jar pi 2 100
Number of Maps  = 2
Samples per Map = 100
Wrote input for Map #0
Wrote input for Map #1
Starting Job
14/01/19 21:22:11 INFO mapred.FileInputFormat: Total input paths to process : 2
14/01/19 21:22:11 INFO mapred.JobClient: Running job: job_201401192119_0001
14/01/19 21:22:12 INFO mapred.JobClient:  map 0% reduce 0%
14/01/19 21:22:18 INFO mapred.JobClient:  map 100% reduce 0%
14/01/19 21:22:25 INFO mapred.JobClient:  map 100% reduce 33%
14/01/19 21:22:26 INFO mapred.JobClient:  map 100% reduce 100%
14/01/19 21:22:27 INFO mapred.JobClient: Job complete: job_201401192119_0001
14/01/19 21:22:27 INFO mapred.JobClient: Counters: 27
14/01/19 21:22:27 INFO mapred.JobClient:   Job Counters
14/01/19 21:22:27 INFO mapred.JobClient:     Launched reduce tasks=1
14/01/19 21:22:27 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=8154
14/01/19 21:22:27 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
14/01/19 21:22:27 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
14/01/19 21:22:27 INFO mapred.JobClient:     Launched map tasks=2
14/01/19 21:22:27 INFO mapred.JobClient:     Data-local map tasks=2
14/01/19 21:22:27 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=8750
14/01/19 21:22:27 INFO mapred.JobClient:   File Input Format Counters
14/01/19 21:22:27 INFO mapred.JobClient:     Bytes Read=236
14/01/19 21:22:27 INFO mapred.JobClient:   File Output Format Counters
14/01/19 21:22:27 INFO mapred.JobClient:     Bytes Written=97
14/01/19 21:22:27 INFO mapred.JobClient:   FileSystemCounters
14/01/19 21:22:27 INFO mapred.JobClient:     FILE_BYTES_READ=50
14/01/19 21:22:27 INFO mapred.JobClient:     HDFS_BYTES_READ=480
14/01/19 21:22:27 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=165610
14/01/19 21:22:27 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=215
14/01/19 21:22:27 INFO mapred.JobClient:   Map-Reduce Framework
14/01/19 21:22:27 INFO mapred.JobClient:     Map output materialized bytes=56
14/01/19 21:22:27 INFO mapred.JobClient:     Map input records=2
14/01/19 21:22:27 INFO mapred.JobClient:     Reduce shuffle bytes=56
14/01/19 21:22:27 INFO mapred.JobClient:     Spilled Records=8
14/01/19 21:22:27 INFO mapred.JobClient:     Map output bytes=36
14/01/19 21:22:27 INFO mapred.JobClient:     Total committed heap usage (bytes)=454238208
14/01/19 21:22:27 INFO mapred.JobClient:     Map input bytes=48
14/01/19 21:22:27 INFO mapred.JobClient:     Combine input records=0
14/01/19 21:22:27 INFO mapred.JobClient:     SPLIT_RAW_BYTES=244
14/01/19 21:22:27 INFO mapred.JobClient:     Reduce input records=4
14/01/19 21:22:27 INFO mapred.JobClient:     Reduce input groups=4
14/01/19 21:22:27 INFO mapred.JobClient:     Combine output records=0
14/01/19 21:22:27 INFO mapred.JobClient:     Reduce output records=0
14/01/19 21:22:27 INFO mapred.JobClient:     Map output records=4
Job Finished in 16.776 seconds
Estimated value of Pi is 3.12000000000000000000

まとめ

次回はmahoutと絡めて動かしていきたいと思います。