hadoop - In python MRJob, how to set up the option for tempory output directory -
i using mrjob run simple word count standard hadoop job:
python word_count.py -r hadoop hdfs:///path-to-my-data this print error indicating can not create temporary directory temporary output:
stderr: mkdir: incomplete hdfs uri, no host: hdfs:///user/path-to-tmp-dir ... ... subprocess.calledprocesserror: command '['/opt/mapr/hadoop/hadoop-0.20.2/bin/hadoop', 'fs', '-mkdir', 'hdfs:///user/ assume can not create directory desired default mrjob. possible pass option mrjob through command line? option found far base_tmp_dir. in description mentioned "path put local temp dirs inside." "local" not looking temporary output directory supposed in hdfs. nevertheless, meant give try (:
python word_count.py --base-tmp-dir=./tmp/ data.txt or
python word_count.py -r hadoop --base-tmp-dir=hdfs:///some-path hdfs:///path-to-data but failed mrjob complain there no such option:
word_count.py: error: no such option: --base-tmp-dir the word_count.py standard 1 found here. may missing essential knowledge on mrjobj or may have go hadoop streaming.
mrjob calls hadoop binary when interacting hdfs. hadoop command needs know namenode located on network uris hdfs:///some-path don't require full host (something hdfs://your-namenode:9000/some-path. command figures out namenode reading configuration xml files.
there's lots of conflicting reports on internet which environment variable set, in environment running latest version of mrjob , apache hadoop 2.4.1, had set hadoop_prefix environment variables. can set command:
export hadoop_prefix=/path/to/your/hadoop
once set, you'll know set correctly if type:
ls $hadoop_prefix/etc/hadoop
and shows configuration xml files.
now run command. should work.
Comments
Post a Comment