Hortonworks sandbox provides a nice playground for hadoop beginners to test their big data application.
- Windows and Linux: Install Virtual Box
- MAC: Install VMWare Fusion
- Install Hortonworks Sandbox
- Follow tutorial to setup environment and password
- The python shipped with hortonworks is Python 2.6, which is really old.
- Install Anaconda3 to upgrade python to Python 3.6 to default location /root/anaconda3
Code here is a modified version of mapper in this blog
Code here is a modified version of reducer in this blog
- Download Shakespeare
Scp the txt into virtual machine, you can use scp on Mac and Linux
-P 2222 t8.shakespeare.txt root@localhost:/root/ ```
- You can use [WinSCP](https://winscp.net/eng/download.php) to put the txt into virtual maxhine
- Make mapper and reducer executable ```chmod +x word_count_mapper.py word_count_reducer.py
Test your mapper and reducer locally
t8.shakespear.txt | word_count_mapper.py | sort | word_count_reducer.py ``` , this step is important because hadoop doesn't show the exact error output from python, so it's hard to debug python in hadoop.
- Upload the txt into hdfs under ```/demo/data```, using [Ambari file view](http://localhost:8080/#/main/views/FILES/1.0.0/AUTO_FILES_INSTANCE)
- Test mapper and reducer using hadoop
hadoop jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-streaming.jar -file /root/mapper.py -mapper mapper.py -file /root/reducer.py -reducer reducer.py -input /demo/data/t8.shakespeare.txt -output /demo/output
Clean up the output folder after the experiment, this step is important because hadoop will not overwrite existing folder
hdfs dfs -rm -r /demo/output