MapReduce: Run Word Count with Python and Hadoop

Install Hortonworks Sandbox

Hortonworks sandbox provides a nice playground for hadoop beginners to test their big data application.

1
bash Anaconda3-XXX-Linux-x86_64.sh

Mapper

Code here is a modified version of mapper in this blog

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
#!/root/anaconda3/bin/python
# Filename: word_count_mapper.py

import sys

# input comes from STDIN (standard input)
for line in sys.stdin:
# remove leading and trailing whitespace
line = line.strip()
# split the line into words
words = line.split()
# increase counters
for word in words:
# write the results to STDOUT (standard output);
# what we output here will be the input for the
# Reduce step, i.e. the input for reducer.py
#
# tab-delimited; the trivial word count is 1
print('{}\t{}'.format(word, 1))

Reducer

Code here is a modified version of reducer in this blog

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
#!/root/anaconda3/bin/python
# Filename: word_count_reducer.py

from operator import itemgetter
import sys

current_word = None
current_count = 0
word = None

# input comes from STDIN
for line in sys.stdin:
# remove leading and trailing whitespace
line = line.strip()

# parse the input we got from mapper.py
word, count = line.split('\t', 1)

# convert count (currently a string) to int
try:
count = int(count)
except ValueError:
# count was not a number, so silently
# ignore/discard this line
continue

# this IF-switch only works because Hadoop sorts map output
# by key (here: word) before it is passed to the reducer
if current_word == word:
current_count += count
else:
if current_word:
# write result to STDOUT
print('{}\t{}'.format(current_word, current_count))
current_count = count
current_word = word

# do not forget to output the last word if needed!
if current_word == word:
print('{}\t{}'.format(current_word, current_count))

Test Mapper and Reducer

  • Download Shakespeare
  • Scp the txt into virtual machine, you can use scp on Mac and Linux

    -P 2222 t8.shakespeare.txt root@localhost:/root/ ```
    1
    2
    - You can use [WinSCP](https://winscp.net/eng/download.php) to put the txt into virtual maxhine
    - Make mapper and reducer executable ```chmod +x word_count_mapper.py word_count_reducer.py

  • Test your mapper and reducer locally

    t8.shakespear.txt | word_count_mapper.py | sort | word_count_reducer.py ``` , this step is important because hadoop doesn't show the exact error output from python, so it's hard to debug python in hadoop.
    1
    2
    3
    4
    5
    - Upload the txt into hdfs under ```/demo/data```, using [Ambari file view](http://localhost:8080/#/main/views/FILES/1.0.0/AUTO_FILES_INSTANCE)
    - Test mapper and reducer using hadoop

    ```bash
    hadoop jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-streaming.jar -file /root/mapper.py -mapper mapper.py -file /root/reducer.py -reducer reducer.py -input /demo/data/t8.shakespeare.txt -output /demo/output

  • Clean up the output folder after the experiment, this step is important because hadoop will not overwrite existing folder

1
hdfs dfs -rm -r /demo/output

Reference

Escaping: Predicting Poker Hands, Part 1

Our hero Tom, who used to be a data scientist, went on an adventure, predicting poker hands. He wanted to share his experience here and he hopes it will be helpful for you to learn how to explore the big data world.

In this beginner’s summary, Tom will first talk about some basic concepts of big data, including training sets, test sets. Tom then will talk about a random forest model and how to rate models in general. In the end, Tom will show why features are important to generate accurate results.

About Me

I’m a data scientist who is eager to dig truth out of data.
I’m a data scientist, but now I’m focusing on data visualization. I love the beauty of big data as well as good design. I love start-up feeling and eager to work as a team. I love entrepreneurship and leadership.