Tom decided to use random forest model, a famous model which is easy to use, to predict poker hands. He wanted to introduce it, but first, he want to introduce decision tree model.
Hortonworks sandbox provides a nice playground for hadoop beginners to test their big data application.
Code here is a modified version of mapper in this blog
Code here is a modified version of reducer in this blog
Scp the txt into virtual machine, you can use scp on Mac and Linux
- You can use [WinSCP](https://winscp.net/eng/download.php) to put the txt into virtual maxhine
- Make mapper and reducer executable ```chmod +x word_count_mapper.py word_count_reducer.py
Test your mapper and reducer locally
- Upload the txt into hdfs under ```/demo/data```, using [Ambari file view](http://localhost:8080/#/main/views/FILES/1.0.0/AUTO_FILES_INSTANCE)
- Test mapper and reducer using hadoop
hadoop jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-streaming.jar -file /root/mapper.py -mapper mapper.py -file /root/reducer.py -reducer reducer.py -input /demo/data/t8.shakespeare.txt -output /demo/output
Clean up the output folder after the experiment, this step is important because hadoop will not overwrite existing folder
hdfs dfs -rm -r /demo/output
Our hero Tom, who used to be a data scientist, went on an adventure, predicting poker hands. He wanted to share his experience here and he hopes it will be helpful for you to learn how to explore the big data world.
In this beginner’s summary, Tom will first talk about some basic concepts of big data, including training sets, test sets. Tom then will talk about a random forest model and how to rate models in general. In the end, Tom will show why features are important to generate accurate results.
I’m a data scientist who is eager to dig truth out of data.
I’m a data scientist, but now I’m focusing on data visualization. I love the beauty of big data as well as good design. I love start-up feeling and eager to work as a team. I love entrepreneurship and leadership.