hadoop（4）——用python代码结合hadoop完成一个小项目

科技2024-12-17 46

mapper.py和reducer.py文件内容借鉴于如下博客： https://blog.csdn.net/marywang56/article/details/80395519

我们都知道hadoop是在java环境下完成的，但是通过hadoop-streaming这个java小程序，我们可以把python代码放入hadoop中，然后通过stdin和stdout来进行数据的传递。（1）开启yarn 通过jps命令查看（2）查看mapper.py和reducer.py

import sys # input comes from STDIN (standard input) for line in sys.stdin: line = line.strip() words = line.split() for word in words: print '%s\t%s' % (word, 1)

from operator import itemgetter import sys current_word = None current_count = 0 word = None for line in sys.stdin: line = line.strip() word, count = line.split('\t', 1) try: count = int(count) except ValueError: continue if current_word == word: current_count += count else: if current_word: print '%s\t%s' % (current_word, current_count) current_count = count current_word = word if current_word == word: print '%s\t%s' % (current_word, current_count)

（3）测试命令 <1> 先看hadoop.txt

<2> 可以看见mapper把每一个字符都分割了开来 <3> 可见sort函数将字母进行排序，对应hadoop里的shuffle过程 <4> 这时可以看见模拟出了最后输出的结果，将一样的词合并作为输出（4）用hadoop来实现此时要写好脚本，如图：

（5）实行脚本任务实行结束（6）查看输出结果（7）可视化查看如图，此运算例已经实行成功

Processed: 0.012, SQL: 8