在hadoop平台运行python代码的mapreduce的程序——排序

wenying_44323744

于 2025-04-16 10:15:30 发布

阅读量452

点赞数 3

CC 4.0 BY-SA版权

分类专栏：大数据文章标签： hadoop 大数据分布式

本文链接：https://blog.csdn.net/weixin_44323744/article/details/147268006

大数据专栏收录该内容

6 篇文章

订阅专栏

编程实现对输入文件排序⤴💮

一、实验目标

对3个文件中的数字排序，并输出。

file1.txt

file2.txt

file3.txt

最终实验效果：

二、编写mapper和reducer

mapper.py

文件内容如下：

#!/usr/bin/env python3
# encoding=utf-8

lines = []

import sys
for line in sys.stdin:
    line = line.strip()
    try:
        line = int(line)
    except ValueError:
        continue
    lines.append(line)

lines = sorted(lines)

for line in lines:
    print(line)

reducer.py

reducer.py文件如下：

#!/usr/bin/env python3
# encoding=utf-8

import sys

i = 1
for line in sys.stdin:
    line = line.strip()
    print("%d %s" % (i, line))
    i = i + 1

给mapper.py和reducer.py文件添加权限，需要输入一次密码。

sudo chmod 777 mapper.py
sudo chmod 777 reducer.py

三、创建三个输入文件

本文把file1.txt和file2.txt文件存放在/usr/local/hadoop目录下。

file1.txt

file2.txt

file3.txt

把file1.txt、file2.txt和file3.txt上传到hdfs上。（在hdfs上先创建input文件夹）

./bin/hdfs dfs -put file1.txt input

./bin/hdfs dfs -put file2.txt input

./bin/hdfs dfs -put file3.txt input

四、运行程序

在本地测试mapper.py和reducer.py

cat file1.txt  file2.txt  file3.txt| python3 mapper.py |python3 reducer.py

在hdfs的文件中运行，最终结果写到hdfs中的output目录下。

 ./bin/hadoop jar share/hadoop/tools/lib/hadoop-streaming-3.3.5.jar -mapper mapper.py -reducer reducer.py -input input/* -output output

但是，我在hdfs上的文件运行结果不正确，翻车😭，是按照数字第一位排序，再按数字第二位排序，所以得到的就是 5排到12后面。和本地测试的结果不一样。

参考：实验5 MapReduce初级编程实践（Python实现）_对于两个输入文件,即文件a和文件b(需要分别按下文内容创建),编写mapreduce程序,对-CSDN博客https://blog.csdn.net/weixin_46584887/article/details/121317376