0% found this document useful (0 votes)
80 views

GitHub - Prince6635 - Movie-Ratings-By-Mapreduce-And-Hadoop - Big Data (Movie Ratings) Based On Hadoop and MapReduce

This GitHub repository contains examples of using MapReduce and Hadoop to analyze big data like movie ratings. It includes Python scripts that use MapReduce to find patterns in movie ratings like the most popular movies or how many movies each user rated. It also explains how Hadoop and its components like HDFS and YARN enable distributed processing of large datasets.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
80 views

GitHub - Prince6635 - Movie-Ratings-By-Mapreduce-And-Hadoop - Big Data (Movie Ratings) Based On Hadoop and MapReduce

This GitHub repository contains examples of using MapReduce and Hadoop to analyze big data like movie ratings. It includes Python scripts that use MapReduce to find patterns in movie ratings like the most popular movies or how many movies each user rated. It also explains how Hadoop and its components like HDFS and YARN enable distributed processing of large datasets.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

3/29/24, 4:36 PM GitHub - prince6635/movie-ratings-by-mapreduce-and-hadoop: Big data (movie ratings) based on Hadoop and MapReduce

prince6635 / movie-ratings-by-mapreduce-and-hadoop Public

Big data (movie ratings) based on Hadoop and MapReduce

6 stars 9 forks Branches Tags Activity

Star Notifications

Code Issues Pull requests Actions Projects Security Insights

master 1 Branch 0 Tags Go to file Go to file Code

prince6635 Run MR job on AWS EMR 8 years ago

assets Run MR job on AWS EMR 8 years ago

.gitignore Initial commit 8 years ago

README.md Run MR job on AWS EMR 8 years ago

friends_by_age.py MapReduce example - average n… 8 years ago

min_temperatures_by_loc… MapReduce example - min temp… 8 years ago

most_popular_movie.py MapReduce example - movie rati… 8 years ago

most_popular_movie_with… MapReduce example - movie rati… 8 years ago

most_popular_superhero.py MapReduce example - Most pop… 8 years ago

movie_recommendation_… Run MR job on AWS EMR 8 years ago

process_marvel_data.py MapReduce example - find super… 8 years ago

superhero_relatons_by_BF… MapReduce example - find super… 8 years ago

total_amount_spent_by_c… MapReduce example - total amo… 8 years ago

total_amount_spent_by_c… MapReduce example - total amo… 8 years ago

word_frequency.py MapReduce example - word freq… 8 years ago

word_frequency_better.py MapReduce example - word freq… 8 years ago

word_frequency_sorted_b… MapReduce example - word freq… 8 years ago

word_frequency_with_co… MapReduce example - movie rati… 8 years ago

https://github.com/prince6635/movie-ratings-by-mapreduce-and-hadoop 1/11
3/29/24, 4:36 PM GitHub - prince6635/movie-ratings-by-mapreduce-and-hadoop: Big data (movie ratings) based on Hadoop and MapReduce

Big data (movie ratings) based on Hadoop and


MapReduce

MapReduce

Exmaple:

how many movies that each user has watched? => key: user_id and value: movie_id, now
duplicate keys are ok, since reducer will handle that later.

https://github.com/prince6635/movie-ratings-by-mapreduce-and-hadoop 2/11
3/29/24, 4:36 PM GitHub - prince6635/movie-ratings-by-mapreduce-and-hadoop: Big data (movie ratings) based on Hadoop and MapReduce

Map:

https://github.com/prince6635/movie-ratings-by-mapreduce-and-hadoop 3/11
3/29/24, 4:36 PM GitHub - prince6635/movie-ratings-by-mapreduce-and-hadoop: Big data (movie ratings) based on Hadoop and MapReduce

https://github.com/prince6635/movie-ratings-by-mapreduce-and-hadoop 4/11
3/29/24, 4:36 PM GitHub - prince6635/movie-ratings-by-mapreduce-and-hadoop: Big data (movie ratings) based on Hadoop and MapReduce

Reduce:

All:

Code snippet: # of movies for each rating?

https://github.com/prince6635/movie-ratings-by-mapreduce-and-hadoop 5/11
3/29/24, 4:36 PM GitHub - prince6635/movie-ratings-by-mapreduce-and-hadoop: Big data (movie ratings) based on Hadoop and MapReduce

Fields: user_id movie_id rating timestamp

Combiner: when mapper is done producing key-value pairs, do some reduction work in mapper,
like aggregating data before sending to reducer to save some network bandwidth.

ex: ./word_frequency_with_combiner.py

Attach config/data file with each MapReduce job across distributed nodes:
./most_popular_movie_with_name_lookup.py

README

https://github.com/prince6635/movie-ratings-by-mapreduce-and-hadoop 6/11
3/29/24, 4:36 PM GitHub - prince6635/movie-ratings-by-mapreduce-and-hadoop: Big data (movie ratings) based on Hadoop and MapReduce

How MapReduce scales / distributed computing:

Hadoop (Run MapReduce job in a distributed way)

https://github.com/prince6635/movie-ratings-by-mapreduce-and-hadoop 7/11
3/29/24, 4:36 PM GitHub - prince6635/movie-ratings-by-mapreduce-and-hadoop: Big data (movie ratings) based on Hadoop and MapReduce

HDFS (Hadoop Distributed File System): is used by Hadoop for distributing data and information
that Hadoop accesses, YARN manages how Hadoop jobs distributed across the cluster.

https://github.com/prince6635/movie-ratings-by-mapreduce-and-hadoop 8/11
3/29/24, 4:36 PM GitHub - prince6635/movie-ratings-by-mapreduce-and-hadoop: Big data (movie ratings) based on Hadoop and MapReduce

Apache YARN (Hadoop uses to figure out what mapper/reducer to run where, how to connect
them all together, keep tracking what's running, etc.)

https://github.com/prince6635/movie-ratings-by-mapreduce-and-hadoop 9/11
3/29/24, 4:36 PM GitHub - prince6635/movie-ratings-by-mapreduce-and-hadoop: Big data (movie ratings) based on Hadoop and MapReduce

AWS Elastic MapReduce

Tools
Python tool for big data: Enthought canopy
mrjob package: for MapReduce Editor -> !pip install mrjob
https://github.com/prince6635/movie-ratings-by-mapreduce-and-hadoop 10/11
3/29/24, 4:36 PM GitHub - prince6635/movie-ratings-by-mapreduce-and-hadoop: Big data (movie ratings) based on Hadoop and MapReduce

Sample data: http://grouplens.org/


datasets -> MovieLens 100K Dataset (ml-100k.zip)

Releases

No releases published

Packages

No packages published

Languages

Python 94.0% Perl 3.2% Shell 2.8%

https://github.com/prince6635/movie-ratings-by-mapreduce-and-hadoop 11/11

You might also like