GitHub - Prince6635 - Movie-Ratings-By-Mapreduce-And-Hadoop - Big Data (Movie Ratings) Based On Hadoop and MapReduce
GitHub - Prince6635 - Movie-Ratings-By-Mapreduce-And-Hadoop - Big Data (Movie Ratings) Based On Hadoop and MapReduce
Star Notifications
https://github.com/prince6635/movie-ratings-by-mapreduce-and-hadoop 1/11
3/29/24, 4:36 PM GitHub - prince6635/movie-ratings-by-mapreduce-and-hadoop: Big data (movie ratings) based on Hadoop and MapReduce
MapReduce
Exmaple:
how many movies that each user has watched? => key: user_id and value: movie_id, now
duplicate keys are ok, since reducer will handle that later.
https://github.com/prince6635/movie-ratings-by-mapreduce-and-hadoop 2/11
3/29/24, 4:36 PM GitHub - prince6635/movie-ratings-by-mapreduce-and-hadoop: Big data (movie ratings) based on Hadoop and MapReduce
Map:
https://github.com/prince6635/movie-ratings-by-mapreduce-and-hadoop 3/11
3/29/24, 4:36 PM GitHub - prince6635/movie-ratings-by-mapreduce-and-hadoop: Big data (movie ratings) based on Hadoop and MapReduce
https://github.com/prince6635/movie-ratings-by-mapreduce-and-hadoop 4/11
3/29/24, 4:36 PM GitHub - prince6635/movie-ratings-by-mapreduce-and-hadoop: Big data (movie ratings) based on Hadoop and MapReduce
Reduce:
All:
https://github.com/prince6635/movie-ratings-by-mapreduce-and-hadoop 5/11
3/29/24, 4:36 PM GitHub - prince6635/movie-ratings-by-mapreduce-and-hadoop: Big data (movie ratings) based on Hadoop and MapReduce
Combiner: when mapper is done producing key-value pairs, do some reduction work in mapper,
like aggregating data before sending to reducer to save some network bandwidth.
ex: ./word_frequency_with_combiner.py
Attach config/data file with each MapReduce job across distributed nodes:
./most_popular_movie_with_name_lookup.py
README
https://github.com/prince6635/movie-ratings-by-mapreduce-and-hadoop 6/11
3/29/24, 4:36 PM GitHub - prince6635/movie-ratings-by-mapreduce-and-hadoop: Big data (movie ratings) based on Hadoop and MapReduce
https://github.com/prince6635/movie-ratings-by-mapreduce-and-hadoop 7/11
3/29/24, 4:36 PM GitHub - prince6635/movie-ratings-by-mapreduce-and-hadoop: Big data (movie ratings) based on Hadoop and MapReduce
HDFS (Hadoop Distributed File System): is used by Hadoop for distributing data and information
that Hadoop accesses, YARN manages how Hadoop jobs distributed across the cluster.
https://github.com/prince6635/movie-ratings-by-mapreduce-and-hadoop 8/11
3/29/24, 4:36 PM GitHub - prince6635/movie-ratings-by-mapreduce-and-hadoop: Big data (movie ratings) based on Hadoop and MapReduce
Apache YARN (Hadoop uses to figure out what mapper/reducer to run where, how to connect
them all together, keep tracking what's running, etc.)
https://github.com/prince6635/movie-ratings-by-mapreduce-and-hadoop 9/11
3/29/24, 4:36 PM GitHub - prince6635/movie-ratings-by-mapreduce-and-hadoop: Big data (movie ratings) based on Hadoop and MapReduce
Tools
Python tool for big data: Enthought canopy
mrjob package: for MapReduce Editor -> !pip install mrjob
https://github.com/prince6635/movie-ratings-by-mapreduce-and-hadoop 10/11
3/29/24, 4:36 PM GitHub - prince6635/movie-ratings-by-mapreduce-and-hadoop: Big data (movie ratings) based on Hadoop and MapReduce
Releases
No releases published
Packages
No packages published
Languages
https://github.com/prince6635/movie-ratings-by-mapreduce-and-hadoop 11/11