SlideShare a Scribd company logo
Big Data Processing using a AWS Dataset: Analysis
of Co-occurrence problem with MapReduce
Vishva Abeyrathne
School of Science
(student)
RMIT University
(Student)
Melbourne, Australia
s3735195@student.rmit.edu.au
Abstract— This paper discusses on problems related to
scaling algorithms with big data and many researches have been
performed to overcome that. Consequently, Cluster computing
has been identified as the best solution for big data processing.
Despite of that, still there were some drawbacks and
MapReduce has been introduced as programming model to
tackle the problem. Co-occurrence matrix is used to identify the
co-occurring words and frequency for a given word. Pairs and
Stripes approaches have been used to comparatively analyze the
performance of the program by differentiating size of the
dataset and the nodes assigned to a cluster. Further optimization
has been suggested to better conduct the research on the dataset.
Keywords—MapReduce, Co-occurrence Matrix, Pairs,
Stripes, Combiner, Mapper, Common Crawl
I. INTRODUCTION
Data driven approaches have immensely contributed to the
field of natural language processing over the last few years.
Most of the researches are ongoing to optimizes the processing
tasks with use of comparatively larger datasets. Web-scale
language models stand out as one of the major scenarios when
it comes to big data processing. Application of those models
to larger datasets has become problematic. Major reason for
that is the capabilities of machines to handle such big data
single handed. It has been identified that distribution of
computation across multiple nodes can work out well. This
paper focuses on implementing scalable language processing
algorithms with the use of MapReduce and cost-effective
cluster computing with Hadoop [1].
Rest of the paper is unfolded as follow. Next section
focuses on what is MapReduce and importance of it. Section
3 focuses on introducing co-occurrence problem whereas
section 4 is based on the implementation. Section 5 describes
the dataset that is utilized, followed by results in the section6.
Finally, section 7 focuses on the discussion of the experiment.
II. MAPREDUCE
Distributed computing is identified as the most reliable
and efficient solution to process large datasets where
computation can be done across multiple processors. Despite
being a good solution, still some issues arrived with parallel
algorithms such as cost required for large shared memory
machines. Many researches have been performed to come up
with an alternative programming model that can be used for
parallel computations. Consequently, MapReduce was
introduced back in 2004 with the capability of applying
computations and perform necessary processing for tons of
data coming from multiple sources.
Key- value pairs can be identified as the major data
structure in MapReduce where mapper and reducer being the
main operations behind all the processing tasks. Mapper is
applied on all the input key-pair values and intermediate key-
pair values are generated as a result of that. Reducer is used to
emit output key-value pairs where values with the same key
are gathered together prior to calculations. Programmers only
need to worry about implementing the relevant mapper and
reducer while runtime executes on number of clusters with the
use of a distributed file system. As further optimisations,
necessary combiners and partitioners can be implemented as
the part of MapReduce program. Combiners performs
aggregation for values with the same set of keys in respective
cluster nodes before moving on the process of the reducer. All
the generated key-value pairs will be written in to local disk.
Partitioners assign intermediate key-value pairs for all the
available reducers where values with the same key will be
reduced together despite the origin of the mapper [2].
III. CO-OCCURRENCE PROBLEM
This case study is primarily relevant to measure the
performance with the approaches of the co-occurrence
problem. Co-occurrence problem associates with calculating
or forming a N * N term co-occurrence matrix using all the
words within a given context. Co-occurring word or
neighbour of a word is defined using a sliding window with a
specific value or a sentence. This problem has been used to
calculate semantic distances which is useful for many tasks in
language processing. Pairs and Stripes are identified as two
major approaches in co-occurrence matrix. Key of the pairs
approach always be the co-occurring word whereas the value
would be the count of those co-occurring word. Stripes
approach is different in its own way where it uses an associate
array to process intermediate results. Key of the stripes
approach will be the specific word and the value will be the
associative array with all the co-occurring words and their
relevant occurrences [2].
IV. IMPLEMENTATION
This section focuses on implementing pairs and stripes
approaches using common crawl data. In pairs approach,
mapper takes all input words and generate intermediate key-
value pairs with co-occurring words as keys and 1 as the value
for co-occurring words. Reducer sums up all the values relate
to a unique key or co-occurring word and produce aggregation
for all the co-occurrences in the given dataset.
Compared to pairs approach, stripes approach emits fewer
intermediate results despite of each being larger due to
associative arrays. All the co-occurring words will be moved
in to an associative array whereas mapper provides outputs as
keys being the words and the values being the associative
arrays relevant to each specific word. Finally, reducer
performs the aggregation on all the intermediate key-value
pairs by summing up all the associative array related to all the
unique keys or words.
Java has been used as the main programming language for
MapReduce implementation it only required a few numbers of
lines to construct the code. Program will be responsible for all
necessary partitioning prior to go through the reducer and
further it will guarantee that values with same key will be
aggregated together. These features allow programmers to
focus on implementation whereas runtime will manage all the
other cluster-based requirements.
V. DATASET
Data is collected from Common Crawl corpus to perform
the experiment. Different subsets are selected to observe the
performance of the data processing tasks when size of the
dataset increases. Dataset with WET format that has plain text
is used to compare the performance of pairs and stripes
approaches with respect to number of nodes in the cluster.
Experimental dataset contains 150mb of data 100mb of data
and 75mb of data. 150mb dataset is used to perform the
experiment on pairs and stripes respect to the number of nodes
to a cluster. All the other 3 datasets are used to conduct the
analysis on the performance of two approaches with different
data size.
VI. RESULTS
As discussed in the section 4, performance of both pairs
and stripes approaches have been tested with the same dataset
using different set of nodes to the cluster. Window size for the
co-occurrence matrix has been used as 2 for the experiment.
Performance of the both approaches have been assessed with
the increase of the data size while having the same number of
nodes to the cluster.
TABLE I. COMPARISSON OF APPROACHES WITH CLUSTER NODES
Cluster Nodes Computation
Time (Pairs)
Computation
Time (Stripes)
2 Nodes 41m 9s 16m 38s
4 Nodes 39m 43s 16m 29s
6 Nodes 39m 40s 16m 27s
8 Nodes 39m 31s 16m 14s
10 Nodes 39m 05s 16m 11s
According to the results, it is obvious that stripes approach
has worked better in this case study over pairs approach with
a considerable time stamp. Stripes approach has been far more
efficient with elapsed time compared that of pairs approach.
TABLE II. COMPARISSON OF APPROACHES WITH DATA SIZE
Dataset Size Computation
Time (Pairs)
Computation
Time (Stripes)
75mb 19m 56s 9m 56s
100mb 27m 16s 12m 43s
150mb 41m 9s 16m 38s
As shown in the above Table, it can be observed that with
the increase in the size of the data, computation time tends to
increase, and efficiency goes down. On the other hand, stripes
approach has performed much better even with increase of the
dataset while pairs approach sticks to a linear model.
VII. DISCUSSION
Compared to work that have been performed in this
particular domain, this research needs to work with more
optimizations to come up with a minimal solution. Most of
the other ongoing researches and the researches that have
been conducted in past few years have utilized the luxury of
having a in-map combiner prior to generate intermediate
results. Combiner has the capability to minimize the portion
of intermediate key-value pairs by getting a local count for all
the words that are processed by each mapper separately.
Implementation of partitioner has been discussed in
several researches relates to this case study. It would be
efficient for reducer to perform the job since the partitioner
decides the exact reducer that a particular key-value should
move on to. As an improvement, some of the pre-processing
can be applied to the common crawl dataset, mainly to
remove unnecessary syntaxes prior to moving on to data
processing with both the approaches. Results suggest that
stripes approach is more effective out of two approaches in
both the scenarios.
CONCLUSION
This paper conducts a major analysis on processing
common crawl data over co-occurrence problem. Both Pairs
and Stripes approaches have been compared by increasing the
size of the dataset as well as adding more nodes to the cluster.
Further optimizations of partitioner and combiner can
provide far more efficient results in terms of running time.
More work pre-processing can be used to achieve more
accurate results whereas unnecessary words and tokens can
be removed prior to main analysis.
REFERENCES
[1] Lin, J. (2008, October). Scalable language processing algorithms for
the masses: A case study in computing word co-occurrence matrices
with MapReduce. In proceedings of the conference on empirical
[2] Lin, J., & Dyer, C. (2010). Data-intensive text processing with
MapReduce. Synthesis Lectures on Human Language
Technologies, 3(1), 1.
Big Data Processing using a AWS Dataset

More Related Content

What's hot (20)

PDF
WITH SEMANTICS AND HIDDEN MARKOV MODELS TO AN ADAPTIVE LOG FILE PARSER
ijnlc
 
PDF
20 26 jan17 walter latex
IAESIJEECS
 
PDF
IRJET- Review of Existing Methods in K-Means Clustering Algorithm
IRJET Journal
 
PDF
Vchunk join an efficient algorithm for edit similarity joins
Vijay Koushik
 
PDF
Improving K-NN Internet Traffic Classification Using Clustering and Principle...
journalBEEI
 
PDF
A h k clustering algorithm for high dimensional data using ensemble learning
ijitcs
 
PDF
Big Data Clustering Model based on Fuzzy Gaussian
IJCSIS Research Publications
 
PDF
Extended pso algorithm for improvement problems k means clustering algorithm
IJMIT JOURNAL
 
PDF
ME Synopsis
Poonam Debnath
 
PDF
Optimized Access Strategies for a Distributed Database Design
Waqas Tariq
 
PDF
50120140505013
IAEME Publication
 
PDF
Survey on classification algorithms for data mining (comparison and evaluation)
Alexander Decker
 
PDF
Journal paper 1
Editor IJCATR
 
PDF
Improved text clustering with
IJDKP
 
PPTX
Graphical Structure Learning accelerated with POWER9
Ganesan Narayanasamy
 
PDF
Latent Semantic Word Sense Disambiguation Using Global Co-Occurrence Information
csandit
 
PDF
A fuzzy clustering algorithm for high dimensional streaming data
Alexander Decker
 
PDF
PAGE: A Partition Aware Engine for Parallel Graph Computation
1crore projects
 
PPTX
Quantum persistent k cores for community detection
Colleen Farrelly
 
WITH SEMANTICS AND HIDDEN MARKOV MODELS TO AN ADAPTIVE LOG FILE PARSER
ijnlc
 
20 26 jan17 walter latex
IAESIJEECS
 
IRJET- Review of Existing Methods in K-Means Clustering Algorithm
IRJET Journal
 
Vchunk join an efficient algorithm for edit similarity joins
Vijay Koushik
 
Improving K-NN Internet Traffic Classification Using Clustering and Principle...
journalBEEI
 
A h k clustering algorithm for high dimensional data using ensemble learning
ijitcs
 
Big Data Clustering Model based on Fuzzy Gaussian
IJCSIS Research Publications
 
Extended pso algorithm for improvement problems k means clustering algorithm
IJMIT JOURNAL
 
ME Synopsis
Poonam Debnath
 
Optimized Access Strategies for a Distributed Database Design
Waqas Tariq
 
50120140505013
IAEME Publication
 
Survey on classification algorithms for data mining (comparison and evaluation)
Alexander Decker
 
Journal paper 1
Editor IJCATR
 
Improved text clustering with
IJDKP
 
Graphical Structure Learning accelerated with POWER9
Ganesan Narayanasamy
 
Latent Semantic Word Sense Disambiguation Using Global Co-Occurrence Information
csandit
 
A fuzzy clustering algorithm for high dimensional streaming data
Alexander Decker
 
PAGE: A Partition Aware Engine for Parallel Graph Computation
1crore projects
 
Quantum persistent k cores for community detection
Colleen Farrelly
 

Similar to Big Data Processing using a AWS Dataset (20)

PPTX
Efficient Parallel Set-Similarity Joins Using MapReduce
Tilani Gunawardena PhD(UNIBAS), BSc(Pera), FHEA(UK), CEng, MIESL
 
PPTX
Ch4.mapreduce algorithm design
AllenWu
 
PDF
MapReduce Algorithm Design - Parallel Reduce Operations
Jason J Pulikkottil
 
PPT
design mapping lecture6-mapreducealgorithmdesign.ppt
turningpointinnospac
 
PPTX
Bigdata Presentation
Yonas Gidey
 
PPTX
Bigdata presentation
Yonas Gidey
 
PDF
Design patterns in MapReduce
Akhilesh Joshi
 
PDF
BDS_QA.pdf
NikunjaParida1
 
PDF
Dotnet datamining ieee projects 2012 @ Seabirds ( Chennai, Pondicherry, Vello...
SBGC
 
PDF
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
PDF
Large Scale Data Analysis with Map/Reduce, part I
Marin Dimitrov
 
PDF
Hadoop Overview & Architecture
EMC
 
PDF
Seattle hug 2010
Abe Taha
 
PPT
Big Data Analytics with Hadoop with @techmilind
EMC
 
PDF
Hadoop Summit 2010 Machine Learning Using Hadoop
Yahoo Developer Network
 
PDF
Intro to Map Reduce
Doron Vainrub
 
PPTX
Hadoop and Mapreduce for .NET User Group
Csaba Toth
 
PPTX
Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologie...
kcitp
 
PDF
Filtering: a method for solving graph problems in map reduce
tuzbing
 
PPT
mapreduce and hadoop Distributed File sysytem
imandoumi
 
Efficient Parallel Set-Similarity Joins Using MapReduce
Tilani Gunawardena PhD(UNIBAS), BSc(Pera), FHEA(UK), CEng, MIESL
 
Ch4.mapreduce algorithm design
AllenWu
 
MapReduce Algorithm Design - Parallel Reduce Operations
Jason J Pulikkottil
 
design mapping lecture6-mapreducealgorithmdesign.ppt
turningpointinnospac
 
Bigdata Presentation
Yonas Gidey
 
Bigdata presentation
Yonas Gidey
 
Design patterns in MapReduce
Akhilesh Joshi
 
BDS_QA.pdf
NikunjaParida1
 
Dotnet datamining ieee projects 2012 @ Seabirds ( Chennai, Pondicherry, Vello...
SBGC
 
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
Large Scale Data Analysis with Map/Reduce, part I
Marin Dimitrov
 
Hadoop Overview & Architecture
EMC
 
Seattle hug 2010
Abe Taha
 
Big Data Analytics with Hadoop with @techmilind
EMC
 
Hadoop Summit 2010 Machine Learning Using Hadoop
Yahoo Developer Network
 
Intro to Map Reduce
Doron Vainrub
 
Hadoop and Mapreduce for .NET User Group
Csaba Toth
 
Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologie...
kcitp
 
Filtering: a method for solving graph problems in map reduce
tuzbing
 
mapreduce and hadoop Distributed File sysytem
imandoumi
 
Ad

Recently uploaded (20)

PPTX
apidays Singapore 2025 - Generative AI Landscape Building a Modern Data Strat...
apidays
 
PDF
NIS2 Compliance for MSPs: Roadmap, Benefits & Cybersecurity Trends (2025 Guide)
GRC Kompas
 
PPTX
apidays Singapore 2025 - From Data to Insights: Building AI-Powered Data APIs...
apidays
 
PDF
Development and validation of the Japanese version of the Organizational Matt...
Yoga Tokuyoshi
 
PDF
1750162332_Snapshot-of-Indias-oil-Gas-data-May-2025.pdf
sandeep718278
 
PPTX
Powerful Uses of Data Analytics You Should Know
subhashenia
 
PDF
The European Business Wallet: Why It Matters and How It Powers the EUDI Ecosy...
Lal Chandran
 
PDF
Research Methodology Overview Introduction
ayeshagul29594
 
PPTX
SlideEgg_501298-Agentic AI.pptx agentic ai
530BYManoj
 
PPTX
05_Jelle Baats_Tekst.pptx_AI_Barometer_Release_Event
FinTech Belgium
 
PDF
apidays Singapore 2025 - From API Intelligence to API Governance by Harsha Ch...
apidays
 
PPTX
04_Tamás Marton_Intuitech .pptx_AI_Barometer_2025
FinTech Belgium
 
PPTX
Aict presentation on dpplppp sjdhfh.pptx
vabaso5932
 
PDF
JavaScript - Good or Bad? Tips for Google Tag Manager
📊 Markus Baersch
 
PDF
InformaticsPractices-MS - Google Docs.pdf
seshuashwin0829
 
PPT
tuberculosiship-2106031cyyfuftufufufivifviviv
AkshaiRam
 
PDF
apidays Singapore 2025 - Surviving an interconnected world with API governanc...
apidays
 
PDF
apidays Singapore 2025 - How APIs can make - or break - trust in your AI by S...
apidays
 
PDF
apidays Singapore 2025 - Streaming Lakehouse with Kafka, Flink and Iceberg by...
apidays
 
PPTX
What Is Data Integration and Transformation?
subhashenia
 
apidays Singapore 2025 - Generative AI Landscape Building a Modern Data Strat...
apidays
 
NIS2 Compliance for MSPs: Roadmap, Benefits & Cybersecurity Trends (2025 Guide)
GRC Kompas
 
apidays Singapore 2025 - From Data to Insights: Building AI-Powered Data APIs...
apidays
 
Development and validation of the Japanese version of the Organizational Matt...
Yoga Tokuyoshi
 
1750162332_Snapshot-of-Indias-oil-Gas-data-May-2025.pdf
sandeep718278
 
Powerful Uses of Data Analytics You Should Know
subhashenia
 
The European Business Wallet: Why It Matters and How It Powers the EUDI Ecosy...
Lal Chandran
 
Research Methodology Overview Introduction
ayeshagul29594
 
SlideEgg_501298-Agentic AI.pptx agentic ai
530BYManoj
 
05_Jelle Baats_Tekst.pptx_AI_Barometer_Release_Event
FinTech Belgium
 
apidays Singapore 2025 - From API Intelligence to API Governance by Harsha Ch...
apidays
 
04_Tamás Marton_Intuitech .pptx_AI_Barometer_2025
FinTech Belgium
 
Aict presentation on dpplppp sjdhfh.pptx
vabaso5932
 
JavaScript - Good or Bad? Tips for Google Tag Manager
📊 Markus Baersch
 
InformaticsPractices-MS - Google Docs.pdf
seshuashwin0829
 
tuberculosiship-2106031cyyfuftufufufivifviviv
AkshaiRam
 
apidays Singapore 2025 - Surviving an interconnected world with API governanc...
apidays
 
apidays Singapore 2025 - How APIs can make - or break - trust in your AI by S...
apidays
 
apidays Singapore 2025 - Streaming Lakehouse with Kafka, Flink and Iceberg by...
apidays
 
What Is Data Integration and Transformation?
subhashenia
 
Ad

Big Data Processing using a AWS Dataset

  • 1. Big Data Processing using a AWS Dataset: Analysis of Co-occurrence problem with MapReduce Vishva Abeyrathne School of Science (student) RMIT University (Student) Melbourne, Australia [email protected] Abstract— This paper discusses on problems related to scaling algorithms with big data and many researches have been performed to overcome that. Consequently, Cluster computing has been identified as the best solution for big data processing. Despite of that, still there were some drawbacks and MapReduce has been introduced as programming model to tackle the problem. Co-occurrence matrix is used to identify the co-occurring words and frequency for a given word. Pairs and Stripes approaches have been used to comparatively analyze the performance of the program by differentiating size of the dataset and the nodes assigned to a cluster. Further optimization has been suggested to better conduct the research on the dataset. Keywords—MapReduce, Co-occurrence Matrix, Pairs, Stripes, Combiner, Mapper, Common Crawl I. INTRODUCTION Data driven approaches have immensely contributed to the field of natural language processing over the last few years. Most of the researches are ongoing to optimizes the processing tasks with use of comparatively larger datasets. Web-scale language models stand out as one of the major scenarios when it comes to big data processing. Application of those models to larger datasets has become problematic. Major reason for that is the capabilities of machines to handle such big data single handed. It has been identified that distribution of computation across multiple nodes can work out well. This paper focuses on implementing scalable language processing algorithms with the use of MapReduce and cost-effective cluster computing with Hadoop [1]. Rest of the paper is unfolded as follow. Next section focuses on what is MapReduce and importance of it. Section 3 focuses on introducing co-occurrence problem whereas section 4 is based on the implementation. Section 5 describes the dataset that is utilized, followed by results in the section6. Finally, section 7 focuses on the discussion of the experiment. II. MAPREDUCE Distributed computing is identified as the most reliable and efficient solution to process large datasets where computation can be done across multiple processors. Despite being a good solution, still some issues arrived with parallel algorithms such as cost required for large shared memory machines. Many researches have been performed to come up with an alternative programming model that can be used for parallel computations. Consequently, MapReduce was introduced back in 2004 with the capability of applying computations and perform necessary processing for tons of data coming from multiple sources. Key- value pairs can be identified as the major data structure in MapReduce where mapper and reducer being the main operations behind all the processing tasks. Mapper is applied on all the input key-pair values and intermediate key- pair values are generated as a result of that. Reducer is used to emit output key-value pairs where values with the same key are gathered together prior to calculations. Programmers only need to worry about implementing the relevant mapper and reducer while runtime executes on number of clusters with the use of a distributed file system. As further optimisations, necessary combiners and partitioners can be implemented as the part of MapReduce program. Combiners performs aggregation for values with the same set of keys in respective cluster nodes before moving on the process of the reducer. All the generated key-value pairs will be written in to local disk. Partitioners assign intermediate key-value pairs for all the available reducers where values with the same key will be reduced together despite the origin of the mapper [2]. III. CO-OCCURRENCE PROBLEM This case study is primarily relevant to measure the performance with the approaches of the co-occurrence problem. Co-occurrence problem associates with calculating or forming a N * N term co-occurrence matrix using all the words within a given context. Co-occurring word or neighbour of a word is defined using a sliding window with a specific value or a sentence. This problem has been used to calculate semantic distances which is useful for many tasks in language processing. Pairs and Stripes are identified as two major approaches in co-occurrence matrix. Key of the pairs approach always be the co-occurring word whereas the value would be the count of those co-occurring word. Stripes approach is different in its own way where it uses an associate array to process intermediate results. Key of the stripes approach will be the specific word and the value will be the associative array with all the co-occurring words and their relevant occurrences [2]. IV. IMPLEMENTATION This section focuses on implementing pairs and stripes approaches using common crawl data. In pairs approach, mapper takes all input words and generate intermediate key- value pairs with co-occurring words as keys and 1 as the value for co-occurring words. Reducer sums up all the values relate to a unique key or co-occurring word and produce aggregation for all the co-occurrences in the given dataset.
  • 2. Compared to pairs approach, stripes approach emits fewer intermediate results despite of each being larger due to associative arrays. All the co-occurring words will be moved in to an associative array whereas mapper provides outputs as keys being the words and the values being the associative arrays relevant to each specific word. Finally, reducer performs the aggregation on all the intermediate key-value pairs by summing up all the associative array related to all the unique keys or words. Java has been used as the main programming language for MapReduce implementation it only required a few numbers of lines to construct the code. Program will be responsible for all necessary partitioning prior to go through the reducer and further it will guarantee that values with same key will be aggregated together. These features allow programmers to focus on implementation whereas runtime will manage all the other cluster-based requirements. V. DATASET Data is collected from Common Crawl corpus to perform the experiment. Different subsets are selected to observe the performance of the data processing tasks when size of the dataset increases. Dataset with WET format that has plain text is used to compare the performance of pairs and stripes approaches with respect to number of nodes in the cluster. Experimental dataset contains 150mb of data 100mb of data and 75mb of data. 150mb dataset is used to perform the experiment on pairs and stripes respect to the number of nodes to a cluster. All the other 3 datasets are used to conduct the analysis on the performance of two approaches with different data size. VI. RESULTS As discussed in the section 4, performance of both pairs and stripes approaches have been tested with the same dataset using different set of nodes to the cluster. Window size for the co-occurrence matrix has been used as 2 for the experiment. Performance of the both approaches have been assessed with the increase of the data size while having the same number of nodes to the cluster. TABLE I. COMPARISSON OF APPROACHES WITH CLUSTER NODES Cluster Nodes Computation Time (Pairs) Computation Time (Stripes) 2 Nodes 41m 9s 16m 38s 4 Nodes 39m 43s 16m 29s 6 Nodes 39m 40s 16m 27s 8 Nodes 39m 31s 16m 14s 10 Nodes 39m 05s 16m 11s According to the results, it is obvious that stripes approach has worked better in this case study over pairs approach with a considerable time stamp. Stripes approach has been far more efficient with elapsed time compared that of pairs approach. TABLE II. COMPARISSON OF APPROACHES WITH DATA SIZE Dataset Size Computation Time (Pairs) Computation Time (Stripes) 75mb 19m 56s 9m 56s 100mb 27m 16s 12m 43s 150mb 41m 9s 16m 38s As shown in the above Table, it can be observed that with the increase in the size of the data, computation time tends to increase, and efficiency goes down. On the other hand, stripes approach has performed much better even with increase of the dataset while pairs approach sticks to a linear model. VII. DISCUSSION Compared to work that have been performed in this particular domain, this research needs to work with more optimizations to come up with a minimal solution. Most of the other ongoing researches and the researches that have been conducted in past few years have utilized the luxury of having a in-map combiner prior to generate intermediate results. Combiner has the capability to minimize the portion of intermediate key-value pairs by getting a local count for all the words that are processed by each mapper separately. Implementation of partitioner has been discussed in several researches relates to this case study. It would be efficient for reducer to perform the job since the partitioner decides the exact reducer that a particular key-value should move on to. As an improvement, some of the pre-processing can be applied to the common crawl dataset, mainly to remove unnecessary syntaxes prior to moving on to data processing with both the approaches. Results suggest that stripes approach is more effective out of two approaches in both the scenarios. CONCLUSION This paper conducts a major analysis on processing common crawl data over co-occurrence problem. Both Pairs and Stripes approaches have been compared by increasing the size of the dataset as well as adding more nodes to the cluster. Further optimizations of partitioner and combiner can provide far more efficient results in terms of running time. More work pre-processing can be used to achieve more accurate results whereas unnecessary words and tokens can be removed prior to main analysis. REFERENCES [1] Lin, J. (2008, October). Scalable language processing algorithms for the masses: A case study in computing word co-occurrence matrices with MapReduce. In proceedings of the conference on empirical [2] Lin, J., & Dyer, C. (2010). Data-intensive text processing with MapReduce. Synthesis Lectures on Human Language Technologies, 3(1), 1.