MDS131 - Research Methods in Data Science - Unit 2 - Part 1
MDS131 - Research Methods in Data Science - Unit 2 - Part 1
• Davy Cielen and Arno Meysman, Introducing Data Science. Simon and Schuster, 2016
• C. R. Kothari, Research Methodology Methods and Techniques. 3rd. ed. New Delhi: New Age
International Publishers, Reprint 2014
• Davy Cielen and Arno Meysman, Introducing Data Science. Simon and Schuster, 2016
• Big data is a blanket term for any collection of data sets so large or complex
that it becomes difficult to process them using traditional data management
techniques such as, for example, the RDBMS (Relational Database Management
Systems).
• Big data is important because it can be used to gain insights into a wide variety
of areas, including business, healthcare, and government. It can also be used to
improve decision making, predict trends, and identify new opportunities.
• Data science involves using methods to analyze massive amounts of data and
extract the knowledge it contains.
• Data science and big data evolved from statistics and traditional data
management but are now considered to be distinct disciplines.
3 V’s 7 V’s
4 V’s
• Volume: This refers to the sheer scale of data generated and collected. Big data involves massive
amounts of information that exceed the processing capacity of conventional databases and tools. The
volume of data is measured in terms of petabytes, exabytes, and beyond.
o Scenario: Social Media Data Analysis for Marketing Description: A marketing company is analyzing
social media data to understand consumer sentiment towards a new product launch. They collect and
process millions of tweets, comments, and posts in real-time to gauge public reactions and identify
potential areas for improvement.
• Velocity: Velocity pertains to the speed at which data is generated, processed, and delivered. With the
advent of real-time data streams from sources like social media, sensors, and financial markets,
organizations need to analyze and act upon data in near-real-time to capitalize on opportunities and
respond to challenges.
o Scenario: Stock Market Real-time Analysis. Description: An investment firm is monitoring stock market
data in real time to make informed trading decisions. They process and analyze market data streams,
such as price fluctuations and trading volumes, to identify trends and execute buy/sell orders swiftly.
Excellence and Service
CHRIST
Deemed to be University
Current Landscape of Big Data – Characteristics / Framework
• Variety: Variety refers to the diverse types of data that big data encompasses. This includes structured
data (such as relational databases), unstructured data (such as text and images), and semi-structured
data (such as XML files). Managing and extracting insights from this varied data requires specialized
techniques and tools.
o Scenario: Healthcare Data Integration. Description: A hospital is integrating various types of patient
data, including structured electronic health records (EHRs), unstructured doctor's notes, and medical
images. They use advanced analytics to correlate different data types to provide personalized treatment
plans.
• Veracity: Veracity refers to the accuracy and reliability of data. In the big data context, data can come
from numerous sources, each with varying levels of accuracy and trustworthiness. Ensuring data quality
and addressing issues like inconsistencies and errors become critical to making reliable decisions.
o Scenario: Fraud Detection in Financial Transactions. Description: A credit card company is analyzing a
massive volume of transaction data to detect fraudulent activities. They use machine learning algorithms
to identify patterns that indicate potentially fraudulent transactions while reducing false positives.
Excellence and Service
CHRIST
Deemed to be University
Current Landscape of Big Data – Characteristics / Framework
• Value: The value of big data lies in its potential to provide meaningful insights and drive informed decisions.
Extracting value from big data involves analyzing and interpreting the data to uncover patterns, trends,
correlations, and insights that can lead to improved business strategies, innovations, and efficiencies.
o Scenario: Retail Customer Behavior Analysis Description: An e-commerce company is analyzing customer
behavior data to improve sales and marketing strategies. They analyze browsing history, purchase patterns,
and demographics to personalize recommendations, promotions, and advertisements, ultimately driving
higher conversion rates.
• Variability: Variability refers to the inconsistency of data flows, which can be erratic and unpredictable. Data
can arrive in irregular intervals, and its structure can change over time. Handling variability requires flexible
data processing techniques and tools that can adapt to changing data patterns.
o Scenario: Weather Forecasting and Emergency Response Description: A meteorological agency is collecting
and processing weather data from various sources, including satellites, sensors, and weather stations. They
handle the variability in data frequency and format to provide accurate and timely weather forecasts for
disaster preparedness.
• Visibility (Visualization): Visibility refers to the ability to access and understand data from
various perspectives. Effective visualization and data presentation techniques are crucial
to making complex data comprehensible and actionable for a wide range of stakeholders.
o Scenario: Supply Chain Analytics for Manufacturing. Description: A manufacturing
company is using big data analytics to gain visibility into its supply chain. They track data
from suppliers, production facilities, transportation, and distribution centers to optimize
inventory levels, reduce lead times, and enhance overall operational efficiency.
These Seven V's collectively emphasize the challenges and opportunities presented by big data.
Organizations that can successfully address these characteristics can harness the power of big
data to gain insights, improve decision-making, drive innovation, and enhance their competitive
advantage.
Natural language
• Definition: Natural language data refers to text or speech data
generated by humans in their everyday communication.
Analyzing natural language involves techniques like sentiment
analysis, language translation, and text summarization.
• Examples: Social media comments, customer reviews, email
correspondence.
Graph-Based Data:
• Definition: Graph-based data represents relationships
between entities using nodes (vertices) and edges. It's
especially useful for modeling complex interactions and
networks.
• Examples: Social networks (nodes as users, edges as
connections), supply chain networks, knowledge graphs.
Streaming Data
• Definition: Streaming data refers to real-time data that is generated, processed, and
analyzed as it is produced. It's crucial for applications that require immediate insights and
actions.
• Examples: Stock market tick data, social media live feeds, IoT sensor data.
• C. R. Kothari, Research Methodology Methods and Techniques. 3rd. ed. New Delhi: New Age
International Publishers, Reprint 2014
Sampling Techniques
• Deliberate sampling: Deliberate sampling is also known as purposive or non-probability
sampling. This sampling method involves purposive or deliberate selection of particular units
of the universe for constituting a sample which represents the universe. When population
elements are selected for inclusion in the sample based on the ease of access, it can be called
convenience sampling. On the other hand, in judgement sampling the researcher’s
judgement is used for selecting items which he considers as representative of the population.
Scenario: Market Research for a New Product Launch
In this scenario, deliberate sampling allows the company to focus its market research efforts on a
specific demographic that is crucial for the success of its new product. While deliberate sampling
doesn't provide the statistical representativeness of probability sampling, it offers valuable
qualitative insights that can guide decision-making in product development and marketing
strategies.
Sampling Techniques
• Simple random sampling: This type of sampling is also known as chance sampling or
probability sampling where each and every item in the population has an equal chance of
inclusion in the sample and each one of the possible samples, in case of finite universe,
has the same probability of being selected.
Imagine a scenario where a research firm wants to conduct a political opinion poll to gauge the
sentiments of the residents of a city regarding upcoming local elections. The firm aims to use
simple random sampling to ensure that each eligible resident has an equal chance of being
included in the survey.
Simple random sampling in this scenario ensures that every registered voter has an equal
chance of being included in the survey, minimizing potential bias and providing a representative
sample of the population's political opinions.
Sampling Techniques
• Systematic sampling: In some instances the most practical way of sampling is to select
every 15th name on a list, every 10th house on one side of a street and so on. Sampling of
this type is known as systematic sampling. An element of randomness is usually introduced
into this kind of sampling by using random numbers to pick up the unit with which to start.
This procedure is useful when sampling frame is available in the form of a list.
Consider a scenario where a retail store wants to collect customer feedback to understand their
satisfaction levels and improve their services. The store aims to use systematic sampling to
efficiently gather feedback from customers while maintaining randomness in the selection process.
Systematic sampling in this scenario allows the retail store to collect customer feedback in an
organized and efficient manner while maintaining a level of randomness. It ensures that feedback is
gathered from a variety of customers, providing insights that can lead to improvements in the
store's operations and customer satisfaction.
Sampling Techniques
• Stratified sampling: If the population from which a sample is to be drawn does not constitute a
homogeneous group, then stratified sampling technique is applied so as to obtain a
representative sample. In this technique, the population is stratified into a number of non-
overlapping subpopulations or strata and sample items are selected from each stratum. If the
items selected from each stratum is based on simple random sampling the entire procedure, first
stratification and then simple random sampling, is known as stratified random sampling.
Imagine a scenario where a district school authority wants to assess the academic performance of
its students across different grade levels and subjects. The district authority aims to use stratified
sampling to ensure representation from each grade level and subject area while maintaining a
manageable sample size.
Stratified sampling in this scenario allows the district authority to obtain a representative sample of
students from each grade level and subject area. This ensures that the assessment results
accurately reflect the academic performance of students across the entire school district, enabling
effective educational planning and targeted interventions.
Sampling Techniques
• Multi-stage sampling: This is a further development of the idea of cluster sampling. This
technique is meant for big inquiries extending to a considerably large geographical area like
an entire country. Under multi-stage sampling the first stage may be to select large primary
sampling units such as states, then districts, then towns and finally certain families within
towns.
Scenario: Environmental Impact Assessment in a Region
Multi-stage sampling in this scenario allows the environmental agency to efficiently gather data
from various ecosystems within the region while maintaining a representative sample. By
considering different geographical zones and sub-areas, they can assess the potential impact of
the construction project on the environment more comprehensively.
Sampling Techniques
• Sequential sampling: This is somewhat a complex sample design where the ultimate size
of the sample is not fixed in advance but is determined according to mathematical decisions
on the basis of information yielded as survey progresses. This design is usually adopted
under acceptance sampling plan in the context of statistical quality control
Consider a scenario where a manufacturing plant produces electronic components and wants to
ensure the quality of its products. The plant implements sequential sampling to monitor the
production process and make real-time decisions about the quality of the components.
In this scenario, sequential sampling is utilized to make ongoing decisions about the quality of
manufactured components. It enables the manufacturing plant to quickly identify and rectify
quality issues, leading to higher product quality and reduced waste.
Sampling Techniques
• Quota sampling: In stratified sampling the cost of taking random samples from individual
strata is often so expensive that interviewers are simply given quota to be filled from
different strata, the actual selection of items for sample being left to the interviewer’s judgement. This is called
quota sampling.
Imagine a scenario where a beverage company wants to conduct a consumer preference survey
to understand which flavors of a new drink are most popular among different age groups and
genders. The company decides to use quota sampling to ensure a balanced representation of
participants from various demographic categories.
In this scenario, quota sampling helps the beverage company gather insights about consumer
preferences across different demographic categories without conducting a fully random sample.
While quota sampling doesn't guarantee statistical representativeness like probability sampling, it
allows for a certain level of control over the composition of the sample to ensure diversity and
balance.
Sampling Techniques
• Cluster sampling and area sampling: Cluster sampling involves grouping the population
and then selecting the groups or the clusters rather than individual elements for inclusion in
the sample. Suppose some departmental store wishes to sample its credit card holders. It
has issued its cards to 15,000 customers. The sample size is to be kept say 450. For cluster sampling this
list of 15,000 card holders could be formed into 100 clusters of 150 card holders each. Three clusters might
then be selected for the sample randomly.
Imagine a scenario where a government health department wants to assess the quality of
healthcare facilities in a large region with many hospitals and clinics. The department decides to
use cluster sampling to evaluate a representative subset of healthcare facilities.
In this example, cluster sampling allows the health department to evaluate the quality of
healthcare facilities in the region without having to assess each facility individually. By selecting
representative clusters and conducting assessments within them, the department can make
informed decisions to improve healthcare services.
Sampling Techniques
• Area sampling is quite close to cluster sampling and is often talked about when the total
geographical area of interest happens to be big one. Under area sampling we first divide
the total area into a number of smaller non-overlapping areas, generally called geographical
clusters, then a number of these smaller areas are randomly selected, and all units in these
small areas are included in the sample. Area sampling is specially helpful where we do not
have the list of the population concerned.
In this example, area sampling allows the environmental agency to assess urban air quality
efficiently across a diverse city landscape. By selecting representative neighborhoods and
measuring air pollution within those areas, they can make informed decisions to address
environmental concerns and enhance the overall quality of life for city residents.