ids model 2
ids model 2
5 marks
ans)
**2a. Three Benefits of Applying Data Science in Modern
Industries**
1. **Enhanced Decision-Making**:
---
2
**Defining Goals**:
Setting clear goals is essential for any data science project as it provides
direction and benchmarks for measuring success. The goals should align
with business objectives, be specific, measurable, achievable, relevant,
and time-bound (SMART).
1. **Identify the Problem**: Define what the project seeks to achieve. For
example, a retail business may aim to improve sales forecasting accuracy.
- **Project Objectives**:
- **Scope of Work**:
3
- **Stakeholders**:
- **Expected Outcomes**:
- **Timeline**:
This charter provides a structured foundation for the project, ensuring that
all stakeholders are aligned on goals, scope, and deliverables.
2. **Data Collection**:
4
6. **Model Building**:
7. **Model Evaluation**:
---
Exploratory Data Analysis (EDA) is a critical step that sets the foundation
for effective model building. Here’s how EDA contributes to the model-
building process:
- EDA provides insights into whether data is suitable for specific model
types (e.g., linear relationships for linear regression, clusters for clustering
algorithms). Understanding data structure through EDA allows data
scientists to choose algorithms that are likely to perform well.
ans)
4a) Certainly! Here’s an expanded explanation of the role of
machine learning in data science:
7
8
Here’s a deeper look into how machine learning plays an essential role in
data science:
4b)
### b) Comparison of Different Types of Machine Learning
Types of ML:
11
Here, we know the value of input data but output and function both
are unknown.
In such scenarios, machine learning algorithms find the function that
finds similarity among different input data instances and groups them
based on the similarity index, which is the output of unsupervised
learning.
• Understands patterns and trends in the data and discover the output.
Applications:
Recommendation Systems
16
1. **Supervised Learning**
2. **Unsupervised Learning**
3. **Reinforcement Learning**
5. a) Outline the steps you would take for feature engineering to improve
model performance.
b) Provide programming tips for efficiently processing large datasets in
Python.
ans)
### 5. a) Steps for Feature Engineering to Improve Model
Performance
6. **Transform Features**:
20
8. **Feature Selection**:
9. **Dimensionality Reduction**:
- Filter out unnecessary data and select relevant subsets early in the
pipeline to minimize the data volume handled in memory.
- For very large datasets that cannot fit into memory, consider
**streaming** data processing libraries like `PySpark` or Dask, which can
process data on the fly, reducing memory constraints.
- For pipelines that must handle streaming data in real-time, tools like
Kafka can provide streaming data to Python for processing.
- When reading and writing data, use efficient file formats and I/O
operations (e.g., using `feather` format for fast, in-memory data loading).
10. a) Describe how Cross filter can be used for data exploration and
analysis in a data visualization context.
b) Discuss various applications of data visualization.
Ans)
### 10. a) Crossfilter in Data Exploration and Analysis
2.Cross filter
Cross filtering is a technique used in data analysis to explore the
relationships between different variables in a dataset. In cross filtering,
the user selects one or more values for a variable, and the other
variables in the dataset are filtered based on those selected values.
For example, imagine you have a dataset that includes information about
customer purchases, including the customer's age, gender, location, and
purchase amount. Using cross filtering, you could select a specific age
range, and the dataset would be filtered to only show purchases made by
customers within that age range. You could then further refine the results
by selecting a specific location, or by filtering by gender.
Cross filtering can help identify patterns and trends in data, and can be
useful in business, marketing, and scientific research applications. It is
often used in data visualization tools to enable interactive exploration of
data.
Crossfiltering Steps:
The dashboard would update to show statistics or graphs for these two
students only.
Now, only Bob and Eva remain in the filtered data (as they are 10th
graders taking Science).
25
After adding the gender filter, only Eva meets all criteria (10th grader,
Science, Female).
1. **Multi-Dimensional Filtering**:
- Each filter applied updates all visualizations that share the same
dataset, providing instant feedback on how filters impact the data.
- This enables smooth interaction, letting users drill down into data with
minimal delay and view results instantly.
users adjust filters in one chart, the entire dashboard updates to reflect
the current filter settings.
- Tools like Tableau, Power BI, and Google Data Studio help businesses
monitor performance, analyze trends, and make informed decisions.
- Data visualization helps track patient data, monitor health trends, and
visualize medical research findings. In public health, visualizations are
used to monitor disease outbreaks, vaccination rates, and other health
metrics.
11. a) Outline the steps for creating an interactive dashboard using dc.js
and describe its key features.
b) Describe the importance of integrating various data visualization
tools in developing effective data applications.
Ans)
### 11. a) Steps for Creating an Interactive Dashboard Using
dc.js and Key Features
1. Define your audience and goals: Ask who you are building this
dashboard for and what do they need to understand? Once you
know that, you can answer their questions more easily with selected
visualizations and data.
2. Choose your data: Most businesses have an abundance of data
from different sources. Choose only what’s relevant to your
30
- Use dc.js functions to define and configure each chart. Specify options
like data source (dimension and group), scales, colors, labels, and axis
formatting.
- Some tools, such as Plotly or Google Data Studio, are also cloud-based,
allowing applications to scale and support high volumes of concurrent
users or data points.
- By combining tools, data applications can draw from both local and
cloud databases, streamlining data ingestion, processing, and
visualization in one cohesive application.
Hadoop Ecosystem
MapReduce:
In the MapReduce approach, the processing is done at the slave nodes, and the
final result is sent to the master node.
The input dataset is first split into chunks of data. In this example,
the input has three lines of text with three separate entities - “bus
car train,” “ship ship train,” “bus ship car.” The dataset is then split
into three chunks, based on these entities, and processed parallelly.
In the map phase, the data is assigned a key and a value of 1. In this
case, we have one bus, one car, one ship, and one train.
These key-value pairs are then shuffled and sorted together based on
their keys.
39
At the reduce phase, the aggregation takes place, and the final
output is obtained.
• Client
• Resource Manager
• Node Manager
40
• Application Master
Resource Manager:
Suppose a client machine wants to fetch some code for data analysis.
Node Manager:
In the node section, each of the nodes has its node managers. These
node managers manage the nodes and monitor the resource usage
in the node.
Whenever a job request comes in, the app master requests the
container from the node manager. The Node Managers check if they
have available resources to fulfil the request. If they do, they
allocate containers and notify the Resource Manager.
- **HDFS** is a distributed file system that breaks large files into blocks
and distributes them across nodes in the cluster. Each block is replicated
on multiple nodes to ensure fault tolerance and high availability.
- **Map Phase**: This phase breaks down the input data into smaller
subsets and processes them independently on different nodes, generating
intermediate key-value pairs.
2. **Data Processing with Hadoop**: The hospital used Hadoop's HDFS for
distributed storage of large datasets and MapReduce for parallel
processing, enabling the handling of vast amounts of medical data in a
scalable way.
1) Atomicity
The term atomicity defines that the data remains atomic. It means if any
operation is performed on the data, either it should be performed or executed
completely or should not be executed at all.
It further means that the operation should not break in between or execute
partially. In the case of executing operations on the transaction, the operation
should be completely executed and not partially.
Example: If Remo has account A having $30 in his account from which he
wishes to send $10 to Sheero's account, which is B.
Now, there will be two operations that will take place. One is the amount of $10
that Remo wants to transfer will be debited from his account A, and the same
amount will get credited to account B, i.e., into Sheero's account.
Now, what happens - the first operation of debit executes successfully, but the
credit operation, however, fails.
Thus, in Remo's account A, the value becomes $20, and to that of Sheero's
account, it remains $100 as it was previously present.
In the above diagram, it can be seen that after crediting $10, the amount is still
$100 in account B. So, it is not an atomic transaction.
The below image shows that both debit and credit operations are done
successfully. Thus the transaction is atomic.
45
Thus, when the amount loses atomicity, then in the bank systems, this becomes
a huge issue, and so the atomicity is the main focus in the bank systems.
2) Consistency
The word consistency means that the value should remain preserved always.
In DBMS, the integrity of the data should be maintained, which means if a
change in the database is made, it should remain preserved always.
In the case of transactions, the integrity of the data is very essential so that the
database remains consistent before and after the transaction. The data should
always be correct.
Example:
46
In the above figure, there are three accounts, A, B, and C, where A is making a
transaction T one by one to both B & C.
There are two operations that take place, i.e., Debit and Credit.
Account A firstly debits $50 to account B, and the amount in account A is read
$300 by B before the transaction. After the successful transaction T, the
available amount in B becomes $150.
Now, A debits $20 to account C, and that time, the value read by C is $250 (that
is correct as a debit of $50 has been successfully done to B). The debit and
credit operation from account A to C has been done successfully.
We can see that the transaction is done successfully, and the value is also read
correctly. Thus, the data is consistent.
In case the value read by B and C is $300, which means that data is inconsistent
because when the debit operation executes, it will not be consistent.
47
3) Isolation
4) Durability
Therefore, the ACID property of DBMS plays a vital role in maintaining the
consistency and availability of data in the database.
The ACID principles are vital for maintaining data accuracy, consistency,
and reliability, especially in systems that require a high degree of trust
and precision, such as banking, finance, and inventory management. They
help prevent data corruption and support smooth concurrent access to
data, enhancing the overall stability and dependability of relational
databases.
8. a) Explain the purpose and basic syntax of the Cypher query language
used in graph databases.
b) Discuss various applications of graph databases.
Ans)
### 8a) Purpose and Basic Syntax of Cypher Query Language in
Graph Databases
- This retrieves all pairs of people (`p` and `f`) who are friends.
- This creates a node labeled `Person` with the properties `name` and
`age`.
- This finds people nodes where the age property is greater than 25.
- This returns only the `name` and `age` properties of each person
node.
- This checks if a person named Alice exists; if not, it creates the node.
Cypher provides a clear and visual way to work with graph structures,
making it powerful and accessible for managing complex, interconnected
data.
Social Networking
Social networking platforms like Facebook, Twitter, and
LinkedIn use graph databases to store and query the
relationships between users, their connections, and their
interactions. This allows them to easily retrieve
information such as a user’s friends, followers, and likes,
as well as to recommend new connections based on shared
interests or connections.
Recommendation Engines
Many companies use graph databases to build
recommendation engines that can suggest personalized
products or services to their customers. For example,
online retailers like Amazon and Netflix use graph
databases to recommend products based on a customer’s
purchase history and browsing behavior. Music and video
52
Fraud Detection
Graph databases are also used in fraud detection,
particularly in the financial industry. They can be used to
detect suspicious patterns of behavior, such as a sudden
increase in transactions from a particular account or a
series of transactions that are all linked to the same
individual. Graph databases can also be used to identify
networks of individuals or companies that are connected to
fraudulent activity.
Knowledge Graphs
Knowledge graphs are a type of graph database that store
information in the form of a graph, with nodes
representing entities and edges representing relationships
between them. They are used in a variety of industries,
including healthcare, finance, and government, to store
and query large amounts of data. For example, a
healthcare company might use a knowledge graph to store
information about patients, their medical history, and their
treatments, and then use graph queries to identify patterns
and trends in patient data.
Tokenization
Tokenization is the process of splitting text into smaller units called tokens. These tokens
can be words or sentences depending on the task.
Tokenization is one of the first steps in natural language processing (NLP) and is essential for
tasks like text analysis, machine learning, and information retrieval.
Types of Tokenization:
1. Word Tokenization:
o Breaks down a sentence or text into individual words.
2. Sentence Tokenization:
o Breaks down a paragraph or text into individual sentences.
o Stop words
Stemming:
Stemming is a process in natural language processing (NLP) where words are reduced to
their root or base form (also called the stem), typically by stripping off suffixes like "-ing," "-
ed," "-ly," etc. The stem may not always be a valid word, but the goal is to reduce the word to
a form that represents its meaning.
Various stemming algorithms: Poster stemmer, Lancaster stemmer and Snowball stemmer.
Example of Stemming:
Word: "running"
Stem: "run"
Lemmatization:
Lemmatization is a process in natural language processing (NLP) that reduces words to their
base or dictionary form (called the lemma). Unlike stemming, which often just chops off
56
word endings, lemmatization takes into account the context and the part of speech of a word
to convert it into its proper base form.
Stemming: Reduces a word to its root form, which may not always be a real word.
For example, "studies" becomes "studi."
Lemmatization: Reduces a word to its valid dictionary form, called the lemma. For
example, "studies" becomes "study," and "better" becomes "good."
Named Entity Recognition (NER) is a subtask of natural language processing (NLP) that
identifies and classifies named entities in text into predefined categories such as names of
people, organizations, locations, dates, quantities, and more. NER helps in understanding the
context and extracting valuable information from unstructured text.
1. **Tokenization**: NLTK can split text into smaller units, such as words or
sentences. Tokenization is a foundational step in text processing, as it
converts unstructured text into a structured form for further analysis.
Features of Neo4J
High Performance and Scalability
Neo4j is designed to handle massive amounts of data and
complex queries quickly and efficiently. Its native graph storage
and processing engine ensure high performance and scalability,
even with billions of nodes and relationships.
Cypher Query Language
Neo4j uses Cypher, a powerful and expressive query language
tailored for graph databases. Cypher makes it easy to create,
58
2 marks
ans)
1. a) Definition of Data Science and Real-World Applications
60
**Real-World Applications**: