0% found this document useful (0 votes)
30 views

Bvraju Institute of Technology, Narsapur: Code No: A46H2

...

Uploaded by

n01811133
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views

Bvraju Institute of Technology, Narsapur: Code No: A46H2

...

Uploaded by

n01811133
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Code No: A46H2 R 20

BVR AJU IN S T IT U TE O F TE CHNO LO GY , N AR S APUR


(UGC - AUTONOMOUS)
III B.Tech II Semester Regular Examinations, June 2023
DATA ANALYTICS USING R
(Information Technology)
PART – A (5x2 = 10 Marks)

1.a) Name two advantages of using NoSQL databases over traditional relational
databases in the context of data analytics.
Ans: Handle large volumes of data at high speed with a scale-out architecture.
Store unstructured, semi-structured, or structured data.
Enable easy updates to schemas and fields.
Be developer-friendly.
Take full advantage of the cloud to deliver zero downtime.
b) What are the key verticals or industries that can be categorized under
engineering, financial, and others?
Ans: Finance: Banks Engineering: E-Sports, 3D Printing
c) Define the concept of time management and explain its importance in meeting
work requirements effectively.
Ans: Time management is the coordination of tasks and activities to maximize the
effectiveness of an individual's efforts. Essentially, the purpose of time management is to
enable people to get more and better work done in less time.
d )Explain the importance of data architecture in managing data for analysis.

Ans: A good data architecture can standardize how data is stored, and potentially reduce
duplication, enabling better quality and holistic analyses. Improving data quality: Well-
designed data architectures can solve some of the challenges of poorly managed data
lakes, also known as “data swamps”.
e ) Why is it important to run descriptive statistics on available data in big data
analytics?
Ans: Descriptive statistics describe, show, and summarize the basic features of a dataset
found in a given study, presented in a summary that describes the data sample and its
measurements. It helps analysts to understand the data better.

PART – B (5x10 = 50 Mark


2 ) Discuss the benefits of using NoSQL databases in data analytics along with
the comparison between SQL and NoSQL
Ans: No SQL databases are more scalable and provide superior performance.
The NoSQL data model addresses several issues that the relational model is not designed
to address:
Large volumes of structured, semi-structured, and unstructured data
Quick iteration, and frequent code pushes
Object-oriented programming that is easy to use and flexible
Efficient, scale-out architecture instead of expensive, monolithic architecture

OR
3)Walk through the process of connecting R to a NoSQL database, highlighting
any challenges one may encounter and explaining how to overcome them.

Ans: Process of connecting R to a NoSQL database:


Step 1: Create a database in MySQL with the following command:
create database databasename;
Step 2: To connect the database with R we can use R Studio. To download R Studio
visit rstudio.com
Step 3: Use the following command to install the MySQL library in RStudio:
install.packages("RMySQL")
Challenges of NoSQL:
Data modeling and schema design. One of the biggest challenges with NoSQL databases
is data modeling and schema design
Query complexity.
Scalability.
Management and administration.
Vendor lock-in.
Data security.
Analytics and BI.
Limited ACID support.
***
4)Discuss the role of data analytics in the production lines of manufacturing
industries. Provide examples of how data analytics can optimize production
processes and improve efficiency.

Ans: Data Analytics give manufacturers insight by identifying patterns, measuring


impact, and predicting outcomes. The ability to analyze equipment failures, production
bottlenecks, supply chain deficiencies, etc., enables better decision-making.
Several industries are undergoing digital transformation to streamline processes.
However, the manufacturing sector has been relatively slow. But the time has arrived
when applying data analytics in manufacturing can ensure improved decision-making
and enhance performance.
Coupled with the rapid development of artificial intelligence, advanced analytics,
robotics, emerging IoT-powered sensors and devices, Industry 4.0 offers manufacturers
the ability to gather, store, process as well as utilize data in daily operations.
Furthermore, business intelligence and business analytics help to draw insights about
potential improvements.
Examples of data analytics to Optimize Production include:
Monitoring production line performance and output.
Tracking inventory levels and movements.
Analyzing customer preferences and trends.
Performing predictive maintenance on equipment.
Analyzing material costs and identifying potential savings.
OR
5)Explore how data analytics can be applied in smart utilities, such as electricity or
water management. Describe a specific scenario where data analytics can help
optimize resource allocation and improve sustainability.

Ans: S.M.A.R.T. (Self-Monitoring, Analysis and Reporting Technology; often


written as SMART) is a monitoring system included in computer hard disk drives
(HDDs) and solid-state drives (SSDs) that detects and reports on various indicators of
drive reliability, with the intent of enabling the expectation of hardware failures.
When S.M.A.R.T. data indicates a possible forthcoming drive failure, software running
on the host system may notify the user so stored data can be copied to another storage
device, preventing data loss, and the failing drive can be replaced.
Smart Utility Systems is the leading provider of Software-as-a-Service (SaaS) solutions
for Customer Engagement, Mobile Workforce, and Big Data Analytics to the Energy and
Utility sector.
Help utilities improve their operational efficiency and maximize revenue realization,
through mobile and cloud technologies.
The Smart Utilities refer to the use of technology and advanced data analytics to
improve the efficiency, reliability, and sustainability of utility services such as electricity,
gas, and water. A Smart Utility system has the capability to integrate, control and monitor
status of Smart Meters and sensors.
***
6)Explain the concept of quality and standards adherence in the context of work.
Describe the benefits of maintaining high-quality standards and provide examples
of how it contributes to overall work effectiveness.
Ans: Quality Standards Adherence
Efficiency – Performing activities well
Effectiveness – Performing activities right

Additionally, some of the benefits of quality standards include the following:


Continuous improvement of quality outcomes.
Efficient adherence to regulatory requirements and compliance.
Reduced process variation and product defects.
Improved worker productivity and safety.
Enhanced customer satisfaction.
OR
7)Discuss the importance of teamwork in achieving work objectives. Provide an
example of a scenario where effective teamwork played a crucial role in meeting
requirements and achieving success.
Ans: Research shows that collaborative problem-solving leads to better outcomes.
People are more likely to take calculated risks that lead to innovation if they have the
support of a team behind them.
Working in a team encourages personal growth, increases job satisfaction, and reduces
stress.
Example: Your team always completed our projects ahead of schedule with very positive
reviews from our clients. Our ability to communicate effectively was what made us such
a good team. People expressed concerns clearly and openly, so we resolved issues as
soon as they arose.
8)a) Discuss the fundamentals of MapReduce
Ans: MapReduce consists of two distinct tasks — Map and Reduce.
As the name MapReduce suggests, reducer phase takes place after the mapper phase
has been completed.
So, the first is the map job, where a block of data is read and processed to produce
key-value pairs as intermediate outputs.
The output of a Mapper or map job (key-value pairs) is input to the Reducer.
The reducer receives the key-value pair from multiple map jobs.
Then, the reducer aggregates those intermediate data tuples (intermediate key-value
pair) into a smaller set of tuples or key-value pairs which is the final output.
b)Explain architecture of Hadoop with neat sketch
Ans: As we all know Hadoop is a framework written in Java that utilizes a large cluster of
commodity hardware to maintain and store big size data. Hadoop works on MapReduce
Programming Algorithm that was introduced by Google. Today lots of Big Brand Companies
are using Hadoop in their Organization to deal with big data, eg. Facebook, Yahoo, Netflix,
eBay, etc. The Hadoop Architecture Mainly consists of 4 components. MapReduce
HDFS(Hadoop Distributed File System)
YARN(Yet Another Resource Negotiator)
Common Utilities or Hadoop Common

1. HDFS: Hadoop Distributed File System. Google published its paper GFS and on
the basis of that HDFS was developed. It states that the files will be broken into
blocks and stored in nodes over the distributed architecture.
2. Yarn: Yet another Resource Negotiator is used for job scheduling and manage
the cluster.
3. Map Reduce: This is a framework which helps Java programs to do the parallel
computation on data using key value pair. The Map task takes input data and
converts it into a data set which can be computed in Key value pair. The output of
Map task is consumed by reduce task and then the out of reducer gives the
desired result.
4. Hadoop Common: These Java libraries are used to start Hadoop and are used by
other Hadoop modules.
OR
9)Explain the concept of exporting data to the cloud, such as AWS or Rackspace.
Discuss the benefits and challenges of utilizing cloud services for data storage
and analysis.

Ans: Export all the Data onto the cloud like Amazon web services S3
We usually export our data to cloud for purposes like safety, multiple access and
real time simultaneous analysis.
There are various vendors which provide cloud storage services. We are discussing
Amazon S3.
An Amazon S3 export transfers individual objects from Amazon S3 buckets to your
device, creating one file for each object. You can export from more than one bucket
and you can specify which files to export using manifest file options
Export Job Process
You create an export manifest file that specifies how to load data onto your device,
including an encryption PIN code or password and details such as the name of the
bucket that contains the data to export. If you are going to mail us multiple storage
devices, you must create a manifest file for each storage device.
You initiate an export job by sending a CreateJob request that includes the manifest
file. You must submit a separate job request for each device. Your job expires after
30 days. If you do not send a device, there is no charge.
You can send a CreateJob request using the AWS Import/Export Tool, the AWS
Command Line Interface (CLI), the AWS SDK for Java, or the AWS REST API.
The easiest method is the AWS Import/Export Tool. For details, see
Sending a CreateJob Request Using the AWS Import/Export Web Service Tool
Sending a CreateJob Request Using the AWS SDK for Java
Sending a CreateJob Request Using the REST API
AWS Import/Export sends a response that includes a job ID, a signature value, and
information on how to print your pre-paid shipping label. The response also saves a
SIGNATURE file to your computer.
You will need this information in subsequent steps.
You copy the SIGNATURE file to the root directory of your storage device. You can
use the file AWS sent or copy the signature value from the response into a new text
file named SIGNATURE. The file name must be SIGNATURE and it must be in the
device's root directory.
Each device you send must include the unique SIGNATURE file for that device and
that JOBID. AWS Import/Export validates the SIGNATURE file on your storage
device before starting the data load. If the SIGNATURE file is missing invalid (if, for
instance, it is associated with a different job request), AWS Import/Export will not
perform the data load and we will return your storage device.
Generate, print, and attach the pre-paid shipping label to the exterior of your package.
See Shipping Your Storage Device for information on how to get your pre-paid
shipping label.
You ship the device and cables to AWS through UPS. Make sure to include your job
ID on the shipping label and on the device you are shipping. Otherwise, your job
might be delayed. Your job expires after 30 days. If we receive your package after
your job expires, we will return your device. You will only be charged for the
shipping fees, if any.
***
10) Explain about hypothesis testing and how to determine multiple analytical
models?

Ans: Hypothesis Testing : Hypothesis testing is the act of testing a hypothesis or a


supposition in relation to a statistical parameter. Analysts implement hypothesis testing in
order to test if a hypothesis is plausible or not. In data science and statistics, hypothesis
testing is an important step as it involves the verification of an assumption that could help
develop a statistical parameter. For instance, a researcher establishes a hypothesis assuming
that the average of all odd numbers is an even number. In order to find the plausibility of
this hypothesis, the researcher will have to test the hypothesis using hypothesis testing
methods. Unlike a hypothesis that is ‘supposed’ to stand true on the basis of little or no
evidence, hypothesis testing is required to have plausible evidence in order to establish that
a statistical hypothesis is true.
Types of Hypothesis:
1) Alternate Hypothesis
2) Null Hypothesis
How to perform Hypothesis Testing:
 Significance Level - The level of significance in hypothesis testing indicates if a
statistical result could have significance if the null hypothesis stands to be true.
 Testing Method - The testing method involves a type of sampling-distribution and a
test statistic that leads to hypothesis testing. There are a number of testing methods that
can assist in the analysis of data samples.
 Test statistic - Test statistic is a numerical summary of a data set that can be used to
perform hypothesis testing.
 P-value - The P-value interpretation is the probability of finding a sample statistic to be
as extreme as the test statistic, indicating the plausibility of the null hypothesis.
OR

11. a) Discuss the concept of training a model on a 2/3 sample data using
statistical and machine learning algorithms
Ans: When training a machine learning model, it is common to split the available
data into two sets: a training set and a test set. The training set is used to fit the
model, while the test set is used to evaluate its performance on new, unseen data.
However, sometimes the amount of available data is limited, and it may not be
advisable to use a large fraction of it for testing purposes. In such cases, one
alternative is to split the data into two sets, with one being used for training and the
other for testing, but it may be more efficient to use a technique known as cross-
validation.
1. Splitting the data: To train the model on 2/3 of the data, you would split the
data randomly into two sets: a training set containing 2/3 of the data and a test set
containing the remaining 1/3. The training set is used to train the model, while the
test set is used to evaluate its performance. However, keep in mind that using only
1/3 of the data for testing may result in a high variance in the performance metric
and may not accurately represent the true performance of the model.
2. Choosing the algorithm: The choice of algorithm depends on the type of
problem you are trying to solve. For example, if you are trying to predict a
continuous variable, you might choose a linear regression or a decision tree
algorithm. If you are trying to predict a categorical variable, you might choose a
logistic regression or a support vector machine algorithm. It is important to select
an algorithm that is appropriate for your problem and to have a good understanding
of how it works.
3. Preprocessing the data: Before training the model, you may need to preprocess
the data by handling missing values, scaling the features, or encoding categorical
variables. This step can help to improve the performance of the model and ensure
that it can handle the data appropriately.
4. Training the model: Once you have chosen an algorithm and preprocessed the
data, you can train the model using the training set. This involves fitting the model
to the data and adjusting its parameters to minimize the error between the predicted
output and the true output. The goal is to find the best parameters that generalize
well to new, unseen data.
5. Evaluating the performance: After training the model, you can evaluate its
performance using the test set. This involves making predictions on the test set and
comparing them to the true values. Common performance metrics include
accuracy, precision, recall, F1 score, and mean squared error. It is important to
report the performance metric on both the training set and the test set to ensure that
the model is not overfitting to the training data.
6. Fine-tuning the model: If the performance of the model is not satisfactory, you
may need to fine-tune the algorithm or adjust the hyper parameters. This involves
repeating the training and testing process with different settings until you find the
best configuration. Cross-validation can be used to optimize the Hyper parameters
and prevent overfitting

b) Briefly explain any 2 methods for outlier elimination

Ans: Most popular outlier detection methods are Z-Score, IQR (Interquartile
Range), and Mahalanobis Distance, DBSCAN (Density-Based Spatial Clustering of
Applications with Noise, Local Outlier Factor (LOF), and One-Class SVM
(Support Vector Machine)
Z-score: Z- Score is also called a standard score. This value/score helps to
understand that how far is the data point from the mean. And after setting up a
threshold value one can utilize z score values of data points to define the outliers.
IQR (Inter Quartile Range) :IQR (Inter Quartile Range) Inter Quartile Range
approach to finding the outliers is the most commonly used and most trusted
approach used in the research field.
Percentile Method: This technique works by setting a particular threshold value,
which is decided based on our problem statement.
While we remove the outliers using capping, then that particular method is known
as Winsorization.
Here, we always maintain symmetry on both sides, meaning if we remove 1% from the right, the left
will also drop by 1%.
***

You might also like