Bvraju Institute of Technology, Narsapur: Code No: A46H2
Bvraju Institute of Technology, Narsapur: Code No: A46H2
1.a) Name two advantages of using NoSQL databases over traditional relational
databases in the context of data analytics.
Ans: Handle large volumes of data at high speed with a scale-out architecture.
Store unstructured, semi-structured, or structured data.
Enable easy updates to schemas and fields.
Be developer-friendly.
Take full advantage of the cloud to deliver zero downtime.
b) What are the key verticals or industries that can be categorized under
engineering, financial, and others?
Ans: Finance: Banks Engineering: E-Sports, 3D Printing
c) Define the concept of time management and explain its importance in meeting
work requirements effectively.
Ans: Time management is the coordination of tasks and activities to maximize the
effectiveness of an individual's efforts. Essentially, the purpose of time management is to
enable people to get more and better work done in less time.
d )Explain the importance of data architecture in managing data for analysis.
Ans: A good data architecture can standardize how data is stored, and potentially reduce
duplication, enabling better quality and holistic analyses. Improving data quality: Well-
designed data architectures can solve some of the challenges of poorly managed data
lakes, also known as “data swamps”.
e ) Why is it important to run descriptive statistics on available data in big data
analytics?
Ans: Descriptive statistics describe, show, and summarize the basic features of a dataset
found in a given study, presented in a summary that describes the data sample and its
measurements. It helps analysts to understand the data better.
OR
3)Walk through the process of connecting R to a NoSQL database, highlighting
any challenges one may encounter and explaining how to overcome them.
1. HDFS: Hadoop Distributed File System. Google published its paper GFS and on
the basis of that HDFS was developed. It states that the files will be broken into
blocks and stored in nodes over the distributed architecture.
2. Yarn: Yet another Resource Negotiator is used for job scheduling and manage
the cluster.
3. Map Reduce: This is a framework which helps Java programs to do the parallel
computation on data using key value pair. The Map task takes input data and
converts it into a data set which can be computed in Key value pair. The output of
Map task is consumed by reduce task and then the out of reducer gives the
desired result.
4. Hadoop Common: These Java libraries are used to start Hadoop and are used by
other Hadoop modules.
OR
9)Explain the concept of exporting data to the cloud, such as AWS or Rackspace.
Discuss the benefits and challenges of utilizing cloud services for data storage
and analysis.
Ans: Export all the Data onto the cloud like Amazon web services S3
We usually export our data to cloud for purposes like safety, multiple access and
real time simultaneous analysis.
There are various vendors which provide cloud storage services. We are discussing
Amazon S3.
An Amazon S3 export transfers individual objects from Amazon S3 buckets to your
device, creating one file for each object. You can export from more than one bucket
and you can specify which files to export using manifest file options
Export Job Process
You create an export manifest file that specifies how to load data onto your device,
including an encryption PIN code or password and details such as the name of the
bucket that contains the data to export. If you are going to mail us multiple storage
devices, you must create a manifest file for each storage device.
You initiate an export job by sending a CreateJob request that includes the manifest
file. You must submit a separate job request for each device. Your job expires after
30 days. If you do not send a device, there is no charge.
You can send a CreateJob request using the AWS Import/Export Tool, the AWS
Command Line Interface (CLI), the AWS SDK for Java, or the AWS REST API.
The easiest method is the AWS Import/Export Tool. For details, see
Sending a CreateJob Request Using the AWS Import/Export Web Service Tool
Sending a CreateJob Request Using the AWS SDK for Java
Sending a CreateJob Request Using the REST API
AWS Import/Export sends a response that includes a job ID, a signature value, and
information on how to print your pre-paid shipping label. The response also saves a
SIGNATURE file to your computer.
You will need this information in subsequent steps.
You copy the SIGNATURE file to the root directory of your storage device. You can
use the file AWS sent or copy the signature value from the response into a new text
file named SIGNATURE. The file name must be SIGNATURE and it must be in the
device's root directory.
Each device you send must include the unique SIGNATURE file for that device and
that JOBID. AWS Import/Export validates the SIGNATURE file on your storage
device before starting the data load. If the SIGNATURE file is missing invalid (if, for
instance, it is associated with a different job request), AWS Import/Export will not
perform the data load and we will return your storage device.
Generate, print, and attach the pre-paid shipping label to the exterior of your package.
See Shipping Your Storage Device for information on how to get your pre-paid
shipping label.
You ship the device and cables to AWS through UPS. Make sure to include your job
ID on the shipping label and on the device you are shipping. Otherwise, your job
might be delayed. Your job expires after 30 days. If we receive your package after
your job expires, we will return your device. You will only be charged for the
shipping fees, if any.
***
10) Explain about hypothesis testing and how to determine multiple analytical
models?
11. a) Discuss the concept of training a model on a 2/3 sample data using
statistical and machine learning algorithms
Ans: When training a machine learning model, it is common to split the available
data into two sets: a training set and a test set. The training set is used to fit the
model, while the test set is used to evaluate its performance on new, unseen data.
However, sometimes the amount of available data is limited, and it may not be
advisable to use a large fraction of it for testing purposes. In such cases, one
alternative is to split the data into two sets, with one being used for training and the
other for testing, but it may be more efficient to use a technique known as cross-
validation.
1. Splitting the data: To train the model on 2/3 of the data, you would split the
data randomly into two sets: a training set containing 2/3 of the data and a test set
containing the remaining 1/3. The training set is used to train the model, while the
test set is used to evaluate its performance. However, keep in mind that using only
1/3 of the data for testing may result in a high variance in the performance metric
and may not accurately represent the true performance of the model.
2. Choosing the algorithm: The choice of algorithm depends on the type of
problem you are trying to solve. For example, if you are trying to predict a
continuous variable, you might choose a linear regression or a decision tree
algorithm. If you are trying to predict a categorical variable, you might choose a
logistic regression or a support vector machine algorithm. It is important to select
an algorithm that is appropriate for your problem and to have a good understanding
of how it works.
3. Preprocessing the data: Before training the model, you may need to preprocess
the data by handling missing values, scaling the features, or encoding categorical
variables. This step can help to improve the performance of the model and ensure
that it can handle the data appropriately.
4. Training the model: Once you have chosen an algorithm and preprocessed the
data, you can train the model using the training set. This involves fitting the model
to the data and adjusting its parameters to minimize the error between the predicted
output and the true output. The goal is to find the best parameters that generalize
well to new, unseen data.
5. Evaluating the performance: After training the model, you can evaluate its
performance using the test set. This involves making predictions on the test set and
comparing them to the true values. Common performance metrics include
accuracy, precision, recall, F1 score, and mean squared error. It is important to
report the performance metric on both the training set and the test set to ensure that
the model is not overfitting to the training data.
6. Fine-tuning the model: If the performance of the model is not satisfactory, you
may need to fine-tune the algorithm or adjust the hyper parameters. This involves
repeating the training and testing process with different settings until you find the
best configuration. Cross-validation can be used to optimize the Hyper parameters
and prevent overfitting
Ans: Most popular outlier detection methods are Z-Score, IQR (Interquartile
Range), and Mahalanobis Distance, DBSCAN (Density-Based Spatial Clustering of
Applications with Noise, Local Outlier Factor (LOF), and One-Class SVM
(Support Vector Machine)
Z-score: Z- Score is also called a standard score. This value/score helps to
understand that how far is the data point from the mean. And after setting up a
threshold value one can utilize z score values of data points to define the outliers.
IQR (Inter Quartile Range) :IQR (Inter Quartile Range) Inter Quartile Range
approach to finding the outliers is the most commonly used and most trusted
approach used in the research field.
Percentile Method: This technique works by setting a particular threshold value,
which is decided based on our problem statement.
While we remove the outliers using capping, then that particular method is known
as Winsorization.
Here, we always maintain symmetry on both sides, meaning if we remove 1% from the right, the left
will also drop by 1%.
***