Module 1 ITE Elective 1 New - Curriculum
Module 1 ITE Elective 1 New - Curriculum
Statistics:
Statistics is the most critical unit in Data science. It is the method or science of collecting and
analyzing numerical data in large quantities to get useful insights.
Machine Learning:
Machine Learning explores the building and study of algorithms which learn to make predictions
about unforeseen/future data.
Deep Learning:
Deep Learning method is new machine learning research where the algorithm selects the analysis
model to follow.
Data analytics, on the other hand, focuses on processing and performing statistical analysis on
existing datasets. Analysts concentrate on creating methods to capture, process, and organize data to
uncover actionable insights for current problems, and establishing the best way to present this data.
More simply, the field of data and analytics is directed toward solving problems for questions we know
but we don’t know the answers to. More importantly, it’s based on producing results that can lead to
immediate improvements.
Data analytics also encompasses a few different branches of broader statistics and analysis which
help combine diverse sources of data and locate connections while simplifying the results.
2. Data Preparation:
Data can have lots of inconsistencies like missing value, blank columns, incorrect data format
which needs to be cleaned. You need to process, explore, and condition data before modeling.
The cleaner your data, the better are your predictions.
3. Model Planning:
In this stage, you need to determine the method and technique to draw the relation between
input variables. Planning for a model is performed by using different statistical formulas and
4. Model Building:
In this step, the actual model building process starts. Here, Data scientist distributes datasets for
training and testing. Techniques like association, classification, and clustering are applied to the
training data set. The model once prepared is tested against the "testing" dataset.
5. Operationalize:
In this stage, you deliver the final baselined model with reports, code, and technical documents.
Model is deployed into a real-time production environment after thorough testing.
6. Communicate Results
In this stage, the key findings are communicated to all stakeholders. This helps you to decide if
the results of the project are a success or a failure based on the inputs from the model.
2. Data Engineer:
Role: The role of data engineer is of working with large amounts of data. He develops,
constructs, tests, and maintains architectures like large scale processing system and
databases.
Languages: SQL, Hive, R, SAS, Matlab, Python, Java, Ruby, C + +, and Perl
3. Data Analyst:
Role: A data analyst is responsible for mining vast amounts of data. He or she will look for
relationships, patterns, trends in data. Later he or she will deliver compelling reporting
and visualization for analyzing the data to take the most viable business decisions.
Languages: R, Python, HTML, JS, C, C+ + , SQL
4. Statistician:
Role: The statistician collects, analyses, understand qualitative and quantitative data by using
statistical theories and methods.
Languages: SQL, R, Matlab, Tableau, Python, Perl, Spark, and Hive
5. Data Administrator:
Role: Data admin should ensure that the database is accessible to all relevant users. He also
makes sure that it is performing correctly and is being kept safe from hacking.
Languages: Ruby on Rails, SQL, Java, C#, and Python
Techniques
What mathematical and statistical techniques you need to learn for data science? There are a
number of these techniques used in data science for the data collection, modification, storage, analysis,
insights, and then representation. The data analysts and scientists mostly work on the following
statistical analyzing techniques that follow as:
Probability and Statistics
Distribution
Regression analysis
Descriptive statistics
Inferential statistics
Non-Parametric statistics
Hypothesis testing
Linear Regression
Logistic Regression
Neural Networks
K-Means clustering
Decision Trees
Although the list doesn’t end here, if you have studied statistics and mathematics, you will have an
idea of how the theories and techniques of samplings and correlations work. Particularly when you work
as a data scientist and need to conclude, research on the patterns, targeted insight, etc. (Sivarajah,
Kamal, Irani, and Weerakkody, 2017)
Tools
Let us start exploring the tools which are used to work on data in different processes. As mentioned
earlier, data does go through a lot of processes in which it is collected, stored, worked upon, and
analyzed.
For your easy understanding, the tools defined here are categorized according to their processes.
The first process is data collection. Although data can be collected through various methods, which
include online surveys, interviews, forms, etc., the information gathered has to be transformed in a
readable form for the data analyst to work on. The following tools can be used for data collection.
Semantria
Semantria is a cloud-based tool that extracts data and information through analyzing
the text and sentiments in it. It is a high-end NLP (neuro-linguistic programming) based
tool that can detect the sentiments on specific elements based on the language used in
it (sounds like magic? No, it is science!).
Trackur
It is yet another tool that collects data, especially on social media platforms, by tracking
the feedback on brands and products. It also works on sentiment analysis. It is a tool
used for monitoring and can be of great value for the marketing companies.
Today, many other apps use similar text /semantics analysis and content management, e.g., Open
Text, Opinion Crawl.
Apache Hadoop
It is a framework for software that deals with huge data volume and its computation. It
provides a layered structure to distribute the storage of data among clusters of
computers for easy data processing of big data.
Apache Cassandra
This tool is free and an open-source platform. It uses SQL and CSL (Cassandra structure
language) to communicate with the database. It can provide swift availability of data
stored on various servers.
Mongo DB
It is a database that is document-oriented and also free to use. It is available on multiple
platforms like Windows, Solaris, and Linux. It is very easy to learn and is reliable.
Similar data storage platforms are CouchDB, Apache Ignite, and Oracle NOSQL Database.
OctoParse
It is a web scraping tool available in both free and paid versions. It gives data as output
in structured spreadsheets, which are readable and easy to use for further operations
on it. It can extract phone numbers, IP addresses, and email IDs along with different
data from the websites.
Content Grabber
It is also a web scraping tool but comes with advanced skills such as debugging and error
handling. It can extract data from almost every website and provide structured data as
output in user preferred formats.
Data Cleaner
Data cleaner works with the Hadoop database and is a very powerful data indexing tool.
It improves the quality of data by removing duplicates and transforming them into one
record. It can also find missing patterns and a specific data group.
OpenRefine
This refining tool deals with tangled data. It cleans before transforming it into another
form. It provides data access with speed and ease.
R
The R programming language is the widely used programming language that is used by
software engineers to develop software that helps in statistical computing and graphics
too. It supports various platforms like Windows, Mac operating system, and Linux. It is
widely used by data analysts, statisticians, and researchers.
Apache Spark
Apache Spark is a powerful analytical engine that provides real-time analysis and
processes data along with enabling mini and micro-batches and streaming. It is
productive as it provides workflows that are highly interactive.
Python
Python has been a very powerful and high-level programming language that has been
around for quite a while. It was used for application development, but now it has been
upgraded with new tools to be used, especially with data science. It gives output files
which can be saved as CSV formats and used as spreadsheets.
Similar data analysis tools are Apache storm, SAS, Flink, and Hive, etc.
Python
Python, as mentioned above, is a powerful and general-purpose programming language
that also provides data visualization. It is packed with vast graphical libraries to support
the graphical representation of a wide variety of data.
Tableau
Having a very large consumer market, Tableau is referred to as the grandmaster of all
visualization software by Forbes. It is an open-source software that can be integrated
with the database, is easy to use, and furnishes interactive data visualization in the form
of bars, charts, and maps.
Similar popular data visualization apps and tools are DataWrapper, Qlik, and Gephi, which are all
open source and also support CSV files as data input.