0% found this document useful (0 votes)
176 views

Module 1 ITE Elective 1 New - Curriculum

The document defines key concepts in big data, data science, and data analytics. It states that big data has become important for businesses to glean insights but requires proper tools to analyze large datasets. Data science and analytics involve extracting insights from vast amounts of data using scientific methods and have become integral to business intelligence. Data science uses methods like statistics, machine learning, and deep learning to discover patterns in data. Data analytics focuses on processing existing datasets to uncover current insights and solutions. The data science process involves discovery, preparation, modeling, building, operationalizing, and communicating results. The document distinguishes data scientists, who predict the future using patterns, from data analysts, who answer provided questions and find meaningful current information. It lists several

Uploaded by

Garry Penoliar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
176 views

Module 1 ITE Elective 1 New - Curriculum

The document defines key concepts in big data, data science, and data analytics. It states that big data has become important for businesses to glean insights but requires proper tools to analyze large datasets. Data science and analytics involve extracting insights from vast amounts of data using scientific methods and have become integral to business intelligence. Data science uses methods like statistics, machine learning, and deep learning to discover patterns in data. Data analytics focuses on processing existing datasets to uncover current insights and solutions. The data science process involves discovery, preparation, modeling, building, operationalizing, and communicating results. The document distinguishes data scientists, who predict the future using patterns, from data analysts, who answer provided questions and find meaningful current information. It lists several

Uploaded by

Garry Penoliar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 10

ITE01 – IT ELECTIVE 1 (Foundations of Data Science) Module 1

Key concepts in big data, data science and data analytics.


What is Big Data?
Big data has become a major component in the tech world today thanks to the actionable insights
and results businesses can glean. However, the creation of such large datasets also requires
understanding and having the proper tools on hand to parse through them to uncover the right
information.
To better comprehend big data, the fields of data science and analytics have gone from largely being
relegated to academia, to instead becoming integral elements of Business Intelligence (BI) and big data
analytics tools.

What is Data Science?


Data Science is the area of study which involves extracting insights from vast amounts of data by the
use of various scientific methods, algorithms, and processes. It helps you to discover hidden patterns
from the raw data. The term Data Science has emerged because of the evolution of mathematical
statistics, data analysis, and big data (See Figure 1).
Furthermore, Data Science is an interdisciplinary field that allows you to extract knowledge from
structured or unstructured data. It enables you to translate a business problem into a research project
and then translate it back into a practical solution. (See Figure 2).

Figure 1: Data Science Components

Statistics:
Statistics is the most critical unit in Data science. It is the method or science of collecting and
analyzing numerical data in large quantities to get useful insights.

Prepared by: MR. ARNALDY D. FORTIN, MBA, MCS Page 1 of 10


ITE01 – IT ELECTIVE 1 (Foundations of Data Science) Module 1
Visualization:
Visualization technique helps you to access huge amounts of data in easy to understand and
digestible visuals.

Machine Learning:
Machine Learning explores the building and study of algorithms which learn to make predictions
about unforeseen/future data.

Deep Learning:
Deep Learning method is new machine learning research where the algorithm selects the analysis
model to follow.

Figure 2: Evolution of Data Sciences

Data analytics, on the other hand, focuses on processing and performing statistical analysis on
existing datasets. Analysts concentrate on creating methods to capture, process, and organize data to
uncover actionable insights for current problems, and establishing the best way to present this data.
More simply, the field of data and analytics is directed toward solving problems for questions we know
but we don’t know the answers to. More importantly, it’s based on producing results that can lead to
immediate improvements.
Data analytics also encompasses a few different branches of broader statistics and analysis which
help combine diverse sources of data and locate connections while simplifying the results.

Prepared by: MR. ARNALDY D. FORTIN, MBA, MCS Page 2 of 10


ITE01 – IT ELECTIVE 1 (Foundations of Data Science) Module 1

Figure 3: Data Science Process

Data Science Process


1. Discovery:
Discovery step involves acquiring data from all the identified internal & external sources which
helps you to answer the business question.
The data can be:
 Logs from webservers
 Data gathered from social media
 Census datasets
 Data streamed from online sources using APIs (Application Programming Interface)

2. Data Preparation:
Data can have lots of inconsistencies like missing value, blank columns, incorrect data format
which needs to be cleaned. You need to process, explore, and condition data before modeling.
The cleaner your data, the better are your predictions.

3. Model Planning:
In this stage, you need to determine the method and technique to draw the relation between
input variables. Planning for a model is performed by using different statistical formulas and

Prepared by: MR. ARNALDY D. FORTIN, MBA, MCS Page 3 of 10


ITE01 – IT ELECTIVE 1 (Foundations of Data Science) Module 1
visualization tools. SQL analysis services, R, and SAS/access are some of the tools used for this
purpose.

4. Model Building:
In this step, the actual model building process starts. Here, Data scientist distributes datasets for
training and testing. Techniques like association, classification, and clustering are applied to the
training data set. The model once prepared is tested against the "testing" dataset.

5. Operationalize:
In this stage, you deliver the final baselined model with reports, code, and technical documents.
Model is deployed into a real-time production environment after thorough testing.

6. Communicate Results
In this stage, the key findings are communicated to all stakeholders. This helps you to decide if
the results of the project are a success or a failure based on the inputs from the model.

What is Data Analytics?


Data analytics focuses on processing and performing statistical analysis on existing datasets.
Analysts concentrate on creating methods to capture, process, and organize data to uncover actionable
insights for current problems, and establishing the best way to present this data. More simply, the field
of data and analytics is directed toward solving problems for questions we know we don’t know the
answers to. More importantly, it’s based on producing results that can lead to immediate
improvements.
Data analytics also encompasses a few different branches of broader statistics and analysis
which help combine diverse sources of data and locate connections while simplifying the results.

How Data Scientists Are Different From Data Analysts:


1. A data scientist uses their skills to predict the future using past patterns while the work of data
analysts is to find meaningful information from the provided data.
2. A data scientist analyses the data and raise questions while data analyst finds the answer of all
the various issues arising in the mind of businesspersons. In short, a data scientist is more about
what if, on the other hand, a data analyst involves in the day-to-day analysis.
3. The work of data scientist is not only to address business problems but also to provide accurate
predictions about the business. However, data analysts only address business issues, but the
rest lies in the hand of the administration.
4. To extract information from a data, a data scientist uses machine learning while data analyst
uses R/SAS tools. (R programming and Statistical Analysis Software)
5. Data Scientists combine different sources and establish a link between them. Primarily data
scientists use diverse sources, explore and examine them. However, data analysts use only data
from a single reference to investigate and examines.
6. The accuracy rate of data scientists is as high as 90%. Whereas, data analysts work is to work on
the questions provided to them by the management.

Prepared by: MR. ARNALDY D. FORTIN, MBA, MCS Page 4 of 10


ITE01 – IT ELECTIVE 1 (Foundations of Data Science) Module 1
7. Data Scientists formulate questions whose answers will prove beneficial for the businesses. Data
analysts on the other hander only solvers a set of questions and hand it to the authorities.

Data Science Jobs & Roles


Most prominent Data Scientist job titles are:
 Data Scientist
 Data Engineer
 Data Analyst
 Statistician
 Data Architect
 Data Admin
 Business Analyst
 Data/Analytics Manager

Let's learn what each role entails in detail:


1. Data Scientist:
Role: A Data Scientist is a professional who manages enormous amounts of data to come up
with compelling business visions by using various tools, techniques, methodologies,
algorithms, etc.
Languages: R, SAS, Python, SQL, Hive, Matlab, Pig, Spark

2. Data Engineer:
Role: The role of data engineer is of working with large amounts of data. He develops,
constructs, tests, and maintains architectures like large scale processing system and
databases.
Languages: SQL, Hive, R, SAS, Matlab, Python, Java, Ruby, C + +, and Perl

3. Data Analyst:
Role: A data analyst is responsible for mining vast amounts of data. He or she will look for
relationships, patterns, trends in data. Later he or she will deliver compelling reporting
and visualization for analyzing the data to take the most viable business decisions.
Languages: R, Python, HTML, JS, C, C+ + , SQL

4. Statistician:
Role: The statistician collects, analyses, understand qualitative and quantitative data by using
statistical theories and methods.
Languages: SQL, R, Matlab, Tableau, Python, Perl, Spark, and Hive

5. Data Administrator:
Role: Data admin should ensure that the database is accessible to all relevant users. He also
makes sure that it is performing correctly and is being kept safe from hacking.
Languages: Ruby on Rails, SQL, Java, C#, and Python

Prepared by: MR. ARNALDY D. FORTIN, MBA, MCS Page 5 of 10


ITE01 – IT ELECTIVE 1 (Foundations of Data Science) Module 1
6. Business Analyst:
Role: This professional need to improve business processes. He/she as an intermediary between
the business executive team and IT department.
Languages: SQL, Tableau, Power BI and, Python

Tools and Techniques of Data Science


Big data is a term used in data science, which refers to the huge amount of data that has been
collected to be used for research and analysis. It goes through various processes, such as it is first
picked, stored, filtered, classified, validated, analyzed, and then processed for final visualization. (Ngiam
and Khor, 2019)
The tools and techniques of data science are two different things. Techniques are a set of
procedures that are followed to perform a task, whereas a tool is an equipment that is used to apply
that technique to perform a task.
Data scientists apply some operational methods, which are called the techniques on the data
through various software, which are known as tools. This combination is used in acquiring data, refining
it for the purpose intended, manipulating and labeling, and then examining the results for the best
possible outcomes.
These methods used by the data scientists and engineers are inclusive of all the operations starting
from the collection of data to storing and manipulating it, performing statistical analysis on it, and
visualization with the help of bars and charts and preparing predictive models for insights.
These processes are attained with the help of several tools and techniques which are extracted from
the three subjects mentioned above.
The lifecycle of a data science project is composed of various stages. Data passes through each stage
and is then transformed into information required by the respective field. Here we will have a look at
the most efficient, quick, and productive tools and techniques used by the data scientists to accomplish
their task at each stage.

 Techniques
What mathematical and statistical techniques you need to learn for data science? There are a
number of these techniques used in data science for the data collection, modification, storage, analysis,
insights, and then representation. The data analysts and scientists mostly work on the following
statistical analyzing techniques that follow as:
 Probability and Statistics
 Distribution
 Regression analysis
 Descriptive statistics
 Inferential statistics
 Non-Parametric statistics
 Hypothesis testing
 Linear Regression
 Logistic Regression
 Neural Networks
 K-Means clustering
 Decision Trees

Prepared by: MR. ARNALDY D. FORTIN, MBA, MCS Page 6 of 10


ITE01 – IT ELECTIVE 1 (Foundations of Data Science) Module 1

Although the list doesn’t end here, if you have studied statistics and mathematics, you will have an
idea of how the theories and techniques of samplings and correlations work. Particularly when you work
as a data scientist and need to conclude, research on the patterns, targeted insight, etc. (Sivarajah,
Kamal, Irani, and Weerakkody, 2017)

 Tools
Let us start exploring the tools which are used to work on data in different processes. As mentioned
earlier, data does go through a lot of processes in which it is collected, stored, worked upon, and
analyzed.
For your easy understanding, the tools defined here are categorized according to their processes.
The first process is data collection. Although data can be collected through various methods, which
include online surveys, interviews, forms, etc., the information gathered has to be transformed in a
readable form for the data analyst to work on. The following tools can be used for data collection.

1. Data Collection Tools


Text Analysis is about parsing texts in order to extract machine-readable facts from them.
The purpose of Text Analysis is to create structured data out of free text content. The
process can be thought of as slicing and dicing heaps of unstructured, heterogeneous
documents into easy-to-manage and interpret data pieces.

 Semantria
Semantria is a cloud-based tool that extracts data and information through analyzing
the text and sentiments in it. It is a high-end NLP (neuro-linguistic programming) based
tool that can detect the sentiments on specific elements based on the language used in
it (sounds like magic? No, it is science!).

 Trackur
It is yet another tool that collects data, especially on social media platforms, by tracking
the feedback on brands and products. It also works on sentiment analysis. It is a tool
used for monitoring and can be of great value for the marketing companies.

Today, many other apps use similar text /semantics analysis and content management, e.g., Open
Text, Opinion Crawl.

2. Data Storage Tools


These tools are used to store a huge amount of data – which is typically stored in shared
computers – and interact with it. These tools provide a platform to unite servers so that
data can be assessed easily.

 Apache Hadoop
It is a framework for software that deals with huge data volume and its computation. It
provides a layered structure to distribute the storage of data among clusters of
computers for easy data processing of big data.

Prepared by: MR. ARNALDY D. FORTIN, MBA, MCS Page 7 of 10


ITE01 – IT ELECTIVE 1 (Foundations of Data Science) Module 1

 Apache Cassandra
This tool is free and an open-source platform. It uses SQL and CSL (Cassandra structure
language) to communicate with the database. It can provide swift availability of data
stored on various servers.

 Mongo DB
It is a database that is document-oriented and also free to use. It is available on multiple
platforms like Windows, Solaris, and Linux. It is very easy to learn and is reliable.

Similar data storage platforms are CouchDB, Apache Ignite, and Oracle NOSQL Database.

3. Data Extraction Tools


Data extraction tools are also known as web scraping tools. They are automated and
extract information and data automatically from websites. The following tools can be
used for data extraction.

 OctoParse
It is a web scraping tool available in both free and paid versions. It gives data as output
in structured spreadsheets, which are readable and easy to use for further operations
on it. It can extract phone numbers, IP addresses, and email IDs along with different
data from the websites.

 Content Grabber
It is also a web scraping tool but comes with advanced skills such as debugging and error
handling. It can extract data from almost every website and provide structured data as
output in user preferred formats.

Similar tools are Mozenda, Pentaho and import.io.

4. Data Cleaning / Refining Tools


Integrated with databases, data cleaning tools are time-saving and reduce the time consumption
by searching, sorting, and filtering data to be used by the data analysts. The refined data
becomes easy to use and is relevant. (Blei and Smyth, 2017)

 Data Cleaner
Data cleaner works with the Hadoop database and is a very powerful data indexing tool.
It improves the quality of data by removing duplicates and transforming them into one
record. It can also find missing patterns and a specific data group.

 OpenRefine
This refining tool deals with tangled data. It cleans before transforming it into another
form. It provides data access with speed and ease.

Prepared by: MR. ARNALDY D. FORTIN, MBA, MCS Page 8 of 10


ITE01 – IT ELECTIVE 1 (Foundations of Data Science) Module 1
Similar data cleaning tools are MapReduce, Rapidminer, and Talend.

5. Data Analysis Tools


Data analysis tools not only analyze the data but also perform certain operations on the data.
These tools inspect the data and study data modeling to draw useful information out of the
data, which is conclusive and helps in decision making for a certain problem or query.

 R
The R programming language is the widely used programming language that is used by
software engineers to develop software that helps in statistical computing and graphics
too. It supports various platforms like Windows, Mac operating system, and Linux. It is
widely used by data analysts, statisticians, and researchers.

 Apache Spark
Apache Spark is a powerful analytical engine that provides real-time analysis and
processes data along with enabling mini and micro-batches and streaming. It is
productive as it provides workflows that are highly interactive.

 Python
Python has been a very powerful and high-level programming language that has been
around for quite a while. It was used for application development, but now it has been
upgraded with new tools to be used, especially with data science. It gives output files
which can be saved as CSV formats and used as spreadsheets.

Similar data analysis tools are Apache storm, SAS, Flink, and Hive, etc.

6. Data Visualization Tools


Data visualization tools are used to present data in a graphical representation for clear insight.
Many visualization tools are a combination of previous functions we discussed and can also
support data extraction and analysis along with visualization.

 Python
Python, as mentioned above, is a powerful and general-purpose programming language
that also provides data visualization. It is packed with vast graphical libraries to support
the graphical representation of a wide variety of data.

 Tableau
Having a very large consumer market, Tableau is referred to as the grandmaster of all
visualization software by Forbes. It is an open-source software that can be integrated
with the database, is easy to use, and furnishes interactive data visualization in the form
of bars, charts, and maps.

Prepared by: MR. ARNALDY D. FORTIN, MBA, MCS Page 9 of 10


ITE01 – IT ELECTIVE 1 (Foundations of Data Science) Module 1
 Orange
Orange also happens to be an open-source data visualization tool supporting data
extraction, data analysis, and machine learning. It does not require programming but
rather has an interactive and user-friendly graphical user interface that displays the data
in the form of bar charts, networks, heat maps, scatter plots, and trees.

 Google Fusion Table


It is a web service powered by Google, which can be easily used by non-programmers
for collecting data. You can upload your data in the form of CSV files and save them too.
It looks more like an excel spreadsheet and allows editing by which you can see real-
time changes in visualizations. It displays data in the form of pie charts, bars, timelines,
line plots, and scatter plots. It allows you to link the data tables to your websites. You
can also create a map based on your data, which can be further modified by coloring
and can also be shared.

Similar popular data visualization apps and tools are DataWrapper, Qlik, and Gephi, which are all
open source and also support CSV files as data input.

Prepared by: MR. ARNALDY D. FORTIN, MBA, MCS Page 10 of 10

You might also like