Chapter 2 - Data Science
Chapter 2 - Data Science
Chapter 2
Data Science
outlines
data science
data vs. information
data types and representation
data value chain
basic concepts of big data.
2.1. An Overview of Data Science
Data science is a multi-disciplinary field that uses scientific methods, processes,
algorithms, and systems to extract knowledge and insights from structured, semi-
structured and unstructured data.
Data science is much more than simply analyzing data.
Data is represented with the help of characters such as alphabets (A-Z, a-z), digits
(0-9) or special characters (+, -, /, *, , =, etc.).
Information
Information is the processed data on which decisions and actions are based. It is
data that has been processed into a form that is meaningful to the recipient and is
of real or perceived value in the current or the prospective action or decision of
recipient. Furtherer more, information is interpreted data; created from organized,
structured, and processed data in a particular context.
Data Processing Cycle
A data type is simply an attribute of data that tells the compiler or interpreter how
the programmer intends to use the data.
structures:
2. Semi-structured:- does not conform with the formal structure of data models
associated with relational databases or other forms of data tables.
it is also known as a self-describing structure
3. Unstructured:- either does not have a predefined data model or is not organized in
a pre-defined manner.
Metadata – Data about Data
It is one of the most important elements for Big Data analysis and big data
solutions.
Metadata is data about data.
Example:- In a set of photographs, metadata could describe when and where the
photos were taken.
Because of this reason, metadata is frequently used by Big Data solutions for initial
analysis.
Data value Chain
It is the process of gathering, filtering, and cleaning data before it is put in a data
warehouse or any other storage solution on which data analysis can be carried out
Data acquisition is one of the major big data challenges in terms of infrastructure
requirements.
Data Analysis
Data analysis involves exploring, transforming, and modeling data with the goal
It is concerned with making the raw data acquired amenable to use in decision-
Example:- Related areas include data mining, business intelligence, and machine
learning.
Data Curation
It is the active management of data over its life cycle to ensure it meets the
Data curators (also known as scientific curators or data annotators) hold the
preservation
Data Storage
It is the persistence and management of data in a scalable way that satisfies the
Relational Database Management Systems (RDBMS) have been the main, and
It covers the data-driven business activities that need access to data, its analysis,
and the tools needed to integrate the data analysis within the business activity.
any other parameter that can be measured against existing performance criteria
Basic concepts of big data
Big data is the term for a collection of data sets so large and complex that it
Big data clustering software combines the resources of many smaller machines,
benefit
Hadoop has an ecosystem that has evolved from its four core components: data
management, access, processing, and storage.
Big Data Life Cycle
1. Ingesting data into the system:- transferred to Hadoop from various sources
2. Processing the data in storage:- The data is stored in the distributed file system,
HDFS, and the NoSQL distributed data, HBase. Spark and MapReduce perform
data processing.
4. Visualizing the results:- which is performed by tools such as Hue and Cloudera