Chapter 2-2
Chapter 2-2
Data
Introduction
● Data science is now one of the most influential topics all around.
● Companies and enterprises are focusing a lot on gathering data science talent.
● Example: The data involved in buying a box of cereal from the store or supermarket
Data Science vs Data scientist
• Data Science defined as the extraction of actionable knowledge directly from the
data through:-
Process of discovery,
Hypothesis, and
Analytical hypotheses analysis.
• The processed and filtered data fed to various analytics programs and
machine learning with statistical methods to generate data which will
soon be used in predictive analysis and other fields
Data Science
• Scientific method requires data to begin iterating towards a more convincing
hypothesis.
• Data scientist
• possess a strong
• Quantitative background in statistics
• Linear algebra
• Programming knowledge with focuses on data warehousing, mining, and
modeling to build and analyze algorithms
Algorithms
• An algorithm is a set of instructions designed to perform a specific task.
• It is represented with the help of characters such as alphabets (A-Z, a-z), digits (0-9)
or special characters (+, -, /, *, <,>, =, etc.
• Information
• The processed data on which decisions and actions are based
• Information is interpreted data; created from organized, structured, and processed
data in a particular context
Data Processing Cycle
• Data processing is the conversion of raw data to meaningful information
through a process.
Storage is the last stage in the data processing cycle, where data,
instruction and information are held for future use.
The importance of this cycle is it allows quick access and retrieval of the
processed information, allowing it to be passed on to the next stage
directly, when needed.
Data types
• A data type is way to tell compiler what type of data supposed to be
stored and what amount of memory will be allocate to them.
• It restricts the compiler to store anything else other than that value
range.
• In order for said ones and zeroes to convey any meaning, they
need to be contextualized.
• Data analytics is done with the aid of specialized systems and software.
• From a data analytics point of view, it is important to understand that there are
three common types of data types or structures: -
• Structured,
• Semi-structured, and
• Unstructured data types
Structured Data
• Structured data is data that adheres to a pre-defined data model and is
therefore straightforward to analyze.
• Structured data concerns all data which can be stored in database SQL in
table with rows and columns.
• They have relational key and can be easily mapped into pre-designed
fields.
• Structured data is highly organized information that uploads neatly into a
relational database
• Unstructured data may have its own internal structure, but does not conform
neatly into a spreadsheet or database.
• The fundamental challenge of unstructured data sources is that they are difficult
for business users and data analysts alike to understand.
Semi structured Data
• Semi-structured data is information that doesn’t reside in a relational
database but that does have some organizational properties that
make it easier to analyze.
“Big Data”
Refers to data sets that are too large or complex to be dealt with by traditional
data-processing application software
Is data whose scale, diversity, and complexity require new:-
architecture,
techniques,
algorithms, and
analytics to manage it and extract value
.
…
Cont….
Big data
• Big Data is associated with the concept of 3 V that is volume, velocity, and
variety.
• So, when this cluster of computers works to perform some tasks and
gives an impression of only a single entity, it is called “cluster
computing”.
Clustered Computing
• Big data clustering software combines the resources of many
smaller machines, seeking to provide a number of benefits:
• Resource Pooling:
• Combining the available storage space to hold data is a clear benefit,
but CPU and memory pooling are also extremely important. Processing
large datasets requires large amounts of all three of these resources.
•
• Object Pooling is a way which enable storing of group of object(called
pool storage) in memory.