UNIT 1
UNIT 1
CS3352 Unit I
Data
In computing, data is information that has been translated into a form that is efficient for movement or
processing
Data Science
Data science is an evolutionary extension of statistics capable of dealing with the massive amounts of data
produced today. It adds methods from computer science to the repertoire of statistics.
Facets of data
In data science and big data you’ll come across many different types of data, and each of them tends to require
different tools and techniques. The main categories of data are these:
¥ Structured
¥ Unstructured
¥ Natural language
¥ Machine-generated
¥ Graph-based
¥ Audio, video, and images
¥ Streaming
Let’s explore all these interesting data types.
Structured data
¥ Structured data is data that depends on a data model and resides in a fixed field within a record. As such,
it’s often easy to store structured data in tables within databases or Excel files
¥ SQL, or Structured Query Language, is the preferred way to manage and query data that resides in
databases.
Unstructured data
Unstructured data is data that isn’t easy to fit into a data model because the content is context-specific or
varying. One example of unstructured data is your regular email
Natural language
¥ Natural language is a special type of unstructured data; it’s challenging to process because it requires
knowledge of specific data science techniques and linguistics.
¥ The natural language processing community has had success in entity recognition, topic recognition,
summarization, text completion, and sentiment analysis, but models trained in one domain don’t
generalize well to other domains.
¥ Even state-of-the-art techniques aren’t able to decipher the meaning of every piece of text.
Machine-generated data
¥ Machine-generated data is information that’s automatically created by a computer, process,
application, or other machine without human intervention.
¥ Machine-generated data is becoming a major data resource and will continue to do so.
¥ The analysis of machine data relies on highly scalable tools, due to its high volume and speed.
Examples of machine data are web server logs, call detail records, network event logs, and telemetry.
Streaming data
¥ The data flows into the system when an event happens instead of being loaded into a data store in a
batch.
¥ Examples are the “What’s trending” on Twitter, live sporting or music events, and the stock market.
1. The first step of this process is setting a research goal. The main purpose here is making sure all the
stakeholders understand the what, how, and why of the project. In every serious project this will result
in a project charter.
2. The second phase is data retrieval. You want to have data available for analysis, so this step includes
finding suitable data and getting access to the data from the data owner. The result is data in its raw
form, which probably needs polishing and transformation before it becomes usable.
3. Now that you have the raw data, it’s time to prepare it. This includes transforming the data from a raw
form into data that’s directly usable in your models. To achieve this, you’ll detect and correct different
kinds of errors in the data, combine data from different data sources, and transform it. If you have
successfully completed this step, you can progress to data visualization and modeling.
4. The fourth step is data exploration. The goal of this step is to gain a deep understanding of the data.
You’ll look for patterns, correlations, and deviations based on visual and descriptive techniques. The
insights you gain from this phase will enable you to start modeling.
5. Finally, we get to model building (often referred to as “data modeling” throughout this book). It is now
that you attempt to gain the insights or make the predictions stated in your project charter. Now is the
time to bring out the heavy guns, but remember research has taught us that often (but not always) a
combination of simple models tends to outperform one complicated model. If you’ve done this phase
right, you’re almost done.
6. The last step of the data science model is presenting your results and automating the analysis, if needed.
One goal of a project is to change a process and/or make better decisions. You may still need to convince
the business that your findings will indeed change the business process as expected. Thisis where you
can shine in your influencer role. The importance of this step is more apparent in projectson a strategic
and tactical level. Certain projects require you to perform the business process over and over again, so
automating the project will save time.
Retrieving data
¥ The next step in data science is to retrieve the required data. Sometimes you need to go into the field
and design a data collection process yourself, but most of the time you won’t be involved in this step.
¥ Many companies will have already collected and stored the data for you, and what they don’t have can
often be bought from third parties.
¥ More and more organizations are making even high-quality data freely available for public and
commercial use.
¥ Data can be stored in many forms, ranging from simple text files to tables in a database. The objective
now is acquiring all the data you need.
¥ Most companies have a program for maintaining key data, so much of the cleaning work may already
be done. This data can be stored in official data repositories such as databases, data marts, data
warehouses, and data lakes maintained by a team of IT professionals.
¥ Data warehouses and data marts are home to preprocessed data, data lakes contain data in its natural or
raw format.
¥ Finding data even within your own company can sometimes be a challenge. As companies grow, their
data becomes scattered around many places. the data may be dispersed as people change positions and
leave the company.
¥ Getting access to data is another difficult task. Organizations understand the value and sensitivity of
data and often have policies in place so everyone has access to what they need and nothing more.
¥ These policies translate into physical and digital barriers called Chinese walls. These “walls” are
mandatory and well-regulated for customer data in most countries.
External Data
¥ If data isn’t available inside your organization, look outside your organizations. Companies provide
data so that you, in turn, can enrich their services and ecosystem. Such is the case with Twitter,
LinkedIn, and Facebook.
¥ More and more governments and organizations share their data for free with the world.
¥ A list of open data providers that should get you started.
Cleansing data
Data cleansing is a sub process of the data science process that focuses on removing errors in your data so
your data becomes a true and consistent representation of the processes it originates from.
¥ The first type is the interpretation error, such as when you take the value in your data for granted, like
saying that a person’s age is greater than 300 years.
¥ The second type of error points to inconsistencies between data sources or against your company’s
standardized values.
An example of this class of errors is putting “Female” in one table and “F” in another when they represent
the same thing: that the person is female.
Sometimes you’ll use more advanced methods, such as simple modeling, to find and identify data errors;
diagnostic plots can be especially insightful. For example, in figure we use a measure to identify data points
that seem out of place. We do a regression to get acquainted with the data and detect the influence of individual
observations on the regression line.
¥ Data collected by machines or computers isn’t free from errors. Errors can arise from human sloppiness,
whereas others are due to machine or hardware failure.
¥ Detecting data errors when the variables you study don’t have many classes can be done by tabulating
the data with counts.
¥ When you have a variable that can take only two values: “Good” and “Bad”, you can create a frequency
table and see if those are truly the only two values present. In table the values “Godo” and “Bade” point
out something went wrong in at least 16 cases.
Most errors of this type are easy to fix with simple assignment statements and if-thenelse
rules:
if x == “Godo”:
x = “Good”
if x == “Bade”:
x = “Bad”
Redundant Whitespace
¥ Whitespaces tend to be hard to detect but cause errors like other redundant characters would.
¥ The whitespace cause the miss match in the string such as “FR ” – “FR”, dropping the observations that
couldn’t be matched.
¥ If you know to watch out for them, fixing redundant whitespaces is luckily easy enough in most
programming languages. They all provide string functions that will remove the leading and trailing
whitespaces. For instance, in Python you can use the strip() function to remove leading and trailing
spaces.
Outliers
An outlier is an observation that seems to be distant from other observations or, more specifically, one
observation that follows a different logic or generative process than the other observations. The easiest way to
find outliers is to use a plot or a table with the minimum and maximum values.
The plot on the top shows no outliers, whereas the plot on the bottom shows possible outliers on the upper side
when a normal distribution is expected.
Integrating data
Your data comes from several different places, and in this substep we focus on integrating these different
sources. Data varies in size, type, and structure, ranging from databases and Excel files to text documents.
Joining Tables
¥ Joining tables allows you to combine the information of one observation found in one table with the
information that you find in another table. The focus is on enriching a single observation.
¥ Let’s say that the first table contains information about the purchases of a customer and the other table
contains information about the region where your customer lives.
¥ Joining the tables allows you to combine the information so that you can use it for your model, as
shown in figure.
To join tables, you use variables that represent the same object in both tables, such as a date, a country name,
or a Social Security number. These common fields are known as keys. When these keys also uniquely define
the records in the table they are called primary keys.
The number of resulting rows in the output table depends on the exact join type that you use
Appending Tables
¥ Appending or stacking tables is effectively adding observations from one table to another table.
¥ One table contains the observations from the month January and the second table contains observations
from the month February. The result of appending these tables is a larger one with the observations from
January as well as February.
Figure. Appending data from tables is a common operation but requires an equal structure in the tables begin
appended,
Transforming data
Certain models require their data to be in a certain shape. Transforming your data so it takes a suitable form for
data modeling.
Relationships between an input variable and an output variable aren’t always linear. Take, for instance, a
relationship of the form y = aebx. Taking the log of the independent variables simplifies the estimation problem
dramatically. Transforming the input variables greatly simplifies the estimation problem. Other timesyou might
want to combine two variables into a new variable.
10
11
Figure shows how reducing the number of variables makes it easier to understand the key values. It also shows
how two variables account for 50.6% of the variation within the data set (component1 = 27.8% + component2
= 22.8%). These variables, called “component1” and “component2,” are both combinations of the original
variables. They’re the principal components of the underlying data structure
¥ Dummy variables can only take two values: true(1) or false(0). They’re used to indicate the absence of
a categorical effect that may explain the observation.
¥ In this case you’ll make separate columns for the classes stored in one variable and indicate it with 1 if
the class is present and 0 otherwise.
¥ An example is turning one column named Weekdays into the columns Monday through Sunday. You
use an indicator to show if the observation was on a Monday; you put 1 on Monday and 0 elsewhere.
¥ Turning variables into dummies is a technique that’s used in modeling and is popular with, but not
exclusive to, economists.
Figure. Turning variables into dummies is a data transformation that breaks a variable that has multiple
classes into multiple variables, each having only two possible values: 0 or 1
During exploratory data analysis you take a deep dive into the data (see figure below). Information becomes
much easier to grasp when shown in a picture, therefore you mainly use graphical techniques to gain an
understanding of your data and the interactions between variables.
The goal isn’t to cleanse the data, but it’s common that you’ll still discover anomalies you missed before,
forcing you to take a step back and fix them.
12
13
¥ The visualization techniques you use in this phase range from simple line graphs or histograms, as
shown in below figure , to more complex diagrams such as Sankey and network graphs.
¥ Sometimes it’s useful to compose a composite graph from simple graphs to get even more insight
into the data Other times the graphs can be animated or made interactive to make it easier and, let’s
admit it, way more fun
The techniques we described in this phase are mainly visual, but in practice they’re certainly not limited to
visualization techniques. Tabulation, clustering, and other modeling techniques can also be a part of exploratory
analysis. Even building simple models can be a part of this step.
Building a model is an iterative process. The way you build your model depends on whether you go with classic
statistics or the somewhat more recent machine learning school, and the type of technique you want to use.
Either way, most models consist of the following main steps:
¥ Selection of a modeling technique and variables to enter in the model
¥ Execution of the model
¥ Diagnosis and model comparison
Model execution
¥ Once you’ve chosen a model you’ll need to implement it in code.
14
¥ Most programming languages, such as Python, already have libraries such as StatsModels or Scikit-
learn. These packages use several of the most popular techniques.
¥ Coding a model is a nontrivial task in most cases, so having these libraries available can speed up the
process. As you can see in the following code, it’s fairly easy to use linear regression with StatsModels
or Scikit-learn
¥ Doing this yourself would require much more effort even for the simple techniques. The following
listing shows the execution of a linear prediction model.
Mean square error is a simple measure: check for every prediction how far it was from the truth, square this
error, and add up the error of every prediction.
15
Above figure compares the performance of two models to predict the order size from the price. The first
model is size = 3 * price and the second model is size = 10.
¥ To estimate the models, we use 800 randomly chosen observations out of 1,000 (or 80%), without
showing the other 20% of data to the model.
¥ Once the model is trained, we predict the values for the other 20% of the variables based on those for
which we already know the true value, and calculate the model error with an error measure.
¥ Then we choose the model with the lowest error. In this example we chose model 1 because it has the
lowest total error.
Many models make strong assumptions, such as independence of the inputs, and you have to verify that these
assumptions are indeed met. This is called model diagnostics.
¥ Sometimes people get so excited about your work that you’ll need to repeat it over and over again
because they value the predictions of your models or the insights that you produced.
¥ This doesn’t always mean that you have to redo all of your analysis all the time. Sometimes it’s sufficient
that you implement only the model scoring; other times you might build an application that automatically
updates reports, Excel spreadsheets, or PowerPoint presentations. The last stage of the data science
process is where your soft skills will be most useful, and yes, they’re extremely important.
Data mining
Data mining is the process of discovering actionable information from large sets of data. Data mining uses
mathematical analysis to derive patterns and trends that exist in data. Typically, these patterns cannot be
discovered by traditional data exploration because the relationships are too complex or because there is too
much data.
These patterns and trends can be collected and defined as a data mining model. Mining models can be applied
to specific scenarios, such as:
¥ Forecasting: Estimating sales, predicting server loads or server downtime
16
¥ Risk and probability: Choosing the best customers for targeted mailings, determining the probable
break-even point for risk scenarios, assigning probabilities to diagnoses or other outcomes
¥ Recommendations: Determining which products are likely to be sold together, generating
recommendations
¥ Finding sequences: Analyzing customer selections in a shopping cart, predicting next likely events
¥ Grouping: Separating customers or events into cluster of related items, analyzing and predicting
affinities
Building a mining model is part of a larger process that includes everything from asking questions about the
data and creating a model to answer those questions, to deploying the model into a working environment. This
process can be defined by using the following six basic steps:
1. Defining the Problem
2. Preparing Data
3. Exploring Data
4. Building Models
5. Exploring and Validating Models
6. Deploying and Updating Models
The following diagram describes the relationships between each step in the process, and the technologies in
Microsoft SQL Server that you can use to complete each step.
The first step in the data mining process is to clearly define the problem, and consider ways that data can be
utilized to provide an answer to the problem.
This step includes analyzing business requirements, defining the scope of the problem, defining the metrics by
which the model will be evaluated, and defining specific objectives for the data mining project. These tasks
translate into questions such as the following:
¥ What are you looking for? What types of relationships are you trying to find?
¥ Does the problem you are trying to solve reflect the policies or processes of the business?
¥ Do you want to make predictions from the data mining model, or just look for interesting patterns and
associations?
¥ Which outcome or attribute do you want to try to predict?
17
¥ What kind of data do you have and what kind of information is in each column? If there are multiple
tables, how are the tables related? Do you need to perform any cleansing, aggregation, or processing to
make the data usable?
¥ How is the data distributed? Is the data seasonal? Does the data accurately represent the processes of the
business?
Preparing Data
¥ The second step in the data mining process is to consolidate and clean the data that was identified in the
Defining the Problem step.
¥ Data can be scattered across a company and stored in different formats, or may contain inconsistencies
such as incorrect or missing entries.
¥ Data cleaning is not just about removing bad data or interpolating missing values, but about finding
hidden correlations in the data, identifying sources of data that are the most accurate, and determining
which columns are the most appropriate for use in analysis
Exploring Data
Exploration techniques include calculating the minimum and maximum values, calculating mean and standard
deviations, and looking at the distribution of the data. For example, you might determine by reviewing the
maximum, minimum, and mean values that the data is not representative of your customers or business
processes, and that you therefore must obtain more balanced data or review the assumptions that are the basis
for your expectations. Standard deviations and other distribution values can provide useful information about
the stability and accuracy of the results.
Building Models
The mining structure is linked to the source of data, but does not actually contain any data until you process it.
When you process the mining structure, SQL Server Analysis Services generates aggregates and otherstatistical
information that can be used for analysis. This information can be used by any mining model that is based on
the structure.
18
Data warehousing
Data warehousing is the process of constructing and using a data warehouse. A data warehouse is constructed
by integrating data from multiple heterogeneous sources that support analytical reporting, structured and/or ad
hoc queries, and decision making. Data warehousing involves data cleaning, data integration, and data
consolidations.
Although a data warehouse and a traditional database share some similarities, they need not be the same idea.
The main difference is that in a database, data is collected for multiple transactional purposes. However, in a
data warehouse, data is collected on an extensive scale to perform analytics. Databases provide real-time data,
while warehouses store data to be accessed for big analytical queries.
Data Warehousing integrates data and information collected from various sources into one comprehensive
database. For example, a data warehouse might combine customer information from an organization’s point-
of-sale systems, its mailing lists, website, and comment cards. It might also incorporate confidential
19
information about employees, salary information, etc. Businesses use such components of data warehouse to
analyze customers.
Data mining is one of the features of a data warehouse that involves looking for meaningful data patterns in
vast volumes of data and devising innovative strategies for increased sales and profits.
Data Mart
A data mart is a subset of a data warehouse built to maintain a particular department, region, or business unit.
Every department of a business has a central repository or data mart to store data. The data from the data mart
is stored in the ODS periodically. The ODS then sends the data to the EDW, where it is stored and used.
Summary
In this chapter you learned the data science process consists of six steps:
¥ Setting the research goal—Defining the what, the why, and the how of your project in a project
charter.
¥ Retrieving data—Finding and getting access to data needed in your project. This data is either found
within the company or retrieved from a third party.
¥ Data preparation—Checking and remediating data errors, enriching the data with data from other data
sources, and transforming it into a suitable format for your models.
¥ Data exploration—Diving deeper into your data using descriptive statistics and visual techniques.
¥ Data modeling—Using machine learning and statistical techniques to achieve your project goal.
¥ Presentation and automation—Presenting your results to the stakeholders and industrializing your
analysis process for repetitive reuse and integration with other tools.
20