0% found this document useful (0 votes)
12 views

UNIT I Introduction to Data mining-converted

Uploaded by

nagretdn0005
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

UNIT I Introduction to Data mining-converted

Uploaded by

nagretdn0005
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

UNIT I

Basic Introduction Data Mining

The amount of data kept in computer files and databases is growing at a


phenomenal rate. At the same time, the users of these data are expecting
more sophisticated information from them. A marketing manager is no
longer satisfied with a simple listing of marketing contacts, but wants
detailed information about customers' past purchases as well as
predictions for future purchases. Simple structured/query language
queries are not adequate to support these increased demands for
information. Data mining steps in to solve these needs. Data mining is
often defined as finding hidden information in a database. Alternatively, it
has been called exploratory data analysis, data driven discovery, and
deductive learning. Traditional database queries (Figure 1.1), access a
database using a well-defined query stated in a language such as SQL. The
output of that query consists of the data from the database that satisfies
the query. The output is usually a subset of the database, but it may also
be an extracted view or may contain aggregations. Data mining access of a
database differs from this traditional access in several ways:

• Query: The query might not be well formed or precisely stated. The
data miner might not even be exactly sure of what he wants to see.
• Data: The data accessed is usually a different version from that of the
original operational database. The data have been cleansed and modified to
better support the mining process.
• Output: The output of the data mining query probably is not a subset of
the database. Instead it is the output of some analysis of the contents of
the database.

The current state of the art of data mining is similar to that of


database query processing in the late 1960s and early 1970s. Over the
next decade there undoubtedly will be great
Although data mining is currently in its infancy, over the last decade we
have seen a proliferation of mining algorithms, applications, and
algorithmic approaches. Example 1.1 illustrates one such application.

EXAMPL1.1
Credit card companies must determine whether to authorize credit card
purchases. Suppose that based on past historical information about
purchases, each purchase is placed into one of four classes: (1) authorize,
(2) ask for further identification before authorization, (3) do not
authorize, and (4) do not authorize but contact police. The data mining
functions here are twofold. First the historical data must be examined to
determine how the data fit into the four classes. Then the problem is to
apply this model to each new purchase. Although the second part indeed
may be stated as a simple database query, the first part cannot be.

Data mining involves many different algorithms to accomplish


different tasks. All of these algorithms attempt to fit a model to the
data. The algorithms examine the data and determine a model that is
closest to the characteristics of the data being examined.
Data mining algorithms can be characterized as consisting of three parts:
•Model: The purpose of the algorithm is to fit a model to the data.
•Preference: Some criteria must be used to fit one model over another.
•Search: All algorithms require some technique to search the data.
In Example 1.1 the data are modeled as divided into four classes. The
search requires examining past data about credit card purchases and their
outcome to determine what criteria should be used to define the class
structure. The preference will be given to criteria that seem to fit the
data best. For example, we probably would want to authorize
a credit card purchase for a small amount of money with a credit card
belonging to a long-standing customer. Conversely, we would not want to
authorize the use of a credit card to purchase anything if the card has
been reported as stolen. The search process requires that the criteria
needed to fit the data to the classes be properly defined.
As seen in Figure 1.2, the model that is created can be either
predictive or descriptive in nature. In this figure, we show under each
model type some of the most common data mining tasks that use that type
of model.

A predictive model makes a prediction about values of data using


known results found from different data. Predictive modeling may be made
based on the use of other historical data. For example, a credit card use
might be refused not because of the user's own credit history, but
because the current purchase is similar to earlier purchases that were
subsequently found to be made with stolen cards. Example 1.1 uses
predictive modeling to predict the credit risk. Predictive model data
mining tasks include classification, regression, time series analysis, and
prediction. Prediction may also be used to indicate a specific type of data
mining function. A descriptive model identifies patterns or relationships in
data. Unlike the predictive model, a descriptive model serves as a way to
explore the properties of the data examined, not to predict new
properties. Clustering, summarization, association rules, and sequence
discovery are usually viewed as descriptive in nature.

BASIC DATA MINING TASKS


In the following paragraphs we briefly explore some of the data
mining functions. We follow the basic outline of tasks shown in Figure 1.2.
This list is not intended to be exhaustive, but rather illustrative. Of
course, these individual tasks may be combined to obtain more
sophisticated data mining applications.

Classification

Classification maps data into predefined groups or classes. It is


often referred to as supervised learning because the classes are
determined before examining the data. Two examples of classification
applications are determining whether to make a bank loan and identifying
credit risks. Classification algorithms require that the classes be defined
based on data attribute values. They often describe these classes by
looking at the characteristics of data already known to belong to the
classes. Pattern recognition is a type of classification where an input
pattern is classified into one of several classes based on its similarity to
these predefined classes. Example 1.1 illustrates a general classification
problem. Example 1.2 shows a simple example of pattern recognition.

EXAMPLE 1.2
An airport security screening station is used to determine: if passengers
are potential terrorists or criminals. To do this, the face of each
passenger is scanned and its basic pattern (distance between eyes, size
and shape of mouth, shape of head, etc.) is identified. This pattern is
compared to entries in a database to see if it matches any patterns that
are associated with known offenders.
Regression
Regression is used to map a data item to a real valued prediction variable.
In actually regression involves the learning of the function that does this
mapping. Regression assumes that the target data fit into some known
type of function (e.g., linear, logistic, etc.) and then determines the best
function of this type that models the given data. Some type of error
analysis is used to determine which function is "best." standard linear
regression, as illustrated in Example 1.3, is a simple example of regression.

EXAMPLE 1.3
A college professor wishes to reach a certain level of savings before
his / her retirement. Periodically, she predicts what her retirement
savings will be based on its current value and several past values. She uses
a simple linear regression formula to predict this value by fitting past
behavior to a linear function and then using this function to predict the
values at points in the future. Based on these values, she then alters
his/her investment portfolio.

Time Series Analysis

With time series analysis, the value of an attribute is examined as it


varies over time. The values usually are obtained as evenly spaced time
points (daily, weekly, hourly, etc.). A time series plot (Figure 1.3), is used
to visualize the time series. In this figure you can easily see that the plots
for Y and Z have similar behavior, while X appears to have less volatility.
There are three basic functions performed in time series analysis: In on􀀇
case, distance measures are used to determine the similarity between
different time series. In the second case, the structure of the line is
examined to determine (and perhaps classify)
its behavior. A third application would be to use the historical time series
plot to predict future values. A time series example is given in following
example 1.4.
EXAMPLE 1.4
Mr. Smith is trying to determine whether to purchase stock from
Companies X, Y, or z. For a period of one month he charts the daily stock
price for each company. Figure 1.3 shows the time series plot that Mr.
Smith ha􀀇 generated. Using this and similar information available from his
stockbroker, Mr. Smith decides to purchase stock
X because it is less volatile while overall showing a slightly larger relative
amount of growth than either of the other stocks.

Prediction
Many real-world data mining applications can be seen as predicting future
data states based on past and current data. Prediction can be viewed as a
type of classification. (Note: This is a data mining task that is different
from the prediction model, although the prediction task is a type of
prediction model.) The difference is that prediction is predicting a future
state rather than a current state. Here we are referring to a type of
application rather than to a type of data mining modeling approach, as
discussed earlier. Prediction applications include flooding, speech
recognition, machine learning, and pattern recognition.
Although future values may be predicted using time series analysis or
regression techniques, other approaches may be used as well. Example 1.5
illustrates the process.

EXAMPLE 1.5
Predicting flooding is a difficult problem. One approach uses monitors
placed at various points in the river. These monitors collect data relevant
to flood prediction: water level, rain amount, time, humidity, and so on.
Then the water level at a potential flooding point in the river can be
predicted based on the data collected by the sensors upriver from this
point. The prediction must be made with respect to the time the data
were collected.

Clustering
Clustering is similar to classification except that the groups are not
predefined, but rather defined by the data alone. Clustering is
alternatively referred to as unsupervised learning or segmentation. It can
be thought of as partitioning or segmenting the data into groups that
might or might not be disjointed. The clustering is usually accomplished by
determining the similarity among the data on predefined attributes. The
most similar data are grouped into clusters. Example 1.6 provides a simple
clustering example. Since the clusters are not predefined, a domain expert
is often required to interpret the meaning of the created clusters

EXAMPLE 1.6
A certain national department store chain creates special catalogs
targeted to various demographic groups based on attributes such as
income, location, and physical characteristics of potential customers (age,
height, weight, etc.). To determine the target mailings of the various
catalogs and to assist in the creation of new, more specific catalogs, the
company performs a clustering of potential customers based on the
determined attribute values. The results of the clustering exercise are
then used by management to create special catalogs and distribute them
to the correct target population based on the cluster for that catalog.

A special type of clustering is called segmentation. With segmentation a


database is partitioned into disjointed groupings of similar tuples called
segments. Segmentation is often viewed as being identical to clustering. In
other circles segmentation is viewed as a specific type of clustering
applied to a database itself. In this text we use the two terms, clustering
and segmentation, interchangeably.

Summarization
Summarization maps data into subsets with associated simple
descriptions. Summarization is also called characterization or
generalization. It extracts or derives representative information about
the database. This may be accomplished by actually retrieving portions
of the data. Alternatively, summary type information (such as the mean of
some numeric attribute) can be derived from the data. The summarization
succinctly characterizes the contents of the database. Example 1.7
illustrates this process.
EXAMPLE 1.7
One of the many criteria used to compare universities by the U.S. News &
World Report is the average SAT or AC T score [GM99]. This is a
summarization used to estimate the type and intellectual level of the
student body.

Association Rules
Link analysis, alternatively referred to as affinity analysis or association,
refers to the data mining task of uncovering relationships among data. The
best example of this type of application is to determine association rules.
An association rule is a model that identifies specific types of data
associations. These associations are often used in the retail sales
community to identify items that are frequently purchased together.
Associations are also used in many other applications such as
predicting the failure of telecommunication switches.

EXAMPLE 1.8
A grocery store retailer is trying to decide whether to put bread on sale.
To help determine the impact of this decision, the retailer generates
association rules that show what other Section 1.2 Data Mining Versus
Knowledge Discovery in Databases 9 products are frequently purchased
with bread. He finds that 60% of the times that bread is sold so are
pretzels and that 70% of the time jelly is also sold. Based on these facts,
he tries to capitalize on the association between bread, pretzels, and jelly
by placing some pretzels and jelly at the end of the aisle where the bread
is placed. In addition, he decides not to place either of these items on sale
at the same time.

Users of association rules must be cautioned that these are not


causal relationships. They do not represent any relationship inherent in the
actual data (as is true with functional dependencies) or in the real world.
There probably is no relationship between bread and pretzels that causes
them to be purchased together. And there is no guarantee that this
association will apply in the future. However, association rules can be used
to assist retail store management in effective advertising, marketing, and
inventory control.
Sequence Discovery
Sequential analysis or sequence discovery is used to determine sequential
patterns in data. These patterns are based on a time sequence of actions.
These patterns are similar to associations in that data (or events) are
found to be related, but the relationship is based
on time. Unlike a market basket analysis, which requires the items to be
purchased at the same time, in sequence discovery the items are
purchased over time in some order.

EXAMPLE 1.9
The Webmaster at the XYZ Corp. periodically analyzes the Web log data
to determine how users of the XYZ's Web pages access them. He is
interested in determining what sequences of pages are frequently
accessed. He determines that 70 percent of the users of page A follow
one of the following patterns of behavior: (A, B, C) or (A, D, B, C)
Or (A, E, B, C). He then determines to add a link directly from page A to
page C.

Data Mining V/S Knowledge Discovery in Database

The terms knowledge discovery in databases (KDD) and data mining are
often used interchangeably. In fact, there have been many other names
given to this process of discovering useful (hidden) patterns in data:
knowledge extraction, information discovery, exploratory data analysis,
information harvesting, and unsupervised pattern recognition.
Over the last few years KDD has been used to refer to a process
consisting of many steps, while data mining is only one of these steps.
DEFINITION 1.1. Knowledge discovery in databases (KDD) is the
process of finding useful information and patterns in data.

DEFINITION 1.2. Data mining is the use of algorithms to extract the


information and patterns derived by the KDD process.

The KDD process is often said to be nontrivial; however, we take the


larger view that KDD is an all-encompassing concept. A traditional SQL
database query can be viewed as the data mining part of a KDD process.
Indeed, this may be viewed as somewhat simple and trivial. However, this
was not the case 30 years ago. If we were to advance 30 years into the
future, we might find that processes thought of today as nontrivial and
complex will be viewed as equally simple. The definition of KDD includes
the keyword useful. Although some definitions have included the term
"potentially useful," we believe that if the information found in the
process is not useful, then it really is not information. Of course, the idea
of being useful is relative and depends on the individuals involved. KDD is a
process that involves many different steps. The input to this process is
the data, and the output is the useful information desired by the users.
However, the objective may be unclear or inexact. The process itself is
interactive and may require much elapsed time. To ensure the usefulness
and accuracy of the results of the process, interaction throughout the
process with both domain experts and technical experts might be needed.
Figure 1.4 indicates the overall KDD process.

KDD process consists of the following five steps:

• Selection: The data needed for the data mining process may be
obtained from many different and heterogeneous data sources. This first
step obtains the data from various databases, files, and non electronic
sources.
• Preprocessing: The data to be used by the process may have incorrect
or missing data. There may be anomalous data from multiple sources
involving different data types and metrics. There may be many different
activities performed at this time. Erroneous data may be corrected or
removed, whereas missing data must be supplied or predicted (often using
data mining tools).
• Transformation: Data from different sources must be converted into a
common format for processing. Some data may be encoded or transformed
into more usable formats. Data reduction may be used to reduce the
number of possible data values being considered.
• Data mining: Based on the data mining task being performed, this step
applies algorithms to the transformed data to generate the desired
results.
• Interpretation/evaluation: How the data mining results are presented
to the users is extremely important because the usefulness of the results
is dependent on it. Various visualization and GUI strategies are used at
this last step. Transformation techniques are used to make the data easier
to mine and more useful, and to provide more meaningful results. The
actual distribution of the data may be
Some attribute values may be combined to provide new values, thus
reducing the complexity of the data. For example, current date and birth
date could be replaced by age. One attribute could be substituted for
another. An example would be replacing a sequence
of actual attribute values with the differences between consecutive
values. Real valued attributes may be more easily handled by partitioning
the values into ranges and using these discrete range values. Some data
values may actually be removed. Outliers, extreme
values that occur infrequently, may actually be removed. The data may be
transformed by applying a function to the values. A common
transformation function is to use the log of the value rather than the
value itself. These techniques make the mining task easier by reducing the
dimensionality (number of attributes) or by reducing the variability of the
data values. The removal of outliers can actually improve the quality of the
results. As with all steps in the KDD process, however, care must be used
in performing transformation. If used incorrectly, the transformation
could actually change the data such that the results of the data mining
step are inaccurate. Visualization refers to the visual presentation of data.
The old expression "a picture is worth a thousand words" certainly is true
when examining the structure of data. For example, a line graph that
shows the distribution of a data variable is easier to understand and
perhaps more informative than the formula for the corresponding
distribution. The use of visualization techniques allows users to summarize,
extract, and grasp more complex results than more mathematical or text
type descriptions of the results. Visualization techniques include:

• Graphical: Traditional graph structures including bar charts, pie charts,


histograms, and line graphs may be used.
• Geometric: Geometric techniques include the. Box plot and scatter
diagram techniques.
• Icon-based: Using figures, colors, or other icons can improve the
presentation of the results.
• Pixel-based: With these techniques each data value is shown as a
uniquely colored pixel.
• Hierarchical: These techniques hierarchically divide the display area
(screen) into regions based on data values.
• Hybrid: The preceding approaches can be combined into one display.

Any of these approaches may be two-dimensional or three-


dimensional. Visualization tools can be used to summarize data as a data
mining technique itself. In addition, visualization can be used to show the
complex results of data mining tasks. The data mining process itself is
complex. As we will see in later chapters, there are many different data
mining applications and algorithms. These algorithms must be carefully
applied to be effective. Discovered patterns must be correctly
interpreted and properly evaluated to ensure that the resulting
information is meaningful and accurate.

The Development of Data Mining


The current evolution of data mining functions and products is the
result of years of influence from many disciplines, including databases,
information retrieval, statistics, algorithms, and machine learning as
indicated in following figure Another computer science area that has, had
a major impact on the KDD process is multimedia and graphics. A major
goal of KDD is to be able to describe the results of the KDD process in a
meaningful manner. Because many different results are often produced,
this is a nontrivial problem. Visualization techniques often involve
sophisticated multimedia and graphics presentations. In addition data
mining techniques can be applied to multimedia applications.
Unlike previous research in these disparate areas, a major trend in
the database community is to combine results from these seemingly
different disciplines into one unifying data or algorithmic approach.

how to define a data mining query and whether a query language (like SQL)
can be developed to capture the many different types of data mining
queries.
• Describing a large database can be viewed as using approximation to help
uncover hidden information about the data.
• When dealing with large databases, the impact of size and efficiency of
developing an abstract model can be thought of as a type of search
problem. It is interesting to think about the various data mining problems
and how each may be viewed in several different perspectives based on
the viewpoint and background of the researchers /developers.

There are many important implementation issues associated with


data mining:

1. Human interaction: Since data mining problems are often not precisely
stated, interfaces may be needed with both domain and technical experts.
Technical experts are used to formulate the queries and assist in
interpreting the results. Users are needed to identify training data and
desired results.
2. Over fitting: When a model is generated that is associated with a
given database state it is desirable that the model also fit future
database states. Over fitting occurs when the model does not fit future
states. This may be caused by assumptions that are made about the data
or may simply be caused by the small size of the training database.
3. Outliers: There are often many data entries that do not fit nicely into
the derived model. This becomes even more of an issue with very large
databases. If a model is developed that includes these outliers , then the
model may not behave well for data that are not outliers .
4. Interpretation of results: Currently, data mining output may require
experts to correctly interpret the results, which might otherwise be
meaningless to the average database user.
5. Visualization of results: To easily view and understand the output of
data mining algorithms, visualization of the results is helpful.
6. Large datasets: The massive datasets associated with data mining
create problems when applying algorithms designed for small datasets.
Many modeling applications grow exponentially on the dataset size and thus
are too inefficient for larger datasets. Sampling and parallelization are
effective tools to attack this scalability problem.
7. High dimensionality: A conventional database schema may be composed
of many different attributes. The problem here is that not all attributes
may be needed to solve a given data mining problem. In fact, the use of
some attributes may interfere with the correct completion of a data
mining task. The use of other attributes may simply increase the overall
complexity and decrease the efficiency of an algorithm. This problem is
sometimes referred to as the dimensionality curse, meaning that there are
many attributes (dimensions) involved and it is difficult to determine
which ones should be used. One solution to this high dimensionality
problem is to reduce the number of attributes, which is known as
dimensionality reduction. However, determining which attributes not
needed is not always easy to do.
8. Multimedia data: Most previous data mining algorithms are targeted to
traditional data types (numeric, character, text, etc.). The use of
multimedia data such as is found in GIS databases complicates or
invalidates many proposed algorithms.
9. Missing data: During the preprocessing phase of KDD, missing data
may be replaced with estimates. This and other approaches to handling
missing data can lead to invalid results in the data mining step.
10. Irrelevant data: Some attributes in the database might not be of
interest to the data mining task being developed.
11. Noisy data: Some attribute values might be invalid or incorrect.
These values are often corrected before running data mining applications.
12. Changing data: Databases cannot be assumed to be static. However,
most data mining algorithms do assume a static database. This requires
that the algorithm be completely rerun anytime the database changes.
13. Integration: The KDD process is not currently integrated into normal
data processing activities. KDD requests may be treated as special,
unusual, or one-time needs. This makes them inefficient, ineffective, and
not general enough to be used on an ongoing basis. Integration of data
mining functions into traditional DBMS systems is certainly a desirable
goal.
14. Application: Determining the intended use for the information
obtained from the data mining function is a challenge. Indeed, how
business executives can effectively use the output is sometimes
considered the more difficult part, not the running of the algorithms
themselves. Because the data are of a type that has not previously
been known, business practices may have to be modified to determine how
to effectively use the information uncovered.
These issues should be addressed by data mining algorithms and products.

INFORMATION RETRIEVAL

Information retrieval (IR) (and more recently digita! libraries and


Internet searching) involves retrieving desired information from textual
data. The historical development of IR was based on effective use of
libraries. So a typical 1R request would be to find all library documents
related to a particular subject, for example "data mining." This is, in fact,
a classification task because the set of documents in the library is divided
into classes based on the keywords involved. In IR systems, documents are
represented by document surrogates consisting of data, such as
identifiers, title, authors, dates, abstracts, extracts, review, and
keywords. As can be seen, the data consist of both formatted and
unformatted (text) data. The retrieval of documents is based on
calculation of a similarity measure showing how close each document is to
the desired results (i.e., the stated query). Similarity measures are also
used in classification and clustering problems.
An IR system consists of a set of documents D = {D1, ..... Dn }. The
input is a query, q, often stated as a list of keywords. The similarity
between the query and each document is then calculated: sim(q,D). This
similarity measure is a set membership function describing the likelihood
that the document is of interest (relevant) to the user
based on the user's interest as stated by the query. The effectiveness of
the system in processing the query is often measured by looking at
precision and recall:

Precision is used to answer the question: "Are all documents


retrieved ones that I am interested in?" Recall answers: "Have all relevant
documents been retrieved?" Here a document is relevant if it should have
been retrieved by the query. Figure 2.5 illustrates the four possible query
results available with IR queries. Of these four quadrants, two represent
desirable outcomes: relevant and retrieved or not relevant and not
retrieved. The other two quadrants represent error situations. Documents
that are relevant and not retrieved should have been retrieved but were
not. Documents that are not relevant and retrieved should not have been
retrieved but were. Figure 2.6 illustrates the basic structure of a
conventional information retrieval query. Many similarity measures have
been proposed for use in information retrieval. As stated earlier, sim (q ,
D;) is used to determine the results of a query q applied to a set
of documents D = {D1 , ... , Dn }. Similarity measures may also be used to
cluster or classify documents by calculating sim( Di, Dj ) for all documents
in the database. Thus, similarity can be used for document-document,
query-query, and query-document measurements.

DECISION SUPPORT SYSTEMS (DSS)


Decision support systems (DSS) are broad computer systems and
related tools that assist managers in making decisions and solving
problems. The goal is to improve decision making process by providing
specific information needed to management. These systems are differing
from traditional database management system in these system more ad
hoc queries and customized information may be provided. Recently the
term Executive Information System (EIS) and Executive Support System
(ESS) have also used as well. These systems are aims at developing the
business structure and computer techniques to better provide information
needed by the management to make the effective business decisions. Data
mining can be thought as a suite of tools that assists in the overall DSS
process, i.e. DSS may be use data mining tools.
In many ways the term DSS is much more broad than the term data
mining. While a DSS usually contains data mining tools, this need not be so.
Likewise, a data mining tool need not be contained in a DSS system. A
decision support system could be enterprise-wide, thus allowing upper-
level managers the data needed to make intelligent
business decisions that impact the entire company. A DSS typically
operates using data warehouse data. Alternatively, a DSS could be built
around a single user and a PC. The bottom line is that the DSS gives
managers the tools needed to make intelligent decisions.

Essential of algorithms & Data structure

In mathematics and computer science, an algorithm is a finite


sequence of well-defined instructions, typically used to solve a class of
specific problems or to perform a computation. Algorithms are used as
specifications for performing calculations, data processing, automated
reasoning, automated decision-making and other tasks. In contrast,
a heuristic is an approach to problem solving that may not be fully
specified or may not guarantee correct or optimal results, especially in
problem domains where there is no well-defined correct or optimal result.
As an effective method, an algorithm can be expressed within a finite
amount of space and time and in a well-defined formal language for
calculating a function. Starting from an initial state and initial input
(perhaps empty), the instructions describe a computation that,
when executed, proceeds through a finite number of well-defined
successive states, eventually producing "output" and terminating at a final
ending state. The transition from one state to the next is not
necessarily deterministic; some algorithms, known as randomized
algorithms, incorporate random input.

Informal Definition
An informal definition could be "a set of rules that precisely defines
a sequence of operations" which would include all computer programs
(including programs that do not perform numeric calculations), and (for
example) any prescribed bureaucratic procedure or cook-book recipe.
In general, a program is only an algorithm if it stops eventually even
though infinite loops may sometimes prove desirable. Algorithms are
essential to the way computers process data. Many computer programs
contain algorithms that detail the specific instructions a computer should
perform—in a specific order—to carry out a specified task, such as
printing students' report cards.
In computer systems, an algorithm is basically an instance
of logic written in software by software developers, to be effective for
the intended "target" computer(s) to produce output from given (perhaps
null) input. An optimal algorithm, even running in old hardware, would
produce faster results than a non-optimal (higher time complexity)
algorithm for the same purpose, running in more efficient hardware; that
is why algorithms, like computer hardware, are considered technology.
Algorithm example
One of the simplest algorithms is to find the largest number in a list of
numbers of random order. Finding the solution requires looking at every
number in the list. From this follows a simple algorithm, which can be
stated in a high-level description in English as
High-level description:

1. If there are no numbers in the set then there is no highest number.


2. Assume the first number in the set is the largest number in the set.
3. For each remaining number in the set: if this number is larger than
the current largest number, consider this number to be the largest
number in the set.
4. When there are no numbers left in the set to iterate over, consider
the current largest number to be the largest number of the set.

Data Structure
In computer science, a data structure is a data organization,
management, and storage format that enable efficient access and
modification. More precisely, a data structure is a collection of data
values, the relationships among them, and the functions or operations that
can be applied to the data i.e., it is an algebraic structure about data
Data structures serve as the basis for abstract data types (ADT).
The ADT defines the logical form of the data type. The data structure
implements the physical form of the data type.
Different types of data structures are suited to different kinds of
applications, and some are highly specialized to specific tasks. For
example, relational databases commonly use B-tree indexes for data
retrieval, while compiler implementations usually use hash tables to look up
identifiers. Data structures provide a means to manage large amounts of
data efficiently for uses such as large databases and internet indexing
services. Usually, efficient data structures are key to designing
efficient algorithms. Some formal design methods and programming
languages emphasize data structures, rather than algorithms, as the key
organizing factor in software design. Data structures can be used to
organize the storage and retrieval of information stored in both main
memory and secondary memory.

Implementation

Data structures are generally based on the ability of a computer to


fetch and store data at any place in its memory, specified by a pointer—a
bit string, representing a memory address, that can be itself stored in
memory and manipulated by the program. Thus, the array and record data
structures are based on computing the addresses of data items
with arithmetic operations, while the linked data structures are based on
storing addresses of data items within the structure itself.
The implementation of a data structure usually requires writing a set
of procedures that create and manipulate instances of that structure. The
efficiency of a data structure cannot be analyzed separately from those
operations. There are numerous types of data structures, generally built
upon simpler primitive data types. Well known examples are

• A byte is the smallest amount of data that a Computer CPU can copy
from memory to a register or back in a single CPU instruction,
therefore a byte stream is the most efficient way to run big data
through a computer, hence Stream processing
• An array is a number of elements in a specific order, typically all of the
same type Elements are accessed using an integer index to specify
which element is required. Typical implementations allocate contiguous
memory words for the elements of arrays. Arrays may be fixed-length
or resizable.
• A linked list (also just list) is a linear collection of data elements of any
type, called nodes, where each node has itself a value, and points to the
next node in the linked list. The principal advantage of a linked list over
an array is that values can always be efficiently inserted and removed
without relocating the rest of the list.
• A record (also called tuple or struct) is an aggregate data structure. A
record is a value that contains other values, typically in fixed number
and sequence and typically indexed by names. The elements of records
are usually called fields or members.
• A union is a data structure that specifies which of a number of
permitted primitive types may be stored in its instances,
e.g. float or long integer. Contrast with a record, which could be
defined to contain a float and an integer; whereas in a union, there is
only one value at a time. Enough space is allocated to contain the widest
member data-type.
• A tagged union (also called variant, variant record, discriminated union,
or disjoint union) contains an additional field indicating its current type,
for enhanced type safety.
• An object is a data structure that contains data fields, like a record
does, as well as various methods which operate on the data contents. An
object is an in-memory instance of a class from a taxonomy. In the
context of object-oriented programming, records are known as plain old
data structures to distinguish them from objects.[12]
In addition, hashes, graphs and binary trees are other commonly used data
structures.

Language support

Most assembly languages and some low-level languages, such


as BCPL (Basic Combined Programming Language), lack built-in support for
data structures. On the other hand, many high-level programming
languages and some higher-level assembly languages, such as MASM, have
special syntax or other built-in support for certain data structures, such
as records and arrays. For example, the C (a direct descendant of BCPL)
and Pascal languages support structs and records, respectively, in addition
to vectors (one-dimensional arrays) and multi-dimensional arrays. Most
programming languages feature some sort of library mechanism that allows
data structure implementations to be reused by different programs.
Modern languages usually come with standard libraries that implement the
most common data structures. Examples are the C++ Standard Template
Library, the Java Collections Framework, and the Microsoft .NET
Framework. Modern languages also generally support modular programming,
the separation between the interface of a library module and its
implementation. Some provide opaque data types that allow clients to hide
implementation details. Object-oriented programming languages, such
as C++, Java, and Smalltalk, typically use classes for this purpose. Many
known data structures have concurrent versions which allow multiple
computing threads to access a single concrete instance of a data structure
simultaneously.
Software Engineering trends & techniques

Software engineering is the systematic application


of engineering approaches to the development of software. A software
engineer is a person who applies the principles of software engineering to
design, develop, maintain, test, and evaluate computer software. The
term programmer is sometimes used as a synonym, but may also lack
connotations of engineering education or skills.
Engineering techniques are used to inform the software development
process which involves the definition, implementation, assessment,
measurement, management, change, and improvement of the software life
cycle process itself. It heavily uses software configuration
management which is about systematically controlling changes to the
configuration, and maintaining the integrity and traceability of the
configuration and code throughout the system life cycle. Modern
processes use software versioning. In software engineering, a software
development process is the process of dividing software development work
into smaller, parallel or sequential steps or sub-processes to
improve design, product management. It is also known as a software
development life cycle (SDLC). The methodology may include the pre-
definition of specific deliverables and artifacts that are created and
completed by a project team to develop or maintain an application.
Most modern development processes can be vaguely described as agile.
Other methodologies include waterfall, prototyping, iterative and
incremental development, spiral development, rapid application
development, and extreme programming.
A life-cycle "model" is sometimes considered a more general term
for a category of methodologies and a software development "process" a
more specific term to refer to a specific process chosen by a specific
organization.[citation needed] For example, there are many specific
software development processes that fit the spiral life-cycle model. The
field is often considered a subset of the systems development life cycle.
The basic principles of software engineering we applied in the development
of algorithms, application & tools in data mining

You might also like