100% found this document useful (4 votes)
141 views

Practical Data Mining 1st Edition Monte F. Hancock Jr 2024 scribd download

Mining

Uploaded by

anumahkoeita
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (4 votes)
141 views

Practical Data Mining 1st Edition Monte F. Hancock Jr 2024 scribd download

Mining

Uploaded by

anumahkoeita
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 81

Download the full version of the ebook at

https://ebookultra.com

Practical Data Mining 1st Edition Monte F.


Hancock Jr

https://ebookultra.com/download/practical-data-
mining-1st-edition-monte-f-hancock-jr/

Explore and download more ebook at https://ebookultra.com


Recommended digital products (PDF, EPUB, MOBI) that
you can download immediately if you are interested.

Data Mining Practical Machine Learning Tools and


Techniques 2nd Edition Ian H. Witten

https://ebookultra.com/download/data-mining-practical-machine-
learning-tools-and-techniques-2nd-edition-ian-h-witten/

ebookultra.com

Making Sense of Data I A Practical Guide to Exploratory


Data Analysis and Data Mining 2nd Edition Glenn J. Myatt

https://ebookultra.com/download/making-sense-of-data-i-a-practical-
guide-to-exploratory-data-analysis-and-data-mining-2nd-edition-glenn-
j-myatt/
ebookultra.com

Data Mining and Data Warehousing 1st Edition S.K. Mourya

https://ebookultra.com/download/data-mining-and-data-warehousing-1st-
edition-s-k-mourya/

ebookultra.com

Practical Graph Mining with R Instructor Solution Manual


Solutions 1st Edition Nagiza F. Samatova

https://ebookultra.com/download/practical-graph-mining-with-r-
instructor-solution-manual-solutions-1st-edition-nagiza-f-samatova/

ebookultra.com
Exploratory Data Mining and Data Cleaning 1st Edition
Tamraparni Dasu

https://ebookultra.com/download/exploratory-data-mining-and-data-
cleaning-1st-edition-tamraparni-dasu/

ebookultra.com

Biological Data Mining Chapman Hall Crc Data Mining and


Knowledge Discovery Series 1st Edition Jake Y. Chen

https://ebookultra.com/download/biological-data-mining-chapman-hall-
crc-data-mining-and-knowledge-discovery-series-1st-edition-jake-y-
chen/
ebookultra.com

Music Data Mining 1st Edition Tao Li

https://ebookultra.com/download/music-data-mining-1st-edition-tao-li/

ebookultra.com

Smith Currie and Hancock s Common Sense Construction Law A


Practical Guide for the Construction Professional 4th
Edition Thomas J. Kelleher Jr.
https://ebookultra.com/download/smith-currie-and-hancock-s-common-
sense-construction-law-a-practical-guide-for-the-construction-
professional-4th-edition-thomas-j-kelleher-jr/
ebookultra.com

Building Winning Algorithmic Trading Systems Website A


Trader s Journey from Data Mining to Monte Carlo
Simulation to Live Trading Kevin J. Davey
https://ebookultra.com/download/building-winning-algorithmic-trading-
systems-website-a-trader-s-journey-from-data-mining-to-monte-carlo-
simulation-to-live-trading-kevin-j-davey/
ebookultra.com
Practical Data Mining 1st Edition Monte F. Hancock Jr
Digital Instant Download
Author(s): Monte F. Hancock Jr
ISBN(s): 9781439868379, 1439868379
Edition: 1
File Details: PDF, 3.96 MB
Year: 2011
Language: english
Information Technology/ Database

Hancock
Achieves a unique and delicate balance between depth, breadth, and clarity.
—Stefan Joe-Yen, Cognitive Research Engineer, Northrop Grumman Corporation
& Adjunct Professor, Department of Computer Science, Webster University

Used as a primer for the recent graduate or as a refresher for the grizzled veteran,
Practical Data Mining is a must-have book for anyone in the field of data
mining and analytics.

PRACTICAL DATA MINING


—Chad Sessions, Program Manager, Advanced Analytics Group (AAG)

Used by corporations, industry, and government to inform and fuel everything from
focused advertising to homeland security, data mining can be a very useful tool
across a wide range of applications. Unfortunately, most books on the subject are
designed for the computer scientist and statistical illuminati and leave the reader
largely adrift in technical waters.

Revealing the lessons known to the seasoned expert, yet rarely written down for
the uninitiated, Practical Data Mining explains the ins-and-outs of the detection,
characterization, and exploitation of actionable patterns in data. This working field
manual outlines the what, when, why, and how of data mining and offers an easy-
to-follow, six-step spiral process.

Helping you avoid common mistakes, the book describes specific genres of data
mining practice. Most chapters contain one or more case studies with detailed
project descriptions, methods used, challenges encountered, and results obtained.
The book includes working checklists for each phase of the data mining process.
Your passport to successful technical and planning discussions with management,
senior scientists, and customers, these checklists lay out the right questions to ask
and the right points to make from an insider’s point of view.

Visit the book’s webpage for access to additional resources—including checklists,


figures, PowerPoint® slides, and a small set of simple prototype data mining tools.

http://www. celestech.com/PracticalDataMining

K13109
ISBN: 978-1-4398-6836-2
90000
w w w. c rc p r e s s . c o m

9 781439 868362
www.auerbach-publications.com

K13109 cvr mech.indd 1 10/31/11 4:31 PM


Practical
Data
Mining

K13109_FM.indd 1 11/8/11 4:17 PM


This page intentionally left blank
Practical
Data
Mining
Monte F. Hancock, Jr.
Chief Scientist, Celestech, Inc.

K13109_FM.indd 3 11/8/11 4:17 PM


CRC Press
Taylor & Francis Group
6000 Broken Sound Parkway NW, Suite 300
Boca Raton, FL 33487-2742

© 2012 by Taylor & Francis Group, LLC


CRC Press is an imprint of Taylor & Francis Group, an Informa business

No claim to original U.S. Government works


Version Date: 20111031

International Standard Book Number-13: 978-1-4398-6837-9 (eBook - PDF)

This book contains information obtained from authentic and highly regarded sources. Reasonable efforts
have been made to publish reliable data and information, but the author and publisher cannot assume
responsibility for the validity of all materials or the consequences of their use. The authors and publishers
have attempted to trace the copyright holders of all material reproduced in this publication and apologize to
copyright holders if permission to publish in this form has not been obtained. If any copyright material has
not been acknowledged please write and let us know so we may rectify in any future reprint.

Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmit-
ted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented,
including photocopying, microfilming, and recording, or in any information storage or retrieval system,
without written permission from the publishers.

For permission to photocopy or use material electronically from this work, please access www.copyright.
com (http://www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood
Drive, Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization that provides licenses and
registration for a variety of users. For organizations that have been granted a photocopy license by the CCC,
a separate system of payment has been arranged.

Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used
only for identification and explanation without intent to infringe.
Visit the Taylor & Francis Web site at
http://www.taylorandfrancis.com

and the CRC Press Web site at


http://www.crcpress.com
Dedication

This book is dedicated to my beloved wife, Sandy, and to my dear little sister, Dr.
Angela Lobreto. You make life a joy.
Also, to my professional mentors George Milligan, Dr. Craig Price, and Tell Gates,
three of the finest men I have ever known, or ever hope to know: May God bless you
richly, gentlemen; He has blessed me richly through you.

v
This page intentionally left blank
Contents

Dedication v

Preface xv

About the Author xxi

Acknowledgments xxiii

Chapter 1 What Is Data Mining and What Can It Do? 1


Purpose 1
Goals 1
1.1 Introduction 1
1.2 A Brief Philosophical Discussion 2
1.3 The Most Important Attribute of the Successful
Data Miner: Integrity 3
1.4 What Does Data Mining Do? 4
1.5 What Do We Mean By Data? 6
1.5.1 Nominal Data vs. Numeric Data 7
1.5.2 Discrete Data vs. Continuous Data 7
1.5.3 Coding and Quantization as Inverse Processes 8
1.5.4 A Crucial Distinction: Data and Information Are
Not the Same Thing 9
1.5.5 The Parity Problem 11
1.5.6 Five Riddles about Information 11
1.5.7 Seven Riddles about Meaning 13
1.6 Data Complexity 14
1.7 Computational Complexity 15

vii
viii Practical Data Mining

1.7.1 Some NP-Hard Problems 17


1.7.2 Some Worst-Case Computational Complexities 17
1.8 Summary 17

Chapter 2 The Data Mining Process 19


Purpose 19
Goals 19
2.1 Introduction 19
2.2 Discovery and Exploitation 20
2.3 Eleven Key Principles of Information Driven Data Mining 23
2.4 Key Principles Expanded 24
2.5 Type of Models: Descriptive, Predictive, Forensic 30
2.5.1 Domain Ontologies as Models 30
2.5.2 Descriptive Models 32
2.5.3 Predictive Models 32
2.5.4 Forensic Models 32
2.6 Data Mining Methodologies 32
2.6.1 Conventional System Development:
Waterfall Process 33
2.6.2 Data Mining as Rapid Prototyping 34
2.7 A Generic Data Mining Process 34
2.8 RAD Skill Set Designators 35
2.9 Summary 36

Chapter 3 Problem Definition (Step 1) 37


Purpose 37
Goals 37
3.1 Introduction 37
3.2 Problem Definition Task 1: Characterize Your Problem 38
3.3 Problem Definition Checklist 38
3.3.1 Identify Previous Work 43
3.3.2 Data Demographics 45
3.3.3 User Interface 47
3.3.4 Covering Blind Spots 50
3.3.5 Evaluating Domain Expertise 51
3.3.6 Tools 53
3.3.7 Methodology 54
3.3.8 Needs 54
Contents ix

3.4 Candidate Solution Checklist 56


3.4.1 What Type of Data Mining Must the System
Perform? 56
3.4.2 Multifaceted Problems Demand Multifaceted
Solutions 57
3.4.3 The Nature of the Data 58
3.5 Problem Definition Task 2: Characterizing Your Solution 62
3.5.1 Candidate Solution Checklist 62
3.6 Problem Definition Case Study 64
3.6.1 Predictive Attrition Model: Summary Description 64
3.6.2 Glossary 64
3.6.3 The ATM Concept 65
3.6.4 Operational Functions 65
3.6.5 Predictive Modeling and ATM 67
3.6.6 Cognitive Systems and Predictive Modeling 68
3.6.7 The ATM Hybrid Cognitive Engine 68
3.6.8 Testing and Validation of Cognitive Systems 69
3.6.9 Spiral Development Methodology 69
3.7 Summary 70

Chapter 4 Data Evaluation (Step 2) 71


Purpose 71
Goals 71
4.1 Introduction 71
4.2 Data Accessibility Checklist 72
4.3 How Much Data Do You Need? 74
4.4 Data Staging 75
4.5 Methods Used for Data Evaluation 76
4.6 Data Evaluation Case Study: Estimating the
Information Content Features 77
4.7 Some Simple Data Evaluation Methods 81
4.8 Data Quality Checklist 85
4.9 Summary 87

Chapter 5 Feature Extraction and Enhancement (Step 3) 89


Purpose 89
Goals 89
5.1 Introduction: A Quick Tutorial on Feature Space 89
5.1.1 Data Preparation Guidelines 90
x Practical Data Mining

5.1.2 General Techniques for Feature Selection and


Enhancement 91
5.2 Characterizing and Resolving Data Problems 93
5.2.1 Outlier Case Study 95
5.2.2 Winnowing Case Study: Principal Component
Analysis for Feature Extraction 95
5.3 Principal Component Analysis 96
5.3.1 Feature Winnowing and Dimension Reduction
Checklist 102
5.3.2 Checklist for Characterizing and Resolving
Data Problems 107
5.4 Synthesis of Features 108
5.4.1 Feature Synthesis Case Study 108
5.4.2 Synthesis of Features Checklist 111
5.5 Degapping 112
5.5.1 Degapping Case Study 114
5.5.2 Feature Selection Checklist 117
5.6 Summary 119

Chapter 6 Prototyping Plan and Model Development


(Step 4) 121
Purpose 121
Goals 121
6.1 Introduction 121
6.2 Step 4A: Prototyping Plan 122
6.2.1 Prototype Planning as Part of a Data Mining
Project 122
6.3 Prototyping Plan Case Study 124
6.4 Step 4B: Prototyping/Model Development 133
6.5 Model Development Case Study 135
6.6 Summary 141

Chapter 7 Model Evaluation (Step 5) 143


Purpose 143
Goals 143
7.1 Introduction 143
7.2 Evaluation Goals and Methods 144
7.2.1 Performance Evaluation Components 144
7.2.2 Stability Evaluation Components 144
Contents xi

7.3 What Does Accuracy Mean? 146


7.3.1 Confusion Matrix Example 146
7.3.2 Other Metrics Derived from the Confusion
Matrix 150
7.3.3 Model Evaluation Case Study: Addressing
Queuing Problems by Simulation 150
7.3.4 Model Evaluation Checklist 152
7.4 Summary 155

Chapter 8 Implementation (Step 6) 157


Purpose 157
Goals 157
8.1 Introduction 157
8.1.1 Implementation Checklist 158
8.2 Quantifying the Benefits of Data Mining 160
8.2.1 ROI Case Study 160
8.2.2 ROI Checklist 162
8.3 Tutorial on Ensemble Methods 164
8.3.1 Many Predictive Modeling Paradigms Are
Available 165
8.3.2 Adaptive Training 167
8.4 Getting It Wrong: Mistakes Every Data Miner Has Made 169
8.5 Summary 176

Chapter 9 Supervised Learning


Genre Section 1—Detecting and Characterizing
Known Patterns 179
Purpose 179
Goals 179
9.1 Introduction 180
9.2 Representative Example of Supervised Learning:
Building a Classifier 180
9.2.1 Problem Description 180
9.2.2 Data Description: Background Research/
Planning 181
9.2.3 Descriptive Modeling of Data: Preprocessing
and Data Conditioning 182
9.2.4 Data Exploitation: Feature Extraction and
Enhancement 185
xii Practical Data Mining

9.2.5 Model Selection and Development 187


9.2.6 Model Training 189
9.2.7 Model Evaluation 189
9.3 Specific Challenges, Problems, and Pitfalls of
Supervised Learning 190
9.3.1 High-Dimensional Feature Vectors (PCA,
Winnowing) 190
9.3.2 Not Enough Data 191
9.3.3 Too Much Data 192
9.3.4 Unbalanced Data 192
9.3.5 Overtraining 193
9.3.6 Noncommensurable Data: Outliers 193
9.3.7 Missing Features 195
9.3.8 Missing Ground Truth 195
9.4 Recommended Data Mining Architectures for
Supervised Learning 195
9.5 Descriptive Analysis 198
9.5.1 Technical Component: Problem Definition 198
9.5.2 Technical Component: Data Selection and
Preparation 200
9.5.3 Technical Component: Data Representation 200
9.6 Predictive Modeling 201
9.6.1 Technical Component: Paradigm Selection 201
9.6.2 Technical Component: Model Construction
and Validation 202
9.6.3 Technical Component: Model Evaluation
(Functional and Performance Metrics) 202
9.6.4 Technical Component: Model Deployment 202
9.6.5 Technical Component: Model Maintenance 202
9.7 Summary 204

Chapter 10 Forensic Analysis


Genre Section 2—Detecting, Characterizing,
and Exploiting Hidden Patterns 205
Purpose 205
Goals 205
10.1 Introduction 206
10.2 Genre Overview 207
10.3 Recommended Data Mining Architectures for
Unsupervised Learning 207
Contents xiii

10.4 Examples and Case Studies for Unsupervised Learning 209


10.4.1 Case Study: Reducing Cost by Optimizing a
System Configuration 212
10.4.2 Case Study: Stacking Multiple Pattern
Processors for Broad Functionality 214
10.4.3 Multiparadigm Engine for Cognitive Intrusion
Detection 215
10.5 Tutorial on Neural Networks 217
10.5.1 The Neural Analogy 217
10.5.2 Artificial Neurons: Their Form and Function 218
10.5.3 Using Neural Networks to Learn Complex
Patterns 219
10.6 Making Syntactic Methods Smarter: The Search Engine
Problem 222
10.6.1 A Submetric for Sensitivity 224
10.6.2 A Submetric for Specificity 224
10.6.3 Combining the Submetrics to Obtain a Single
Score 225
10.6.4 Putting It All Together: Building a Simple
Search Engine 226
10.6.5 The Objective Function for This Search Engine
and How to Use It 231
10.7 Summary 231

Chapter 11 Genre Section 3—Knowledge: Its Acquisition,


Representation, and Use 233
Purpose 233
Goals 233
11.1 Introduction to Knowledge Engineering 233
11.1.1 The Prototypical Example: Knowledge-Based
Expert Systems (KBES) 234
11.1.2 Inference Engines Implement Inferencing
Strategies 236
11.2 Computing with Knowledge 237
11.2.1 Graph Methods: Decision Trees, Forward/
Backward Chaining, Belief Nets 238
11.2.2 Bayesian Belief Networks 243
11.2.3 Non-Graph Methods: Belief Accumulation 245
11.3 Inferring Knowledge from Data: Machine Learning 246
11.3.1 Learning Machines 247
xiv Practical Data Mining

11.3.2 Using Modeling Techniques to Infer Knowledge


from History 248
11.3.3 Domain Knowledge the Learner Will Use 250
11.3.4 Inferring Domain Knowledge from Human
Experts 251
11.3.5 Writing on a Blank Slate 255
11.3.6 Mathematizing Human Reasoning 256
11.3.7 Using Facts in Rules 256
11.3.8 Problems and Properties 258
11.4 Summary 259

References 261

Glossary 263

Index 269
Preface

How to Use This Book

Data mining is much more than just trying stuff and hoping something good happens!
Rather, data mining is the detection, characterization, and exploitation of actionable
patterns in data.
This book is a wide-ranging treatment of the practical aspects of data mining in
the real-world. It presents in a systematic way the analytic principles acquired by the
author during his 30+ years as a practicing engineer, data miner, information scientist,
and Adjunct Professor of Computer Science.
This book is not intended to be read and then put on the shelf. Rather, it is a working
field manual, designed to serve as an on-the-job guidebook. It has been written specifi-
cally for IT consultants, professional data analysts, and sophisticated data owners who
want to establish data mining projects; but are not themselves data mining experts.
Most chapters contain one or more cases studies. These are synopses of data min-
ing projects led by the author, and include project descriptions, the data mining meth-
ods used, challenges encountered, and the results obtained. When possible, numerical
details are provided, grounding the presentation in specifics.
Also included are checklists that guide the reader through the practical considera-
tions associated with each phase of the data mining process. These are working check-
lists: material the reader will want to carry into meetings with customers, planning
discussions with management, technical planning meetings with senior scientists,
etc. The checklists lay out the questions to ask, the points to make, explain the what’s
and why’s—the lessons learned that are known to all seasoned experts, but rarely
written down.
While the treatment here is systematic, it is not formal: the reader will not encoun-
ter eclectic theorems, tables of equations, or detailed descriptions of algorithms. The
“bit-level” mechanics of data mining techniques are addressed pretty well in online
literature, and freeware is available for many of them. A brief list of vendors and sup-
ported applications is provided below. The goal of this book is to help the non-expert
address practical questions like:

xv
xvi Practical Data Mining

• What is data mining, and what problems does it address?


• How is a quantitative business case for a data mining project developed and
assessed?
• What process model should be used to plan and execute a data mining project?
• What skill sets are needed for different types/phases of data mining projects?
• What data mining techniques exist, and what do they do? How do I decide
which are needed/best for my problem?
• What are the common mistakes made during data mining projects, and how can
they be avoided?
• How are data mining projects tracked and evaluated?

How This Book Is Organized

The content of the book is divided into two parts: Chapters 1–8 and Chapters 9–11.
The first eight chapters constitute the bulk of the book, and serve to ground the
reader in the practice of data mining in the modern enterprise. These chapters focus
on the what, when, why, and how of data mining practice. Technical complexities are
introduced only when they are essential to the treatment. This part of the book should
be read by everyone; later chapters assume that the reader is familiar with the concepts
and terms presented in these chapters.
Chapter 1 (What is Data Mining and What Can it Do?) is a data mining manifesto:
it describes the mindset that characterizes the successful data mining practitioner. It
delves into some philosophical issues underlying the practice (e.g., Why is it essential
that the data miner understand the difference between data and information?).
Chapter 2 (The Data Mining Process) provides a summary treatment of data min-
ing as a six-step spiral process.
Chapters 3–8 are devoted to each of the steps of the data mining process. Check-
lists, case studies, tables, and figures abound.

• Step 1—Problem Definition


• Step 2—Data Evaluation
• Step 3—Feature Extraction and Enhancement
• Step 4—Prototype Planning and Modeling
• Step 5—Model Evaluation
• Step 6—Implementation

The last three chapters, 9–11, are devoted to specific categories of data mining
practice, referred to here as genres. The data mining genres addressed are Chapter
9: Detecting and Characterizing Known Patterns (Supervised Learning), Chapter 10:
Detecting, Characterizing, and Exploiting Hidden Patterns (Forensic Analysis), and
Chapter 11: Knowledge: Its Acquisition, Representation, and Use.
Preface xvii

It is hoped the reader will benefit from this rendition of the author’s extensive
experience in data mining/modeling, pattern processing, and automated decision
support. He started this journey in 1979, and learned most of this material the hard
way. By repeating his successes and avoiding his mistakes, you make his struggle
worthwhile!

A Short History of Data Technology: Where


Are We, and How Did We Get Here?
What follows is a brief account of the history of data technology along the classical
lines. We posit the existence of brief eras of five or ten year’s duration through which
the technology passed during its development. This background will help the reader
understand the forces that have driven the development of current data mining tech-
niques. The dates provided are approximate.

Era 1: Computing-Only Phase (1945–1955):

As originally conceived, computers were just that: machines for performing computa-
tion. Volumes of data might be input, but the answer tended to consist of just a few
numbers. Early computers had nothing that we would call online storage.
Reliable, inexpensive mass storage devices did not exist. Data was not stored in the
computer at all: it was input, transformed, and output. Computing was done to obtain
answers, not to manage data

Era 2: Offline Batch Storage (1955–1965):

Data was saved outside of the computer, on paper tape and cards, and read back in
when needed. The use of online mass storage was not widespread, because it was expen-
sive, slow, and unstable.

Era 3: Online Batch Storage (1965–1970):

With the invention of stable, cost-effective mass storage devices, everything changed.
Over time, the computer began to be viewed less as a machine for crunching numbers,
and more as a device for storing them. Initially, the operating system’s file management
system was used to hold data in flat files: un-indexed lists or tables of data. As the
need to search, sort, and process data grew, it became necessary to provide applications
for organizing data into various types of business-specific hierarchies. These early
databases organized data into tiered structures, allowing for rapid searching of records
in the hierarchy.
Data was stored on high-density media such as magnetic tape, and magnetic drum.
Platter disc technology began to become more generally used, but was still slow and
had low capacity.
xviii Practical Data Mining

Era 4: Online Databases (1970–1985):

Reliable, cost-effective online mass storage became widely available. Data was organized
into domain specific vertical structures, typically for a single part of an organization.
This allowed the development of stovepipe systems for focused applications. The use of
Online Transaction Processing (OLTP) systems became widespread, supporting inven-
tory, purchasing, sales, planning, etc. The focus of computing began to shift from raw
computation to data processing: the ingestion, transformation, storage, and retrieval
of bulk data.
However, there was an obvious shortcoming. The databases of functional orga-
nizations within an enterprise were developed to suit the needs of particular business
units. They were not interoperable, making the preparation of an enterprise-wide data
view very difficult. The difficulty of horizontal integration caused many to question
whether the development of enterprise-wide databases was feasible.

Era 5: Enterprise Databases (1985–1995):

As the utility of automatic data storage became clear, organizations within businesses
began to construct their own hierarchical databases. Soon, the repositories of corporate
information on all aspects of a business grew to be large.
Increased processing power, widespread availability of reliable communication net-
works, and development of database technology allowed the horizontal integration of
multiple vertical data stores into an enterprise-wide database. For the first time, a global
view of an entire organization’s data repository was accessible through a single portal.

Era 6: Data Warehouses and Data Marts (since 1995):

This brings us to the present. Mass storage and raw compute power has reached the
point today where virtually every data item generated by an enterprise can be saved.
And often, enterprise databases have become extremely large, architecturally complex,
and volatile. Ultra-sophisticated data modeling tools have become available at the pre-
cise moment that competition for market share in many industries begins to peak. An
appropriate environment for application of these tools to a cleansed, stable, offline
repository was needed and data warehouses were born. And, as data warehouses have
grown large, the need to create architecturally compatible functional subsets, or data
marts, has been recognized.
The immediate future is moving everything toward cloud computing. This will
include the elimination of many local storage disks as data is pushed to a vast array of
external servers accessible over the internet. Data mining in the cloud will continue
to grow in importance as network connectivity and data accessibility become virtu-
ally infinite.

Data Mining Information Sources

Some feeling for the current interest in data mining can be gained by reviewing the
following list of data mining companies, groups, publications, and products.
Preface xix

Data Mining Publications

• Two Crows Corporation


Predictive and descriptive data mining models, courses and presentations.
http://www.twocrows.com
• “Information Management.” A newsletter web site on data mining papers, books
and product reviews.
http://www.information-management.com
• “Searching for the Right Data Modeling Tool” by Terry Moriarty
http://www.information-management.com/issues/19980601/383-1.html
• “Data Mining FAQs” by Jesus Mena
http://www.information-management.com/issues/19980101/792-1.html
• “Data Mining & Pattern Discovery,” Elder Research, Inc.
http://www.datamininglab.com/
• “An Evaluation of High-end Data Mining Tools for Fraud Detection” by Dean W.
Abbot, I.P. Matkovsky, and John F. Elder
http://www.datamininglab.com/TOOLCOMPARISON/tabid/58/Default.aspx
• KDnuggets.com is a web site providing companies with data mining related
products.
http://www.kdnuggets.com/companies/products.html

Data Mining Technology/Product Providers

• SPSS Web Site:


http://www.spss.com
• SPSS Products:
http://www.spss.com/products/products/categories/data_mining/

General Data Mining Tools

The data mining tools in the following list are used for general types of data:

• Data-Miner Software Kit—A comprehensive collection of programs for


efficiently mining big data. It uses the techniques presented in Predictive Data
Mining: A Practical Guide by Morgan Kaufmann.
http://www.data-miner.com
• RuleQuest.com—System is rule based with subsystems to assist in data cleansing
(GritBot) and constructing classifiers (See5) in the form of decision trees and
rulesets.
http://www.rulequest.com/products.html
• SAS
http://www.sas.com
• Weka 3 from the University of Waikato—A collection of machine learning algo-
rithms for solving real-world data mining problems.
http://www.cs.waikato.ac.nz/ml/weka/
xx Practical Data Mining

Tools for the Development of Bayesian Belief Networks

• Netica—BBN software that is easy to use, and implements BBN learning from
data. It has a nice user interface.
http://www.norsys.com
• Hugin—Implements reasoning with continuous variables and has a nice user
interface.
http://www.hugin.dk
About the Author

Monte F. Hancock, Jr., BA, MS, is Chief Scientist for Celestech, Inc., which has
offices in Falls Church, Virginia, and Phoenix, Arizona. He was also a Technical
Fellow at Northrop Grumman; Chief Cognitive Research Scientist for CSI, Inc., and
was a software architect and engineer at Harris corporation, and HRB Singer, Inc.
He has over 30 years of industry experience in software engineering and data mining
technology development.
He is also Adjunct Full Professor of Computer Science for the Webster University
Space Coast Region, where he serves as Program Mentor for the Master of Science
Degree in Computer Science. Monte has served for 26 years on the adjunct faculty in
the Mathematics and Computer Science Department of the Hamilton Holt School of
Rollins College, Winter Park, Florida, and served 3 semesters as adjunct Instructor in
Computer Science at Pennsylvania State University.
Monte teaches secondary Mathematics, AP Physics, Chemistry, Logic, Western
Philosophy, and Church History at New Covenant School, and New Testament Greek
at Heritage Christian Academy, both in Melbourne, Florida. He was a mathematics
curriculum developer for the Department of Continuing Education of the University
of Florida in Gainesville, and serves on the Industry Advisory Panels in Computer
Science for both the Florida Institute of Technology, and Brevard Community
College in Melbourne, Florida. Monte has twice served on panels for the National
Science Foundation.
Monte has served on many program committees for international data mining con-
ferences, was a Session Chair for KDD. He has presented 15 conference papers, edited
several book chapters, and co-authored the book Data Mining Explained with Rhonda
Delmater, Digital Press, 2001.
Monte is cited in (among others):

• “Who’s Who in the World” (2009–2012)


• “Who’s Who in America” (2009–2012)
• “Who’s Who in Science and Engineering” (2006–2012)
• “Who’s Who in the Media and Communication” (1st ed.)

xxi
xxii Practical Data Mining

• “Who’s Who in the South and Southwest” (23rd–25th ed.)


• “Who’s Who Among America’s Teachers” (2006, 2007)
• “Who’s Who in Science and Theology” (2nd ed.)
Acknowledgments

It is always a pleasure to recognize those who have provided selfless support in the
completion of a significant work.
Special thanks is due to Rhonda Delmater, with whom I co-authored my first book,
Data Mining Explained (Digital Press, 2001), and who proposed the development of
this book. Were it not for exigent circumstances, this would have been a joint work.
Special thanks are also due to Theron Shreve (acquisition editor), Marje Pollack
(compositor), and Rob Wotherspoon (copy editor) of Derryfield Publishing Services,
LLC. What a pleasure to work with professionals who know the business and under-
stand people!
Special thanks are due to Dan Strohschein, who worked on technical references,
and Katherine Hancock, who verified the vendor list.
Finally, to those who have made significant contributions to my knowledge
through the years: John Day, Chad Sessions, Stefan Joe-Yen, Rusty Topping, Justin
Mortimer, Leslie Kain, Ben Hancock, Olivia Hancock, Marsha Foix, Vinnie, Avery,
Toby, Tristan, and Maggie.

xxiii
This page intentionally left blank
Chapter 1
What Is Data Mining
and What Can It Do?

Purpose

The purpose of this chapter is to provide the reader with grounding in the fundamental
philosophical principles of data mining as a technical practice. The reader is then intro-
duced to the wide array of practical applications that rely on data mining technology.
The issue of computational complexity is addressed in brief.

Goals

After you have read this chapter, you will be able to define data mining from both
philosophical and operational perspectives, and enumerate the analytic functions data
mining performs. You will know the different types of data that arise in practice. You
will understand the basics of computational complexity theory. Most importantly, you
will understand the difference between data and information.

1.1 Introduction
Our study of data mining begins with two semi-formal definitions:

Definition 1. Data mining is the principled detection, characterization, and exploita-


tion of actionable patterns in data. Table 1.1 explains what is meant by each of these
components.

1
2 Practical Data Mining

Table 1.1 Definitive Data Mining Attributes


Attribute Connotations
Principled Rational, empirical, objective, repeatable
Detection Sensing and locating
Characterization Consistent, efficient, tractable symbolic representation
that does not alter information content
Exploitation Decision making that facilitates action
Actionable Pattern Conveys information that supports decision making

Taking this view of what data mining is we can formulate a functional definition
that tells us what individuals engaged in data mining do.

Definition 2. Data Mining is the application of the scientific method to data to obtain
useful information. The heart of the scientific approach to problem-solving is rational
hypothesis testing guided by empirical experimentation.
What we today call science today was referred to as natural philosophy in the 15th
century. The Aristotelian approach to understanding the world was to catalog and
organize more-or-less passive acts of observation into taxonomies. This method began
to fall out of favor in the physical sciences in the 15th century, and was dead by the 17th
century. However, because of the greater difficulty of observing the processes underly-
ing biology and behavior, the life sciences continued to rely on this approach until well
into the 19th century. This is why the life sciences of the 1800s are replete with taxono-
mies, detailed naming conventions, and perceived lines of descent, which are more a
matter of organizing observations than principled experimentation and model revision.
Applying the scientific method today, we expect to engage in a sequence of planned
steps:

1. Formulate hypotheses (often in the form of a question)


2. Devise experiments
3. Collect data
4. Interpret data to evaluate hypotheses
5. Revise hypotheses based upon experimental results

This sequence amounts to one cycle of an iterative approach to acquiring knowl-


edge. In light of our functional definition of data mining, this sequence can be thought
of as an over-arching data mining methodology that will be described in detail in
Chapter 3.

1.2 A Brief Philosophical Discussion


Somewhere in every data mining effort, you will encounter at least one computation-
ally intractable problem; it is unavoidable. This has technical and procedural impli-
What Is Data Mining and What Can It Do? 3

cations, but it also has philosophical implications. In particular, since there are by
definition no perfect techniques for intractable problems, different people will handle
them in different ways; no one can say definitively that one way is necessarily wrong
and another right. This makes data mining something of an art, and leaves room for
the operation of both practical experience and creative experimentation. It also implies
that the data mining philosophy to which you look when science falls short can mean
the difference between success and failure. Let’s talk a bit about developing such a data
mining philosophy.
As noted above, data mining can be thought of as the application of the scientific
method to data. We perform data collection (sampling), formulate hypotheses (e.g.,
visualization, cluster analysis, feature selection), conduct experiments (e.g., construct
and test classifiers), refine hypotheses (spiral methodology), and ultimately build theo-
ries (field applications). This is a process that can be reviewed and replicated. In the real
world, the resulting theory will either succeed or fail.
Many of the disciplines that apply to empirical scientific work also apply to the
practice of data mining: assumptions must be made explicit; the design of principled
experiments capable of falsifying our hypotheses is essential; the integrity of the evi-
dence, process, and results must be meticulously maintained and documented; out-
comes must be repeatable; and so on. Unless these disciplines are maintained, nothing
of certain value can result. Of particular importance is the ability to reproduce results.
In the data mining world, these disciplines involve careful configuration management
of the system environment, data, applications, and documentation. There are no effec-
tive substitutes for these.
One of the most difficult mental disciplines to maintain during data mining work
is reservation of judgment. In any field involving hypothesis and experimentation, pre-
liminary results can be both surprising and exhilarating. Finding the smoking gun in a
forensic study, for example, is hitting pay-dirt of the highest quality, and it is hard not
to get a little excited if you smell gunpowder.
However, this excitement cannot be allowed to short-circuit the analytic pro-
cess. More than once I have seen exuberant young analysts charging down the hall
to announce an amazing discovery after only a few hours’ work with a data set; but
I don’t recall any of those instant discoveries holding up under careful review. I can
think of three times when I have myself jumped the gun in this way. On one occa-
sion, eagerness to provide a rapid response led me to prematurely turn over results to
a major customer, who then provided them (without review) to their major customer.
Unfortunately, there was an unnoticed but significant flaw in the analysis that invali-
dated most of the reported results. That is a trail of culpability you don’t want leading
back to your office door.

1.3 The Most Important Attribute of the Successful Data


Miner: Integrity
Integrity is variously understood, so we list the principal characteristics data miners
must have.
4 Practical Data Mining

• Moral courage. Data miners have lots of opportunities to deliver unpleasant


news. Sometimes they have to inform an enterprise that the data it has collected
and stored at great expense does not contain the type or amount of information
expected.
Further, it is an unfortunate fact that the default assessment for data mining
efforts in most situations is “failure.” There can be tremendous pressure to pro-
duce a certain result, accuracy level, conclusion, etc., and if you don’t: Failure.
Pointing out that the data do not support the desired application, are of low
quality (precision/accuracy), and do not contain sufficient samples to cover the
problem space will sound like excuses, and will not always redeem you.
• Commitment to enterprise success. If you want the enterprise you are assist-
ing to be successful, you will be honest with them; will labor to communicate
information in terms they can understand; and will not put your personal success
ahead of the truth.
• Honesty in evaluation of data and information. Individuals that demonstrate
this characteristic are willing to let the data speak for itself. They will resist the
temptation to read into the data that which wasn’t mined from the data.
• Meticulous planning, execution, and documentation. A successful data
miner will be meticulous in planning, carrying out, and documenting the min-
ing process. They will not jump to conclusions; will enforce the prerequisites
of a process before beginning; will check and recheck major results; and will
carefully validate all results before reporting them. Excellent data miners create
documentation of sufficient quality and detail that their results can be repro-
duced by others.

1.4 What Does Data Mining Do?


The particulars of practical data mining “best practice” will be addressed later in great
detail, but we jump-start the treatment with some bulleted lists summarizing the func-
tions that data mining provides.

Data mining uses a combination of empirical and theoretical principles to con-


nect structure to meaning by

• Selecting and conditioning relevant data


• Identifying, characterizing, and classifying latent patterns
• Presenting useful representations and interpretations to users

Data mining attempts to answer these questions

• What patterns are in the information?


• What are the characteristics of these patterns?
• Can meaning be ascribed to these patterns and/or their changes?
What Is Data Mining and What Can It Do? 5

• Can these patterns be presented to users in a way that will facilitate their assess-
ment, understanding, and exploitation?
• Can a machine learn these patterns and their relevant interpretations?

Data mining helps the user interact productively with the data

• Planning helps the user achieve and maintain situational awareness of vast,
dynamic, ambiguous/incomplete, disparate, multi-source data.
• Knowledge leverages users’ domain knowledge by creating functionality based
upon an understanding of data creation, collection, and exploitation.
• Expressiveness produces outputs of adjustable complexity delivered in terms
meaningful to the user.
• Pedigree builds integrated metrics into every function, because every recommen-
dation has to have supporting evidence and an assessment of certainty.
• Change uses future-proof architectures and adaptive algorithms that anticipate
many users addressing many missions.

Data mining enables the user to get their head around the problem space
Decision Support is all about . . .

• Enabling users to group information in familiar ways


• Controlling HMI complexity by layering results (e.g., drill-down)
• Supporting user’s changing priorities (goals, capabilities)
• Allowing intuition to be triggered (“I’ve seen this before”)
• Preserving and automating perishable institutional knowledge
• Providing objective, repeatable metrics (e.g., confidence factors)
• Fusing and simplifying results (e.g., annotate multisource visuals)
• Automating alerts on important results (“It’s happening again”)
• Detecting emerging behaviors before they consummate (look)
• Delivering value (timely, relevant, and accurate results)

. . . helping users make the best choices.

Some general application areas for data mining technology

• Automating pattern detection to characterize complex, distributed signatures


that are worth human attention and recognize those that are not
• Associating events that go together but are difficult for humans to correlate
• Characterizing interesting processes not just facts or simple events
• Detecting actionable anomalies and explaining what makes them different and
interesting
• Describing contexts from multiple perspectives with numbers, text and graphics
• Accurate identification and classification—add value to raw data by tagging and
annotation (e.g., automatic target detection)
6 Practical Data Mining

o Anomaly, normalcy, and fusion—characterize, quantify, and assess normalcy


of patterns and trends (e.g., network intrusion detection)
• Emerging patterns and evidence evaluation—capturing institutional knowledge
of how events arise and alerting users when they begin to emerge
• Behavior association—detection of actions that are distributed in time and space
but synchronized by a common objective: connecting the dots
• Signature detection and association—detection and characterization of multi-
variate signals, symbols, and emissions
• Concept tagging—ontological reasoning about abstract relationships to tag and
annotate media of all types (e.g., document geo-tagging)
• Software agents assisting analysts—small footprint, fire-and-forget apps that
facilitate search, collaboration, etc.
• Help the user focus via unobtrusive automation
o Off-load burdensome labor (perform intelligent searches, smart winnowing)
o Post smart triggers or tripwires to data stream (anomaly detection)
o Help with workflow and triage (sort my in-basket)
• Automate aspects of classification and detection
o Determine which sets of data hold the most information for a task
o Support construction of ad hoc on-the-fly classifiers
o Provide automated constructs for merging decision engines (multi-level fusion)
o Detect and characterize domain drift (the rules of the game are changing)
o Provide functionality to make best estimate of missing data
• Extract, characterize and employ knowledge
o Rule induction from data and signatures development from data
o Implement non-monotonic reasoning for decision support
o High-dimensional visualization
o Embed decision explanation capability in analytic applications
• Capture, automate and institutionalize best practices
o Make proven enterprise analytic processes available to all
o Capture rare, perishable human knowledge and distribute it everywhere
o Generate signature-ready prose reports
o Capture and characterize the analytic process to anticipate user needs

1.5 What Do We Mean By Data?


Data is the wrapper that carries information. It can look like just about anything:
images, movies, recorded sounds, light from stars, the text in this book, the swirls that
form your fingerprints, your hair color, age, income, height, weight, credit score, a list
of your likes and dislikes, the chemical formula for the gasoline in your car, the num-
ber of miles you drove last year, your cat’s body temperature as a function of time, the
order of the nucleotides in the third codon of your mitochondrial DNA, a street map
of Liberal Kansas, the distribution of IQ scores in Braman Oklahoma, the fat content
of smoked sausage, a spreadsheet of your household expenses, a coded message, a com-
puter virus, the pattern of fibers in your living room carpet, the pattern of purchases
What Is Data Mining and What Can It Do? 7

at a grocery store, the pattern of capillaries in your retina, election results, etc. In fact:
A datum (singular) is any symbolic representation of any attribute of any given thing.
More than one datum constitutes data (plural).

1.5.1 Nominal Data vs. Numeric Data


Data come in two fundamental forms—nominal and numeric. Fabulously intricate
hierarchical structures and relational schemes can be fashioned from these two forms.
This is an important distinction, because nominal and numeric data encode infor-
mation in different ways. Therefore, they are interpreted in different ways, exhibit
patterns in different ways, and must be mined in different ways. In fact, there are many
data mining tools that only work with numeric data, and many that only work with
nominal data. There are only few (but there are some) that work with both.
Data are said to be nominal when they are represented by a name. The names of
people, places, and things are all nominal designations. Virtually all text data is nomi-
nal. But data like Zip codes, phone numbers, addresses, social security numbers, etc.
are also nominal. This is because they are aliases for things: your postal zone, the den
that contains your phone, your house, and you. The point is the information in these
data has nothing to do with the numeric values of their symbols; any other unique
string of numbers could have been used.
Data are said to be numeric when the information they contain is conveyed by the
numeric value of their symbol string. Bank balances, altitudes, temperatures, and ages
all hold their information in the value of the number string that represents them. A
different number string would not do.
Given that nominal data can be represented using numeric characters, how can you
tell the difference between nominal and numeric data? There is a simple test: If the
average of a set of data is meaningful, they are numeric.
Phone numbers are nominal, because averaging the phone numbers of a group of
people doesn’t produce a meaningful result. The same is true of zip codes, addresses,
and Social Security numbers. But averaging incomes, ages, and weights gives symbols
whose values carry information about the group; they are numeric data.

1.5.2 Discrete Data vs. Continuous Data


Numeric data come in two forms—discrete and continuous. We can’t get too techni-
cal here, because formal mathematical definitions of these concepts are deep. For the
purposes of data mining, it is sufficient to say that a set of data is continuous when,
given two values in the set, you can always find another value in the set between them.
Intuitively, this implies there is a linear ordering, and there aren’t gaps or holes in the
range of possible values. In theory, it also implies that continuous data can assume
infinitely many different values.
A set of data is discrete if it is not continuous. The usual scenario is a finite set of
values or symbols. For example, the readings of a thermometer constitute continuous
8 Practical Data Mining

data, because (in theory), any temperature within a reasonable range could actually
occur. Time is usually assumed to be continuous in this sense, as is distance; therefore
sizes, distances, and durations are all continuous data.
On the other hand, when the possible data values can be placed in a list, they are
discrete: hair color, gender, quantum states (depending upon whom you ask), head-
count for a business, the positive whole numbers (an infinite set) etc., are all discrete.
A very important difference between discrete and continuous data for data mining
applications is the matter of error. Continuous data can presumably have any amount
of error, from very small to very large, and all values in between. Discrete data are either
completely right or completely wrong.

Figure 1.1 Nominal to numeric coding of data.

1.5.3 Coding and Quantization as Inverse Processes


Data can be represented in different ways. Sometimes it is necessary to translate data
from one representational scheme to another. In applications this often means converting
numeric data to nominal data (quantization), and nominal data to numeric data (coding).
What Is Data Mining and What Can It Do? 9

Quantization usually leads to loss of precision, so it is not a perfectly reversible pro-


cess. Coding usually leads to an increase in precision, and is usually reversible.
There are many ways these conversions can be done, and some application-depen-
dent decisions that must be made. Examples of these decisions might include choosing
the level of numeric precision for coding, or determining the number of restoration
values for quantization. The most intuitive explanation of these inverse processes is
pictorial. Notice that the numeric coding (Figure 1.1) is performed in stages. No infor-
mation is lost; its only purpose was to make the nominal feature attributes numeric.
However, quantization (Figure 1.2) usually reduces the precision of the data, and is
rarely reversible.

Figure 1.2 Numeric to nominal quantization.

1.5.4 A Crucial Distinction: Data and Information Are Not the


Same Thing
Data and information are entirely different things. Data is a formalism, a wrapper, by
which information is given observable form. Data and information stand in relation to
one another much as do the body and the mind. In similar fashion, it is only data that
are directly accessible to an observer. Inferring information from data requires an act
of interpretation which always involves a combination of contextual constraints and
rules of inference.
In computing systems, the problem “context” and “heuristics” are represented
using a structure called a domain ontology. As the term suggests, each problem space
has its own constraints, facts, assumptions, rules of thumb, and these are variously
represented and applied.
10 Practical Data Mining

The standard mining analogy is helpful here. Data mining is similar in some ways
to mining for precious metals:

• Silver mining. Prospectors survey a region and select an area they think might
have ore, the rough product that is refined to obtain metal. They apply tools
to estimate the ore content of their samples and if it is high enough, the ore is
refined to obtain purified silver.
• Data mining. Data miners survey a problem space and select sources they think
might contain salient patterns, the rough product that is refined to obtain infor-
mation. They apply tools to assess the information content of their sample and if
it is high enough, the data are processed to infer latent information.

However, there is a very important way in which data mining is not like silver
mining. Chunks of silver ore actually contain particular silver atoms. When a chunk
of ore is moved, its silver goes with it. Extending this part of the silver mining analogy
to data mining will get us into trouble. The silver mining analogy fails because of the
fundamental difference between data and information.
The simplest scenario demonstrating this difference involves their different relation
to context. When I remove letters from a word, they retain their identity as letters, as do
the letters left behind. But the information conveyed by the letters removed and by the
letters left behind has very likely been altered, destroyed, or even negated.
Another example is found in the dependence on how the information is encoded. I
convey exactly the same message when I say “How are you?” that I convey when I say
“Wie gehts?,” yet the data are completely different. Computer scientists use the terms
syntax and semantics to distinguish between representation and meaning, respectively.
It is extremely dangerous for the data miner to fall into the habit of regarding partic-
ular pieces of information as being attached to particular pieces of data in the same way
that metal atoms are bound to ore. Consider a more sophisticated, but subtle example:
A Morse code operator sends a message consisting of alternating, evenly spaced dots
and dashes (Figure 1.3):

Figure 1.3 Non-informative pattern.

This is clearly a pattern but other than manifesting its own existence, this pattern
conveys no information. Information Theory tells that us such a pattern is devoid of
information by pointing out that after we’ve listened to this pattern for a while, we can
perfectly predict which symbol will arrive next. Such a pattern, by virtue of its complete
predictability is not informative: a message that tells me what I already know tells me
nothing. This important notion can be quantified in the Shannon Entropy (see glos-
sary). However, if the transmitted tones are varied or modulated, the situation is quite
different (Figure 1.4):
What Is Data Mining and What Can It Do? 11

Figure 1.4 Informative modulation pattern.

This example makes is quite clear that information does not reside within the dots
and dashes themselves; rather, it arises from an interpretation of their inter-relation-
ships. In Morse code, this is their order and duration relative to each other. Notice that
by removing the first dash from O = - - -, the last two dashes now mean M = - -, even
though the dashes have not changed. This context sensitivity is a wonderful thing, but
it causes data mining disaster if ignored.
A final illustration called the Parity Problem convincingly establishes the distinct
nature of data and information in a data mining context.

1.5.5 The Parity Problem


Let’s do a thought experiment (Figure 1.5). I have two marbles in my hand, one white
and one black. I show them to you and ask this question: Is the number of black
marbles even, or is it odd?
Naturally you respond odd, since one is an odd number. If both of the marbles had
been black, the correct answer would have been even, since 2 is an even number; if I
had been holding two white marbles, again the correct answer would have been even,
since 0 is an even number.
This is called the Parity Two problem. If there are N marbles, some white (pos-
sibly none) and some black (possibly none), the question of whether there are an odd
number of black marbles is called the Parity-N Problem, or just the Parity Problem.
This problem is important in computer science, information theory, coding theory,
and related areas.
Of course, when researchers talk about the parity problem, they don’t use marbles,
they use zeros and ones (binary digits = bits). For example, I can store a data file on
disc and then ask whether the file has an odd or even number of ones; the answer is the
parity of the file.
This idea can also be used to detect data transmission errors: if I want to send
you 100 bits of data, I could actually send you 101, with the extra bit set to a one or
zero such that the whole set has a particular parity that you and I have agreed upon
in advance. If you get a message from me and it doesn’t have the expected parity, you
know the message has an odd number of bit errors and must be resent.

1.5.6 Five Riddles about Information


Suppose I have two lab assistants named Al and Bob, and two data bits. I show only
the first one to Al, and only the second one to Bob. If I ask Al what the parity of the
12 Practical Data Mining

Figure 1.5 The parity problem.

original pair of bits is, what will he say? And if I ask Bob what the parity of the original
pair of bits is, what will he say?
Neither one can say what the parity of the original pair is, because each one is lack-
ing a bit. If I handed Al a one, he could reason that if the bit I can’t see is also a one,
then the parity of the original pair is even. But if the bit I can’t see is a zero, then the
parity of the original pair is odd. Bob is in exactly the same boat.

Riddle one. Al is no more able to state the parity of the original bit pair than he was
before he was given his bit and the same is true for Bob. That is, each one has 50% of
the data, but neither one has received any information at all.
Suppose now that I have 100 lab assistants, and 100 randomly generated bits of
data. To assistant 1, I give all the bits except bit 1; to assistant 2, I give all the bits except
bit 2; and so on. Each assistant has received 99% of the data. Yet none of them is any
more able to state the parity of the original 100-bit data set than before they received
99 of the bits.
What Is Data Mining and What Can It Do? 13

Riddle two. Even though each assistant has received 99% of the data, none of them
has received any information at all.

Riddle three. The information in the 100 data bits cannot be in the bits themselves.
For, which bit is it in? Not bit 1, since that bit was given to 99 assistants, and didn’t
provide them with any information. Not bit 2, for the same reason. In fact, it is clear
that the information cannot be in any of the bits themselves. So, where is it?

Riddle four. Suppose my 100 bits have odd parity (say, 45 ones and 55 zeros). I arrange
them on a piece of paper, so they spell the word “odd.” Have I added information? If
so, where is it? (Figure 1.6)

Riddle five. Where is the information in a multiply encrypted message, since it com-
pletely disappears when one bit is removed?

Figure 1.6 Feature sets vs. sets of features.

1.5.7 Seven Riddles about Meaning


Thinking of information as a vehicle for expressing meaning, we now consider the idea
of “meaning” itself. The following questions might seem silly, but the issues they raise
are the very things that make intelligent computing and data mining particularly dif-
ficult. Specifically, when an automated decision support system must infer the “mean-
ing” of a collection of data values in order to correctly make a critical decision, “silly”
issues of exactly this sort come up . . . and they must be addressed. We begin this in
Chapter 2 by introducing the notion of a domain ontology, and continue it in Chapter
11 for intelligent systems (particularly those that perform multi-level fusion).
For our purposes, the most important question has to do with context: does mean-
ing reside in things themselves, or is it merely the interpretation of an observer? This
is an interesting question I have used (along with related questions in axiology) when I
teach my Western Philosophy class. Here are some questions that touch on the connec-
tion between meaning and context:
14 Practical Data Mining

Riddle one. If meaning must be known/remembered in order to exists/persist, does


that imply that it is a form of information?

Riddle two. In the late 18th century, many examples of Egyptian hieroglyphics were
known, but no one could read them. Did they have meaning? Apparently not, since
there were no “rememberers.” In 1798, the French found the Rosetta Stone, and within
the next 20 or so years, this “lost” language was recovered, and with it, the “mean-
ing” of Egyptian hieroglyphics. So, was the meaning “in” the hieroglyphics, or was it
“brought to” the hieroglyphics by its translators?

Riddle three. If I write a computer program to generate random but intelligible stories
(which I have done, by the way), and it writes a story to a text file, does this story have
meaning before any person reads the file? Does it have meaning after a person reads the
file? If it was meaningless before but meaningful afterwards, where did the meaning
come from?

Riddle four. Two cops read a suicide note, but interpret it in completely different
ways. What does the note mean?

Riddle five. Suppose I take a large number of tiny pictures of Abraham Lincoln and
arrange them, such that they spell out the words “Born in 1809”; is additional mean-
ing present?

Riddle six. On his deathbed, Albert Einstein whispered his last words to the nurse
caring for him. Unfortunately, he spoke them in German, which she did not under-
stand. Did those words mean anything? Are they now meaningless?

Riddle seven. When I look at your family photo album, I don’t recognize anyone, or
understand any of the events depicted; they convey nothing to me but what they imme-
diately depict. You look at the album, and many memories of people, places, and events
are engendered; they convey much. So, where is the meaning? Is it in the pictures, or
is it in the viewer?

As we can see by considering the questions above, the meaning of a data set arises
during an act of interpretation by a cognitive agent. At least some of it resides outside
the data itself. This external content we normally regard as being in the domain ontol-
ogy; it is part of the document context, and not the document itself.

1.6 Data Complexity


When talking about data complexity, the real issue at hand is the accessibility of latent
information. Data are considered more complex when extracting information from
them is more difficult.
What Is Data Mining and What Can It Do? 15

Complexity arises in many ways, precisely because there are many ways that latent
information can be obscured. For example, data can be complex because they are
unwieldy. This can mean many records and/or many fields within a record (dimen-
sions). Large data sets are difficult to manipulate, making their information content
more difficult and time consuming to tap.
Data can also be complex because their information content is spread in some
unknown way across multiple fields or records. Extracting information present in com-
plicated bindings is a combinatorial search problem. Data can also be complex because
the information they contain is not revealed by available tools. For example, visualiza-
tion is an excellent information discovery tool, but most visualization tools do not sup-
port high-dimensional rendering.
Data can be complex because the patterns that contain interesting information
occur rarely. Data can be complex because they just don’t contain very much informa-
tion at all. This is a particularly vexing problem because it is often difficult to deter-
mine whether the information is not visible, or just not present.
There is also the issue of whether latent information is actionable. If you are trying
to construct a classifier, you want to characterize patterns that discriminate between
classes. There might be plenty of information available, but little that helps with this
specific task.
Sometimes the format of the data is a problem. This is certainly the case when those
data that carry the needed information are collected/stored at a level of precision that
obscures it (e.g., representing continuous data in discrete form).
Finally, there is the issue of data quality. Data of lesser quality might contain infor-
mation, but at a low level of confidence. In this case, even information that is clearly
present might have to be discounted as unreliable.

1.7 Computational Complexity


Computer scientists have formulated a principled definition of computational complex-
ity. It treats the issue of how the amount of labor required to solve an instance of a
problem is related to the size of the instance (Figure 1.7).
For example, the amount of labor required to find the largest element in an arbi-
trary list of numbers is directly proportional to the length of the list. That is, finding
the largest element in a list of 2,000 numbers requires twice as many computer opera-
tions as finding the largest element in a list of 1,000 numbers. This linear proportional-
ity is represented by O(n), read “big O of n,” where n is the length of the list.
On the other hand, the worst-case amount of labor required to sort an arbitrary list
is directly proportional to the square of the length of the list. This is because sorting
requires that the list be rescanned for every unsorted element to determine whether it
is the next smallest or largest in the list. Therefore, sorting an arbitrary list of 2,000
numbers items requires four times as many computer operations as sorting a list of
1,000 numbers. This quadratic proportionality is represented by O(n2) read “big O of
n squared,” where n is the length of the list.
16 Practical Data Mining

Figure 1.7 The hierarchy of computational complexity.

Figure 1.8 Complexity hierarchy for various text processing problems.


What Is Data Mining and What Can It Do? 17

Lots of research has been conducted to determine the Big O complexity of various
algorithms. It is generally held that algorithms having polynomial complexity, O(np), are
tractable, while more demanding Big O complexities are intractable. The details can’t
be addressed here, but we do note that many data mining problems (optimal feature
selection, optimal training of a classifier, etc.) have a computational complexity that is
beyond any polynomial level. In practice, this means that data miners must be content
with solutions that are good enough. These are referred to as satisficing solutions.
Problems that are very computationally complex in their general case may fall into
a class of problems referred to as NP-Hard. These problems, which have no known
efficient algorithmic solutions, are frequently encountered in data mining work. Often
problems in a domain are arranged in a hierarchy to help system architects make engi-
neering trades (Figure 1.8).

1.7.1 Some NP-Hard Problems

• The Knapsack Problem. Given cubes of various sizes and materials (and hence,
values), find the highest value combination that fits within a given box.
• The Traveling Salesman Problem. Given a map with N points marked, find the
shortest circuit (a route that ends where it starts) that visits each city exactly once.
• The Satisfiability Problem. Given a boolean expression, determine whether
there is an assignment of the variables that makes it true.
• The Classifier Problem. Given a neural network topology and a training set,
find the weights that give the best classification score.

1.7.2 Some Worst-Case Computational Complexities

• Determining whether a number is positive or negative: O(1) = constant time


• Finding an item in a sorted list using binary search: O(log(n))
• Finding the largest number in an unsorted list: O(n)
• Performing a Fast Fourier Transform: O(n*log(n))
• Sorting a randomly ordered list: O(n2)
• Computing the determinant of an n-by-n matrix: O(n3)
• Brute-force solution of Traveling Salesman Problem: O(n!)

1.8 Summary
The purpose of this chapter was to provide the reader with a grounding in the fun-
damental principles of data mining as a technical practice. Having read this chapter,
you are now able to define data mining from both a philosophical and operational
perspective, and enumerate the analytic functions data mining performs. You know
the different types of data that arise in practice. You have been introduced to the basics
18 Practical Data Mining

of computational complexity theory, and the unavoidable presence of intractability.


Most importantly, you have considered the important differences between data and
information.
Now that you have been introduced to some terminology and the fundamental
principles of data mining, you are ready to continue with a summary overview of data
mining as a principled process.

Coming up
The next chapter presents a spiral methodology for managing the data mining process.
The key principles underlying this process are summarized in preparation for the
detailed treatments that follow later.
Chapter 2
The Data Mining Process

Purpose
The purpose of this chapter is to provide the reader with a deeper understanding of
the fundamental principles of data mining. It presents an overview of data mining as
a process of discovery and exploitation that is conducted in spirals, each consisting of
multiple steps. A Rapid Application Development (RAD) data mining methodology is
presented that accommodates disruptive discovery and changing requirements.

Goals
After you have read this chapter, you will be able to explain the more complex princi-
ples of data mining as a discipline. You will be familiar with the major components
of the data mining process, and will know how these are implemented in a spiral
methodology. Most importantly, you will understand the relative strengths and weak-
nesses of conventional and RAD development methodologies as they relate to data
mining projects.

2.1 Introduction
Successful data mining requires the cultivation of an appropriate mindset. There are
many ways that data mining efforts can go astray; even seemingly small oversights can
cause significant delays or even project failure. Just as pilots must maintain situational
awareness for safe performance, data miners must remember where they are in their
analysis, and where they are going. All of this demands a principled approach imple-
mented as a disciplined process.

19
20 Practical Data Mining

The alternative to using a disciplined process is often expensive failure. “Data


mining boys love their analytic toys”; directionless analysts can spend infinite time
unsystematically pounding on data sets using powerful data mining tools. Someone
who understands the data mining process must establish a plan: there needs to be
a “Moses.”
There also needs to be a “Promised Land.” Someone familiar with the needs of the
enterprise must establish general goals for the data mining activity. Because data min-
ing is a dynamic, iterative discovery process, establishing goals and formulating a good
plan can be difficult. Having a data mining expert review the problem, set up a reason-
able sequence of experiments, and establish time budgets for each step of analysis will
minimize profitless wandering through some high-dimensional wilderness.
There is still some disagreement among practitioners about the scope of the term
data mining: Does data mining include building classifiers and other kinds of models,
or only pattern discovery? How does conventional statistics fit in? and so on. There is
also disagreement about the proper context for data mining: Is a data warehouse neces-
sary? Is it essential to have an integrated set of tools? However, there is general agree-
ment among practitioners that data mining is a process that begins with data in some
form and ends with knowledge in some form.
As we have seen, data mining is a scientific activity requiring systematic thinking,
careful planning, and informed discipline. We now lay out the steps of a principled data
mining process at a high level, being careful not to get lost in the particulars of specific
techniques or tools.
In computer science, development methodologies that repeat a standardized
sequence of steps to incrementally produce successively more mature prototypes of a
solution are referred to as spiral methodologies. Each cycle through the sequence of steps
is one spiral.
An enterprise is any entity that is a data owner having an operational process. This
includes businesses, government entities, the World Wide Web, etc. The following is
an overview of a data mining project as a process of directed discovery and exploitation
that occurs within an enterprise.

2.2 Discovery and Exploitation


As a process, data mining has two components: discovery and exploitation.
Discovery is an analytic process, e.g., determining the few factors that most influ-
ence customer churn. Exploitation is a modeling process, e.g., building a classifier that
identifies the customers’ most likely to churn based upon their orders last quarter. We
can characterize these functionally by noting that during discovery, meaningful pat-
terns are detected in data, and characterized formally, resulting in descriptive models.
During exploitation, detected patterns are used to build useful models (e.g., classifiers).

• Discovery
o Detect actionable patterns in data
o Characterize actionable patterns in data
Exploring the Variety of Random
Documents with Different Content
further to the right, and the wind has changed to N. The high
shores, behind which the whole country is bare, with the exception
of a few uschàrs, and seems to lie higher, approach again the river
on the left; and two villages shew themselves at some hundred
paces, on the gently-ascending downs; below them the old river-bed
appears on dry ground.
The Shilluks, armed with lances, and standing on the shore,
shout again their “Habàba!” but we sail now, and they do not offer
us anything, much as we should like to make use of their cows and
wood; and besides there are two many of them. Groups of tokuls
stand in a row. A quarter after twelve, continually E.S.E. Half-past
twelve, S.E. by E.; to the left, E. The wind has changed, and is
contrary; so we go E.S.E. The Shilluks also have sleeping-places,
open at the top, wherein warm ashes form their beds, with which
also they powder their hair, thereby making it look grey.
A quarter before one. From E. by N. A gohr on the right, and we
go, at one o’clock, E.S.E. Half-past one. The river takes a direction
before us to E., with some little inlets, so that we cannot see the
lower shore. The wind blows strongly against us from E. We have
but scanty fare, being without meat. I cannot deny kew to myself
now, for I really want it.
Half-past two. E. by S. A Haba on the right, before it a lake
connected with the river in front; the forest is upon a gentle
declivity, and covered with shrubs, thorns, and dwarf-trees, even to
the edge of the water. The shore also falls away gently to the river,
near which it only rises a little above the narrow green margin of
grass. We halt close to the right shore, owing to want of wood.
The shore ascends to about fifteen feet high, where the trees
begin, and is composed of nothing but mimosas, although the Nile
very certainly does not flow over it; for the river has full play far
away to the left.
If we call these lakes, marshes, and reed-morasses, a
longitudinal valley, enclosed as they are with the Nile between two
high shores, which, however, do not ascend to the due height, the
original shores perhaps lying still further by the irregular low line of
mountains, or rather hills, it is plain that the same is gradually filled
by alluvial deposits from the mountains of Bari, or from above, and
an accumulation of vegetables, or the momentary sprouting forth of
an corresponding kingdom of plants, must have soon followed the
more important vegetable matter. As the sluices of the so-called
valley pour into the great Nile, it must have falls on a level with the
Nile itself, and has, therefore, dug a bed, and made an even slope to
this side, after the stream had removed the first barriers or dikes of
the high shores, which are now secure from any inundation. A river-
bed, indeed, naturally becomes deeper when there is a proper fall
and a regular conduit. The lower Nile has elevated its bed, because
it has but few vents. Why could not the White River have a similar
retrograde connection of water, which is prevented from flowing off,
such as is the case, in the first place, near Khartùm? The Nile here
might have been previously in majestic fullness, and flowed rapidly
between the present old shores to Khartùm, until it created shallows
and islands, where reeds and water-plants of every species sprang
forth luxuriantly from the nearly stagnant water, and vehemently
opposed the natural course of the river, seized the alluvial deposits
from above in their polypi-arms, and rose to what we now see to be
meadows and marshes.
The Shilluks are tolerably acquainted with the good disposition of
the Turks: as soon as a vessel approaches a group of them, they get
up and go away; this even befell Selim Capitan, in spite of his
interpreter. When they see us coming, they drive the cows from the
water, even without letting them drink. We on our side are afraid,
and with justice, to land on the inhabited spots. I brought back two
guinea-fowls, the produce of my shooting excursion with my
servants; I had seen Suliman Kashef with one of a similar kind
above. They are not at all like those in Taka, and different only from
those of Europe by the darker colour of their plumage. We shall
remain here to-night; thunder and rain have been satisfied with
merely threatening us,—and are happily over. I disembark once
more, and see fifty to sixty giraffes in the level shore towards the
horizon, but it was too late to get at them. The thermometer was at
nine o’clock in the morning 21°, but did not get up afterwards to
more than 28°, fortunately for us,—not so much on account of
shooting as because the heat might have been insupportable, for we
were between these high shores à talus, with an average angle of
25° to 30°, and the wind was entirely still.
10th March. We remain to-day here for the sake of shooting,
conformably to Suliman Kashef’s determination. His halberdiers set
off to-night to follow the course of the giraffes, and to find out their
abode in the gallas,—unfortunately without success, for they did not
like perhaps to trust themselves so far in the territory of their deadly
enemies.
I remarked a number of burnt bones of hippopotami in the low
forest lying close to the river. I should be inclined to believe that the
natives burn the carrion intentionally, in order not to be exposed to
the disgusting effluvium. A species of black wasps build hanging-
nests here, which however seem from their transparency to contain
very little honey. I could not ascertain this more exactly, because I
was obliged to be cautious in breaking off a branch with such nests
on it. We remark low mountains beyond the softly ascending desert,
and perhaps the dry water-courses which issue here from the steppe
flow to them, and there may be the real abode of the deer. In my
shooting excursion I looked carefully among the thorn-bushes, and
found that the plants are mostly the same; I had fancied quite
otherwise. A blue convolvulus—not, however, belonging to the water
—displayed a lighter colour than usual, and had also round and
glutinous leaves: I took seeds of some pretty creepers and gathered
the fruits of the shrubs, for I was already acquainted with the
leaves. Every thing now was withered, and I am curious to know
what will become of the various seeds I have collected when they
are sown in Europe.
Most of the birds had retreated before the shooting of the other
sportsmen commenced, but I stumbled upon several turtle-doves,
and instinctively grasped my gun, letting my botanical bundle fall on
the ground. I shot some, and got under a tree, where I saw them
fluttering around. The thorns stuck to me and pricked me all over,
and there I sat bent, like an ostrich caught in a thorn bush,
compared with which the bull-rush of Moses was a child. I could not
force through it with my coat on and gun in my hand; so I got loose
from the sharp barbs of the thorns with torn clothes, leaving behind
the tarbusch, takie, and half my cowl, without even scratching my
ears, though they were bleeding enough already. I fetched back my
tarbusch by means of my gun, and then examined my malicious
enemy a little closer, notwithstanding he was an old acquaintance. I
found withered apples on it, and gathered some, for the sake of the
seed; when green they are exceedingly similar to oranges or
Egyptian lemons. I have not found it confirmed that they are deadly
poison to camels.
11th March.—“Bauda mafish, am’d el Allàh!” (the latter properly
Hamdl el Allàh,) was the cry on all sides to Allàh, because the gnats
had taken their departure, and I hope that those which are still in
my cabin will soon follow their companions. Departure at a quarter
before ten to S.E. by E., then a little E. by S. Summer or pastoral
villages on the left: we perceive also herds, but not a morsel of them
is destined for us. On the right an old river-bed or narrow lake,
mostly marshy, and connected below with the river. A quarter after
ten, E.S.E., on a pretty good course, with the exception of some
shallow inlets. We sail, with a south-west wind, four miles. On the
left again open reed-huts or sleeping-places, and herds to which the
people are collecting,—on account of the Turks. All the Haba here is
deposited soil, which lies almost always higher than the other
ground. This evidently fading forest once enjoyed better times, when
the blessing of rain was afforded it, but the benefit of which it lost
directly by its higher situation.
What fables are told of the incredible luxuriance of the tropical
kingdom of plants! At all events it could only be said of aquatic
plants which are forced by water, evaporation, and sunshine, as if by
steam or chemical preparations; but then only in the rainy season
and a few weeks beyond. I saw, indeed, trees shooting forth at this
time in Taka, which boiling and cauldron-shaped valley may perhaps
contain a tropical growth, or something like it; and plants springing
up from the morass with incredible celerity and luxuriance, as if by
magic. But trees that have true manly vigour, and strive to shoot out
with sound strong muscles, whose pith is still clearly to be seen in
the bark, with not a bough injured,—not a branch hanging down
withered,—these are sought for in vain in the Tropics, so far as I
have seen. We can form a tolerable idea of the momentary life and
vigour there by comparing in Europe, acacias, planes, and poplars,
on suitable soil; it is the most cheerful awakening after a long
repose: but part of the limbs always continues in a sleep-like death,
whether it be under the bark of the stem, or a bough that the sun
scorches, or a runner become dry, which disfigures the whole tree. A
forest requires care, either by the fortuitous kindness of Nature
herself; or, when that is not sufficient, by the directing hand of man.
The omnipotence of the terrestrial womb of fruits is past,—that
which gave previously the magic of lovely green to the coming
species, without any visible seeds of themselves. Half-past eleven
o’clock, S.E.—It has just rained a little;—what anxiety and fear of
rain these half-naked coloured people shew; what care they display
in preparing immediately a tent to sit under! I have very often
remarked this; rain must therefore make a sensible impression on
their hot skin. Twelve o’clock, E.S.E. We see at the distance on the
left towards the horizon, solitary dhellèbs as usual on elevated
ground; and also isolated little groups of Shilluks. Narrow tracks of
water right and left, which not long ago were flowing cheerfully. The
river has also gradually laid aside its terraces in preceding times,
until it has limited itself to its present bed; and those parts of the
shore, lying higher are only just moistened, even when it is at its
highest water-mark. It would be interesting to follow these old river-
beds in the ascending line at the side, and to arrive at the dams of
the primitive stream, or at the higher circumvallation which
surrounded the lake here at one time. A quarter after one o’clock.—
On the right a gohr cul-de-sac, low bushes to S.E., called by the very
same name as the Haba; on the left solitary trees and straw huts of
the herdsmen. At two—on the right, another gohr cul-de-sac,—to E.
We sail E.N.E, and wind, for the first time since the morning, to the
left: a track of water in the shape of a terrace, just there, from half a
foot to a foot higher than our level. A beautiful line of dome-palms
before us, but still thicker a little to the left. Half-past three, N.E.—
Heaps of simsim-sheaves on the water at the left, and a row of ten
villages near the dome-palms. A broad gohr or river comes from W.
This may be the river of the Jengähs; but it seems to approach in
the background too much to the Nile; perhaps therefore it is that
gohr which is said to have its old river-bed on the high shores, below
the villages of the Shilluks. A quarter before three, E. We see on the
left seven more large and small villages, by or near that row of
dome-palms, which on this side is very thin; then a dome-forest to
the left at a quarter of an hour’s distance.
An unlimited water-course before us in E. by N., but no huts to
be seen on the left. Therefore, the nation of the Nuèhrs might have
been dislodged by the Shilluks from that quarter; for the former
extend, or are said to extend, up to the Sobàt and its shores. This
side, at all events, had been inhabited, as I plainly saw this morning
at our landing-place. The Haba, however, continues at a slight
distance from the river; on the left also the dome-forest is now
reduced to a strip of a wood. The shores are surprisingly low on
both sides; and therefore not any tokul-village is to be seen near
them. A gohr is on the right, which is scarcely separated from the
river, and in connection with it, like the other narrow ones. Three
o’clock. On the left three more villages in the dome-forest tract; and
on the right and left parrallel gohrs, subordinate Niles, which are
now stagnant, and the fish in which are a prey to men and beasts.
Four more villages to the left, near the dome-wood retreating from
the river; on the right the forest thickens.
Half-past three. Towards S. We have a tolerably high and
apparently planted island at our left, and halt at the right near a hill
—probably a deserted domicile. But look there! that is really the far-
famed Sobàt, the water of which is flowing against us, and which is
so much feared by the crew, who are tired of the voyage. I soon
disembarked on the shore, sauntered up the hill, and was surprised
to find that I could see so far in the distance, and fed my eye and
mind with a diorama which extended from W. to N.E. The Nile is
conspicuous in the W., and meanders to N.E., where it is lost to the
sight. An isolated dhellèb-palm on the right shore indicates this last
boundary. The horizon behind this glittering length of the Nile is
adorned with a transparent forest of dome-palms, interspersed with
slender dhellèb palms, with their small heads. The basin of a lake
spreads from W. to N.W., at my feet, and the river Sobàt winding
downwards from S.E., and flowing in the depth at my right, unites
with the Nile near the lake: both its shores are bare, and only a few
melancholy straw tokuls stand on the extreme point of the right
shore. All the remaining part of the district extends far and wide in a
dead waste, with a little withered grass; and the horizon alone from
S.S.W. to S.W., displays afar some palms and other trees, through
which the blue sky glistens.
The lake lying in the angle between the left shore of the Sobàt
and the right of the White Stream is connected with the former by a
narrow opening, evidently prevented from closing by the hand of
man. The mouth, as is the case elsewhere, is merely stopped up by
reeds, to keep the fish of the lake in confinement. Our blacks
shewed on this occasion what they do to catch fish when the water
of these lakes is shallow, and does not reach up to a man’s middle.
They disturb it with their feet, put fishing or conical baskets into it,
and harpoon the large fish, who come to the top to breathe.
The Sobàt, swelling at high water far higher and stronger, has
raised unquestionably a dam against this lake, the former river-bed
of the White Stream, and pressed the Nile more towards N.W. into
its present bed. Notwithstanding such an advantage being at hand,
the natives have cut through the dam for the purpose of catching
fish. The Sobàt has shortly before its mouth a hundred and thirty
mètres in breadth and three fathoms in depth, whilst when we were
here before it was four fathoms; and according to Selim Capitan, a
few days earlier last year, five fathoms. We can tell but very little
generally of the depth of the Nile, because its bed is very uneven,
and the stream causes eternal fluctuations.
The name of Sobàt could only have been given to this river by
the Funghs, for the Arabs have never possessed it, and usually call it
Bach’r el Makàda (river of Habesch.) The Dinkas name the White
Stream Kedi, and this Kiti, which mostly denotes water in the
dialects on the White Stream up to Bari, where it is called Kirboli: Kir
also means water among the tribes down the river. Its name is Tilfi
and Tak with the Nuèhrs and Shilluks.
When I view the steep and high slope of the shores of the Sobàt,
and the proportionate thin layer of earth on the immovable strata of
clay or original soil, which here is twenty to twenty-five feet higher
than on the shore or in the bed of the Nile, I return to my former
conviction, that the immeasurable particles of stone and plants
stream by means of the breach, and flowing away of the lakes of the
Ethiopian highlands, to the lake of the basin-shaped valley of the
White Stream which flows off with the Nile, as the deepest point;
and that all the lower country under the mountain chains of Fàzogl
and Habesch, from the Atbara to the land of Bari must be under
water, if it be not a lake connected with the depressed regions of the
White Stream. If the lakes, therefore, of that lofty plain were torn by
a powerful catastrophe, and deserted their chasms or valleys, as the
water-basins of Switzerland did formerly—(even now there are lakes
or flat valleys, signs of a deluge, in which the waters might have
dashed from the summit of Atlas to the top of the Alps)—there is no
question that the lower lakes or valleys must have filled and
overflowed. The first rushing-down of the mass of waves, incredibly
violent as it must have been, the falling of mountains accompanying
it, and their washing-away, overpowered everything below them, as
if gods had descended from Olympus, and no longer recognized
those limits that would have remained eternal obstacles by an
inferior shock. The first deposit was a layer of clay on the side of the
Sobat, whilst the White Stream suffered no such sediment when in
its primitive strength, and washed away everything that it could
seize, as is shewn by the far lower shores. The high shores of the
Sobàt and its environs fall away, especially towards the level parts of
the left side of the Nile, to which the accumulated slime could still
less arrive owing to the stream carrying it off, although several gohrs
and rivers from thence pour into it. These afford water certainly, but
no slime to increase the height of the shore, as we plainly see by the
Gazelle River, and also in the little Kiti of the Jengähs called Njin-
Njin. We must assume from the Dinka country and its greater
elevation, that the ground towards the Nile was heightened formerly
by its gohrs flowing from above, or perhaps constant rivers; whilst
Kordofàn, which lies over the left shore of the Nile, discharges no
rivers, and its oases have run down from the mountains themselves,
and formed islands in the sands which still remain, for the sunken
ground forms cisterns that nourish the succulent power of the
mountains by imbibing the moist element; or it may be, that springs
were bored by God’s own hand.
CHAPTER IX.
ROYAL CRANES. — SCRUPLES OF FEÏZULLA CAPITAN. — COMPOSITION OF THE
SHORES. — DESCRIPTION OF THE DHELLÈB—PALM AND ITS FRUIT. — FORM
OF EGYPTIAN PILLARS DERIVED FROM THIS TREE. — DIFFERENCE BETWEEN
EGYPTIAN AND GREEK ARCHITECTURE. — DESCRIPTION OF THE SUNT-TREE.
— DEATH OF AN ARABIAN SOLDIER. — VISIT OF A MEK OR CHIEF. —
DANGEROUS RENCONTRE WITH A LION ON SHORE. — PURSUIT OF THIS
BEAST BY THE AUTHOR AND SULIMAN KASHEF WITH HIS MEN. — FEAR OF
THE NATIVES AT THE TURKS. — PLUNDER OF THEIR TOKULS BY THE CREW. —
BREAD-CORN OF THE DINKAS. — ANTELOPE HUNT. — DIFFERENT SPECIES OF
THESE ANIMALS. — IMMENSE HERDS ON THE BANKS OF THE WHITE NILE. —
LIONS AGAIN. — BAD CONDITION OF THE VESSELS.

12th March.—We set out at half-past nine o’clock, and sail to S.E.
by E. Shrubs on the higher shore to the right. A quarter before ten,
from S.E. by E.; further to the left round a corner, to which a bend
corresponds on the opposite shore: this is often the case on the Nile.
To E.N.E., and immediately again with a short tract to N.E. The river
flows with all its force against the left shore, and therefore the latter
is higher, more perpendicular, and disrupt, than the right, which
soon, however, becomes similar. We go a short tract libàhn, and see
a few miserable small straw tokuls with thin doors, on the left, in the
little green underwood, which seems to be nourished by the
inundation, and is mostly young döbker.
The shores display again iron oxyde. A quarter before eleven:
from E. by N., to the right, E.S.E., where we sail. The shores on the
right and left are higher, according to the current, and the falling of
the river is accurately marked out on the shore by little gradations,
which are exceedingly regular, and one to two inches high. We crawl
on only slowly with the faint south wind, and make now one mile;
for the current being stagnant below towards the Nile, told me
directly that the floating companion of the mountain dissipates
quickly its water, differently from the slow, crawling Nile, which is
obliged to work through the plain of a lake-basin.
Eleven o’clock. The wind freshens, and we go S.E. and E.S.E. On
the left a solitary dhellèb-palm rises on the shore, with its beautiful
and really symmetrical head; its slender base without rings, and its
elegant foliage. From hence in the bend, further to the right, in S.,
where five dhellèb-palms break the uniformity of the high shore on
the left. A low ridge of a hill lies near them, on which a village must
have once stood. If I could but transplant the tallest dhellèb to
Louisa’s island, near Berlin, to make it the common property of all
the northern nations! It is hot, for the high shores keep the
refreshing breeze from the deep water, and only the sail enjoys a
cheerful gust of wind, with the assistance of which we go, at a
quarter before twelve, from S.W., where a regular forest before us
presents itself to the eye, to the left, in S.W. by S. We make two
miles; a quarter of a mile, perhaps, being derived from the current.
A quarter after twelve, from S.W., to the left, E. by N. We hardly
move from the place till it blows from N.E., and then we go better,
having four miles’ course. An old sailor runs on shore close by the
vessel, to find crocodiles’ eggs; tumbles into holes, falls in the grass,
and is using every exertion to find a convenient sand-path instead of
the clay. The crew call him to come off, but he wants to shew that
he is a nimble fellow—thus every one has his hobby-horse.
The river winds continually in a bend to the left: a wretched
stunted forest on the right, and miserable tokuls, without people,
here and there on this shore. One o’clock; from E. by N., where the
river winds again to the right, S.E. by S. We halt at a quarter before
two, at the right shore, yet not to let the men rest; that would be
against the Turkish custom, for they think there are no human
beings except themselves. At three o’clock we go with libàhn to S.E.,
and immediately to the left E. Half-past three, in a bend to the right,
S.S.E.; and four o’clock, on the left, in the bend, to E.S.E. Five
o’clock, from E.N.E., on the right to E., where we stop at the right
shore.
Last night I awoke up several times, and the wild geese on the
neighbouring lake, seemed to call to me in a friendly manner, and
scream “Here we are, for you have not had for a long time either
sheep, goats, or fowls.” I was on the wing therefore at day-break,
but saw only four royal cranes (grus royal, Arabic gornu, or chornu),
one of whom I shot, for they are very delicious when dressed in a
ragout. Feïzulla, although he has been seven years in England,
drinks drams and wine like a Turk, and scruples to dine with me,
because I had not cut the bird’s throat immediately after it was shot,
whilst it was yet alive, and made it debièg (koscher, as the Jews
say). These beautiful birds, with a tuft of golden hair and shining
feathers, appear in flocks on the White River: my Sale killed a brace
in a moment, and would have brought us more if he could have
followed them. The geese would only surrender at discretion to the
“longue carabine,” and I had only my short double-barrel.
I visited once more, on this occasion, the hill above-mentioned,
which I found quite adapted for the situation of a village. I had seen
already the remains of potters’ ware, and solitary flower-gardens, or
plots of ground trodden down, where once tokuls stood, but where
now neither grass nor shrubs could grow; and I came to the
conclusion that a considerable village must have stood there, which
could have belonged only to the Nuèhrs, and was probably
destroyed by the Shilluks. Thermometer, sunrise, 21°; half-past nine
o’clock, 28°; noon, 29°; no rise beyond that was perceptible
afterwards.
13th March.—Departure at seven o’clock, with libàhn to E.S.E. by
E.; then to the left, E.N.E., and we sail with a good north-east wind.
A quarter before eight: from E., in the bend to S.E,; on the left some
straw tokuls. The wind becomes strong, and we make six miles for
the present; the mountain stream seems to be here at its lowest
pitch, and has only a quarter of a mile rapidity. Eight o’clock; from
S.E. by S.; to the left, E. by S., where we are obliged to go libàhn. A
quarter after eight, to the left, but we halt before the corner of the
bend till noon, owing to the violent east wind. I made a little
excursion into the immeasurable plain, which was tree-less and
comfortless; and found two villages, better built than usual, to which
I was not able to approach, and likewise a long and dried-up marsh.
I could not, unfortunately, discover any guinea-fowls in the durra-
stubble.
At twelve o’clock, we proceed with libàhn to N.E., where our
Bach’r el Makàda winds again to the right. Half-past twelve. The
shores, with few exceptions, attain a height of fifteen to eighteen
feet: the upper surface of the soil consists of humus to two or three
feet deep (which may be deeper in the low ground, old gohrs, and
several tracts), and under it nothing is seen but clay or mud, having
a yellowish colour on the shore, from the iron oxyde, with which it is
strongly impregnated, and generally more so than on the White Nile,
where this is only the case in layers. A fertile country, but requiring
human hands, canals, and sakiën. We see from its shores, and in the
dried-up pools, which receive very little nourishment here from
vegetable matter, particularly on the upper land, that the Sobàt
brings down fruitful earth or slime.
From half-past twelve to two, in a bend to the left, S.E., where
we go again left in N.E. by N. On the same side there is a tolerably
well built little village on the shore. A quarter before three, still
further to the left, N. by E. Four o’clock, we wheel to the right in
E.N.E., where we get the view of a genuine low forest, and notice on
the left a village in the winding to S. by E. Half-past four, also
further; a hamlet on the right with straw tokuls, the first on this side.
We see here also reed-boats, as among the Nuèhrs and Shilluks on
the Nile. At five o’clock to S., where we at first halt at the right
shore, before the bend to the left. Two large villages lie from half to
three-quarters of an hour distant, and I see an immeasurable bare
plain cracked from drought,—a summer shallow lake without any
verdure. We go then to the left shore, the soil of which is less mixed
with sand than that of the right, and gives us some hope of shooting
and fishing. The huntsman Sale returned, however, disconsolate, for
he had seen nothing at all.
The left shore is still more precipitous and higher here than the
right one, because the stream forces itself into this bend. When we
disembark, we find that the land again rises to a gentle acclivity, and
we have the prospect of a large lake about three quarters of an hour
distant, which overflows perhaps deeper into the Sobàt. Many lakes
of this kind must be found in the country of the Dinkas, because
springs, as in the Taka country, are not sufficient for the watering of
the cattle of this merissa-loving, dancing and singing tribe; and
besides, the drawing of the water would cause too much trouble.
The Sobàt is stagnant here in the proper sense of the word, and
no log can determine anything else.
14th March. We navigate again on the right side, and go at half-
past seven o’clock with libàhn from S. by E., immediately S.E., where
the north-east wind remains contrary to us, notwithstanding the
narrow water-tract. Some small and still green reed-huts hang on
the shore, sheltered from the north wind: these are stations for
hunters of hippopotami and crocodiles, or for fishermen, who,
however, have gone away, and taken with them their working
implements, for they are frightened of us. The durra seems to thrive
famously on the half-sandy shore, and rises cheerfully above the
reeds; probably it is sown,—that is, a handful thrown here and there
on the vacant spots.
Eight o’clock—E. by N., and N.E. by E. The upper margin of the
right shore is planted throughout with durra, and some small fishing-
huts shew that men dwell there. Ten o’clock.—Hitherto always N.E.
within considerable deviations, and then N. by E.; where we halt at
the corner of the right shore on account of the wind, for the river
goes still further to the left: level land above, some underwood, and
a village at a little distance. A quarter before one.—N. by W., and
about one, in a bend to the right. When the crew relieve one
another at the rope, they imitate to perfection the Uh-uh-i-ih of the
tribes on the upper part of the White Stream, and during the towing
itself they sing the song à-à-à-jòk-jòk, which would be difficult for a
white man to do. The force of the water is directed here against the
right shore, which is without any crust of vegetation, and seems to
ascend to the uppermost margin, as is proved by the gradations
being washed away, and the thin layer of humus, one foot to one
and a half high, decreases perpendicularly, whilst the lower part of
the soil displays unmixed clay. It certainly required a powerful
pressure of water to wash this primary deposit to such a depth; the
left shore, on the contrary, has a coating of slime and vegetation
down to the water.
Two o’clock.—E. by N.; twenty-one dhellèb-palms on the left,
with a pastoral hamlet of thirty new straw-tokuls. The crew are
beginning to shoot down the dhellèb-fruits, and I also disembark on
the shore, beyond which the ground, with the beautiful group of
trees, is still imperceptibly elevated. We are quite comfortable there,
but I gaze far and wide for a point to break the unbounded flat
waste that shews not a thorn or a bush; the river winds melancholy
between the naked shores. These palms stand in luxuriant growth,—
a proof that the soil is capable of other things, and may look for a
better future. The very pretty straw-huts present nothing worth
having to our rapacious eyes, and near them we remark the
sleeping-places, and a large, glimmering heap of dung, serving at
night for fire and a bed. The cow-dung is collected in little heaps in
the enclosure, surrounded with palings, where the beast is tied, and
is still quite fresh: notwithstanding this, it is very certain that we
expect in vain the return this evening of these beautifully spotted
cattle. Standing on an old trunk of a tree, I remarked a large village
on the right shore at a quarter of an hour up the river.
The dhellèb-tree has the same fibrous texture of bark, and of the
interior of the trunks, as the dates and dome-palms; but it is far
finer, thicker, and stronger. The outside of the bark shews rings from
below upwards, and the tree itself shoots forth slenderly from the
earth, and swells gradually towards the centre to a spheroid form,
when it decreases again to the top, and rises stately, separating the
head from the stem. The fruit is as large as a child’s head, and in
clusters, as in the palms before named, but on far stronger stalks,
from which it hangs down immediately close to the stem. It is
smooth outside, and of a golden colour, like its pulp; the latter is
fibrous, of a bitter-sweet taste, like chewing soft wood and leaves
behind in the mouth an astringent taste, which may arise here from
the fruit not being fully ripe. There are from four to six kernels in
this gold apple of the size of a child’s hand, or of those of the dome-
palms: the stalk has a scaly covering, surrounding about a third part
of the fruit. The kernels, or the nuts, have themselves a solid pulp,
shining like dark glass, being exactly similar to that of the dome-
fruit: at first it is like milk, but on coming to maturity becomes of the
consistency of horn. The trunk of palms is surrounded with the same
kind of rings as the date-tree, the rind feeling smooth, like planed
wood; consequently it was impossible to climb these trees to gather
any fruit, owing principally to the swelling in the centre, and
therefore it was shot down. After several attempts, we drove large
nails in the stem, to hold the rope by, and then we ascended
gradually.
The bark falls off on the ground, as is the case with the other
palms, for the tree throws out foliage like grass from the interior: the
thick rootlets spread themselves in all directions through the ground,
like polypi, with a thousand veins of life.
There seems to me to be no doubt that the Egyptian pillars,
protruding in the middle, derived their origin from the dhellèb-palms,
which might have been transplanted in the Thebaïs; for it was
impossible that the Egyptians should not take notice of the unusual
shape of this tree—they who borrowed all their forms and
embellishments, even to those of their spoons and salve-boxes, from
the kingdom of nature.
Lifeless figures having no meaning are never represented by
them; flowers, foliage, leaves, sacred animals, or parts of them
properly introduced, are intermixed with hieroglyphics, like a
garland, without beginning or end. The Greeks quickly seized what
was beautiful in this, discarded what was heavy and confused, and
pleased themselves and succeeding ages by lighter and more
elegant forms. They placed the acanthus and horns, or volutes on
the capitals of their pillars, and the Germans planted a stone-forest
as the holiest of holy.
A large village of the Nuèhrs (judging from several potsherds)
stood on our hill: this nation dwells up the river from hence and in
the direction of the White Stream, where we had seen them last. I
had found also on the last landing-place fragments and the
foundations of a village, and heard from our blacks that the Shilluks,
several years ago, had a great war with the Nuèhrs, drove them
from these parts, and took possession of the lake abounding in fish,
which I have previously mentioned. We have not remarked any sunt
among the mimosas from the country of Bari up to the Sobàt, and
even on this river, but we see talle. The latter tree has a reddish
bark; the long white prickles grow by couples; the flowers are
whitish and without any particular scent; the bark, however, is used
for pastilles, and, when rubbed, sprinkled on the merissa. It affords
the best gum (gamme, semmag), which is white like that from sunt,
while that from the sejal (or sayal) is blackish. Thermometer
yesterday morning 22°, and did not rise beyond 27°, and this
morning 18°; noon 26° to 29°.
15th March. We leave our beautiful palms at half-past nine
o’clock, and go from E. by N., and notwithstanding the strong north-
east wind, slowly in the bend to the right. A quarter after ten, S.E.
by E., then a very short tract S.S.E.: some grass huts of fishermen,
and crocodile and hippopotami hunters at the lower declivity of the
shore on the left. Half-past ten, to the left S.E., and further to the
left, S.E. by E., where we halt at eleven o’clock, because an Arabian
soldier has just cried himself to death before our cabin! He wept at
having to die in a foreign land and not seeing his mother any more.
Nearly all these people lose their courage directly they are attacked
by any illness, the nature of which they cannot visibly perceive as
they can a wound, &c. He died with a piece of bread in his mouth,
because the Arabs believe, and with justice, that so long as you can
chew bread you will not die. It is shameful that we dare not take
even medicine from the fine black physician we have on board, and
much less can we expect assistance or salvation from him. Ten
minutes have flown; the deceased is carried to the upper part of the
shore, and yet the worthy disciple of Clot-Bey has never even looked
at him! We leave at half-past two the place where the soldier was
buried in dead silence, after having received five more cows, upon
whom the crew fell like wolves, and navigate to the left, E.S.E.; then
again slowly to the right. Three o’clock, to S.E. We sail about five
minutes, and stop again at the right shore, by the corner where it
turns to the left, and then again, “Jo hàmmet, Ja mohammed!” is
chaunted at the rope. In the winding below the left shore we saw a
water-hunting establishment of seven straw tokuls. A quarter before
three, from E.S.E. to E. by S. A quarter after four, E.S.E. Half-past
four, E. Some few trees on the right entirely or partly withered, and
soon afterwards a few green ones, of which those standing lower
shew that the water has poured into the shores, even to the margin.
Five o’clock, E. by N., then slowly right to E., where we halt at a
quarter of an hour later. The river makes a strong bend to the right,
and we hope to sail to-morrow.
This afternoon, when the cows were brought us, I procured a
ring, with much difficulty, for sug-sug, and though badly
manufactured, it is at least peculiar to the country. I saw several
such rings among them, but not one of them had a circular form,
and by this we may measure the standard of their skill. Those which
are better worked, are found among the Nuèhrs. The five cows
came from the Mek, who presented himself in person to Suliman
Kashef, with whom Selim Capitan also happened to be: he was
clothed in a ferda, which he had received from the Shilluks. He wore
a very thick copper ring on his hand, and was of opinion that dress
is the privilege of sheikhs. An old woman and a man preceded him;
the former attired like an ancient Queen of the Witches. We dressed
the mek in a red caftan, put a gay-coloured red handkerchief round
his head, and hung glass beads on him. Another cow was brought to
us, but they wanted an enormous quantity of sug-sug for it, (these
trinkets are generally held in little value here, because the Gelabis
frequent these regions,) and still more for goats and sheep.
Thermometer, sunrise, 18°; noon and subsequently, 28° to 30°.
16th March.—Man is not appalled in the midst of danger itself,—if
it were so, he would be lost; but the frail human heart throbs
afterwards. Yesterday evening I left the vessel, in company with
Thibaut, to get at a swarm of finches, which birds are said to give a
delicious flavour to a pillau, of which we wanted to be joint
partakers. We were soon obliged to separate, in order to salute the
birds on both sides of their settlement. In my excursion, however, on
the shore, I came all of a sudden within a few steps of a lion,
without having the least distant idea that this fearful enemy could be
in the neighbourhood of all our vessels, and I had only my double-
barrel, which was loaded merely with small shot; whilst my
huntsman Sale, was pursuing a gazelle, at a long distance off.
Possibly our firing had awakened this supreme chief from his sleep,
for otherwise I must have seen him before, although my eye was
directed to a brace of birds at the left; because the underwood could
not have concealed an object of such size, as it only reached up to
the knee, and was merely interspersed here and there with a higher
bush. I was just taking aim slowly and almost irresolutely at the two
beautiful birds, who were looking at me with surprise and
confidence, contrary to the custom of the cunning finches, when the
lion stood before me on the right, as if he had sprung from the
earth. He was so close to me that he appeared to stand as high as
up to my breast, but yet I stood, my poor weak weapon in my hand,
holding it close to my side, with perfect presence of mind, so as to
keep my face free, and to wait for the attack; I was firm, and he
seemed also to be resolute.
At first we stared at each other mutually; he measured me from
top to toe, but disregarded the Turkish accoutrements and sun-burnt
countenance, for my red cap which he seemed not to despise. I, on
my side, recognized in him the dreaded king of beasts, although he
wore no mane, according to his usual custom, but I did not appeal
to his magnanimity. At last he turned his face from me, and went
away slowly with a dreadfully pliable movement of his hinder parts,
and his tail hanging down, but could not restrain himself from
turning round to look at me once more, while I was trusting to the
effect of one or two shots in the eyes or jaws, if it came to a contest
of life or death; and really I remained standing immovable, with too
much of the lion in me to tremble, and to bring certain destruction
on my head by untimely flight. However, away he went, looking
round several times, but not stopping, as if he feared pursuit, and I
turned my back to him equally slowly, without even calling out a
farewell; but I cast a searching look over my shoulders every now
and then, right and left, expecting that he might make a spring like
a cat, and I kept him in sight before me, when I was about to jump
down from the shore on to the sand where the vessels and crew
were. I confess openly that I now felt an evident throbbing of the
heart, and that my nose seemed to have turned white. Taken
unawares as I had been by the lion, the distance of five paces,
according to the measurement I made, was nearly too close for me:
on his side it was only necessary for him to have smelt me, which
probably I should not have allowed. I stood a moment on the margin
of the shore, in order that I might tranquilly summon Suliman Kashef
to the pursuit of the beast, without betraying any pallor of
countenance, and then I jumped down on the sand. When I swore
by the prophets to Suliman Kashef that my account was true, he was
ready immediately with his sharpshooters. At my advice we formed a
line of riflemen above, though I could not obtain a couple of bullets
for my gun; but the Turks soon crawled together again, except a tall
black slave of Suliman’s, who was at the right wing. When the latter
soon afterwards pointed and made signs that the lion was near at
hand, his master motioned with his hand and gun that he would
shoot him if he did not join us, for he held himself as lost, being left
quite alone. We set off at a slight trot, because the lion continued
his walk, until at last Suliman, as it began to get dark, ordered three
of his boldest warriors to go in advance. Three shots were fired, but
the men came back, and described the lion as a real monster. I was
actually glad that the magnanimous beast, according to all
probability, was not even wounded. They called me again an “Agù el
bennaht,” because I accompanied the expedition to see my lion a
second time, and they expressed themselves rejoiced that God had
preserved me, and wished me happiness, with pious phrases from
the Koràn.
To-day we sailed at half-past six o’clock from the place to S.E.
and S.E. by S.; at seven o’clock; E. by S., a village on the high shore
at the right.
We saw yesterday, from our landing-place, four villages, lying
together on the right and left shore, which the Dinkas have taken
into their possession. At half-past seven o’clock, after we had sailed
only slowly (two miles), owing to the wind being partly adverse, we
proceeded to E.S.E. and S.E. by E. The strong breeze caught the
sails, and we make seven miles clear of deduction: unfortunately,
the tract will not be long. A quarter before eight we stop before the
corner, where a winding to the left commences, in order to go
libàhn, because the vessels ahead do it. Some huntsmen’s huts, with
their inhabitants, stand on the right shore, and I procure, on this
occasion, a horn of the Tete species of antelope. We proceed,
sailing, to S.E. by E., and E.S.E., and halt a quarter after eight. Again
at S.E. by E., to go libàhn round the left. Unfortunately, the wind has
torn the sail, which I had feared for a long time would be the case;
for it was ripped up in several places, and the Tailor Capitan did not
trouble himself about it. “Allàh kerim!” A large village at some
distance above. At a quarter before one, we go libàhn to S.E. by E.;
then E.S.E. and E. by S. On the right shore a village with Dinka
tokuls and sleeping-places. It is not yet, however, decided whether
the Dinkas dwell there, although the style of architecture of the
tokuls, their grooved and arched roofs, without eaves, seem rather
to denote that they belong to this tribe than to that of the Nuèhrs.
The wind is very strong, and the crew are obliged to tow with all
their might; but the river winds now to the right, and we can,
perhaps, sail. A quarter before two. From E. by N., slowly in the
bend to the right: a village on the right shore, in the bend to the
left, exactly like that on the left side. Half-past two, E. by S. We
cannot see anything of the village here, owing to the high shore;
and the blacks, who stood shortly before in large numbers on the
shore, have fled because they saw the Turkish countenances of
Suliman Kashef’s halberdiers. The Turk is pleased at such fear, which
is associated with hatred and contempt on the part of the negroes. A
quarter before three; S. by W. The wind makes the men at the rope
run; but we are not able to sail, because the river winds immediately
to the left. We have a low sand-island at our right. Our men will let
nothing lie by the huntsmen’s huts: tortoise-shells (water-tortoises),
vessels,—such as gadda, burma, gara—everything is carried off; for
the blacks have imbibed the Turkish notion of “Abit,” and are now
askari (soldiers), who pretend to know nothing of their countrymen.
Three o’clock. To the left in S. and S.S.E.; then again to the right.
Half-past three. We sail a little S. by W. and S. by E.; a village on the
left. The Dinkas appear to mix everything called corn to make bread;
such as durra, lubiën of different species, gourd or melon stones,
&c., of which I have a specimen; and also lotus seeds, found here in
great quantities, and therefore denoting that there are several lakes
in the interior, and the small rice I have mentioned previously. A
large hippopotamus shewed himself on the flat left shore: he was
afraid of the vessels and the shouting of the crew, and trotted in a
semicircle, like an immense wild boar, in order to plunge into the
water with a greater roar. Four o’clock. To the left E.S.E. Five o’clock.
From E. further to the left.
The crawling along these cheerless shores, notwithstanding the
shouting, jokes, teasing, and stumbling on board the vessels from
side to side, and sometimes into the water, and the huzzaing when
that takes place—notwithstanding all the various kinds of occupation
and non-occupation which may amuse us for a short time—is
exceedingly wearisome; and it is well for me if I retain my senses to
sketch here and there an idea, which may be followed out or
rejected by those whose attainments are higher, and who have the
advantage of an enlightened circle, where opinions and views can be
expressed and discussed. Such a circle, however, cannot be found in
Bellet Sudan, or on board my vessel. We halt a little after six o’clock
in E.N.E., at the right shore. Thermometer, sunrise, 18°; noon, 27°
and 28°; sunset, 27°.
11th March.—We had a great antelope-hunt yesterday evening.
Amongst others, there was an Ariel with twenty-five rings on its
horns, and a Tete, and three female Tilli. The latter, also a species of
antelopes, are of lighter colour than the Ariels, and almost white,
whilst the Tete has a dark-brown coat with white breast and belly.
The female Tilli are distinguished by having long tails, but the males
are said to be bare behind. I was not able to leave the vessel
sufficiently early to see a herd of more than a thousand antelopes
that were going to the watering-place. My huntsman, also, who had
struck into another road, saw some hundred together; all the others
agreed that there were these thousand which I have mentioned. But
they soon dexterously divided to the right and the left on the
immeasurable level of this land, where there was merely low grass,
wild bamie and a quantity of basil, which latter was also met with on
all sides in the countries further up; and Suliman Kashef only shot
four, and my Sale not a single one. I myself could only see some
antelopes on the horizon, because it was already getting dusk, and I
stopped with Sabatier close to the vessels, in case some beast
should be scattered from the herd, but in vain. On this occasion,
also, I saw two lions at a distance.
At night the wind blew in coldly at the door and windows, and
even this morning the north-east wind was cool. At half-past six we
proceed E.N.E., and in a bend further to the right E. and E. by S.,
where we make a stronger evolution to the right. Eight o’clock.
Libàhn from S.E. by S. to S. We glide over shallows apparently
consisting of rubble-stone; the wind becomes strong and tosses the
waves. A quarter before nine, S.E. by S. to S., then still more to the
left, where we are soon thrown by the wind on the left shore, and
stop in E.S.E. Thibaut is with me, and they are calling for him; his
ship is full of water, and all the crew are summoned there: it is
fortunate that we are near land. Selim Capitan neglected to have the
vessels caulked at Khartùm, or to order at least gotrahm (instead of
tar) to be applied to the parts which we had stopped up with some
oakum.
At five minutes’ distance above, a large village deserted by
people; we are magnanimous enough on our side to keep the crew
from plundering it. It is slightly elevated: the same is also the case
with the shore, so that shallow lakes are formed right and left, at
present dry, and having vents to the water, which apparently are
kept open by human hands for the sustentation of the soil,—on
which, however, nothing is seen. A number of snail-shells are lying
together on the surface just as I have seen in other places, and it
seems that snails are eaten. We remain here on account of the
accident to Thibaut’s vessel, but the shores, à talus, do not allow us
to bring it on the dry land. Thermometer 17° and 24°.
CHAPTER X.
VARIOUS SPECIES OF GRASSES. — FORMATION OF THE SHORES. —
WATERFOWLS. — AN ANTELOPE OF THE TETE SPECIES, NOW AT BERLIN. —
STRATA OF THE SHORE. — THE SOBÀT RIVER. THE MAIN ROAD FOR THE
NATIVES FROM THE HIGHLANDS TO THE PLAINS. — OBSERVATIONS ON THE
COURSE OF THE NILE AND SOBÀT. — A THOUSAND ANTELOPES SEEN MOVING
TOGETHER! — WILD BUFFALOES, LIONS, AND HYÆNAS. — AFRICA, THE
CRADLE OF THE NEGRO RACE. — THE SHUDDER-EL-FAS: DESCRIPTION OF
THIS SHRUB. — ARNAUD’S CHARLATANRY. — OUR AUTHOR FEARED BY THE
FRENCHMEN. — ARNAUD AND SABATIER’S JOURNALS: THE MARVELLOUS
STORIES OF THE FORMER. — THIBAUT’S JEALOUSY. — VISIT OF A SHEIKH OF
THE SHILLUKS. — FEAR OF THE TURKS AT THESE PEOPLE. — SULIMAN
KASHEF PURSUED BY A LION.

18th March.—We sail at a quarter before seven o’clock with a cold


north-east wind S.E., and then S. by E. and S. The wind, however,
becomes too powerful; twice are we driven on oyster-beds—that is,
on those thorn muscles, as if over stones, and have reefed sails to
prevent the ships from going to pieces, their condition being so bad.
A quarter after seven. From S. by E. to the left. Visible sand-banks in
the curve seem to block up our road, but we managed to pass by
them on the right, with the assistance of the sails, close to the left
shore towards S.E., and away in the bend to E. by S. Half-past
seven. S.E. by E., and then a quarter before eight right round; six
huntsmen’s tokuls being near a sand-bank on the lower shore of the
projection of the left side of the river. We halt on the left by a
shallow island clothed with low verdure towards S.S.W., and intend
to stop here to-day and to-morrow to make observations, and the
most needful repairs to the vessels and sails.
Suliman Kashef shot yesterday evening, at a gazelle-hunt, a large
antelope, called by the Arabs Tete, in the foreleg, shattering it to
pieces; the animal fell twice, but made off at last on three legs.
Sabatier and I had chosen the left wing, and concealed ourselves
with one of my servants in the high grass: the former fired and
missed. The cracked earth displayed a magnificent soil: the grass,
standing thickly in tufts, reached up to our breast, and was a great
obstacle at the beginning of our rapid march. It was still green at the
bottom, and the present desiccation of the ground, on which we
remarked everywhere the traces of footsteps of wild beasts, and
their dung, might therefore have only taken place a short time. This
grass, narrow and three-edged, with cylindrical spikes, formed the
principal produce of the soil. Less common was the grass similar to
our species with flat two-edged leaves; it had knotty stalks like the
three-edged, but a couple of spikes grew together on each blade (I
have remarked this previously), which unfortunately were not yet at
maturity, and therefore very small. The third species of grass
consisted of slender reeds, cropped and sprouting anew, or trod on
the ground. I perceived, also, some bamie growing wild, and birds’
nests of grass hanging on it.
I had lost sight of my comrade; and although at the
commencement of my excursion I had seen the vessels sailing up
the river at my side, it soon became dark. Suliman Kashef, however,
had the sagacity on his return to the vessels, to order the reeds to
be set on fire as signals, so that luckily I found my way back, though
sinking every now and then up to the knees into the deep foot-prints
of hippopotami close to the river,—a further proof that the shores,
being only slightly elevated, form shallow lakes here at the rainy
season, which are not dried up so soon.
I had taken a short walk previously on the left shore. The very
same appearances of water remaining behind were visible, and I
found muscles on the dry ground, amongst which were the Erethria
ovata. Long traces of little deposits of earth, which, on closer
examination, I discovered to be dams against the high water on the
shore itself; and the alluvial reeds in conjunction with the muscles,
make me conjecture that the Sobàt ascends over its shores here, as
in many other places. Behind these low deposits lay an unlimited
stubble-field on the other side of the village which lies on a gently
ascending hill, elevated perhaps by the remains of clay walls, and
stretching far beyond the horizon. The better kind of tokuls have
frequently a roof, but the eaves only project inconsiderably: the
smaller ones have a round form of roof, low sleeping places and
reed-hedges being between them. Dinkas are said to dwell there;
but not a person, not a living creature, is to be seen. Thermometer,
sunrise, 17°; noon, 28°; sunset, 26°.
19th March.—We all dine together in the open air, after an
antelope-hunt. The island on which we are, is, properly speaking, a
large broad sand-bank, about a quarter of an hour long: its
somewhat elevated back is covered with verdure, and is connected
with the shore on the right at low water-mark. Purslane (Arabic Rigli)
is found very commonly upon it; we see also numerous birds fishing
in the many tongue-shaped segments of the upper part, and, in fact,
sharing among themselves the narrow lake on the high right shore,
close to which is a village, from whence the people have likewise
fled. These feathered occupants seem to remove very seldom from
this happy place. The antelopes presented themselves in great
numbers; but Suliman Kashef’s body-guard, though generally good
shots, did not know the huntsman’s custom of dividing and forming
a chain, so as to catch the herd in the middle. I had no inclination,
either, to join in such a surrounding of the game; for these Turks fire
as if they were shooting in files, and their guns carry far, and are
always recommended to the care of the supreme Allàh.
20th March.—Departure at a quarter after nine o’clock, with a
favourable north-east wind, without sails, S.S.W. and S.W. by W.,
where, on the right, behind the high shore, a village lies in the bend
to the left, and below it a broad sand-bank, on which some long-
legged water-fowls are wandering about. We leave at the right side
another sand-bank exactly similar to the former, throwing its
shallows far beyond the middle of the river, and halt, S.S.E. at the
right shore at half-past ten o’clock. Suliman Kashef’s halberdiers
bring eight antelopes, one of which I procure, being the largest of
the Tete species. This specimen is now in the Zoological Museum at
Berlin as a nova species.
The shores have widened here, and fall off in an angle of 45° to
50°: though they appear on this account lower, yet it is plainly visible
by the steeper places, that they always become higher. It is only
below in the places where the river beats against, that the bluish
clay is seen: the remaining part of the shores has, apparently,
merely constituents of the same, as is the case in most places where
the high water has not washed away the crust of humus crumbling
from above and covering the base of the surface; for the original soil
discloses itself immediately under the covering of earth, as is seen in
precipices, and clefts in rocks caused by water. The river has also
thrown or deposited thick layers on the shores. We must not be
deceived here by observing various strata of earth mixed above and
below with sand; this is a later alluvial deposit.
A pure layer of clay is never to be seen, however, in these tracts
of strata, so far as I have remarked here and on the Nile. If it does
appear, it lies either as the foundation of the whole, below on the
banks of the water, as on the Nile; for all the ground there is alluvial
and earthy deposits, gained when the high water is drawn off; or it
rises, as in the Sobàt, with the talus of the shores to the surface,
which is covered with a crust of humus. The Sobàt dug a bed for
itself in firm clay-ground that resisted the water, and remained
tolerably constant in the trench opened by it, without having altered
its course, for no gohrs are seen on dry ground; but perhaps, in
some places, it has flowed over its bed, and formed channels. On
the contrary, the White Stream wallowed for a long time in the deep
slime of an emptied lake, before it threw up solid dams, on which
there are marshy forests, as on the old shores. This long valley-basin
lies also on a layer of clay.
The Sobàt may be considered as a further boundary of the
peninsula of Sennaar, and have given to the latter the name of
Gesira. Certainly it has been, like the Blue Nile, a main-road for the
tribes of the highlands of Ethiopia to the valleys of these countries;
and this must have been especially the case because it has no
accompanying marsh-lakes. Such nations could not have wound
down from the mountains of Bari and the highlands there, by reason
of the many marshes; for we are not to suppose that nomadic tribes
can provide themselves and families with a stock of provisions for a
long journey, or stow entire herds in their hewn-out trunks of trees
(canoes); and it is impossible that the cattle could have been driven
along the shore for their use.
The further I ascend the Sobàt, the plainer I perceive why the
right shore just behind Khartùm appears higher than the left, and
why I could not get rid of the idea that this oblique inclination of the
land was in opposition to the course and the mouth of the Nile, but
still might be explained. The deposit of particles of earth and sand
can only come from above, and will always try to level and equalise
the tracts of land which the Nile covers with showers of rain, brooks,
and rivulets. It is clear that the surface is elevated by that means,
and that, where these washed-away and liquid particles of earth
reach a stream like the White Nile, they are carried down by it,
without the other shore (the left side of the Nile here) deriving
naturally any advantages therefrom.
The high mountain chain of Fàzogl and Habesh mixed, as I
conjecture, its collective waters, owing to a breach in its partition-
walls, and their slime and morasses, and perhaps entire hills of
decayed and corrupted matter connected therewith, filled depths in
the lower valley on the side of the Nile up to the Delta—its most
famous memorial,—and levelled the mountains of the
neighbourhood, when Bertat, Dinka, and the country between the
Sobàt and Bari rivers might have shot up in indomitable strength like
artesian wells. Such catastrophes roll mountains and masses like a
brook does its little pebbles, and throw up the water released from
confinement in the cavities of heights which attract and collect it. A
flood of liquid earth rolled then far and wide from the mountains
without order and with numerous arms, but conformably to nature,
the heavy particles sank. The water itself washed away, smoothed
and levelled the ground. Therefore now we perceive those
immeasurable plains on the Sobàt, whereon beasts cannot hide
themselves, and which would be without shelter in the rainy season,
if there were not mountains and forests in the neighbourhood.
Though it be mathematically proved that the great Nile runs in a
channel as upon an ass’s back, yet we find just the contrary in the
White Nile; but the Sobàt even displays that phenomenon, although
not at this moment, for its shores are emptied, except in the
lowermost grade. They lie and stretch higher than the adjacent land,
being heaped up by the waves of the river; they are, however,
generally narrow dams, only appearing wide in the places where
there are shallow lakes behind in distant connection, or overgrown
gohrs, the grass border of which more easily withstands that deep
washing away than these immeasurable plains, which might be
called beautiful from their splendid soil, if Ceres waved her golden
ears, and Pomona offered shade and fruit. They shew, indeed, but
little declination to the Nile, for which the Sobàt itself affords the
best standard, being stagnant, and its shores only increasing in
height here and there. The shores become higher, as on the great
Nile itself; the less precipitous ones (although this is only local) are
deceptive, as I have remarked several feet difference on the disrupt
shore, and still more on the return voyage. I cannot divest myself of
the idea that a lake has stood here also, or it may be that the
surface of the earth from the region above to this, has been laid flat
by the inundation, similar to the level fields of Egypt.
There is an incredible number of deer on the shores of the Sobàt,
for I can add from my own conviction, so far as my eyes and ears do
not deceive me, that I saw herds of antelopes at least a thousand
strong—the Turks say from three to four thousand. About evening
they shew themselves in immense lines on the bare horizon of the
steppe, stand still, and approach—their tread sounds, in truth, like
the evolutions of distant cavalry; at last, as soon as it is dark, they
separate in the little bushes on the margin of the shore, to descend
to the water. Hitherto I have not been able to seize this opportunity,
because no one would remain with me on account of the lions and
other savage beasts prowling about here, and it did not seem to me
exactly safe, by reason of my close acquaintance with the lion and
his just revenge, to lie alone behind a bush, and shoot some of the
animals at a few paces off. My cook, however, has promised to

You might also like