Datamining With Big Data - Siva
Datamining With Big Data - Siva
Big Data concern large-volume, complex, growing data sets with multiple,
autonomous sources. With the fast development of networking, data storage,
and the data collection capacity, Big Data are now rapidly expanding in all
science and engineering domains, including physical, biological and biomedical
sciences. This paper presents a HACE theorem that characterizes the features of
the Big Data revolution, and proposes a Big Data processing model, from the
data mining perspective. This data-driven model involves demand-driven
aggregation of information sources, mining and analysis, user interest
modelling, and security and privacy considerations. We analyse the challenging
issues in the data-driven model and also in the Big Data revolution.
1 .INTRODUCTION
Introduction Data Mining
LITERATURE SURVEY
2. SYSTEM STUDY
FEASIBILITY STUDY
The feasibility of the project is analyzed in this phase and business
11
proposal is put forth with a very general plan for the project and some cost
estimates. During system analysis the feasibility study of the proposed system is
to be carried out. This is to ensure that the proposed system is not a burden to
the company.
efficiently. The user must not feel threatened by the system, instead must accept
it as a necessity. The level of acceptance by the users solely depends on the
methods that are employed to educate the user about the system and to make
him familiar with it. His level of confidence must be raised so that he is also
able to make some constructive criticism, which is welcomed, as he is the final
user of the system.
3. SYSTEM REQUIREMENTS:
HARDWARE REQUIREMENTS:
System
Hard Disk
40 GB.
Floppy Drive
1.44 Mb.
13
Monitor
15 VGA Colour.
Mouse
Logitech.
Ram
512 Mb.
SOFTWARE REQUIREMENTS:
Operating system :
Windows XP/7.
Coding Language :
JAVA/J2EE
IDE
Netbeans 7.4
Database
MYSQL
4. Software Environment
The Java Programming Language
The Java programming language is a high-level language that can be
characterized by all of the following buzzwords:
Simple
Architecture neutral
14
Object oriented
Portable
Distributed
High performance
Interpreted
Multithreaded
Robust
Dynamic
Secure
With most programming languages, you either compile or interpret a program
so that you can run it on your computer. The Java programming language is
unusual in that a program is both compiled and interpreted. With the compiler,
first you translate a program into an intermediate language called Java byte
codes the platform-independent codes interpreted by the interpreter on the
Java platform. The interpreter parses and runs each Java byte code instruction
on the computer. Compilation happens just once; interpretation occurs each time
the program is executed. The following figure illustrates how this works.
You can think of Java byte codes as the machine code instructions for the
Java Virtual Machine (Java VM). Every Java interpreter, whether its a
development tool or a Web browser that can run applets, is an implementation
of the Java VM. Java byte codes help make write once, run anywhere
15
possible. You can compile your program into byte codes on any platform that
has a Java compiler. The byte codes can then be run on any implementation of
the Java VM. That means that as long as a computer has a Java VM, the same
program written in the Java programming language can run on Windows 2000,
a Solaris workstation, or on an iMac.
Native code is code that after you compile it, the compiled code runs on a
specific hardware platform. As a platform-independent environment, the Java
platform can be a bit slower than native code. However, smart compilers, welltuned interpreters, and just-in-time byte code compilers can bring performance
close to that of native code without threatening portability.
What Can Java Technology Do?
certificates.
Software components: Known as JavaBeansTM, can plug into
existing component architectures.
Object
serialization:
Allows
lightweight
persistence
and
We cant promise you fame, fortune, or even a job if you learn the Java
programming language. Still, it is likely to make your programs better and
requires less effort than other languages. We believe that Java technology will
help you do the following:
Get started quickly: Although the Java programming language is
a powerful object-oriented language, its easy to learn, especially
for programmers already familiar with C or C++.
19
maintains
separate
list
of
ODBC
data
sources.
mention these two as an example. There are ODBC drivers available for several
dozen popular database systems. Even Excel spreadsheets and plain text files
can be turned into data sources. The operating system uses the Registry
information written by ODBC Administrator to determine which low-level
ODBC drivers are needed to talk to the data source (such as the interface to
Oracle or SQL Server). The loading of the ODBC drivers is transparent to the
ODBC application program. In a client/server environment, the ODBC API
even handles many of the network issues for the application programmer.
The advantages of this scheme are so numerous that you are probably
thinking there must be some catch. The only disadvantage of ODBC is that it
isnt as efficient as talking directly to the native database interface. ODBC has
had many detractors make the charge that it is too slow. Microsoft has always
claimed that the critical factor in performance is the quality of the driver
software that is used. In our humble opinion, this is true. The availability of
good ODBC drivers has improved a great deal recently. And anyway, the
criticism about performance is somewhat analogous to those who said that
compilers would never match the speed of pure assembly language. Maybe not,
but the compiler (or ODBC) gives you the opportunity to write cleaner
programs, which means you finish sooner. Meanwhile, computers get faster
every year.
JDBC
In an effort to set an independent database standard API for Java; Sun
Microsystems developed Java Database Connectivity, or JDBC. JDBC offers a
generic SQL database access mechanism that provides a consistent interface to a
variety of RDBMSs. This consistent interface is achieved through the use of
22
JDBC Goals
Few software packages are designed without goals in mind. JDBC is one
that, because of its many goals, drove the development of the API. These goals,
in conjunction with early reviewer feedback, have finalized the JDBC class
library into a solid framework for building database applications in Java.
The goals that were set for JDBC are important. They will give you some
insight as to why certain classes and functionalities behave the way they do. The
eight design goals for JDBC are as follows:
23
Architecture-neutral
Object-oriented
Portable
Distributed
High-performance
Interpreted
multithreaded
Robust
Dynamic
Secure
Java is also unusual in that each Java program is both compiled and
interpreted. With a compile you translate a Java program into an
intermediate language called Java byte codes the platform-independent
code instruction is passed and run on the computer.
Compilation happens just once; interpretation occurs each time the
program is executed. The figure illustrates how this works.
25
Interpreter
Java Program
Compilers
My Program
You can think of Java byte codes as the machine code instructions
for the Java Virtual Machine (Java VM). Every Java interpreter,
whether its a Java development tool or a Web browser that can run
Java applets, is an implementation of the Java VM. The Java VM can
also be implemented in hardware.
Java byte codes help make write once, run anywhere possible.
You can compile your Java program into byte codes on my platform
that has a Java compiler. The byte codes can then be run any
implementation of the Java VM. For example, the same Java program
can run Windows NT, Solaris, and Macintosh.
Networking
TCP/IP stack
The TCP/IP stack is shorter than the OSI one:
26
IP datagrams
The IP layer provides a connectionless and unreliable delivery
system. It considers each datagram independently of the others. Any
association between datagram must be supplied by the higher layers.
The IP layer supplies a checksum that includes its own header. The
header includes the source and destination addresses. The IP layer
handles routing through an Internet. It is also responsible for breaking
up large datagram into smaller ones for transmission and reassembling
them at the other end.
UDP
UDP is also connectionless and unreliable. What it adds to IP is a
checksum for the contents of the datagram and port numbers. These are
27
TCP
TCP supplies logic to give a reliable connection-oriented protocol
above IP. It provides a virtual circuit that two processes can use to
communicate.
Internet addresses
In order to use a service, you must be able to find it. The Internet
uses an address scheme for machines so that they can be located. The
address is a 32 bit integer which gives the IP address. This encodes a
network ID and more addressing. The network ID falls into various
classes according to the size of the network address.
Network address
Class A uses 8 bits for the network address with 24 bits left over for
other addressing. Class B uses 16 bit network addressing. Class C uses
24 bit network addressing and class D uses all 32.
Subnet address
Internally, the UNIX network is divided into sub networks. Building
11 is currently on one sub network and uses 10-bit addressing, allowing
1024 different hosts.
Host address
8 bits are finally used for host addresses within our subnet. This
28
Total address
Port addresses
A service exists on a host, and is identified by its port. This is a 16
bit number. To send a message to a server, you send it to the port for
that service of the host that it is running on. This is not location
transparency! Certain of these ports are "well known".
Sockets
A socket is a data structure maintained by the system to handle
network connections. A socket is created using the call socket. It returns
an integer that is like a file descriptor. In fact, under Windows, this
handle can be used with Read File and Write File functions.
29
#include <sys/types.h>
#include <sys/socket.h>
int
an
appropriate
dataset
interface
(plus
default
Implement a new (to JFreeChart) feature for interactive time series charts --to display a separate control that shows a small version of ALL the time
series data, with a sliding "view" rectangle that allows you to select the
subset of the time series data to display in the main chart.
3. Dashboards
types of markup language (HTML, XML, and so on) and dynamic content. It is
typically comprised of web components such as JavaServer Pages (JSP),
servlets and JavaBeans to modify and temporarily store data, interact with
databases and web services, and render content in response to client requests.
Because many of the tasks involved in web application development can be
repetitive or require a surplus of boilerplate code, web frameworks can be
applied to alleviate the overhead associated with common activities. For
example, many frameworks, such as JavaServer Faces, provide libraries for
templating pages and session management, and often promote code reuse.
What is Java EE?
Java EE (Enterprise Edition) is a widely used platform containing a set of
coordinated technologies that significantly reduce the cost and complexity of
developing, deploying, and managing multi-tier, server-centric applications.
Java EE builds upon the Java SE platform and provides a set of APIs
(application programming interfaces) for developing and running portable,
robust, scalable, reliable and secure server-side applications.
Some of the fundamental components of Java EE include:
Enterprise JavaBeans (EJB): a managed, server-side component
architecture used to encapsulate the business logic of an application. EJB
technology enables rapid and simplified development of distributed,
transactional, secure and portable applications based on Java technology.
Java Persistence API (JPA): a framework that allows developers to
manage data using object-relational mapping (ORM) in applications built
on the Java Platform.
JavaScript and Ajax Development
JavaScript is an object-oriented scripting language primarily used in client-side
interfaces for web applications. Ajax (Asynchronous JavaScript and XML) is a
32
Web 2.0 technique that allows changes to occur in a web page without the need
to perform a page refresh. JavaScript toolkits can be leveraged to implement
Ajax-enabled components and functionality in web pages.
Web Server and Client
Web Server is a software that can process the client request and send the
response back to the client. For example, Apache is one of the most widely used
web server. Web Server runs on some physical machine and listens to client
request on specific port.
A web client is a software that helps in communicating with the server. Some of
the most widely used web clients are Firefox, Google Chrome, Safari etc. When
we request something from server (through URL), web client takes care of
creating a request and sending it to server and then parsing the server response
and present it to the user.
HTML and HTTP
Web Server and Web Client are two separate softwares, so there should be some
common language for communication. HTML is the common language between
server and client and stands for HyperText Markup Language.
Web server and client needs a common communication protocol, HTTP
(HyperText Transfer Protocol) is the communication protocol between server
and client. HTTP runs on top of TCP/IP communication protocol.
Some of the important parts of HTTP Request are:
HTTP Method action to be performed, usually GET, POST, PUT etc.
URL Page to access
Form Parameters similar to arguments in a java method, for example
user,password details from login page.
Sample HTTP Request:
33
MIME Type or Content Type: If you see above sample HTTP response header,
it contains tag Content-Type. Its also called MIME type and server sends it
to client to let them know the kind of data its sending. It helps client in
rendering the data for user. Some of the mostly used mime types are text/html,
text/xml, application/xml etc.
Understanding URL
URL is acronym of Universal Resource Locator and its used to locate the
server and resource. Every resource on the web has its own unique address.
Lets see parts of URL with an example.
http://localhost:8080/FirstServletProject/jsps/hello.jsp
http:// This is the first part of URL and provides the communication protocol
to be used in server-client communication.
34
localhost The unique address of the server, most of the times its the hostname
of the server that maps to unique IP address. Sometimes multiple hostnames
point to same IP addresses and web server virtual host takes care of sending
request to the particular server instance.
8080 This is the port on which server is listening, its optional and if we dont
provide it in URL then request goes to the default port of the protocol. Port
numbers 0 to 1023 are reserved ports for well known services, for example 80
for HTTP, 443 for HTTPS, 21 for FTP etc.
35
Web Container
Tomcat is a web container, when a request is made from Client to web server, it
passes the request to web container and its web container job to find the
correct resource to handle the request (servlet or JSP) and then use the
response from the resource to generate the response and provide it to web
server. Then web server sends the response back to the client.
When web container gets the request and if its for servlet then container
creates two Objects HTTPServletRequest and HTTPServletResponse. Then it
finds the correct servlet based on the URL and creates a thread for the request.
Then it invokes the servlet service() method and based on the HTTP method
service() method invokes doGet() or doPost() methods. Servlet methods
generate the dynamic page and write it to response. Once servlet thread is
complete, container converts the response to HTTP response and send it back to
client.
Some of the important work done by web container are:
Communication
Support
Container
provides
easy
way
of
communication between web server and the servlets and JSPs. Because of
container, we dont need to build a server socket to listen for any request
from web server, parse the request and generate response. All these
important and complex tasks are done by container and all we need to
focus is on our business logic for our applications.
Lifecycle and Resource Management Container takes care of
managing the life cycle of servlet. Container takes care of loading the
servlets into memory, initializing servlets, invoking servlet methods and
destroying them. Container also provides utility like JNDI for resource
36
Deployment Descriptor
web.xml file is the deployment descriptor of the web application and contains
37
MySQL:
MySQL, the most popular Open Source SQL database management system, is
developed, distributed, and supported by Oracle Corporation.
The MySQL Web site (http://www.mysql.com/) provides the latest information
about MySQL software.
38
and use it without paying anything. If you wish, you may study the source
code and change it to suit your needs. The MySQL software uses the GPL
(GNU General Public License), http://www.fsf.org/licenses/, to define
what you may and may not do with the software in different situations. If
you feel uncomfortable with the GPL or need to embed MySQL code into
a commercial application, you can buy a commercially licensed version
from us. See the MySQL Licensing Overview for more information
(http://www.mysql.com/company/legal/licensing/).
The MySQL Database Server is very fast, reliable, scalable, and easy to
use.
If that is what you are looking for, you should give it a try. MySQL Server
can run comfortably on a desktop or laptop, alongside your other
applications, web servers, and so on, requiring little or no attention. If
you dedicate an entire machine to MySQL, you can adjust the settings to
take advantage of all the memory, CPU power, and I/O capacity
available. MySQL can also scale up to clusters of machines, networked
together.
You can find a performance comparison of MySQL Server with other
database managers on our benchmark page.
41
Manager Master
Manager Master
Location
tweeteraccount
Verify account
1.2.1
1.2.2
Location info
a
Tweet count
1.2.4
t
Managers Master
43
Hash tags
1.2.3
2.3
UML DIAGRAMS
UML stands for Unified Modeling Language. UML is a standardized
general-purpose modeling language in the field of object-oriented software
engineering. The standard is managed, and was created by, the Object
Management Group.
The goal is for UML to become a common language for creating models
of object oriented computer software. In its current form UML is comprised of
two major components: a Meta-model and a notation. In the future, some form
of method or process may also be added to; or associated with, UML.
The Unified Modeling Language is a standard language for specifying,
Visualization, Constructing and documenting the artifacts of software system, as
well as for business modeling and other non-software systems.
The UML represents a collection of best engineering practices that have
proven successful in the modeling of large and complex systems.
The UML is a very important part of developing objects oriented
software and the software development process. The UML uses mostly
graphical notations to express the design of software projects.
GOALS:
The Primary goals in the design of the UML are as follows:
1. Provide users a ready-to-use, expressive visual modeling Language so
that they can develop and exchange meaningful models.
2. Provide extendibility and specialization mechanisms to extend the core
concepts.
3. Be independent of particular programming languages and development
process.
44
CLASS DIAGRAM:
In software engineering, a class diagram in the Unified Modeling Language
(UML) is a type of static structure diagram that describes the structure of a
system by showing the system's classes, their attributes, operations (or
methods), and the relationships among the classes. It explains which class
contains information.
46
SEQUENCE DIAGRAM:
A sequence diagram in Unified Modeling Language (UML) is a kind of
interaction diagram that shows how processes operate with one another and in
what order. It is a construct of a Message Sequence Chart. Sequence diagrams
are sometimes called event diagrams, event scenarios, and timing diagrams.
47
ACTIVITY DIAGRAM:
Activity diagrams are graphical representations of workflows of stepwise
activities and actions with support for choice, iteration and concurrency. In the
Unified Modeling Language, activity diagrams can be used to describe the
business and operational step-by-step workflows of components in a system. An
activity diagram shows the overall flow of control.
48
5. SCREEN LAYOUT
49
50
51
52
53
6. SYSTEM TESTING
The purpose of testing is to discover errors. Testing is the process of
trying to discover every conceivable fault or weakness in a work product. It
provides a way to check the functionality of components, sub assemblies,
assemblies and/or a finished product It is the process of exercising software
with the intent of ensuring that the
Software system meets its requirements and user expectations and does not fail
in an unacceptable manner. There are various types of test. Each test type
addresses a specific testing requirement.
TYPES OF TESTS
Unit testing
Unit testing involves the design of test cases that validate that the internal
54
program logic is functioning properly, and that program inputs produce valid
outputs. All decision branches and internal code flow should be validated. It is
the testing of individual software units of the application .it is done after the
completion of an individual unit before integration. This is a structural testing,
that relies on knowledge of its construction and is invasive. Unit tests perform
basic tests at component level and test a specific business process, application,
and/or system configuration. Unit tests ensure that each unique path of a
business process performs accurately to the documented specifications and
contains clearly defined inputs and expected results.
Integration testing
Integration tests are designed to test integrated software components to
determine if they actually run as one program. Testing is event driven and is
more concerned with the basic outcome of screens or fields. Integration tests
demonstrate that although the components were individually satisfaction, as
shown by successfully unit testing, the combination of components is correct
and consistent. Integration testing is specifically aimed at
exposing the
Functional test
Functional tests provide systematic demonstrations that functions tested are
available as specified by the business and technical requirements, system
documentation, and user manuals.
Functional testing is centered on the following items:
55
Valid Input
Invalid Input
Functions
Output
exercised.
Systems/Procedures: interfacing systems or procedures must be invoked.
Organization and preparation of functional tests is focused on requirements,
key functions, or special test cases. In addition, systematic coverage pertaining
to identify Business process flows; data fields, predefined processes, and
successive processes must be considered for testing. Before functional testing is
complete, additional tests are identified and the effective value of current tests is
determined.
System Test
System testing ensures that the entire integrated software system meets
requirements. It tests a configuration to ensure known and predictable results.
An example of system testing is the configuration oriented system integration
test. System testing is based on process descriptions and flows, emphasizing
pre-driven process links and integration points.
White Box Testing
White Box Testing is a testing in which in which the software tester has
knowledge of the inner workings, structure and language of the software, or at
least its purpose. It is purpose. It is used to test areas that cannot be reached
from a black box level.
Black Box Testing
Black Box Testing is testing the software without any knowledge of the
56
inner workings, structure or language of the module being tested. Black box
tests, as most other kinds of tests, must be written from a definitive source
document, such as specification or requirements document, such as
specification or requirements document. It is a testing in which the software
under test is treated, as a black box .you cannot see into it. The test provides
inputs and responds to outputs without considering how the software works.
6.1 Unit Testing:
Unit testing is usually conducted as part of a combined code and unit test
phase of the software lifecycle, although it is not uncommon for coding and unit
testing to be conducted as two distinct phases.
Test strategy and approach
Field testing will be performed manually and functional tests will be
written in detail.
Test objectives
All field entries must work properly.
Pages must be activated from the identified link.
The entry screen, messages and responses must not be delayed.
Features to be tested
Verify that the entries are of the correct format
No duplicate entries should be allowed
All links should take the user to the correct page.
57
Test Date:
Unit
Interface
Functionality
Performance
Acceptance
09/09/2010
System Date, if
58
09/03/2015
Tester:
Test Case
Description:
Results:
Requirement(s)
to be tested:
Roles and
Responsibilities
:
Set Up
Procedures:
Hardware:
Software:
Test Items and
Features:
Procedural
Steps:
Expected
Results of
Case:
applicable:
Test Case Number:
Janardhan
Fail
INTRODUCTION
Getting Twitter Account
Gathering the Requirements of the Project Designing and Testing.
By Installing Eclipse.
ENVIRONMENTAL NEEDS
PC with Minimum 20GB Hard Disk and 1GB RAM.
Windows XP/2000, Oracle, Eclipse.
TEST
LocationID and Retweet Count.
If the User enters the Location id it will be redirected to another appropriate
page so that we can confirm test is accepted.
If the page is redirected we can confirm the result of this Test case is
succeeded.
Test Date:
Tester:
Test Case
Description:
Results:
Unit
Interface
Functionality
Performance
Acceptance
09/09/2010
System Date, if
applicable:
Test Case Number:
Janardhan
09/02/2015
2
Fail
INTRODUCTION
59
Requirement(s)
to be tested:
Roles and
Responsibilities
:
Set Up
Procedures:
Hardware:
Software:
Test Items and
Features:
Procedural
Steps:
Expected
Results of
Case:
CONCLUSION
Driven by real-world applications and key industrial stakeholders and initialized
by national funding agencies, managing and mining Big Data have shown to be
a challenging yet very compelling task. While the term Big Data literally
concerns about data volumes, our HACE theorem suggests that the key
60
characteristics of the Big Data are 1) huge with heterogeneous and diverse data
sources, 2) autonomous with distributed and decentralized control, and 3)
complex and evolving in data and knowledge associations. Such combined
characteristics suggest that Big Data require a big mind to consolidate data for
maximum values [27].
To explore Big Data, we have analyzed several challenges at the data, model,
and system levels. To support Big Data mining, high-performance computing
platforms are required, which impose systematic designs to unleash the full
power of the Big Data. At the data level, the autonomous information sources
and the variety of the data collection environments, often result in data with
complicated conditions, such as missing/uncertain values. In other situations,
privacy concerns, noise, and errors can be introduced into the data, to produce
altered data copies. Developing a safe and sound information sharing protocol is
a major challenge. At the model level, the key challenge is to generate global
models by combining locally discovered patterns to form a unifying view. This
requires carefully designed algorithms to analyze model correlations between
distributed sites, and fuse decisions from multiple sources to gain a best model
out of the Big Data. At the system level, the essential challenge is that a Big
Data mining framework needs to consider complex relationships between
samples, models, and data sources, along with their evolving changes with time
and other possible factors. A system needs to be carefully designed so that
unstructured data can be linked through their complex relationships to form
useful patterns, and the growth of data volumes and item relationships should
help form legitimate patterns to predict the trend and future.
We regard Big Data as an emerging trend and the need for Big Data mining is
arising in all science and engineering domains. With Big Data technologies, we
will hopefully be able to provide most relevant and most accurate social sensing
feedback to better understand our society at real time. We can further stimulate
the participation of the public audiences in the data production circle for societal
61
BIBLIOGRAPHY
[1] R. Ahmed and G. Karypis, Algorithms for Mining the Evolution of
Conserved Relational States in Dynamic Networks, Knowledge and
Information Systems, vol. 33, no. 3, pp. 603-630, Dec. 2012.
[2] M.H. Alam, J.W. Ha, and S.K. Lee, Novel Approaches to Crawling
Important Pages Early, Knowledge and Information Systems, vol. 33, no. 3, pp
707-734, Dec. 2012.
[3] S. Aral and D. Walker, Identifying Influential and Susceptible Members of
Social Networks, Science, vol. 337, pp. 337-341, 2012.
[4] A. Machanavajjhala and J.P. Reiter, Big Privacy: Protecting Confidentiality
in Big Data, ACM Crossroads, vol. 19, no. 1, pp. 20-23, 2012.
[5] S. Banerjee and N. Agarwal, Analyzing Collective Behavior from Blogs
Using Swarm Intelligence, Knowledge and Information Systems, vol. 33, no.
62
63
[14] C.T. Chu, S.K. Kim, Y.A. Lin, Y. Yu, G.R. Bradski, A.Y. Ng, and K.
Olukotun, Map-Reduce for Machine Learning on Multicore, Proc. 20th Ann.
Conf. Neural Information Processing Systems (NIPS 06), pp. 281-288, 2006.
[15] G. Cormode and D. Srivastava, Anonymized Data: Generation, Models,
Usage, Proc. ACM SIGMOD Intl Conf. Management Data, pp. 1015-1018,
2009.
[16] S. Das, Y. Sismanis, K.S. Beyer, R. Gemulla, P.J. Haas, and J. McPherson,
Ricardo: Integrating R and Hadoop, Proc. ACM SIGMOD Intl Conf.
Management Data (SIGMOD 10), pp. 987-998. 2010.
[17] P. Dewdney, P. Hall, R. Schilizzi, and J. Lazio, The Square Kilometre
Array, Proc. IEEE, vol. 97, no. 8, pp. 1482-1496, Aug. 2009.
[18] P. Domingos and G. Hulten, Mining High-Speed Data Streams, Proc.
Sixth ACM SIGKDD Intl Conf. Knowledge Discovery and Data Mining (KDD
00), pp. 71-80, 2000.
[19] G. Duncan, Privacy by Design, Science, vol. 317, pp. 1178-1179, 2007.
[20] B. Efron, Missing Data, Imputation, and the Bootstrap, J. Am. Statistical
Assoc., vol. 89, no. 426, pp. 463-475, 1994.
[21] A. Ghoting and E. Pednault, Hadoop-ML: An Infrastructure for the Rapid
Implementation of Parallel Reusable Analytics, Proc. Large-Scale Machine
Learning: Parallelism and Massive Data Sets Workshop (NIPS 09), 2009.
[22] D. Gillick, A. Faria, and J. DeNero, MapReduce: Distributed Computing
64
http://www.nytimes.com/2008/11/12/technology/internet/12flu.html.
2008.
[24] D. Howe et al., Big Data: The Future of Biocuration, Nature, vol. 455,
pp. 47-50, Sept. 2008.
[25] B. Huberman, Sociology of Science: Big Data Deserve a Bigger
Audience, Nature, vol. 482, p. 308, 2012.
[26] IBM What Is Big Data: Bring Big Data to the Enterprise, http:// www01.ibm.com/software/data/bigdata/, IBM, 2012.
[27] A. Jacobs, The Pathologies of Big Data, Comm. ACM, vol. 52, no. 8, pp.
36-44, 2009.
[28] I. Kopanas, N. Avouris, and S. Daskalaki, The Role of Domain
Knowledge in a Large Scale Data Mining Project, Proc. Second Hellenic Conf.
AI: Methods and Applications of Artificial Intelligence, I.P. Vlahavas, C.D.
Spyropoulos, eds., pp. 288-299, 2002.
[29] A. Labrinidis and H. Jagadish, Challenges and Opportunities with Big
Data, Proc. VLDB Endowment, vol. 5, no. 12, 2032-2033, 2012.
[30] Y. Lindell and B. Pinkas, Privacy Preserving Data Mining, J. Cryptology,
vol. 15, no. 3, pp. 177-206, 2002.
65
[31] W. Liu and T. Wang, Online Active Multi-Field Learning for Efficient
Email Spam Filtering, Knowledge and Information Systems, vol. 33, no. 1, pp.
117-136, Oct. 2012.
[32] J. Lorch, B. Parno, J. Mickens, M. Raykova, and J. Schiffman, Shoroud:
Ensuring Private Access to Large-Scale Data in the Data Center, Proc. 11th
USENIX Conf. File and Storage Technologies (FAST 13), 2013.
[33] D. Luo, C. Ding, and H. Huang, Parallelization with Multiplicative
Algorithms for Big Data Mining, Proc. IEEE 12th Intl Conf. Data Mining, pp.
489-498, 2012.
[34] J. Mervis, U.S. Science Policy: Agencies Rally to Tackle Big Data,
Science, vol. 336, no. 6077, p. 22, 2012.
[35] F. Michel, How Many Photos Are Uploaded to Flickr Every Day and
Month? http://www.flickr.com/photos/franckmichel/6855169886/, 2012.
[36] T. Mitchell, Mining our Reality, Science, vol. 326, pp. 1644-1645, 2009.
[37] Nature Editorial, Community Cleverness Required, Nature, vol. 455, no.
7209, p. 1, Sept. 2008.
[38] S. Papadimitriou and J. Sun, Disco: Distributed Co-Clustering with MapReduce: A Case Study Towards Petabyte-Scale End-to-End Mining, Proc.
IEEE Eighth Intl Conf. Data Mining (ICDM 08), pp. 512-521, 2008.
[39] C. Ranger, R. Raghuraman, A. Penmetsa, G. Bradski, and C. Kozyrakis,
Evaluating MapReduce for Multi-Core and Multiprocessor Systems, Proc.
66
IEEE 13th Intl Symp. High Performance Computer Architecture (HPCA 07),
pp. 13-24, 2007.
[40] A. Rajaraman and J. Ullman, Mining of Massive Data Sets. Cambridge
Univ. Press, 2011.
[41] C. Reed, D. Thompson, W. Majid, and K. Wagstaff, Real Time Machine
Learning to Find Fast Transient Radio Anomalies: A Semi-Supervised Approach
Combining Detection and RFI Excision, Proc. Intl Astronomical Union Symp.
Time Domain Astronomy, Sept. 2011.
[42] E. Schadt, The Changing Privacy Landscape in the Era of Big Data,
Molecular Systems, vol. 8, article 612, 2012.
[43] J. Shafer, R. Agrawal, and M. Mehta, SPRINT: A Scalable Parallel
Classifier for Data Mining, Proc. 22nd VLDB Conf., 1996.
[44] A. da Silva, R. Chiky, and G. Hebrail, A Clustering Approach for
Sampling Data Streams in Sensor Networks, Knowledge and Information
Systems, vol. 32, no. 1, pp. 1-23, July 2012.
[45] K. Su, H. Huang, X. Wu, and S. Zhang, A Logical Framework for
Identifying Quality Knowledge from Different Data Sources, Decision Support
Systems, vol. 42, no. 3, pp. 1673-1683, 2006.
[46]
Blog,
Dispatch
from
the
Denver
Debate,
http://
blog.twitter.com/2012/10/dispatch-from-denver-debate.html,Oct. 2012.
[47] D. Wegener, M. Mock, D. Adranale, and S. Wrobel, Toolkit-Based High67
Rough Set Theory, Knowledge-Based Systems, vol. 43, pp. 82-94, 2013.
[56] J. Zhao, J. Wu, X. Feng, H. Xiong, and K. Xu, Information Propagation in
Online Social Networks: A Tie-Strength Perspective, Knowledge and
Information Systems, vol. 32, no. 3, pp. 589-608, Sept. 2012.
[57] X. Zhu, P. Zhang, X. Lin, and Y. Shi, Active Learning From Stream Data
Using Optimal Weight Classifier Ensemble, IEEE Trans. Systems, Man, and
Cybernetics, Part B, vol. 40, no. 6, pp. 1607- 1621, Dec. 2010.
69