Changing Focus On Interoperability in Information Systems From System Syntax Structure To Semantics
Changing Focus On Interoperability in Information Systems From System Syntax Structure To Semantics
0 CHANGING FOCUS ON
INTEROPERABILITY
IN INFORMATION SYSTEMS:
FROM SYSTEM, SYNTAX, STRUCTURE
TO SEMANTICS
Amit P. Sheth
1. Introduction
Interoperability has been a basic requirement for the modern information systems
environment for over two decades. How have key requirements for interoperability
changed over that time? How can we understand the full scope of interoperability
issues? What has shaped research on information system interoperability? What key
progress has been made? This chapter provides some of the answers to these
questions. In particular, it looks at different levels of information system
interoperability, while reviewing the changing focus of interoperability research
themes, past achievements and new challenges in the emerging global information
infrastructure (GII). It divides the research into three generations, and discusses some
of achievements of the past. Finally, as we move from managing data to information,
and in future knowledge, the need for achieving semantic interoperability is
discussed and key components of solutions are introduced.
Data and information interoperability has gained increasing attention for several
reasons, including:
• excellent progress in interconnection afforded by the Internet, Web and
distributed computing infrastructures, leading to easy access to a large number
of independently created and managed information sources of broad variety;
• increasing specialization of work, but increasing need to reuse and analyze data,
leading to creation of information and knowledge, and their subsequent reuse
and sharing.
1
2 Amit P. Sheth
1.1. Distribution
The scope of interoperability during the first generation was primarily departmental
and almost always within a company. Usually, the multidatabase systems involved
just a few databases and computer nodes, either connected point-to-point or in a local
area network. With the significant impact of the Internet and advent of the Web, the
scope of interoperability during the second generation has been enterprise-wide as
well as inter-enterprise. It was not unusual to find tens of computers and data
repositories involved in a second-generation system. In the third generation, with
significant improvements in communication technology, global information
infrastructure, and distributed computing infrastructure, the dimension of distribution
of data has achieved a very broad scope—from a single system to global. As the
distributed nature of data and information is often hidden from the end users, the
system developers face several new challenges. A few of the noteworthy challenges
involve increasing use of large amounts of data and information sources—
particularly involving visual data, use of a wide variety of communication modes
with a variety of bandwidths, and a larger optimization space involving varying
capabilities of the component systems. Compared to the first generation systems, the
issue of optimization has received less attention in the second generation.
1.2. Autonomy
source are often willing to let others share the data only if they retain control. Thus it
is important to understand the aspects of autonomy and how they can be addressed
when a database system participates in a federation or shares its data with new users
or applications.
Let us look at a classification of autonomy issues in the context of federated
database systems (adapted from Sheth and Larson 1990), noting that these can be
adapted to other architectures by considering the various types of information
sources and information system components involved. A component participating in
a federation may exhibit several types of autonomy, including design,
communication, association and execution.
Design autonomy refers to the ability of a component to choose its own design
with respect to any matter, including
• the data or information being managed (i.e., the Universe of Discourse or
domain),
• the representation (data model, query language) and the naming of the data
elements (or the ontology used),
• the conceptualization or semantic interpretation of the data (or the context),
• constraints used to manage the data,
• the functionality of the system,
• association and sharing with other systems (see association autonomy below),
and
• the implementation (e.g., record and file structures, concurrency control
algorithms).
Communication autonomy refers to the ability of a component to decide whether
to communicate with other components. A component with communication
autonomy is able to decide when and how it responds to a request from another
component. Execution autonomy refers to the ability of a component to execute local
operations without interference from an external entity and to decide the order in
which to execute external operations. Thus, an external system cannot enforce an
order of execution of the commands on a component with execution autonomy.
Execution autonomy implies that a component can abort any operation that does not
meet its local constraints and that its local operations are logically unaffected by its
participation in a federation. Furthermore, the component does not need to inform an
external or federated system of the order in which external operations are executed
and the order of an external operation with respect to local operations. Operationally,
a component exercises its execution autonomy by treating external operations in the
same way as local operations.
Association autonomy implies that a component has the ability to decide whether
and how much to share its functionality (i.e., the operations it supports) and
resources (i.e., the data it manages) with others. This includes the ability to associate
or disassociate itself from the federation and the ability of a component to participate
in one or more federations. Several first-generation systems in the database area paid
significant attention to the autonomy issue because they also attempted to support
4 Amit P. Sheth
1.3. Heterogeneity
During the second half of the 1970s, we saw the ability to deal with hardware,
operating systems, and communications heterogeneity; although with evolution in
each of these, new issues have to be continuously addressed. During the 1980s, we
saw significant progress in managing heterogeneity and support interoperability or
integration in environments with structured databases and traditional database
management systems (DBMSs). There is a large body of work during the first
generation in dealing with heterogeneity associated with data models or schematic
issues, DBMSs including query languages, concurrency control, commit and
recovery, etc.
During the 1990s, the emergence of distributed computing, middleware
technology, and standards has allowed us to increase focus on the heterogeneity that
is intrinsic to data (or media). This has particularly supported syntactic and structural
interoperability, and allowed us to address issues at the information level. As the
future information system increasingly addresses the information and knowledge
level issues, it will increasingly require semantic interoperability. Semantic
interoperability requires that the information system understands the semantics of the
6 Amit P. Sheth
user’s information request and those of information sources, and uses mediation or
information brokering to satisfy the information request as well as it can.
The remainder of this chapter provides an overview of the three generations of
systems, with emphasis on the heterogeneity dimension of support for different
levels of interoperability. Table 1 provides an overview of the three generations in
terms of a variety of criteria.
2. First generation
Figure 2 shows how some of the distribution, autonomy, and heterogeneity issues
for integrating component databases can be handled (see Sheth and Larson 1990 for
more details). Briefly, the local schemas can represent data in the data model of
respective DBMSs. To be able to compare the data objects modeled in different data
models, one has either to perform direct and pairwise comparison—something like
comparing apples and oranges—or to convert the schemas to a common or canonical
Changing Focus on Interoperability in Information Systems 11
model, preferably with an expressive power exceeding that of models for component
databases, and then compare objects. Defining export schemas allows handling of
one aspect of autonomy. Integration of export schemas into federated schemas
allows for integrated or uniform access to objects managed by multiple component
databases. Defining external schemas allows for handling additional types of
heterogeneity.
DBMS-level heterogeneity covers only a small set of heterogeneity related to
structure databases. Figure 3 shows one classification of a variety of conflicts related
to achieving interoperability among or integration of multiple databases managed by
traditional DBMSs (Kim et al. 1993; Sheth and Kashyap 1993).
3. Second generation
During the second generation two very important trends brought extraordinary
opportunities for interoperability and exploitation of data: (a) proliferation of a
variety of data—from structured database, and semi-structured data, to digital media,
including visual media (Gupta and Jain 1997), and (b) spread of the Internet and
emergence of the Web. Applications such as digital libraries (Paepcke et al. 1998)
and electronic commerce provided the context of interoperability.
Some of the key trends and achievements of this generation are (Papazoglou and
Schlageter 1998; Sheth and Klas 1998):
• technology for dealing with heterogeneity of systems, data, and representational
levels;
• support for a broader variety of data—not just structured databases, but also text,
semi-structured, and unstructured (including image and video) data;
• use of a broad variety of metadata to support interoperability and integration;
and
• use of knowledge representation and reasoning, especially for handling
terminological differences.
Changing Focus on Interoperability in Information Systems 13
3.1. Metadata
Metadata are usually defined as data about data. Often they is more than that,
involving information about data as they is stored or managed, and revealing partial
semantics such as intended use (i.e., application) of data. This information can be of
broad variety, meeting if not surpassing the variety in the data themselves. Metadata
can be regarded as an extension (albeit a significant one) of the concept of the
schema in structured databases. They may describe, or be a summary of the
information content of the individual databases in an intentional manner. They
typically represent constraints between the individual media objects that are implicit
and not necessarily represented in the databases themselves. Some metadata may
also capture content-independent information like location and time of creation.
Examples of what we consider media types are structured data (data in relational or
object-oriented databases), textual data (of different formats, such as Word files,
source code, etc.), images (of possibly different modalities such as X-Ray, MRI
scan), audio (of possibly different modalities such as monaural, stereophonic), and
video. Sheth and Klas (1998) give an extensive discussion on types of metadata and
their applications in managing and exploiting various digital media.
The criterion we use to classify metadata (Kashyap et al. 1995) is the extent to
which they are successful in capturing the (data and information) content of the
information asset (also called artifact or document in different contexts) represented
in various media types. The level of abstraction at which the content of the assets is
captured is very important. We believe that to capture the semantic content (i.e., at a
level of abstraction closer to that of humans), it is important for the metadata to
model application domain-specific information. Metadata descriptions present two
advantages:
1. They enable the abstraction of representational details such as the format and
organization of data, and capture the information content of the underlying data
independent of representational details. This represents the first step in reduction
of information overload as intentional metadata descriptions are in general an
order of magnitude smaller than the underlying data.
2. They enable representation of domain knowledge describing the information
domain to which the underlying data belong. This knowledge may then be used
to make inferences about the underlying data. This helps in reducing information
overload as the inferences may be used to determine the relevance of the
underlying data without accessing the data.
One of several classifications of metadata (Kashyap et al. 1995) is as follows
(also see Boll et al. 1998; Lagoze et al. 1996):
Content-independent metadata: This type of metadata captures information that
does not depend on the content of the asset with which it is associated. Examples of
this type of metadata are location, modification date of a document and type of
sensor used to record a photographic image. There is no information content
captured by these metadata but these might still be useful for retrieval of assets from
their actual physical locations and for checking whether the information is current or
not.
Content-dependent metadata: This type of metadata depends on the content of the
asset it is associated with. Examples of content-dependent metadata are size of a
14 Amit P. Sheth
and even expert systems. However, the mediator architectures (Wiederhold 1992)
were clearly the dominant ones, involving wrappers for encapsulating heterogeneous
information sources to provide more uniform interface to the rest of the world, and
mediators to provide a broad variety of value-added services (Wiederhold 1997).
and geographical and environmental (e.g., FGDC and UDK; Günther and Voisard
1998).
Among a large number of systems representing this generation, three classes of
systems stand out: systems focusing on information integration or uniform access to
heterogeneous repositories, systems providing more dynamic architecture or query
processing mechanisms to process a user service or information request on demand,
and systems that address domain-specific or semantic-level issues. A brief review of
these three classes of systems follows.
Changing Focus on Interoperability in Information Systems 17
4. Third generation
One thing that has not changed for the third generation is that we are once again
faced with more distribution, more autonomy, and more heterogeneity among the
accessible information, information sources, and users. With the progress in global
interconnectivity, we now need to deal with more heterogeneous information
consisting not only of a broader variety of digital data, but also operations and
computations (such as simulations) that can create new data and information. The
scale of the problem has changed from a few databases to millions of information
resources, and the new resources are added independently to the accessible set of
resources, as other resources change rapidly or disappear. Currently favorite
strategies that depend on keyword-based access or involve only representational or
structural components of data are usually found to provide a poor quality of result,
and their lack of precision leads to increasing information overload. We fully expect
increasing standardization and interoperability at system, syntactic, and structural
levels to address many issues—for example, see Paepcke et al. (1998) for relevant
work in the domain of digital libraries. However, the key challenges to be faced are
at the semantic level, where people would increasingly expect the information
systems to help them not at the data level, but at the information, and increasingly
knowledge levels.
Even a casual user of the Web is aware of the rapid increase in the amount and
diversity of information available online. However, what is creating an even bigger
challenge is the increased expectations of the user in terms of understanding of the
context of the user’s information need, increasing availability of semantically rich
visual and new media, and a corresponding need to support semantic-level
interoperability. The problem of information overload has turned the challenge of
“So far (schematically) yet so near (semantically)” (Sheth and Kashyap 1993) faced
by the previous generations into “So near (syntactically and structurally) yet so far
(semantically)”.
Although there are several uses and interpretations of semantics in information
systems, our view is that future information systems will need to support a more
general notion that involves relating the content and representation of information
resources to entities and concepts in the real world (Beech 1997; Meersman 1997;
Sheth 1997). That is, the limited forms of operational and axiomatic semantics of a
particular representational or language framework are not sufficient (see Paepcke et
al. 1998 for a relevant discussion on syntax and some types of semantics). Semantic
interoperability will then support high-level (hence easier to use), context-sensitive
information requests over heterogeneous information resources, hiding system,
syntax, and structural heterogeneity. In essence, we need an approach that reduces
the problem of knowing the contents and structure of many information resources to
the problem of knowing the contents of easily-understood, domain-specific
Changing Focus on Interoperability in Information Systems 21
ontologies, which a user familiar with the domain is likely to know or understand
easily.
Foundational research leading to building the third generation of information
systems has been carried out in several umbrella projects and initiatives, including
Knowledge Sharing Effort (http://www-ksl.stanford.edu/knowledge-sharing),
Intelligent Integration of Information (http://mole.dc.isx.com/I3), and the Digital
Library Initiative (http://www.cise.nsf.gov/iis/dli_home.html). Systems belonging to
the third class of the second generation have also made contributions that the third
generation systems can build on. Increasing standardization at different levels of
information systems architecture for corresponding type of interoperability also plays
an important role. Some of the examples are as follows.
• System: IIOP for interactions between distributed objects and components,
KQML for interaction between agents;
• Syntactic: XML for all forms of Web-accessible data;
• Structural: RDF for general purpose description of information sources, various
object models for web-based information exchange (Manola 1998), MPEG-4 for
structural or object-level description video, MHEF-5 for multimedia and
hypermedia, KIF for knowledge representation, OKBC for distributed
knowledge bases;
• Semantic: MPEG-7 (still in progress) with likely support for limited forms of
semantics with identification of context, objectives requirements, and
applications.
We now focus our attention on a discussion of possible enablers of semantic
interoperability. In particular, we identify three enablers and capabilities:
Terminology (and language) transparency: This will allow a user to choose an
ontology of his or her choice (e.g., one based on LCC for querying bibliographic data
or FGDC for geospatial data), while allowing the information source to subscribe to
a related but different ontology (e.g., an ontology based on DDC or UDK,
repsectively. The latter recognizes some overlap between geospatial data sets and
environmental data sets, and their respective modeling).
Context-sensitive information processing: The information system will recognize
or understand the context of an information need and use it to limit information
overload, both by formulating more precise queries used for searching information
sources and by filtering and transforming the information before presenting it to the
user.
Semantic correlation: This will allow the representation of semantically-related
information regardless of distribution and heterogeneity (including various forms of
media) by the user or the third party, and their use for obtaining all forms of relevant
information anywhere.
Three key components of a possible solution are metadata (especially domain-
specific and content-based metadata), contexts, and ontologies (Kashyap and Sheth
1998). We briefly discuss their role in developing semantic interoperability
solutions. One key aspect of the third generation (operation or process
interoperation) will not be discussed for brevity.
22 Amit P. Sheth
4.2. Context
metadata, and ontologies are created, administered, and enhanced independently; and
(2) mediator architectures (Wiederhold 1992) which involve decoupling information
creators and providers from information users and better semantic-level services and
interoperability. However, we believe that the key to this generation of systems is
their support for semantic interoperability, through exploitation of various forms of
metadata, multiple ontologies, and contexts. Furthermore, we believe that before
very general architectures that can support various domains can be developed,
support for semantic interoperability demands that we focus on a specific domain
first, such as GIS (Goodchild et al. 1997), and then extend what we learn to general-
purpose and multi-domain environments. Figure 6 shows a schematic of a system for
supporting semantic interoperability as described above in a geographical domain. It
is too early to give representative examples of the third generation, but a few early
efforts are described by Wiederhold (1996), and Papozoglou and Schlageter (1998);
and see InfoQuilt (http://lsdis.cs.uga.edu/infoquilt).
Acknowledgements
Members of the InfoQuilt project, especially Vipul Kashyap, Kshitij Shah, Clemens
Bertram, and Krishnan Parsuraman, have contributed to some of the ideas presented
in this paper.
Changing Focus on Interoperability in Information Systems 25
References
Kiyoki Y, Kitagawa T, Hayama T 1998 A metadatabase system for semantic image search by
a mathemetical model of meaning. In Sheth A, Klas W (eds) Multimedia Data
Management: Using Metadata to Integrate and Apply Digital Media. McGraw Hill: 191-
222
Lagoze C, Lynch C, Daniel R 1996 The Warwick Framework: A container architecture for
aggregating sets of metadata. Technical Report TR96-1593. Cornell University,
Department of Computer Science. http://cs-tr.cs.cornell.edu:80/Dienst/UI/2.0/
Describe/ncstrl.cornell/ TR96-1593
Lee J, Madnick S, Siegel M 1996 Conceptualizing semantic interoperability: a perspective
from the knowledge level. International Journal of Cooperative Information Systems 5(4):
367–393
Levy A Y, Srivastava D, Kirk T 1995 Data model and query evaluation in global information
systems. Intelligent Information Systems 5(2): 121-143
Litwin W, Boudenant J, Esculier C, Ferrier A, Glorieux A, La Chimia, J, Kabbaj K,
Moulinoux C, Rolin P, Stangret C 1982 SIRIUS: systems for distributed data management.
In Schneider H-J (ed) Distributed Data Bases. North-Holland, Netherlands: 311–66
Manola F 1998 Towards a Web Object Model. Object Services and Consulting, Inc.
http://www.objs.com/OSA/wom.htm
Meersman R 1997 An essay on the role and evolution of data(base) semantics. In Meersman
R, Mark L (eds) Database Application Semantics. Chapman and Hall
Meersman R, Mark L (eds) 1997 Database Application Semantics. Chapman and Hall
Mena E, Kashyap V, Illarramendi A, Sheth A 1998 Domain specific ontologies for semantic
information brokering on the global information infrastructure. Proceedings, International
Conference on Formal Ontology in Information Systems (FOIS'98), Torino: 269-283
Mendelzon A, Mihaila G, Milo T 1997a Querying the World Wide Web. Journal of Digital
Libraries 1(1): 68–88
Mendelzon A, Mihaila G, Raschid L, Tomasic A 1997b Locating and accessing heterogeneous
data sources. Proceedings of CASCON'97
Mudumbai S 1997 ZEBRA: Customizable, Extensible Metadata-based Access to Federated
Image Repositories. M.S. Thesis, Department of Computer Science, University of Georgia
Online Computer Library Center, Inc 1997 Dublin Core Metadata Element Set: Reference
Description. Office of Research and Special Projects, Dublin, Ohio. http://
www.oclc.org:5046/research/dublin_core/
Ouksel A, Naiman C 1994 Coordinating context building in heterogeneous information
systems. Journal of Intelligent Information Systems 3 (2): 151-183
Paepcke A, Chang C, Garcia-Molina H, Winograd T 1998 Interoperability for digital libraries
worldwide. Communications of the ACM 41(4): 33-43
Papazoglou M, Schlageter G (eds) 1998 Cooperative Information Systems: Current Trends
and Directions. Academic Press
Ram S (ed) 1991 Heterogeneous distributed database systems. Special Issue. IEEE Computer
24(12)
Schek H–J, Sheth A, Czjedo B (eds) 1993 Proceedings of the RIDE-IMS'93: International
Workshop on Interoperability in Multidatabase Systems. IEEE Computer Society
Shah K, Sheth A 1998 Logical information modeling of Web-accessible heterogeneous digital
assets. Proceedings of the Forum on Research and Technology Advances in Digital
Libraries (ADL'98), Santa Barbara: 266-275
Shah K, Sheth A, Mudumbai S 1997 Black box approach to image feature manipulation used
by visual information retrieval engines. Second IEEE Metadata Conference
Sheth A 1987 Heterogeneous Distributed Databases: Issues in Integration. Tutorial Notes,
3rd International Conference on Data Engineering
Sheth A (ed) 1991 Semantic issues in multidatabase systems. Special Issue. SIGMOD Record
28 Amit P. Sheth