Stardog Data Fabric Whitepaper
Stardog Data Fabric Whitepaper
Data Fabric
The next generation of data management
Build a data fabric to power collaborative, cross-functional projects and products.
Escape reactive workflows with a resilient digital foundation, no rip-and-replace
required.
© 2020 Stardog
Table of Contents
Data fabric
The next generation of data management 03
Semantic Graph
Virtualization
Inference
The mandate for enterprise IT to deliver business value has never been stronger. 76% of
in this daunting task, agility is key. However, enterprises are hampered by data strategies that
leave teams flat-footed when the market shifts or new questions arise.
Structured data management systems worked acceptably well when the enterprise data
landscape was itself predominantly structured. But the world is different now. The enterprise
data landscape is increasingly hybrid, varied, and changing. The emergence of IoT, rise in
unstructured data volume, increasing relevance of external data sources, and trend towards
hybrid multi-cloud environments are obstacles to satisfying each new data request.
The old data strategy centered around relational data systems is fundamentally broken. How
Enterprise data fabrics offer the new way forward. The data fabric weaves together data from
internal silos and external sources and creates a network of information to power your
business’ applications, AI, and analytics. Quite simply, they support the full breadth of today’s
“Stardog enables you to browse through the data and all these
much better job satisfaction and getting the knowledge in hand that
1. Data fabrics can answer unanticipated questions and adapt to new requirements.
3. Data fabrics enable query across data silos and external sources, regardless of
data structure.
5. Data fabrics connect data at the compute layer, not the storage layer. This connects
Data fabrics support cross-functional data connections that are key to creating and defending
Take supply chain for example. Traditional supply chain data systems are a relay race,
operating with linear handoffs and siloed, peer-to-peer links between systems. When
COVID-19 hit, supply chains globally collapsed. Some strain or even partial collapse was
inevitable; but it was made much worse by bad data strategy that treated supply chain as a
rigid system when, in reality, it’s a complex network of actors who had to be fully in sync to
adjust as needed.
Terminal Shipping
Trucking
Producer Customs Declarant Warehouse Last Mile Consumer
Operator Line Company
Order
Co n
fi r
m at
T r ans p o io
r t n
O
rd
e
Expor t De c r
la
r at
io I ns p e c t i o
Book n R
in n e
g su
lt
s
Terminal Shipping
Trucking
Producer Customs Declarant Warehouse Last Mile Consumer
Operator Line Company
P
st
n
t
e
I m
E
ac
B/L
C
n
k i ng L i A
i o
T
n
o
o
e
m fi
of r m M
p
Shi p
t
r a
o
t r
ET t De cl a
A
of en
Shi pm
ETA r
of Cont a i ne
t
Paymen
questions they were previously blind to. “Show me all the lots of raw materials and associated
suppliers involved in the production of finished good lot 123.” Or, “How do COGS for product A
compare between these two regions?” Or, “Which manufacturers supplied the raw ingredients
But how exactly do data fabrics succeed where other approaches have failed?
1. First, data fabrics change the status quo by delivering meaning, not just data,
across the enterprise. This meaning is woven together from many sources: data
and metadata, internal and external sources, and cloud and on-premise systems.
Meaning is captured within the data model, with all context on each data asset
people and algorithms can make better decisions while also reducing the
2. Second, a data fabric delivers answers via powerful querying capabilities. A data
fabric is not a static thing; rather, it’s a queryable data layer, allowing users to
answer questions from across data silos. In a data fabric, query happens at the
compute layer above the actual storage layer. It’s at this compute layer where the
data fabric connects otherwise disconnected silos and systems. Data flows from
source to app and back again, constantly enriching and improving upon the data
fabric.
all connected apps. They are the next step forward in the maturation of the data
enterprise’s assets, but failed to make the data usable. Data lakes fail precisely
because they tried to connect data at the storage layer, not at the compute layer,
based on data location rather than based on data meaning. Physical colocation of
fact even less capable than data lakes since they only admit structured data to
begin with, leaving the semistructured and unstructured data silos completely
These previous solutions failed in part due to hybrid, varied, and changing data, but also due
to organizational pushback. Data fabrics, however, are built for collaboration. By leveraging
and connecting these existing assets, data fabrics are driving a new breed of cross-functional
Diverse, and it's only becoming more diverse as unstructured data growth rates skyrocket.
Distributed across multiple systems in different places, particularly as hybrid and multi-cloud
Whereas previous data management solutions have focused on eliminating silos through
fighting data silos. Rather than working against data silos, a data fabric leverages these data
Instead of replacing legacy technologies, a data fabric works alongside existing investments
and improves their utility. This is because a data fabric is not a single solution, it is actually an
architecture design that operates at the compute layer and focuses on connecting data
wherever it resides. and, thus, actually improving upon existing data storage assets like data
lakes, data catalogs, warehouses, and other data integration platforms like MDM.
We can start to see now how “data fabric” actually works as a description of what’s really
going on: just like an ordinary fabric, which conforms to whatever it lays over, an enterprise
data fabric lays over existing data assets and connects to them via individual threads, and
weaves these sources together into a unified layer. By doing so, data fabrics actually
This core concept in the data fabric enables the other capabilities
that allow for dynamic integration and data use case orchestration.”
Gartner “How to Activate Metadata to Enable a Composable Data Fabric,” Mark Beyer,
Knowledge graphs are able to represent everything that happens to enterprise data because
they serve as a universal format for data, regardless of its source structure or location or
format. A knowledge graph replaces the current laborious process for integrating enterprise
data, which typically involves extraction, translation, modeling, and mapping between various
applications. The custom code required for modeling and mapping quickly becomes unwieldy
business. It easily represents data of various structures and supports multiple schemas.
Furthermore, it creates the semantic understanding of enterprise and third-party data that
provides critical access to business insight. This serves as the core of the data fabric, enriching
Stardog’s Enterprise Knowledge Graph platform is uniquely able to deliver a data fabric
architecture without requiring rip-and-replace or building yet another data silo. After
and they have limited control over the quality of this data. They needed a flexible solution that
could relate their internal experimental results to external and publicly available studies. They also
needed to be able to evaluate the many-to-many relationships within their R&D data, such as,
“Find a set of compounds which are creating a similar effect,” or “Find compounds which have
By implementing Stardog atop their data lake, they created a company-wide data fabric that
provides a consolidated, one-stop shop for 90% of their R&D data. Their data fabric brings data
access directly to data scientists and accelerates drug target identification and drug repurposing
In this section we answer some common questions we hear about data fabrics:
How exactly is data “enriched” and how does this impact analytics outcomes?
In the next section, Connecting the Enterprise, we’ll cover practical requirements of
implementation, including building a team, socializing your data fabric, and developing a data
SEMANTIC GRAPH
silos. However, this isn’t its only contribution. Semantic graphs create meaning by
Semantic graph uniquely supports your ability to: mapping entities, their metadata, and
1. Connect all the data that matters also called RDF graph, is the only
context.
same data
In order to create business value within the enterprise, you must be able to connect all the
data that matters. Some of this data will be stored in tables, but also in PDFs, webpages,
emails, and other semistructured and unstructured sources. Only semantic graph is able to
represent data that is natively stored in other structures and connect all relevant metadata and
context.
With Stardog, different data dialects and structures embedded in legacy systems can be
represented in the standard language of RDF. This allows for queries across relational
“it is a simple data model with a standard syntax that can represent
information of any form. The true power of the RDF, however, is its
being misunderstood.”
Gartner, How to Use Semantics to Drive the Business Value of Your Data,”
The key to understanding how semantic graph integrates data is to know that it links or
connected related data, rather than transforming it. Each data object is assigned a unique ID,
to which all related information is linked. This unique process allows data owners to maintain
Figure. Data and metadata from varied sources is unified within Stardog, creating a
At this point, people typically start to worry about scale. But in fact, the largest information
integration projects on the planet already use this model. Look no further than your web
browser to see this in action. The Web contains a world of information, created by different
contributors, and accessible through a single browser. Google Search is also powered by a
Knowledge Graph, a network of 500 billion facts about five billion entities. Both Google and
the Web are proof for this model of large-scale, complex, and decentralized information
integration.
In Practice
What is it about this network of meaning that leads to data agility? It has everything to do with
was a huge strategic benefit to the bank. We are now able to design
Executive Director
Top 5 US Bank
Semantic graph operates in stark contrast to relational data. Finding connections between
different relational databases requires time-intensive data modeling and query operations.
Each new question produces a new dataset with its own schema. That’s not sustainable for the
rate of new and unanticipated questions that the business wants to ask of its data. Today, data
and analytics leaders need to be able to quickly support iterative question and answer cycles
from the business and easily dig into new territory in their data.
Instead of rows and columns and tables and keys, semantic graph organizes information using
nodes and edges to represent for entities and the relationships between those entities. This
graph data model is fundamentally simpler than the relational model, yet it’s also far more
The model actually exists at the compute layer, not at the storage layer, which means you can
modify the schema at any time by adding new nodes and edges, you don’t have to struggle at
a point in time to come up with a single shared data model covering all current and future
enterprise data needs. It also means that the enterprise can have many different, even
mutually incompatible schemas, that all apply discretely to the common pool of connected
data. And that means you never have to force-fit emerging data sources and use cases to
adhere to standardized rules from an already outdated perspective. The result? The same data
We just made a point worth diving more deeply into. What happens if you have many schemas,
as is typically the case in enterprises? Can there be one schema to rule them all? In an ideal
world, different use cases, organizations, lines of business, and applications would all see
things in precisely the same way. Since this is not an ideal world, however, more often than
fabric.
events – to identify what control should have prevented the risk and how to manage these risks in
the future.
Historically, relevant data was stored across 15 separate applications, forcing analysts to run
ad-hoc reports in Excel. This was not only time-consuming but also made it impossible for analysts
to know if they had captured all data related to a particular incident. When analysts made
decisions with incomplete information, they left the bank exposed to future risk events and
The bank implemented a broad, reusable data fabric to identify relationships across the various
applications involved in risk management, including incident reporting, control registries and IT
asset management systems. Now, analysts can can traverse the linked information in Stardog to
uncover dependencies within the data and identify root causes of particular incidents.
Furthermore, they can proactively ask “what if?” questions to predict the impact of theoretical risk
scenarios, creating a more proactive risk strategy that allows them to triage and mitigate potential
risks.
Learn MORE
expense of replicating, moving, and storing data multiple times. Virtualization connects source
data directly, cutting down on what would be an otherwise complex and cumbersome ETL
system, migrating data from dozens or even hundreds of systems and external vendors into a
single repository. Copying data for each new analysis leads to human error and data drift. It
leads to uncertainty about which data sources are trusted, current, or canonical. Data
virtualization provides access to live source data and it means you’re guaranteed to always
get the most up to date data every time you ask a question.
that can be neatly fitted into tables, rows, and data fabric based on Stardog exists at the
columns.
systems. While they can protect data lakes from accidental edits, they cannot integrate data
that is of diverse structures, is externally sourced, suffers from frequently changing schemas,
Stardog’s Virtual Graph capability is the most mature and powerful graph-based virtualization
solution on the market. Virtual Graphs connect data across data silos, even without copying
that data into Stardog. Further, they provide a direct access line for external data sources.
Lastly, they offer a reliable scale-out mechanism. Stardog can also virtualize other Stardog
Since not all data can be virtualized, whether due to regulation or internal policy, Stardog
offers both graph virtualization and graph storage in a completely seamless blend. Use both in
combination to support the needs of different data owners while still feeding your data fabric
In Practice
Stardog’s Inference Engine associates related information stored in disparate sources, and
then uses this rich web of relationships to discover new relationships within your data. By
expressing all the implied relationships and connections between your data sources, you
Inference creates new relationships by interpreting connected data against your business
logic in the data model. A knowledge graph’s data model is often called an ontology or
vocabulary and lays out common relationships between entities. This allows companies to
describe complex domains, such as medicine, in which multiple facts, modeling constructs,
and business rules interact with each other to imply new connections.
Stardog supports multiple inference schemas or enterprise data grows exponentially with
data models at the same time applied to the the number of connections, which is
underlying data fabric. By offering this support, exactly what Stardog’s Inference Engine
contrast, other data integration approaches, including data lakes and data warehouses,
connect data based on storage, in which case only one schema can be applied to that data.
Which is one reason enterprises have to continually create new data silos for every new
challenge or problem!
review how Stardog arrived at an answer and Figure. Ontologies are data models that show how
providing trusted results and accountability within an organization, but also necessary for
In Practice
While a knowledge graph is the key ingredient of the data fabric, it is not the only thing you
need to be successful. Stardog has led dozens of companies through connecting their
In this section, learn how to best use Stardog alongside your existing data management
investments and how to successfully get started with your data fabric deployment.
A successful data fabric requires leveraging and connecting existing source systems.
Stardog’s Virtual Graphs connects to existing data catalogs, data lakes, databases, and other
data management platforms, offering comprehensive support of the most important enterprise
data sources.
For data fabric deployments, Stardog recommends leveraging work completed in data
catalogs to accelerate data discovery and semantic enrichment within Stardog. Stardog Studio
provides an integrated development environment to access and import data and create
Using the data catalog as an input, Stardog builds a data map of your enterprise data assets.
This data map accelerates data fabric creation through partially automated learning and
data model. Many think this is a prerequisite to the initiative, and the undertaking may strike
In fact, you only need to define as many concepts as needed for your initial use case. Identify
a critical business problem to spearhead the broader data fabric initiative. Approach your data
fabric with an MVP mindset and do strictly the minimal work to accomplish the first significant
A key premise of Stardog’s platform is that data modeling is reusable. When things change,
simply write a new modular rule to amend the model and proceed with accessing your
connected data. Due to this reusable data modeling principle, the business value derived from
In Practice
users.
There are many public data models that Stardog can read, helping customers to accelerate
their data model development. A public data model may account for about 80% of modeling
required for your project, with the remaining 20% customized based on your proprietary data
or unique internal operations. Our team can help advise on publicly available data models that
or vendor.
right in Studio.
With an MVP in hand, it’s important to socialize your data fabric. A data fabric would be
useless if the business meaning is locked away from the business. This lack of access has
literal lack of access to the data; data is trapped in source systems or within IT only
inability to access due to skill gap, ie lack of workers skilled in manipulating graph data
direct connections to popular business intelligence platforms via our BI/SQL server, which
converts graph data back into SQL to make it available through all major SQL variants. You
can use the BI/SQL Server to connect Stardog to any platform that runs on SQL. Or, you can
use our supported Connectors to BI platforms including Tableau, Power BI, cumul.io, Apache
Stardog improves upon the capabilities of these BI platforms. For example, as visualizations
are created in Power BI, Virtual Graph queries would run behind the scenes allowing users to
analyze data from multiple data sources as if all the data is stored in one data source. Similarly,
if you had a dozen different point-of-sale data sources, you can write a Stardog rule defining
the relevant columns as geographic coordinates so that Tableau can automatically display all
REST API Applications can access data from Stardog directly via a REST API endpoint.
BI/SQL Server Business analysts gain direct access to the breadth of unified data directly in
their BI platform of choice. Stardog’s BI/SQL Server allows any BI platform that
CLI System administrators have the option to access Stardog directly via the
Command Line.
Stardog Studio System administrators and data modelers can use our IDE, Stardog Studio, to
query, visualize data, explore data models, and evaluate data provenance.
Python Data scientists can access the unified data via Stardog’s python extension
pystardog. Data scientists can also use Stardog’s built-in machine learning to
extension
train models directly on the virtualized data.
Use Stardog’s data quality constraints to manage overall data quality and ensure
conformance with defined rules. Constraints also support measuring the quality of the data,
measures.
As your data fabric grows, Stardog grows with you. Stardog also has the ability to query other
Stardog instances, which is key for compliance with data movement regulations. For
entities without copying any data. Set each operating entity up with their own Stardog
instance and Stardog can easily execute a global query across all virtualized data.
Stardog makes it easy to get started with your data fabric. In addition to our platform detailed
above, we have the team and expertise to take you from MVP to global deployment! Contact
us to learn more about our customers who have successfully reduced time to insight 50-90%.
Stardog, the leading Enterprise Knowledge Graph platform, turns data into knowledge to
power more effective digital transformations. Industry leaders including BNY Mellon, Bosch,
and NASA use Stardog to create a flexible data layer that can support countless
applications. Stardog has been recognized by Fast Company as one of the world’s Most
Innovative Companies, by Database Trends and Applications as one of the 100 companies
that matter most in data management, and by KMWorld as one of the 100 companies that
matter most in knowledge management. Stardog is a privately held, venture-backed
company headquartered in Arlington, VA.