0% found this document useful (0 votes)

25 views

Master Thesis

Uploaded by

Lillian Velauytham

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

25 views

Master Thesis

Uploaded by

Lillian Velauytham

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 100

University of Liège

School of Engineering and Data Science

In collaboration with Safran Aero Boosters

A Chatbot-Driven Search Engine

for Improved Data Accessibility

Natural Language Access to Enterprise Knowledge Graphs Using

Large Language Models

Tutor
Jean-Louis Jacquerie
Advisor
Ashwin Ittoo

Master’s thesis completed in order to obtain the degree of Master of

Science in Engineering and Data Science by Corentin Merlé
Academic year 2022-2023

1
Abstract

Increasingly, large organizations are faced with the challenge of making data
accessible and understandable to non-expert users1 . Despite the advances in natural
language processing and knowledge representation, turning data into natural language
responses that can be understood by a general audience remains a significant challenge.
Moreover, this issue is exacerbated by the exponential growth of information and
the fragmentation of data into isolated silos, which underscores the urgent need for
tools to provide more straightforward, single-point data access.
This thesis aims to address these challenges by introducing the use of Enterprise
Knowledge Graphs as a unified data structure for consolidating and representing
disparate data sources, coupled with SPARCoder, our ontology-aware 2 Text-to-
SPARQL fine-tuned Large Language Model based on StarCoder (Li et al. 2023),
capable of querying knowledge graphs to retrieve data using natural language. The
proposed natural language "search engine" architecture leverages the strengths of
Large Language Models in understanding and generating human-like text, combined
with the structured representation of information provided by knowledge graphs. In
essence, this approach bridges the gap between complex data and end-users, offering
a more accessible interface.
In this work, we undertake a comprehensive description of our proposed system,
contrasting its advantages and drawbacks with traditional methods of data access
and retrieval as well as other state-of-the-art large language models.
Consequently, we assert that the integration of large language models with
knowledge graph querying significantly improves data accessibility for non-expert
users. The proposed "search engine" prototype not only facilitates a more intuitive
and accessible way of interacting with data but also opens up new possibilities for
user interaction, leading to more informed and data-driven decision making.

1
By non-expert users, we refer to individuals who may not have formal training or deep familiarity
with database querying languages, or advanced data analytics.
2
Ontology-aware means that our system aims to understand and utilize the ontology’s semantic
information and structures to generate more accurate and semantically relevant SPARQL queries.

2
Acknowledgements

I’d like to express my deepest gratitude to my tutors, both from the University and
Safran Aero Boosters. Their guidance and expertise support have been instrumental
to this research.
I also wish to acknowledge to Safran Aero Boosters for the invaluable internship
experience and for providing the crucial computational resources, particularly the
access to high-end GPUs, that significantly propelled my work forward.
Special thanks to the Plateau Digital team at Safran. Their welcoming nature
and camaraderie made my time there both enriching and memorable.
On a personal note, my profound thanks go to my parents, whose love, guidance,
and belief in me have been the foundation of all my endeavors. Additionally, I
extend heartfelt appreciation to Charline, my girlfriend, for her enduring support
and encouragement throughout this journey.

3
Contents

Contents i

1 Introduction 1
1.1 Data Challenges at Safran Aero Boosters . . . . . . . . . . . . . . . . . . . 2
1.1.1 Volume and Variety . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.2 Data Silos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.3 Real-time Access . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.4 Quality and Consistency . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.5 Security . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.6 Compliance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1.7 Tailored Data and Information Retrieval . . . . . . . . . . . . . . . 4
1.2 Information and Data Retrieval: Towards Knowledge Retrieval . . . . . . . 5
1.2.1 Data Retrieval Systems . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.2 Information Retrieval Systems . . . . . . . . . . . . . . . . . . . . . 6
1.2.3 Knowledge Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2.4 Evolving Information Retrieval: Vector Databases . . . . . . . . . . 6
1.3 Internship and Research Questions . . . . . . . . . . . . . . . . . . . . . . 7
1.3.1 Internship Proposition . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3.2 Choosing a Path: Structured vs. Unstructured Data . . . . . . . . 7
1.3.3 SAB’s Vision and Research Question . . . . . . . . . . . . . . . . . 8

2 Related Work 9
2.1 Natural Language Processing . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.1 Language Processing before the Deep Learning . . . . . . . . . . . . 9
2.1.2 Language Processing in the Deep Learning Age . . . . . . . . . . . 10
2.1.3 Language Processing in the Foundation Models Age . . . . . . . . . 11
2.2 Knowledge Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.1 Brief History of Knowledge Graphs . . . . . . . . . . . . . . . . . . 12
2.3 Unifying LLMs and KGs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3.1 Enriching LLMs with KGs . . . . . . . . . . . . . . . . . . . . . . . 14
2.3.2 Amplifying KGs using LLMs . . . . . . . . . . . . . . . . . . . . . . 14
2.3.3 Mutual Enhancement . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.4 Large Language Models for Code Generation . . . . . . . . . . . . . . . . . 14
2.4.1 State-of-the-Art in Code Generative LLMs . . . . . . . . . . . . . . 14
2.4.2 From Text-to-SQL to Text-to-SPARQL . . . . . . . . . . . . . . . . 14

3 StarCoder 17
3.1 Key Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.1.1 Word and Token Embeddings . . . . . . . . . . . . . . . . . . . . . 17

i
3.1.2 Positional Encodings . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.1.3 Context Length . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.1.4 Attention Mechanisms . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.1.4.1 Self-Attention . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.1.4.2 Multi-Head vs Multi-Query Attention . . . . . . . . . . . 24
3.1.5 Transformers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.1.5.1 Encoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.1.5.2 Decoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.1.6 Decoder-only Transformers and LLMs . . . . . . . . . . . . . . . . 28
3.1.6.1 Training Process . . . . . . . . . . . . . . . . . . . . . . . 28
3.1.7 Fast Decoders Transformers . . . . . . . . . . . . . . . . . . . . . . 30
3.2 StarCoder Foundation Model . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.2.1 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.2.1.1 Token Embeddings . . . . . . . . . . . . . . . . . . . . . . 31
3.2.1.2 Positional Embeddings . . . . . . . . . . . . . . . . . . . . 31
3.2.1.3 Fast Decoders Transformer Blocks . . . . . . . . . . . . . 31
3.2.1.4 Linear Head . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.2.2 StarCoderBase Data Preparation . . . . . . . . . . . . . . . . . . . 32
3.2.2.1 Source Selection . . . . . . . . . . . . . . . . . . . . . . . 32
3.2.2.2 Code Language Selection . . . . . . . . . . . . . . . . . . 32
3.2.2.3 Data Quality Assurance . . . . . . . . . . . . . . . . . . . 32
3.2.3 Training Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.2.3.1 Pre-Training . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.2.3.2 Pre-Training for Python . . . . . . . . . . . . . . . . . . . 33
3.2.3.3 Clusters and Carbon Footprint . . . . . . . . . . . . . . . 33
3.2.4 StarChat: From StarCoder to Assistant . . . . . . . . . . . . . . . . 34
3.2.4.1 Training Details . . . . . . . . . . . . . . . . . . . . . . . 34
3.2.4.2 From StarChat to SPARCoder . . . . . . . . . . . . . . . 34

4 Enterprise Knowledge Graphs 37

4.1 Why an EKG for Your Company ? . . . . . . . . . . . . . . . . . . . . . . 38
4.2 Ontology Development . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.2.1 Why do we need ontologies ? . . . . . . . . . . . . . . . . . . . . . . 39
4.2.2 Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.2.3 Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.2.4 Namespace . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.2.4.1 URI Selection . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.2.4.2 Domain-Specific Naming . . . . . . . . . . . . . . . . . . . 42
4.2.4.3 Versioning . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.2.4.4 Employee and Training Namespaces . . . . . . . . . . . . 42
4.2.5 Ontologies Engineering . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.2.5.1 Employee Ontology . . . . . . . . . . . . . . . . . . . . . . 43
4.2.5.2 Training Ontology . . . . . . . . . . . . . . . . . . . . . . 43
4.2.5.3 Transitivity of relations . . . . . . . . . . . . . . . . . . . 44
4.2.5.4 Inverse Relations . . . . . . . . . . . . . . . . . . . . . . . 44
4.2.6 Glosses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.2.7 Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.3 Data Transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

ii
4.3.1 Structured Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.3.2 Tools and Frameworks . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.3.3 Future Directions and Unstructured Data . . . . . . . . . . . . . . 48
4.4 Data Annotations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.5 Quality Assurance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

5 SPARCoder 51
5.1 Limitations of Solely Relying on Enterprise Knowledge Graphs . . . . . . . 51
5.2 Limitations of Solely Relying on LLMs . . . . . . . . . . . . . . . . . . . . 52
5.3 Selection of the Enterprise Knowledge Graph Platform . . . . . . . . . . . 53
5.4 From StarChat to SPARCoder . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.4.1 Dataset Selection and Creation . . . . . . . . . . . . . . . . . . . . 56
5.4.2 Fine-Tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.4.2.1 Parameters Efficient Fine-tuning: LoRA . . . . . . . . . . 57
5.4.2.2 Instruction Fine-Tuning . . . . . . . . . . . . . . . . . . . 58
5.4.2.3 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.4.2.4 Qualitative Analysis . . . . . . . . . . . . . . . . . . . . . 62
5.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.6 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.6.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.6.2 Frontend Web Interface . . . . . . . . . . . . . . . . . . . . . . . . 66
5.6.3 Flask Server Middleware . . . . . . . . . . . . . . . . . . . . . . . . 67
5.6.4 Backend Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.6.5 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.7.1 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

A StarCoder 71
A.1 Glosses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
A.1.1 Employee . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
A.1.2 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

B SPARCoder 78
B.1 Interface and Server, Production Considerations . . . . . . . . . . . . . . . 81
B.2 Qualitative Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
B.2.1 Example 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
B.2.2 Example 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
B.2.3 Example 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
B.2.4 Example 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

Bibliography 89

iii
Chapter 1

Introduction

In the age of information, our ability to generate and store vast amounts of data has
grown exponentially (Mazumdar et al. 2019). For multinational corporations such as
Safran Group, there is a surging need to store, manage, and access this ever-expanding
information ecosystem. While this large amounts of data holds the promise of novel
insights and transformative discoveries through data science, it often remains underutilized.
One of the primary reasons for this underutilization is the complexity and inaccessibility
of data, which hinders its understanding and use.
The power of data is not solely in its volume but in the insights that can be drawn
from it. However, realizing these insights necessitates tools that enable intuitive access
and understanding. Despite being in the golden age of technological advancements in
natural language processing (NLP) and knowledge representation, there still exists a gap
between vast datasets and the ability to extract meaningful information from them. This
disconnect is a pressing concern, especially as the need for data-driven decision-making
becomes pivotal in every sector.
Knowledge Graphs (KGs) have emerged as a promising solution to address this data
fragmentation issue. By consolidating information from multiple sources into a unified
knowledge, KGs offer a structured representation of data (Studer et al. 1998). However,
while they do consolidate and structure data, the interaction mechanism with these graphs
remains technical, necessitating expertise in querying languages like SPARQL.
On the other hand, Large Language Models (LLMs) have demonstrated a strong ability
to understand and generate human-like text (Brown et al. 2020; Devlin et al. 2019), as
well as machine code, making them an excellent bridge between technical data structures
and the end-user (Sun et al. 2023). By integrating the structured approach of KGs with
the intuitive interaction enabled by LLMs, we hope to democratize data accessibility.
In this thesis, our investigation is structured around three primary points. Firstly, we
turn our attention to the cutting-edge domain of Large Language Models, specifically,
the development of a Text-to-SPARQL Assistant Model. This exploration will cover on
how to choose a LLM that we can train and use to effectively query an underlying KG,
making the interaction intuitive for end-users. Secondly, we delve into the process of
developing an Enterprise Knowledge Graph, presenting the inherent challenges associated
with its creation and maintenance. This will provide insights into the foundation of our
proposed system, emphasizing the importance of a well-structured and comprehensive KG.
Lastly, we converge our insights from the former domains to present the architecture of our
chatbot-driven search engine, which integrates the enterprise KG with the text-to-SPARQL
model. Thereby, we aim to offer a comprehensive understanding of our system.

1
1.1 Data Challenges at Safran Aero Boosters
Safran Aero Boosters (SAB) stands as an industry leader in the design, production, and
testing of low-pressure compressors, oil-system equipment, and test benches for aircraft.
As an integral subsidiary of the Safran Group , it underpins the propulsion systems
of countless aircraft around the globe (Figure 1.1). While its primary mission orbits
around the manufacturing of boosters, the intricate realm of aerospace manufacturing
demands high precision and efficiency. In such a complex environment, allocating resources
to continually update and refine state-of-the-art information systems might not always
ascend to the peak of priorities. But in the competitive arena of aerospace manufacturing,
accentuated by elevated labor costs in Belgium, it’s imperative for SAB to constantly
pioneer in research and development. This ensures not only sustained profitability but
also a competitive edge. Achieving this requires arming the workforce with optimal tools,
both in tangible assets and in cutting-edge information systems. However, this endeavor
presents its own set of challenges.

Figure 1.1. Safran reports full-year 2022 results

1.1.1 Volume and Variety

SAB industries produce a massive volume of data daily, ranging from transactional records
and quality assurance reports to inventory logs and machine recordings. This data spans
both structured formats (like relational databases) and unstructured ones (documents,
manuals, emails). Managing, organizing, and effectively retrieving relevant data from this
vast landscape poses a substantial challenge.

1.1.2 Data Silos

Data silos, as experienced within SAB, represent one of the major challenges faced by
corporations. The creation of isolated data repositories can be attributed to various factors.

2
A primary contributor is the aforementioned variety of data. Given the diverse nature of
data being generated, from structured to unstructured data, it is often more convenient,
or even necessary, to utilize specialized databases tailored to specific data types. For
instance, structured data such as machine recordings might be best managed in SQL-based
relational databases, while unstructured data like quality assurance reports could be better
handled using NoSQL databases.
Another dimension to consider is the historical development of departments or divisions.
Over time, as each unit evolved, they might have adopted or developed systems best
suited to their immediate needs, often without a company-wide strategy in mind. This
decentralized approach, while providing short-term solutions, leads to data fragmentation.
This compartmentalization of data not only hinders a comprehensive view but also poses
challenges in cross-referencing data points across departments. For instance, correlating
insights from the Quality Control Department with data from machine recordings becomes
a laborious task, reducing the potential for data-driven decision-making.
Lastly, the presence of data silos often amplifies redundancy, as the same data might be
stored and managed in multiple locations, leading to inefficiencies in storage and potential
inconsistencies in data interpretation.
In light of these complexities, there’s a pressing need to bridge these silos, ensuring
that data, irrespective of its source or type, is accessible, and actionable across the entire
company. Resolving this data fragmentation issue would not only provide a unified view
of data but also lay the foundation for more collaboration across various departments.

1.1.3 Real-time Access

In aerospace manufacturing domain, the stakes are even higher when it comes to real-
time decisions. Companies often operate with backlogs filled several years in advance,
underscoring the importance of delivering on schedule. Any delay in data retrieval could
potentially disrupt this finely calibrated timeline, causing a domino effect on subsequent
deliveries and operations. As every component, module, and assembly is a critical cog in
the larger machinery, disruptions can have cascading impacts not only within SAB but
also for Safran Group supply chains. Ensuring swift and efficient data extraction becomes
more than a convenience, it’s an operational imperative.

1.1.4 Quality and Consistency

With numerous data sources and varied entry mechanisms, ensuring data quality and
consistency becomes a significant concern. In a domain as crucial as aerospace manufac-
turing, inaccurate or outdated data can lead to misguided decisions with tangible negative
consequences on the production. Furthermore, inconsistent data representations across
different departments can impede collaboration, as teams might be working with different
versions or interpretations of the same data. This necessitates robust data governance
and validation mechanisms to ensure that data is both accurate and consistent across the
enterprise.

1.1.5 Security
Safran Group, being a pivotal player in the industrial landscape, possesses invaluable
intellectual property and industrial secrets that underscore its competitive edge. Given

3
this unique position, the company is dealing not just with standard corporate data, but
with sensitive information that, if compromised, could compromise its strategic advantage
and market reputation.
Furthermore, Safran’s engagement in defense activities amplifies the importance of
this responsibility. Defense-related data is not only commercially sensitive but can also be
of national security interest. Any breach or unauthorized access to this data could have
consequences beyond the company and potentially impacting national security.
Ensuring the maximum security for this data is primordial. This demands a multi-
faceted approach, encompassing rigorous access controls, advanced encryption standards,
and continuous monitoring for potential threats. Additionally, it’s vital for Safran to
foster a security-centric culture, ensuring that every stakeholder, from executives to the
operational workforce, is aware of the importance of data security and is equipped with
the tools and knowledge to uphold it.

1.1.6 Compliance
Since 2018, compliance extends beyond traditional industrial regulations and delves into
the realm of data protection and privacy. In fact, the emergence of data protection laws,
such as the General Data Protection Regulation (GDPR) in Europe, signifies a global shift
towards safeguarding individual privacy and ensuring responsible data management. For
instance, an important article from the GDPR is Article 5, which stipulates that personal
data shall be processed lawfully, fairly, and in a transparent manner in relation to the data
subject1 , emphasizing principles of data minimization and accuracy (European Commission
2016). Moreover, Article 22 of the GDPR is also particularly pertinent in the context
of modern AI-driven enterprises2 . It deals with automated individual decision-making,
including profiling, and states that the data subject shall have the right not to be subject
to a decision based solely on automated processing, including profiling, which produces
legal effects concerning him or her or similarly significantly affects him or her (European
Commission 2016).
For Safran, aligning with such regulations is vital, especially given the increasing
integration of AI in various business processes. Adherence implies stringent requirements
on how personal data is collected, stored, processed, and shared, while violations can lead
to severe financial penalties, but more critically, can damage the trust and reputation that
Safran holds with its stakeholders, partners, and customers.

1.1.7 Tailored Data and Information Retrieval

Within a diverse corporation, the nuances in data requirements can be vast. Different
departments, roles, and decision-makers not only need different data but also require that
data to be presented in contextually relevant ways. A one-size-fits-all query mechanism
falls short in such an environment, often leading to inefficiencies.
Firstly, the nature of the queries can vary drastically. While a buyer may be looking
for specific price related to raw material, an engineer might be more interested in detailed
component specifications. These vastly different data needs dictate a need for a system
that can discern the underlying intent behind a query.

1
In the context of the GDPR, a data subject refers to an identified or identifiable natural person.
2
By modern AI-driven enterprises, we refer to organizations that heavily rely on data to fuel their
artificial intelligence and machine learning applications for various processes and decision-making.

4
Furthermore, the complexity of the data in question can vary. Some users might require
a deep dive into the data, while others could be seeking a broad overview. A tailored data
and information retrieval system must be agile enough to handle both these extremes and
everything in between.
Understanding the context is another pivotal facet. The same piece of data can be
interpreted differently based on the department or the role of the user. For instance, a
sudden increase in raw material costs might be viewed as a procurement challenge by the
buyer’s department, but as a potential price adjustment scenario by the sales department.
A sophisticated retrieval system should not only fetch the data but also provide auxiliary
information or related data points that can aid in contextual understanding.
Lastly, the way information is presented is equally crucial. The interface and user
experience play a significant role in how efficiently users can extract value from the data.
Some might prefer visual representations such as charts or graphs, while others could lean
towards tabulated data or detailed text reports. The retrieval system must be versatile
enough to adapt to these varied presentation preferences.
In essence, the true value of data lies not just in its availability, but in its accessibility.
For SAB, navigating its intricate field of operations and myriad stakeholders, implement-
ing a tailored, context-aware data retrieval mechanism could significantly enhance data
accessibility and utilization.

1.2 Information and Data Retrieval: Towards

Knowledge Retrieval
As explained in the previous section, the digital era has been accompanied by an explosion
in data generation and storage. Thereby, the demand for adept retrieval systems has
grown exponentially with this explosion. Historically, these retrieval systems have been
split into two principal domains: Data Retrieval and Information Retrieval. However, as
data becomes fragmented and assumes diverse forms, an interest in a retrieval method
able to leverage advantages of both historical method has emerged (Hambarde et al. 2023;
Chaudhri et al. 2022).

1.2.1 Data Retrieval Systems

At the heart of data retrieval systems lie database management systems, which are
tailored for the organization and extraction of structured data. These systems, particularly
those employing Structured Query Language (SQL) technologies, are proficient at storing
structured datasets in systematic formats, such as tables or records. Users can then query
specific data segments based on set criteria. Being relational, they excel at correlating
different datasets, facilitating complex data extraction based on multiple parameters.
However, additionally to the fact that it requires to follow systematic formats and data
expert to query such systems, it also requires datasets to be stored in the same database,
or to use additional tools such as PowerBI3 .

3
PowerBI (Microsoft) is a tool that is extensively used to join tables from various databases at SAB
and in many other companies.

5
1.2.2 Information Retrieval Systems
On the contrary, information retrieval systems navigate the ocean of unstructured or
semi-structured data. Web search engines are the prime exemplars of this system (Brin
et al. 1998). Unlike SQL-based systems which rely on structured queries, information
retrieval systems, especially those utilizing keyword-based technologies like ElasticSearch,
scan and index vast textual data amounts. Post indexing, these systems retrieve pertinent
documents or web pages in response to user queries. While their prowess is in pinpointing
data across enormous volumes, their mechanism is predominantly keyword matching,
which sometimes may not capture the nuanced contexts.

1.2.3 Knowledge Retrieval

However, the distinction between data and information retrieval is more profound than
it may appear at first glance. In fact, we can cite Bugaje et al. 2017 which state that
"The fundamental characteristics of and form of user interaction with research datasets
differ considerably from those of research publications. Notwithstanding these differences,
however, the majority of currently available research data repositories use the same retrieval
engines for research data (datasets) as for publications (text), which retrieval engines,
inevitably, are ill-suited as long-term solutions for sustainable data retrieval and use".
This insight underscores the inadequacy of employing a one-size-fits-all retrieval strategy,
emphasizing the inherent uniqueness of data types and the necessity for tailored retrieval
methodologies.

1.2.4 Evolving Information Retrieval: Vector Databases

With the advancement of NLP, traditional information search methodologies, predomi-
nantly keyword-based, are gradually being complemented or even replaced by advanced
techniques like vector databases and semantic matching.
Vector databases, for instance, don’t just rely on explicit keywords. Instead, they
transform data, including text, videos, and even images, into high-dimensional vectors,
capturing the semantic essence of the data (Figure 1.2). When a query is made, it’s also
transformed into a similar vector, and the database then fetches results based on vector
similarity, enabling a more nuanced and context-aware retrieval.

Figure 1.2. Vector Database (Tiwari 2023)

6
Semantic matching furthers this idea by focusing on understanding the meaning or
context behind a query rather than just the exact words. It takes into account synonyms,
related terms, and the broader context of the query to provide more relevant results,
drawing from concepts in natural language processing and machine learning.

1.3 Internship and Research Questions

1.3.1 Internship Proposition
As discussed in section 1.1, the lines between raw data, structured information, and
actionable knowledge increasingly blur, the future of retrieval systems will rely on their
adaptability to handle divers data. Within this evolving landscape, knowledge retrieval is
not just a theoretical pursuit but a crucial step for organizations to add value to their vast
data volumes.
Amidst the hype surrounding "ChatGPT", I was fortunate to secure an internship at
SAB. The primary aim was to explore the vast potentials of LLMs. There was a vision,
perhaps a bit ambitious, to bring to life a "SAB ChatGPT", tailor-made to fit the specific
data, demands and nuances of the organization.
Nevertheless, as my internship progressed and research insights accumulated, it became
clear that LLMs, despite their vast potential, had certain limitations (Section ??). Specifi-
cally, their tendency to "hallucinate" and the opaque, black-box nature of their operations
made them less than ideal for replacing traditional information systems in an enterprise
setting. However, their real potential lay in acting as intermediaries, an interfaces that
could bridge users and information systems. In this role, LLMs could democratize data,
rendering it more accessible and user-friendly, while still leveraging the robustness and
structure of traditional information systems.

1.3.2 Choosing a Path: Structured vs. Unstructured Data

The previously discussed divergence into two branches, Data and Information Retrieval
for structured and unstructured data respectively, posed a significant decision point in my
research journey. To demonstrate the technical skills I had develop during my engineering
degree and to align with the objectives of my master’s thesis, I had to make a strategic
choice.
Natural Language Information Retrieval for unstructured data, while crucial, has
seen extensive exploration over the years, both academically and commercially. The
field is rich with research (Devlin et al. 2019; Cer et al. 2018; Johnson et al. 2017) and
already boasts of commercial vector database solutions tailored for it (Curiosity, Chroma,
Pinecone, Sinequa,...). This saturation made it challenging for me to identify areas of
significant novelty. Especially in my internship setting at SAB, where the handling of
highly confidential data was paramount and computational resources on-premise were
limited, this situation further diminished the potential for innovative advances in the area
of unstructured data retrieval.
On the other hand, Natural Language Data Retrieval on structured data remains
comparatively less explored, offering more opportunities for fresh insights and innovation.
Given its challenges, it stood out as an inviting domain for my master’s thesis, where my
contributions could potentially stand out more distinctly.

7
1.3.3 SAB’s Vision and Research Question
A key directive from my tutor at SAB was the development of a Chatbot-driven search
engine. This tool was envisioned to be compatible with all the company’s information
systems, ensuring holistic data accessibility. Moreover, it was imperative that this solution
be both technologically feasible, cost-effective and aligned with SAB data challenges
described in section 1.1, reflecting a practical approach to enhancing data accessibility.
With this in mind, and given the decision to focus on structured data, as we delved
deeper into potential solutions, we recognized that Enterprise Knowledge Graphs (EKGs)
align well with the constraints presented (Section ??). EKGs, with their structured data
representation and integration capabilities, are designed for large company data. The
availability of commercial solutions and comprehensive EKG platforms suggests their
growing industry acceptance, making them an appealing option for SAB’s requirements.
However, while EKGs offer an organized and integrated data view, their intricacies
might render them less accessible to non-experts. Here’s where LLMs can bridge the gap.
LLMs have the potential to serve as a user-friendly interface, interpreting natural language
queries and retrieving relevant results from the EKGs (Section ??). This combination could
not only meet the constraints set by SAB but also democratize access to the information
housed within EKGs. Thereby, my research question crystallized:

Research Question: Can large language models effectively integrate with enterprises
knowledge graphs to enhance data accessibility in enterprises ?

Underlying Research Questions:

1. What architectural factors should be taken into account when selecting the most
suitable Foundation Large Language Model ?

2. What are the primary challenges in developing Enterprise Knowledge Graphs

to consolidate and represent disparate data sources ?

3. Why not employ a Knowledge Graph or a Large Language Model in isolation?

What advantages arise from their combined utilization ?

4. How can Large Language Models be effectively fine-tuned to query knowledge

graphs using natural language ?

8
Chapter 2

Related Work

2.1 Natural Language Processing

In the vast and ever-evolving landscape of Natural Language Processing (NLP), tracing its
timeline from origins to Language Foundation Models provides a comprehensive view of
its journey. This section seeks not just to elucidate the timeline of NLP’s progression, but
also to showcase how the field has transformed over time, with insights drawn primarily
from three works (Louis 2020; Chiusano 2022; Bommasani et al. 2021). To accompany the
text, an illustrative timeline offers a visual roadmap on Figure 2.1, capturing NLP key
milestones.

Figure 2.1. Natural Language Processing Timeline (Chiusano 2022)

2.1.1 Language Processing before the Deep Learning

NLP has undergone significant transformations since its origins in the late 1940s. The
field was initially initiate by Weaver’s memorandum, which introduced the concept of
Machine Translation (MT) (Weaver 1949). However, translation systems at this times was
often relying on dictionary-lookup methods, but these simplistic approaches often ignore
the complexity of languages, leading to unsatisfactory results. A major breakthrough
came with the introduction of Chomsky’s generative grammar offered a more structured

9
approach to linguistic analysis (Chomsky 1957). However, the initial enthusiasm of the
1950s was tempered by the Automatic Language Processing Advisory report in 1966,
which critiqued the feasibility of MT and recommended reduced funding (Pierce et al.
1966). Despite this, the following decades NLP advancements continued, systems like
ELIZA, SHRDLU, and LUNAR showcased the potential of NLP in various applications
(Weizenbaum 1966; Winograd 1971; Woods et al. 1972). The 1980s predominantly
used symbolic approaches, employing complex rules for language parsing (Dyer 1983).
A paradigm shift occurred in the late 1980s and early 1990s when statistical models,
powered by the rise in computational capabilities and machine learning, began to supplant
traditional rule-based systems, heralding a new era in NLP research which bring us to the
modern NLP era.

2.1.2 Language Processing in the Deep Learning Age

During the onset of the 21st century, language modeling underwent a transformative shift
with the introduction of neural networks, aiming to predict subsequent words in texts
based on preceding ones by Bengio et al. (2003). Thereby the year 2003 initiated the
neural language model wave in NLP, it also introduced the concept of word embeddings, a
foundational pillar in modern NLP on which we will come back to later in section 3.1.1
(Bengio et al. 2003). Although Recurrent Neural Networks and their advanced version,
Long Short-Term Memory networks, became prominent successors to feed-forward networks
for language tasks, the latter still found relevance in specific applications (Rumelhart et al.
1986; Hochreiter et al. 1997; Tomáš Mikolov et al. 2010; Graves 2013).
The year 2008 marked another significant advancement when Collobert et al. (2008)
merged the principles of multi-task learning with neural networks, leading to a unified
architecture capable of handling diverse NLP tasks using a single convolutional neural
network architecture (CNN) (LeCun et al. 1998). Such multi-task frameworks not only
streamlined model architectures but also facilitated collaborative learning across tasks.
A major step occurred in 2013 with the introduction of Word2Vec which despite building
on embedding concepts, presented efficient training methodologies allowing embeddings
to be trained on massive text corpora (Tomas Mikolov et al. 2013). Concurrently, there
was a surge in the adoption of diverse neural architectures, including RNNs, CNNs, and
Recursive Neural Networks, each tailored for different facets of language processing (Socher
et al. 2013).
In 2014, sequence-to-sequence learning has been proposed adding another dimension,
enabling end-to-end transformation of sequences, which notably revolutionized the domain
of machine translation (Sutskever et al. 2014). This was closely followed by the advent of
the attention mechanism in 2015, a method that further enhanced the sequence-to-sequence
paradigm by allowing models to selectively focus on relevant parts of the input (Bahdanau
et al. 2015).
Building upon the attention mechanism’s success, transformer architecture has been
introduced in the famous paper Attention Is All You Need (Vaswani et al. 2017). This
architecture utilized the attention mechanism, eliminating the need for recurrent layers
and offering significant improvements in both efficiency and performance. The Transformer
became foundational in the field of NLP, as it enabled the development of increasingly
powerful models.
These models, known as foundation language models or large language models, represent
the culmination of NLP research. They possess an unparalleled ability to generalize across

10
a vast array of linguistic tasks, making them a central block in modern NLP.

2.1.3 Language Processing in the Foundation Models Age

The emergence of large language models,
a subclass of foundation models, marks
a pivotal moment in NLP history. These
models are trained on vast text corpus,
using self-supervision mechanisms, and
are then fine-tuned to handle a wide
variety of downstream tasks. Some ex-
amples from this category include BERT
(Devlin et al. 2019), or GPT (Radford
et al. 2018). While the underlying tech-
nologies, namely deep neural networks
and self-supervised learning, have been
around for few years, what sets founda-
tion models apart is their colossal size.
In fact, as it is shown on Figure 2.2, Figure 2.2. LLMs Sizes Evolution
Language models changed their scale.

A notable distinction of foundation models can be highlight by two terms: "emergence"

and "homogenization." The former refers to behaviors within foundation models that
are derived rather than being explicitly crafted. On the other hand, homogenization
stand for the unification of methods employed in developing machine learning solutions
across different applications. The terminology, foundation models was chosen to highlight
the shift currently underway in the AI domain. While existing terminologies such as
pre-trained model or self-supervised model somewhat touch upon the technicalities, they
fail to encapsulate the overarching paradigm shift and its implications (Bommasani et al.
2021).

The influence of foundation models is most

prominently seen in the domain of NLP. His-
torically, NLP was centered around creat-
ing models tailored for specific tasks which
spanned from classification tasks to se-
quence labeling, or even generation tasks.
However, the recent trend tends to leverage
a unique foundation model, slightly adapted
for each task. This strategy has repeatedly
proven to be superior. For instance, as we
can see on Figure 2.3, since the adoption of
Figure 2.3. Progress over time of Aristo’s
foundation, or Large-scale language models,
scores on Regents 8th Grade Science (non-
they outperform previous state-of-the-art
diagram, multiple choice questions, held-out
systems for answering open-ended science
test set).(Clark et al. 2020)
questions (Clark et al. 2020).

11
This paradigm shift, initiated by foundation models, has redefined the trajectory of NLP.
From originally focusing on crafting tailored architectures for individual tasks, the current
emphasis revolves around maximizing the capabilities of foundation models, steering
research towards more effective adaptation methods (Figure 2.4) and understanding the
subtleties of these models (Ben Zaken et al. 2022; E. J. Hu et al. 2021).

Figure 2.4. Pre-Training and Addaptation of Foundation Model (Bommasani et al. 2021)

2.2 Knowledge Graphs

Knowledge representation is an essential block of artificial intelligence, aiming to model the
intricate relationships and attributes of information to support reasoning and understanding.
Central to this domain is the concept of the Knowledge Graph (KG), which has seen
multiple definitions and applications over time.
A KG, serves as a structured representation of information, employing nodes and
edges to represent relationships and attributes between various entities. These entities,
can be of different nature, from individuals to organizations, or even conceptual entities.
Every edge in this graph is distinctly labeled, exemplifying the relationship it intends to
denote, whether it be friendship ties between people, affiliations between companies and
individuals.
Different methodologies exist to assign meanings to these nodes and edges. These
meanings can be expressed in simplistic human-readable formats like English text or in
more complex computational terms using formal languages.

2.2.1 Brief History of Knowledge Graphs

In computer science domain, graph-based data representations have long been integral
to the discipline. From the earliest phases of AI research, semantic networks, or directed
labeled graphs, emerged as a base for AI systems (Woods et al. 1972). These graphs
evolved into taxonomies and ontologies, becoming critical structures in KR.

12
Seminal works in the domain explored computational interpretations of these semantic
networks, employing first-order logic (FOL) (Hayes 1981; McCarthy 1989). Initially
grounded in network data models, databases evolved into relational models, sharing the
foundational logic with programming (Taylor et al. 1976; Codd 1982).
Historical KR systems, took elements from FOL and semantic networks, demonstrating
the capability to capture and represent diverse knowledge facets, from causality rules to
expert insights. The trajectory of these developments predominantly encapsulated a shift
from explicit representations to expert systems and eventually to extensive common-sense
knowledge bases (Lenat, Ramanathan V Guha, et al. 1995, 1991).
The dawn of the internet age in the mid-1990s fundamentally altered the landscape.
With the information explosion, methods to access, comprehend, and search this in-
formation go through rapid evolution. Algorithms like Page Rank marked the initial
breakthroughs (Page et al. 1999). However, the vision of semantic-enhanced searches soon
became a reality with resources like Wikidata and Data Commons, rooted in the principles
of the earlier Meta Content Format (Ramanathan V. Guha 1996). Contrasting earlier
AI systems, modern KGs largely focus on representing vast ground facts, reducing the
interest around complex inferences.
More recently, Knowledge Graphs are regaining attention in the AI domain. Their
evolution, from early directed labeled graphs to today’s sophisticated KGs, show the pro-
gression and aspirations of the AI domain, striving for more meaningful and comprehensible
representations of knowledge (Chaudhri et al. 2022).

2.3 Unifying LLMs and KGs

The potential of LLMs coupled with structured information like Knowledge Graphs offers
a promising direction in the field of artificial intelligence. LLMs, characterized by their
vast generalizability, sometimes face limitations when it comes to accessing concrete
factual knowledge. Conversely, KGs, while being reservoirs of structured knowledge, are
labor-intensive to query, and also to build and update 2.5.

Figure 2.5. LLMs and KGs, Pros and Cons (Pan et al. 2023)

Three main strategies seems to emerged in order to unify LLMs and knowledge graphs
(Pan et al. 2023).

13
2.3.1 Enriching LLMs with KGs
This modality focuses on infusing LLMs with structured knowledge from KGs during their
formative and operational stages. The primary intent is to make LLMs more factual and
knowledge-aware (Xu et al. 2021).

2.3.2 Amplifying KGs using LLMs

Here, LLMs act as catalysts aiding various KG operations. These operations can range
from tasks like embedding and completing knowledge gaps to more advanced functionali-
ties like converting graph data to textual content and addressing queries (N. Hu et al. 2023).

In our case, our research aims at developing a text-to-SPARQL LLM, which is a

straightforward, but powerful approach of amplifying KGs using LLMs.

2.3.3 Mutual Enhancement

Envisioned as a holistic framework, both LLMs and KGs function collaboratively, each
feeding into and enhancing the other, facilitating a data and knowledge-driven dual
reasoning mechanism (Wang et al. 2021; Ke et al. 2021).

2.4 Large Language Models for Code Generation

The scope of LLMs has expanded beyond natural language to encompass programming
languages. Thanks to the availability of open-source code, it has facilitated the creation of
expansive coding datasets, such as The Stack (Kocetkov et al. 2022), empowering them
to generate, comprehend, and optimize codes. As a result, the boundaries of software
development are being reshaped, with these models acting as new tools for an intuitive
coding experience.

2.4.1 State-of-the-Art in Code Generative LLMs

Several LLMs, such as OpenAI’s Codex, the Salesforce’s CodeGen series, and the open-
source StarCoder models from the BigCode community, have set benchmarks in the code
generation landscape (Li et al. 2023; Chen, Tworek, Jun, Yuan, Ponde de Oliveira Pinto,
et al. 2021; Nijkamp et al. 2022). Their ability to synthesize code from natural language
prompts or even input-output examples offers significant potential in bridging the gap
between human natural language instructions and machine executions.
One of the salient features emerging from this research is the capacity of these models
to understand and break down complex, multi-step problems. This is evident in multi-turn
program synthesis, where an overarching problem is subdivided into multiple subproblems,
each addressed through specific prompts. Such an approach has shown to boost the
efficiency and accuracy of generated solutions, allowing for more modular and maintainable
code (Nijkamp et al. 2022).

2.4.2 From Text-to-SQL to Text-to-SPARQL

The strengths of LLMs are not limited to traditional programming languages. SQL-
PaLM, for instance, exemplifies how LLMs can be tailored for domain-specific languages,

14
translating natural language queries into Structured Query Language (SQL) commands
(Sun et al. 2023). We can also cite NSQL from Numbers Station, an other Text-to-SQL
family of model based on CodeGen Models (S. Wu et al. 2023).
Back to our study case, this capability lays to confirm the feasibility of our research’s
focus: developing an LLM adept at Text-to-SPARQL transformations. Given the structural
similarities between SQL and SPARQL, and the demonstrated capabilities of existing
models, which is an important point for potential success of our proposed LLM.

15
Chapter 3

StarCoder

In the vast landscape of LLMs for code generation, selecting the right technology is the
first step towards the success of our system. The model that we are looking for not
only needs to understand the nuances of human language but also need to interact with
programming-centric data structures. This section motivates our model choice, StarCoder
(Li et al. 2023), because it outperforms open-source Code LLMs and is in close contention
for the top spot with closed-source models (Appendix A).
StarCoder stands out as an avant-garde creation from the BigCode community, a
collaborative endeavor dedicated to innovating Large Language Models specifically for
code (Code LLMs). It is at the intersection of advanced natural language processing and
code understanding, with an interesting ratio between it large context window1 and it
relatively small size (Appendix A), making it a prime candidate as foundation model for
our ontology aware Text-to-SPARQL LLM.
In the following, we will deeply explain the motivation behind our choice of StarCoder
as our foundation model, explaining how it aligns with the objectives of our system. In
Section 3.1, we detail its key components, then in Section 3.2, we present StarCoder
as our Foundation Model, and data it was trained on. Let’s start by discussing the
key components of the StarCoder arichitercture before bringing them together to form
StarCoder’s architecture.

3.1 Key Components

At its core, StarCoder is built upon the fast transformers architecture, which is a specialized
variant of the Generative Pre-trained Transformer (GPT), tailored for code. This section
delves into the details of its key component, starting with the foundational concept of
word embeddings.

3.1.1 Word and Token Embeddings

Word embeddings are a class of techniques where words from a vocabulary are mapped to
high-dimensional vectors of real numbers. These vectors aim at capturing the semantic
essence of the words, allowing models to understand the relationships and nuances between
different words. The idea behind word embeddings, as represented on Figure 3.1 and
3.2 is to represent words in a way that similar words have similar representations in a

1
The context window of a LLM refers to the amount of text it can consider at once.

17
high-dimensional space. This concept is fundamental for natural language model, as it
provides a dense and continuous representation for words, as opposed to sparse and discrete
representations like one-hot encoding.

Figure 3.1. Words Embeddings Visualization2

Figure 3.2. GloVe vectors trained on Wikipedia3

This concepte has been popularized with word2vec (Tomas Mikolov et al. 2013), which
is one of the most popular word embedding techniques. Word2vec uses shallow neural
networks to produce word embeddings by either predicting the context given a word
(Skip-Gram) or predicting a word given its context (Continuous Bag of Words).

In the context of StarCoder, and more globally in the context of transformers model,
the concept of word embeddings is slightly modified to what is named token embeddings.
Unlike traditional word embeddings that represent individual words, token embeddings
represent tokens, which can be as short as a single character or as long as a word (Figure
3.3). This granularity is crucial for tasks like code generation, where understanding
individual symbols, operators, or short sequences can be as important as understanding
full words.

2
Source: https://www.tensorflow.org/tutorials/representation/word2vec
3
Source:https://jalammar.github.io/illustrated-word2vec/

18
Figure 3.3. Open AI’s GPT3 Tokenizer4

The tokenization process breaks down input text into these smaller units or tokens.
For instance, a line of code might be tokenized into individual symbols, keywords, and
identifiers. This tokenization allows the model to understand and generate code at a
granular level, ensuring that even subtle nuances in code syntax and semantics are captured.
Once the text is tokenized, each token is mapped to a vector in a high-dimensional
space, akin to conventional word embeddings. This vector representation is refined during
the training phase. The embeddings encapsulate not only semantic and syntactic details
about the tokens but also morphological information. Morphology delves into the internal
composition of words, breaking them down into fundamental units called morphemes. For
instance, in the word ’compressor’ (Figure 3.3), ’compress’ denotes the root morpheme
conveying the core meaning, while ’or’ is a derivational morpheme that transforms the verb
compress into a noun indicating an entity performing the action. Additionally, embeddings
can also capture conjugation details similarly.
Once the text is tokenized, each token is mapped to a vector in a high-dimensional
space like traditional word embeddings. This vector representation is learned during the
training process. The embeddings capture semantic and syntactic information about the
tokens.

3.1.2 Positional Encodings

As we will explain later on, in transformers architecture, each words of the input simul-
taneously flows through the model. Therefore, the model do not have an insight of the
order of tokens. To address this, we add positional encodings to the token embeddings.
These encodings ensure that the model can account for the position of words in the input
sequence, which is crucial for understanding the meaning of a sentence (Figure 3.4).

4
https://platform.openai.com/tokenizer

19
Figure 3.4. Influence of Positional Encoding on Training (Fleuret 2021)

In practice, positional encodings are added to the word embeddings (Figure 3.5).
These positional encodings are illustrated on Figure 3.6 illustrated on Figure 3.6 and the
mathematical formulation is given by:

!
t
P E(t, 2i) = sin 2i
10000 dmodel !
t
P E(t, 2i + 1) = cos 2i
10000 dmodel

Where:

• P E(t, 2i) represents the positional encoding for even indices 2i of the dimension.

• P E(t, 2i+1) represents the positional encoding for odd indices 2i+1 of the dimension.

• t is the position in the sequence.

• dmodel is the dimension of the embeddings.

Figure 3.5. (Vaswani et al. 2017)

Figure 3.6. Positional Encodings5

5
https://erdem.pl/2021/05/understanding-positional-encoding-in-transformers

20
3.1.3 Context Length
Context Length, Max Context Length or Context Window commonly, refers to the number
of token a LLM can consider at once. The primary focus of positional encodings is to
provide the model with the order of tokens, therefore, when defining positional encodings,
it’s crucial to ensure positional encodings can accommodate the desired model’s context
length such that each potential position in the input has a unique encoding.

3.1.4 Attention Mechanisms

The attention mechanism, introduced in the seminal work by Bahdanau et al. (2014),
became a cornerstone in the development of sequence-to-sequence models6 , particularly
in tasks such as machine translation. At its core, attention provides a way for models to
focus on specific parts of the input sequence when producing an output. This dynamic
weighting of input parts allows models to capture long-range dependencies and contextual
nuances, which are particularly crucial in both natural language sequence-to-sequence
modeling.
The attention mechanisms can be defined generically as follows. Let’s consider a query
vector q ∈ Rq . Corresponding to this query, we have a key matrix K ∈ Rm×k and a value
matrix V ∈ Rm×v . The objective of an attention layer is to compute an output vector
y ∈ Rv based on these inputs.
The attention mechanism can be expressed as:
m
y= softmaxi (a(q, Ki ; θ))Vi (3.1)
X

i=1

Here, softmaxi refers to the softmax operation applied to the ith component. a(q, k)
is the scoring function between the query q ∈ Rq and the key k ∈ Rk . It is given by:

a(q, k) = wvT tanh(WqT q + WkT k) (3.2)

with,

wv ∈ Rh
Wq ∈ Rq×h
Wk ∈ Rk×h

6
A sequence-to-sequence model refers to neural network architecture designed to take a sequence as
input and produce a sequence as output.

21
Figure 3.7. Attention Layer (Louppe 2023; Zhang et al. 2021)

The function a(q, k) essentially computes a scalar that measures the similarity between
the query and the key. The learnable weights matrices Wq and Wk are used to transform
the original query and key into a shared representation space, and the resulting vectors
are then combined using the weights vector wv . The tanh function ensures the output lies
between -1 and 1. The final softmax operation across all keys ensures that the attention
weights sum to 1, allowing the mechanism to distribute its attention across the key-value
pairs.
′
Thereby, given the two input sequence X ∈ Rn×x and X ′ ∈ Rm×x (n and m are the
sequences lengths and x and x′ are the embeddings sizes), the formulation of a classical
attention layer is such that,

QKT
!
attention(Q, K, V) = softmax √ V (3.3)
d
with,

Q = XWqT , ∈ Rn×d , Wq , ∈ Rd×x

′
K = X′ WkT , ∈ Rm×d , Wk , ∈ Rd×x
′
V = X′ WvT , ∈ Rm×v , Wv , ∈ Rv×x

(Louppe 2023; Zhang et al. 2021)

3.1.4.1 Self-Attention
The standard attention mechanism as described above is focused on mapping a query to a
set of key-value pairs. However, self-attention, the variant of the attention mechanism used
in the Transformer architecture, instead of requiring separate sets of queries, keys, and
values, in self-attention, these all derive from the same input sequence. This self-attention
enables the model to focus on different parts of the input sequence to generate its output
understanding the relation between each token in the sequence.
Mathematically, in self-attention, Q, K, and V are all derived from the same input
sequence X. For a given input sequence X ∈ Rn×d , the queries, keys, and values are
computed as:

22
Q = XWq , ∈ Rn×d , Wq , ∈ Rd×x
K = XWk , ∈ Rn×d , Wk , ∈ Rd×x
V = XWv , ∈ Rn×v , Wv , ∈ Rv×x

As a result, the attention scores and the output are determined entirely by the input
sequence X. In essence, every token within the sequence gets an opportunity to interact
with every other token, regardless of distance or position. I order to exemplify, Figure 3.8
and 3.9 show self attention scores. On Figure 3.8 we can observe that the it gives the highest
attention score to the word intership which is refereed by it. Same on Figure 3.9 where we
can see that employee referees to Paul. This design inherently the limitations imposed by
predefined context sizes, enabling the model to identify and leverage dependencies that
span across long sequences.

Figure 3.8. GPT-2 Attention Scores - BertViz

Figure 3.9. GPT-2 Attention Scores - BertViz7

The last important point about attention mechanism lies in its inherent parallelizability.
Traditional recurrent architectures, such as LSTMs (Hochreiter et al. 1997) or GRUs (Chung
et al. 2014), process input sequences sequentially, which inherently limits the potential for

7
https://github.com/jessevig/bertviz

23
parallel computation. In contrast, the self-attention mechanism in Transformers treats
each position in the sequence independently and in parallel. This means that all positions
can be computed simultaneously, significantly speeding up training and inference times,
especially when leveraging modern GPU architectures.

3.1.4.2 Multi-Head vs Multi-Query Attention

While the self-attention mechanism offers significant benefits in terms of capturing depen-
dencies within sequences, there have been variations to expand its power. The concept of
multi-head attention was first introduced by Vaswani et al. (2017) in the paper Attention Is
All You Need which has introduced what is commonly called transformers. This mechanism
employs multiple sets of learnable weight matrices for queries, keys, and values as it is
shown of Figure 3.10.
Formally, for multi-head attention, given h matrices Hi , the i-th heads, we have:

Hi = attention(QWiQ , KWiK , VWiV ) (3.4)

with, the weight matrices for the i-th head,

WiQ ∈ Rdmodel ×dk

WiK ∈ Rdmodel ×dk
WiV ∈ Rdmodel ×dv

With WiQ , WiK , and WiV ), the weight matrices for the i-th head.
Thus, the output of the multi-head attention mechanism is:

multihead(Q, K, V) = concat(H1 , H2 , ..., Hh )WO (3.5)

with,

WO ∈ Rhdv ×dmodel

(Louppe 2023; Zhang et al. 2021)

24
Figure 3.10. Multi-Head Attention(Vaswani
et al. 2017) Figure 3.11. Multi-Query Attention

In Transformers, multi-head attention enable the model to capture information from

different parts of the input. For instance, one head might focus on syntactic structures,
while another might emphasize semantic relationships. By having multiple attention heads,
the model can capture diverse aspects of the information in parallel, greatly enriching its
representation of the input sequence.
However, one drawback of the multi-head attention mechanism is the increase in
computational complexity, and the high number of parameters can contribute to overfitting,
especially when the dataset is limited.
Emerging from this backdrop is the multi-query attention mechanism, an other variant
used in models like Fast Transformers presented in the paper Fast Transformer Decoding:
One Write-Head is All You Need (Shazeer 2019). Instead of having multiple heads from
multiple sets of learnable weight matrices for queries, keys, and values, heads are from
multiple sets of learnable weight matrices for queries, but share key and value weight
matrices (Figure 3.11). This mechanism allows the model to focus on various parts of the
input for a specific attention head, essentially enabling it to have multiple perspectives
without the need for additional key or value representations.
Formally, for multi-query attention, given h matrices Hi , the i-th heads, we have:

Hi = attention(QWiQ , KWK , VWV ) (3.6)

with, the weight matrices for the i-th head,

WiQ ∈ Rdmodel ×dk

WK ∈ Rdmodel ×dk
)
Shared Across Heads
WV ∈ Rdmodel ×dv

Thus, as for the multi-head attention mechanism, the output of the multi-query
attention mechanism is given by:

multihead(Q, K, V) = concat(H1 , H2 , ..., Hh )WO (3.7)

25
with,

WO ∈ Rhdv ×dmodel

The advantages of multi-query attention include:

• Reduced Computational Complexity: Unlike multi-head attention, which

multiplies the computational cost for each head, multi-query attention maintains
only one set of key-value weight matrices, which moderates the computational
overhead.

• Memory Efficiency: Since there’s no need for additional key and value matrices for
each perspective, multi-query attention is more memory efficient than its multi-head
cousin.

• Enhanced Expressiveness: By allowing multiple queries and multiple heads, the

model can still capture a wider range of dependencies and nuances in the data
without exponentially increasing the number of parameters.

• Mitigation of Overfitting: With fewer parameters compared to multi-head

attention, the chances of overfitting due to an excessive number of parameters are
reduced, especially beneficial in scenarios with limited training data.

In essence, multi-query attention strives to strike a balance between the expressiveness of

multi-head attention in traditional Transformers and the efficiency of alternative attention
mechanisms, making it a compelling choice for a variety of sequence-to-sequence tasks
(Shazeer 2019; Ainslie et al. 2023).

3.1.5 Transformers
Transformers, introduced by Vaswani et al. (2017) in the paper Attention Is All You Need,
strongly impacts the field of natural language processing by providing a new paradigm
for sequence modeling. Unlike traditional LSTMs and GRUs, which rely on sequential
computation, Transformers exploit the parallel processing capabilities of modern GPUs
through attention mechanisms, making training more efficient, enabling them to capture
longer-range dependencies, and enabling transformers model to scale efficiently.
The Transformer model consists of token embeddings and positional encoding blocks,
as well as the core of the architecture, an encoder and a decoder stack as described on
Figure 3.12. The encoder and decoder each containing multiple blocks. Let’s, outline the
components of the architecture and their interactions, starting with the encoder.

26
Figure 3.12. The Transformers - Model Architecture (Vaswani et al. 2017)

3.1.5.1 Encoder
Each encoder block (on the left, Figure 3.12) repeated Nx times, consists of three main
components:

• Multi-head self-attention layers (section 3.1.4.2)

• Position-wise feed-forward networks layers

• Normalization layers

The input to the encoder is passed through the multi-head self-attention mechanism
block, which allows to weigh the relevance of different parts of the input sequence, as
previously detailed (section 3.1.4). The output of the attention mechanism layer passes
through a feed-forward neural network, applied identically to each position. Additionally,
residual connections surround the Multi-head self-attention layers and feed-forward net-
works layers, followed by layer normalization (He et al. 2016; Ba et al. 2016).

3.1.5.2 Decoder
The decoder (on the right, 3.12), repeated Nx times, has an architecture similar to the
encoder but with one additional multi-head attention layer. This layer attends to the
encoder’s output. The four components of each decoder layer are:

• Masked multi-head self-attention mechanism

• Multi-head attention layers over the encoder’s output

27
• Position-wise feed-forward networks layers

• Normalization layers

Masking in the self-attention mechanism ensures that the prediction for a particular
word does not depend on future words in the sequence, by masking tokens after the
prediction, preserving the auto-regressive property. In other words, this means that, during
the attention computation, future tokens have no influence on the current or previous
tokens. For a more visual understanding of transformers components, refer to Figure 3.12
accompanying this description.

Positional Encoding
As explained in section 3.1.4, attention layer integrate any form of recurrence. To make
up for this and to give the transformers some information about the relative and absolute
position of words in a sequence, the authors used positional encodings. These encodings
are added to the input embeddings at entry of the encoder and decoder. The encoding
use sine and cosine functions of different frequencies (section 3.1.2), which ensure a unique
encoding for each position (Figure 3.6).

3.1.6 Decoder-only Transformers and LLMs

Originally, Transformer models were designed as
encoder-decoder architectures. However, there
emerged another type of models that utilizes only
the decoder part (Liu et al. 2018). One prominent
architecture using decoder-only Transformers are
the Generative Pre-trained Transformers (GPTs)
model, represented on Figure 3.13, proposed by
the now-famous OpenAI lab (Radford et al. 2018).
This model architecture has been widely adopted
by the research community.
At the heart of the decoder-only Transformer is
the principle of autoregression. In autoregressive
modeling, predictions are made one token at a
time, and each new token is generated based on
Figure 3.13. GPT, Decoder-only
context tokens and the previously predicted tokens.
Transformers (Radford et al. 2018)
As a result, decoder-only Transformers excel in
text generation tasks.
GPT models, employs a unique training approach, in fact, unlike the original Trans-
former architecture that required paired input-output sequences8 , GPTs which use unidi-
rectional self-attention, ensuring that each word only attends to previous words in the
sequence allows a novel training approach.

3.1.6.1 Training Process

The power of GPTs lies in their two-step training process.

1. Pre-Training (Self-Supervised Learning, Figure 3.14)

7
E.g., for translation tasks

28
GPTs are first pre-trained on huge corpora of text. During this phase, it learns to
predict the next word in a sequence, effectively becoming a language model. This
self-supervised learning helps the model capture the structure of our language, as
well as vast amounts of general linguistic knowledge from diverse contexts without
explicit labels.

Figure 3.14. Self-Supervised Learning

For a given sequence of words9 w1 , w2 , . . . , wT , a language model computes the

probability of each word wt given the preceding words w1 , . . . , wt−1 . The cross-
entropy loss for this prediction can be calculated for each word, and the overall loss
for the sequence is the average of these individual losses:

1X T
L=− log P (wt |w1 , . . . , wt−1 )
T t=1

2. Finetuning (Supervised Learning, Figure 3.14)

Once pre-trained, GPTs can be fine-tuned on a smaller, task-specific dataset. Here,
the model adapts its generalized knowledge to a specific task, whether that’s text
classification, question answering, or any other supervised task. For finetuning on
specific tasks like classification, the architecture is often extended with task-specific
heads, and the loss function can vary based on the nature of the task (Figure 3.15
and 3.16). For instance:

• For classification, it’s typically Cross-Entropy Loss between the predicted and
true class labels.
• For regression tasks, it might be Mean Squared Error between the predicted
and true values.

However, the core idea remains rooted in the cross-entropy loss, especially for tasks
that involve predicting a probability distribution over discrete outcomes.
9
We use words for a better understanding, however in practice the cross-entropy loss is computed over
tokens

29
Figure 3.16. GPT Downstream Tasks Fine-
tuning (Radford et al. 2018)

Figure 3.15. Supervised Learning - Review

Classification

3.1.7 Fast Decoders Transformers

The Transformer model, while groundbreaking in many aspect, is not without its limitations.
One of the key challenges that has emerged is the speed and memory-bandwidth required
for inference. This memory-bandwidth issue is associated with repeatedly loading the
expansive keys and values tensors used in traditional, multi-head attention (section 3.1.4.2),
Transformers.
Traditional Transformer models leverage multi-head attention layers to facilitate com-
munication between sequences. During training, this design is efficient, thanks to parallel
processing across sequence lengths. However, this is not the case during incremental
inference. In this phase, the inherent parallelism that makes training efficient cannot be
exploited. This results in a significant memory-bandwidth bottleneck due to the consistent
need to load large tensors, making the decoding process slow and potentially impractical
for real-time applications.
Recognizing these challenges, researchers proposed an alternative known as multi-
query attention, previously presented in section 3.1.4.2 (Shazeer 2019). This architectural
variation maintains the integrity of the attention mechanism but with a crucial difference:
the keys and values weights are shared across all attention heads. This simple change
has profound implications. By sharing these weights, their size, and by extension, the
memory-bandwidth requirements are dramatically reduced. The immediate benefit is an
acceleration in the decoding process during incremental inference.
Moreover, the multi-query attention mechanism not only provides benefits during the
inference phase but also offers significant advantages during training. The shared keys and
values weights drastically reduce the number of parameters in the model, leading to a lighter
architecture. This streamlined design translates to reduced bandwidth requirements during
training. Furthermore, the efficiency gains extend beyond just speed. The compact nature
of the model makes it particularly suited for local deployment and execution. In scenarios
where cloud-based computations are impractical or undesirable, multi-query attention
allows Transformer models to run inference tasks locally, on-device, thus opening doors to
real-time applications even in bandwidth-constrained or privacy-sensitive environments.

30
Empirical evaluations have showcased the strengths of this approach. Decoding using
the multi-query attention mechanism is notably faster compared to the traditional multi-
head attention mechanism. Crucially, this boost in speed does not come at the expense
of performance; the observed quality degradation is minimal, making it an acceptable
trade-off for many applications.
Having delved into the primordial components that constitute StarCoder, from the
initial token embeddings to the innovations in fast decoders, it’s clear that each element
plays a crucial role. These are not just isolated concepts, together, they pave the path to
the very essence of our focus, the StarCoder Foundation Model.

3.2 StarCoder Foundation Model

StarCoder is a state-of-the-art Code LLM developed by the BigCode community (Li et al.
2023). With 15.5B parameters, StarCoder has been designed to optimize and streamline
programming tasks across multiple languages, it is distinctively trained on The Stack
(Kocetkov et al. 2022), which encompasses a vast assortment of permissively licensed code.
Its architecture amalgamates various interesting features, including an extended 8K context
length, the innovative Fill-in-the-Middle infilling capabilities, and the rapid large-batch
processing enabled by Multi-Query-Attention. Having familiarized ourselves with the
different components of StarCoder in the previous sections, we will now delve deeper into
its complete architecture.

3.2.1 Architecture
StarCoder’s architecture, based on SantaCoder’s architecture (Allal et al. 2023), follows
the principles of the Generative Pre-Trained Transformers family, while embracing special
components suited to handle coding tasks, it also allows a large context window while
keeping a relatively low number of parameters (Appendix A, Figure A.1).

3.2.1.1 Token Embeddings

StarCoder uses an embedding matrix with a dimensionality of 49156 × 6144, indicating
that it is designed to handle a vocabulary size of 49156 unique tokens, each represented as
a 6144-dimensional vector.

3.2.1.2 Positional Embeddings

StarCoder has an extended context length, allowing it to handle sequences of length up to
8192 tokens, each again represented in a 6144-dimensional space.

3.2.1.3 Fast Decoders Transformer Blocks

StarCoder employs 40 fast decoders transformer blocks. As described in section 3.1.7.

3.2.1.4 Linear Head

The final layer in StarCoder’s architecture is a linear layer, which is used to project the
dense representations of the sequences into predictions for the next token in the sequence.
The head maps back to the vocabulary size, 49156 tokens.

31
3.2.2 StarCoderBase Data Preparation
Building an effective Code LLM demands meticulous data selection. In this section, we
summarize the data preparation used by the BigCode team to pretrain StarCoderBase.

3.2.2.1 Source Selection

The primary source of data for training StarCoderBase is The Stack. Provided by Kocetkov
et al. (2022), The Stack is a repository encompassing permissively licensed code repositories
from GitHub.

3.2.2.2 Code Language Selection

From the vast array of 358 programming languages housed in The Stack, they restrict
their choice to 86 programming languages (Appendix A, Figure A.2 and A.3) based on
two main criteria:

• Volume and Popularity: They selected languages with data exceeding 500 MB
and those ranking in the top 50 on platforms such as Githut10 and the TIOBE
Index11 2022.

• Exclusions: Configuration languages (Nix, Puppet, etc.) and no longer supported

languages.

They dataset also made room for data formats like JSON and YAML, but with
restricted volume, given their data-centric nature as opposed to code-centric.

3.2.2.3 Data Quality Assurance

In order to ensure quality data, they adopted two approaches:

• Visual Inspection: They proceded to a systematic review, ensuring a balanced

representation of different languages. A thousand files from each extension has been
analysed, by 18 community annotators from the BigCode collective. Annotators
ensured the code resembled human-written syntax rather than auto-generated. Based
on their feedback, They refined the data filters, leading to the exclusion of 36
extensions. The detailed have been catalogued in a accessible Google sheet.

• Filtering: To ensure data quality, various filters were applied: XML filter to
eliminate non-code XML content, an alpha filter targeting non-code files based on
alphabetic character count, an HTML filter focusing on content visibility, and specific
length and character-based filters for data-heavy formats like JSON and YAML.

StarCoderBase’s data preprocessing and preparation is balanced between automated

processes and human double-check, which aims to end up with a qualitative dataset for
code languages modelling.

10
Githut is an analytical tool that presents statistics about programming languages based on the
number of repositories and pushes to repositories on GitHub. It provides insights into the popularity and
activity levels of programming languages.
11
TIOBE Index is an indicator of the popularity of programming languages.

32
3.2.3 Training Details
I order to transform this empty architecture in a foundation model, BigCode team proceeds
to an extensive training. In the subsequent sections, we develop the details about BigCode
team’s training process.

3.2.3.1 Pre-Training
StarCoderBase has been trained to optimize its efficiency over a huge dataset detailed in
previous section 3.2.2, following these settings:
• Iterations: 250 000
• Batch Size: 4M tokens
• Cumulative Tokens: 1T
• Training Data: Prepared dataset (section 3.2.2)
• Optimizer: Adam (Kingma et al. 2017) with parameters β1 = 0.9, β2 = 0.95, ε = 10−8
• Weight Decay: 0.1
• Learning Rate: Cosine decay, starting at 3 × 10−4 and attenuating to 3 × 10−5
following 2 000 iterations of linear warm-up.

3.2.3.2 Pre-Training for Python

Subsequent to training StarCoderBase, finetuning, conducted under identical conditions,
targeting Python was applied:
• Iterations: 8 500 (2 epochs)
• Batch Size: 4M tokens
• Cumulative Tokens: 25B
• Training Data: Python subset
• Optimizer: Adam (Kingma et al. 2017) with parameters β1 = 0.9, β2 = 0.95, ε = 10−8
• Learning Rate: Cosine decay, starting at 5 × 10−5 and attenuating to 5 × 10−6
following 1 000 iterations of linear warm-up.
This phase is instrumental in tailoring the model to Python-specific nuances.

3.2.3.3 Clusters and Carbon Footprint

The StarCoder model pre-training utilized a GPU cluster with 512 A100 80 GB GPUs across
64 nodes, employing a 3D-parallel layout and 32-fold data parallelism for optimal efficiency.
We maintained a micro-batch size of 1, accumulating over 16 steps to achieve a 4M token
global batch size, and leveraged Megatron-LM’s optimizer for distributed optimization.
It took 320 256 GPU hours (280W per GPU). The environmental ramifications of this
intensive setup for StarCoder’s pre-training amounted to approximately 89 671.68 kWh
of electricity consumption. When considering the carbon intensity of the us-west-2 AWS

33
region and average AWS datacenter metrics, this results to an estimated CO2 emission
of 16.68 tonnes12 , underscoring the environmental considerations to keep in mind when
developing LLMs.

3.2.4 StarChat: From StarCoder to Assistant

StarChat is an evolution of the StarCoder, finetuned to offer assistance in coding tasks.
Leveraging the foundational capabilities of StarCoder, StarChat aims to bridge the gap
between code generation and interactive code-based assistance. In this section we present
StarChat , and motivate of decisions to use it a our foundation to built our StarCoder-
Text-to-SPARQL assistant.
With the complexity of programming and the myriad of library available but not always
known and exploited, there is a growing demand for tools that can offer accurate, and
context-sensitive coding help. While StarCoder offers powerful code generation capabilities
that can be useful for code completion with tools such as Github Copilot (Gershgorn
2021) and his concurrent StarCoderEx13 , there still is a gap in conversational support for
developers. This is where StarChat comes.
Here are the main advantages which motivate us to choose StarChat as foundation for
our model:

• Built-in Special Tokens: One of the striking features that make StarChat a
suitable candidate for conversational code assistance is its familiarity with special
tokens <|assistant|> and <|user|>. These tokens enable the model to understand
and delineate between user prompts and model-generated responses, allowing for
more structured and coherent interactions.

• No Need for Data Collection: Leveraging StarCoder’s pre-training, StarChat has

already been fine-tuned on existing well chosen datasets. This makes the transition
from StarChat to Star-Text-to-SPARQL efficient and resource-friendly.

• Versatility: StarChat is designed not only to understand and generate code but
also to converse about it. This dual capability is essential to discuss with users who
need not just code generation but explanations, and rephrasing of results.

3.2.4.1 Training Details

StarChat has been finetuned on two OpenAssistant Conversations Dataset, OASST1
(Köpf et al. 2023) and the databricks-dolly-15k (Conover et al. 2023) dataset. The former
provides a source of human-generated assistant-style conversation data, while the latter
offers a collection of instruction-following records. This combination ensures that StarChat
is well-equipped to handle both conversational and specific coding instructions.

3.2.4.2 From StarChat to SPARCoder

12
It is equivalent to over two times the average annual CO2 emissions per capita in Belgium, which
stood at 8.10 metric tons in 2019 (Data Commons). This underscores the environmental impact of training
LLMs. One potential approach to mitigate such environmental costs is to leverage pretrained LLMs, which
can be fine-tuned for specific tasks without the need for extensive retraining. Such strategies accentuate
the need for sustainable and environmentally conscious approaches in AI research and development.
13
StarCoderEx is a Code Generator extention for VS Code based on StarCoder

34
What architectural factors should be taken into account when selecting the most
suitable Foundation Large Language Model ? (Underlying Research Question 1.3.3)

The decision to use StarChat as the foundation for our SPARCoder model was driven
by several compelling reasons:

• Multi-Query self-attention: Advantages of Multi-Query self-attention discussed

in section 3.1.4.2.

• Context Length: Starcoder boasts the longest context length compared to other
open-source models. (Appendix A.1)

• Coding and Natural Language Capabilities: Given its training on code, in-
cluding SPARQL (Appendix A.3) and conversational datasets, a SPARQL finetuned
StarChat can provide guidance by understanding natural language queries and
translate these queries into SPARQL.

• Interactive Query Formulation: One of the most challenging aspects of SPARQL

is its syntax. With StarChat’s conversational capabilities, developers can interactively
build and refine SPARQL queries, making the process more intuitive, iterative, and
less error-prone.

• Streamlined Workflow: Instead of oscillating between separate tools for code

generation and coding assistance, users can rely on a single integrated tool.

• Data Visualization Enhancement: Beyond query generation, there is a vast

potential to use StarChat capabilities with languages like Python, for data analysis
and visualization. Leveraging powerful Python libraries like Matplotlib, Seaborn,
and Pandas can greatly enhance the interpretation of SPARQL queries.

In conclusion, choosing StarChat, which is built on the robust code-generation capabil-

ities of StarCoder and enhancing them with conversational capabilities, as our base model
for our SPARCoder model, offers users a powerful and interactive assistant, and a vast
improvement perspectives.

35
Chapter 4

Enterprise Knowledge Graphs

As it has been discused in Section 1.1, the ability to capture, structure, and utilize the
vast amounts of enterprise data has become pivotal. This data from various source and
nature, if not handled efficiently, can become overwhelming, leading to inefficiencies and
missed opportunities. This is precisely where Knowledge Graphs, especially Enterprise
Knowledge Graphs (EKG), become an appropriate tool.
A Knowledge Graph (KG) can be broadly defined as a graph-structured knowledge base,
designed to store information in nodes (entities) and edges (relationships) to represent and
connect real-world entities and their interrelations in a semantically meaningful manner
(Ehrlinger et al. 2016). While the concept of Knowledge Graphs has been around for some
time, their relevance and utility have skyrocketed with the ascent of large-scale knowledge
bases like Google’s Knowledge Graph, which aims to understand facts about people, places,
and things and how these entities are all interconnected (Singhal 2012).
Enterprise Knowledge Graphs, on the other hand, are specialized versions of knowledge
graphs tailored for the needs of enterprises. They bridge silos of data, provide a unified
view of data sources, and enable advanced analytics, thereby powering more informed
decision-making processes. The use of EKGs aids businesses in recognizing patterns, opti-
mizing operations, fostering innovation, and enhancing customer experiences by leveraging
connections that would have otherwise remained hidden in isolated data sets (Fensel 2011).

What are the primary challenges in developing Enterprise Knowledge Graphs to

consolidate and represent disparate data sources ? (Underlying Research Questions
1.3.3)

Despite their promising benefits, the deployment and maintenance of EKGs are not
without challenges. These include issues related to data integration, scalability, real-time
processing, security, and more. The intricacies of such challenges and the processes of
implementing Enterprise Knowledge Graphs can often be overwhelming. To provide clarity
and a structured approach to these complexities, this section has been largely inspired by
the Knowledge Representation and Reasoning course given by professor Debruyne (2023).
Thereby, we will follow the roadmap of building and maintaining a KG as described in
Figure 4.1. To further elucidate this process, we will use the practical use case which has
been given by SAB, which involves integrating Employee, Training, and Digital Training
databases to pinpoint areas of expertise of each employee, department, etc.

37
Figure 4.1. Building and Maintaining a KG (Debruyne 2023), based on (H. Wu et al. 2017)

4.1 Why an EKG for Your Company ?

Before delving into the intricate processes of developing an Enterprise Knowledge Graph,
it is fundamental to address an primordial question: Why does our company need an EKG
? Addressing this question is paramount as it anchors the development process to the
unique needs, aspirations, and challenges faced by the enterprise.
In broad terms, here are some key elements to address this question:
1. Strategic Alignment: The decision to invest in an EKG should align with the
organization’s broader strategic vision. Whether the goal is to break data silos,
streamline operations, or offer personalized client experiences, the purpose behind
adopting an EKG should resonate with the enterprise’s long-term objectives.
2. Data-Driven Insights: With the arrival of Machine Learning in KG, an EKG can be
a tool that transforms vast, disconnected data sets into actionable insights, revealing
patterns, trends, and inferred relationships which can inform decision-making at
every level.
3. Operational Efficiency: Enterprises are often faced with the challenge of managing
diverse data sources, formats, and systems. An EKG can serve as a unified layer that
bridges these disparate sources, ensuring that data flows seamlessly, redundancies
are minimized, and data quality is maintained.
4. Enhanced Collaboration: Knowledge is most powerful when shared. EKGs can
promote collaboration by ensuring that every department, team, and individual has
access to a unified, holistic view of the company’s knowledge, fostering a culture of
shared learning and synergy.
To summarise, the decision to develop an EKG is not just a technical one. In fact,
this decision must be deeply rooted in the enterprise’s vision, goals, and challenges. By
understanding and articulating the ’why’ behind the need for an EKG, organizations can
set a clear path forward, ensuring that the EKG serves as a tool that answers needs.
In addressing the specific use case at SAB, the primary motivation for integrating the
Employee and Training databases via an Enterprise Knowledge Graph arises from the
need to unify these diverse data sources. With a multifaceted workforce, every employee
has a unique expertise shaped by their position. This, when combined with their formal
and digital training histories, forms a comprehensive profile. The EKG is designed to

38
capture this intricate granularity, linking each employee’s professional data with their
training records. Beyond offering a view of their capabilities, the EKG also traces the
professional interconnections among employees. While such a consolidated representation
can significantly aid HR and management decisions, its fundamental purpose is to interface
with SPARCoder, our Text-to-SPARQL LLM. This ensures precise answers concerning
employees, their interrelations, and training.
Having established the foundational need and purpose for our EKG at SAB, let’s now
delve into the ontology development process.

4.2 Ontology Development

The goal of this section is to understand how one can collaboratively explicit knowledge
by formalizing it into ontologies. Then use these ontologies to annotate and integrate
resources.

4.2.1 Why do we need ontologies ?

Most of information systems rely on a conceptual schema which comes from agreements
between different stakeholders (Business Domain Expert, Designer and End Users). How-
ever, most of the time these agreements and design rules are implicit and proper to each
department, company, domain, etc, which makes difficult to have different information
system to work together, or in other words, it makes difficult to have different information
system to interoperate. Thereby, to have different system working together we need
semantic interoperability.
Definition 4.2.1 Semantic Interoperability is defined as the ability of two or more
information systems or their (computerized) components to exchange data, knowledge or
resources and to interpret the information in those systems. De Leenheer et al. 2008
For information systems to achieve semantic interoperability, an ontology serves as a
critical component. This comes from the fact that an ontology provides a common and
standardized vocabulary that encapsulates knowledge and relationships within a given
domain. It represents an agreed-upon understanding among all stakeholders involved,
including business domain experts, designers, and end users.
The construction of an ontology is an iterative and collaborative process. As stake-
holders refine and evolve their agreements, the ontology also adapts and matures. This
evolution often reflects the dynamic nature of business environments and the need to
accommodate new knowledge or changes in the representation of existing knowledge. For
instance, the introduction of a new business process or a change in the organization’s
structure can trigger the need for ontology revision.
Importantly, ontologies play a pivotal role in externalizing the semantics within an
information system. They capture and formalize the implicit rules and knowledge inherent
to the system, making these semantics explicit and universally understandable. By
doing so, ontologies transcend the boundaries of individual information systems. This
externalization process decouples the semantics from the specifics of the system, thereby
facilitating information exchange, system integration, and enabling different systems to
’understand’ and ’communicate’ with each other more effectively.
To encapsulate the previous discussion, we can introduce the definition of ontology as
proposed by the literature (Studer et al. 1998; Gruber 1995)

39
Definition 4.2.2 An Ontology is a [formal,] explicit specification of a [shared] conceptu-
alization.

With this in mind, let’s delve into the ontology development process. In the subsequent
sections, we will walk through each step involved in this process, beginning with the task
of defining the scope of application for our ontologies.

4.2.2 Application
As we have alreday discusted, defining the scope of application is an essential first step in
ontology development, as it establishes the boundaries of the knowledge domain and guides
the subsequent design and implementation stages. This foundational phase is crucial in
ensuring that the ontology is purposeful, fit for its intended use, and avoids unnecessary
complexity.
Our application is designed with a primary focus on querying the corporate hierarchy
and expertise within a company. It integrates data from three primary sources, namely
Employee, Traning, and Digital Academy Training databases as described in section
4.2.3.
The Employee database primarily provides data concerning employee details, including
their hierarchical positions within the company, their department, etc. The Traning
database is the source for information related to employee training and the specific skills or
knowledge they have gained as a result. Lastly, the Digital Academy Training database
is the source for information related to online training undertaken by employees, offering
additional insight into their skills and knowledge base.
The key constraints in our ontology model pertain to the relationships between different
entities. For example, every employee is assumed to be part of a department and reports to
a manager. Or, concerning business rules, we can incorporated the rule that an employee
can be part of multiple formations, representing their multidisciplinary training and
expertise. Rules and constraints will be discussed in the ontology development section
4.2.5.
In conclusion, the goal of our application is to leverage the ontology and underlying
databases to represent and facilitate queries pertaining to the company hierarchy and
expertise.

4.2.3 Databases
For the purpose of our proof of concept project, we leverage three separate CSV files,
extracted from databases, as our core databases: the Employee, the Traning, and the
Digital Academy Training databases (Fig.4.2). In order to comply with GDPR reg-
ulations, these databases are populated with anonymized data, effectively eliminating
sensitive information such as names, email addresses, and other personally identifiable
details. This approach is adopted in the interest of maintaining data security, ensuring that
our project activities do not pose any potential risk to the confidentiality of the data sub-
jects. Using CSV databases also provides a crucial advantage by eliminating the need for
direct connections to production databases, thereby preventing the risk of unintentionally
affecting the integrity and availability of the operational systems. Furthermore, utilizing
CSV files simplifies the overall setup of the project due to their ubiquitous compatibility
and ease of manipulation. Such an approach also greatly enhances the accessibility and

40
transferability of our project, allowing for effortless migration, inspection, and sharing of
the datasets.

Figure 4.2. CSV Databases

However, it’s important to note that the approach and methodology we’re utilizing for
this proof of concept project are not strictly confined to CSV files or anonymized databases.
Indeed, the same principles can be efficiently employed when dealing with actual, live
databases, whether they are hosted on AWS, Oracle, or other such platforms. The process is
complemented by tools provided by Knowledge Graphs Platforms, which offers a multitude
of connectors for various databases1 . These connectors facilitate seamless interfacing with
a wide array of databases, enabling us to retrieve, manipulate, and annotate data from
multiple sources. Therefore, while we’re using CSV files for their simplicity and security
advantages in the context of this proof of concept, we could just as easily adapt our
approach to real-world, production databases when the necessity arises.

4.2.4 Namespace
Within the field of ontology development, namespaces stand as a primordial element for
identifying resources. Understanding namespaces is crucial to ensuring the unambiguity,
and reliability of ontologies, especially in environments where multiple ontologies coexist.
Namespaces, which are URIs, selected for an ontology plays a role in ensuring its unique
identification. In other words, an URI namespace is use to identify an ontology. To that
end, our ontology utilizes a systematic approach in determining its namespace, that we
will exemplify with our use case. Note also that a prefix is a short, human-readable label
that stands in for a full namespace URI. Instead of writing the full URI, we can use the
prefix, which acts as an abbreviation. Once a prefix is declared and associated with a
namespace URI, it can be used in its place.

1
E.g., https://www.stardog.com/platform/connectors/

41
4.2.4.1 URI Selection
A common practice is to represent ontology namespaces as URLs2 . Although it need
not be a live URL, the namespace chosen must be associated with the project’s domain,
offering clarity regarding its source. For instance, if the ontology is developed under the
auspices of a company, the namespace could start with the company’s website URL, such
as http://www.company-example.com/ontology/.

4.2.4.2 Domain-Specific Naming

Reflecting the ontology’s scope, the namespace includes a specification of its domain.
For example, if the ontology pertains to employee data, the namespace includes this
information: http://www.company-example.com/ontology/example_onto/.
Moreover, consistency is maintained throughout the naming conventions within the
ontology. Subdomains or specific branches within the ontology are named in alignment
with the parent namespace, ensuring a coherent structure.
Finally, we should take precautions to ensure that our namespace is not easily confused
with others. By avoiding generic terms and using a specific and unique identifier for the
ontology, we have maximized its distinguishability.

4.2.4.3 Versioning
To accommodate the evolution of the ontology over time, our namespace includes versioning
information. This approach is particularly helpful when updates are made, providing a
clear record of the ontology’s version history, e.g.,
http://www.company-example.com/ontology/example_onto/v0/.

4.2.4.4 Employee and Training Namespaces

We can conclude that by carefully choosing the namespace, we not only ensure the unique
identification of our ontology but also provide valuable information regarding its source,
domain, and version. This careful selection of a namespace is crucial, considering that
once the ontology is published, changing the namespace is challenging. Therefore, our
systematic approach to choosing a namespace underscores the robustness and scalability
of our ontologies.
For our use case we will use the 2 following URI which have the same basis :

Safran Group URL Safran Aero Boosters

z }| {z }| {
https://www.safran-group.com/ companies/safran-aero-boosters/
Domain/Sub-domain Version
z }| { z}|{
Employee ontology: ... ontology/employee/ v0/
Domain/Sub-domain Version
z }| { z}|{
Training ontology: ... ontology/training/ v0/

2
A URL (Uniform Resource Locator) is a specific type of URI (Uniform Resource Identifier) that not
only allows to identify a resource but also provides a means to locate it

42
4.2.5 Ontologies Engineering
As already mentioned, our project leverages two distinct ontologies to encapsulate concepts,
relations, and instances pertaining from our databases.

4.2.5.1 Employee Ontology

The Employee Ontology focuses on describing the structure and roles within the company
(Figure 4.3). It includes 3 key concepts:
• Employee: An individual who works, or has worked, part-time or full-time under a
contract of employment and has recognized rights and duties.
• ActiveEmployee: An individual who is currently working, part-time or full-time
under a contract of employment and has recognized rights and duties.
• Office: A location where an employee carries out their work activities.
• Company: An organized group of people with a particular purpose, such as a business
or government department.
• Department: A functional unit within an organization.
Relations include:
• worksFor: This relation links an Employee to a Company.
• employs: This relation links an Company to a Employee.
• isLocatedAt: This relation links an Employee to an Office.
• belongsTo: This relation links an Employee to a Department.
• includes: This relation links an Department to a Employee.
• isSuperiorOf: This relation links an Employee to an Employee.
• isSubordinateOf: This relation links an Employee to an Employee.
• isSecretaryOf: This relation links an Employee to an Employee.
• isCorrespondentOf: This relation links an Employee to an Employee.

4.2.5.2 Training Ontology

The Formation Ontology, on the other hand, focuses on describing the training and skills
acquisition within the company (Figure 4.3). It encapsulates concepts such as:
• Training: A structured training and development program in the context of profes-
sional development.
• DigitalAcademy Training: A specific type of Training offered via the online,
DigitalAcademy platform that provides a variety of training courses.
• SABTraining: A specific type of Training that represents another distinct approach
to professional development within the company.

43
Relations include:

• participatedIn: This relation links an Employee to a Training.

• followedBy: This relation links an Training to a Employee.

• offeredBy: This relation links a Training to a Company.

4.2.5.3 Transitivity of relations

In the corporate hierarchy, management often follows a transitive pattern. In other words,
if an individual A is the direct superior of individual B, and B is the direct superior of
individual C, then it is implicit that A is also the direct superior of C. This cascading effect
of management, or transitivity, is a crucial aspect of modeling the company hierarchy.
To encapsulate this notion in our Employee Ontology, we will define the isSuperiorOf
relation as a transitive relation. This means that the OWL reasoner can infer indirect
isSuperiorOf relations based on a chain of direct isSuperiorOf relations.
By doing so, we are able to make complex inferences about the management hierarchy
within the company, aiding in the querying and understanding of the corporate structure.
It ensures our ontology effectively represents the real-world dynamics of the company’s
management, thus enhancing the usability and effectiveness of the model.

4.2.5.4 Inverse Relations

Inverse relations in ontology offer a unique perspective on relationships between entities.
They allow for data to be interpreted from a different angle, enabling a more comprehensive
understanding of relationships.
In our ontology, some relationships naturally have an inverse. For example, the
worksFor relation, which links an Employee to a Company, has an inverse relation, which
can be called employs. employs relation links a Company to an Employee, effectively
describing the same relationship but from the perspective of the company.
Similarly, the belongsTo relation can have an inverse relation includes, which would
link a Department to an Employee, showing which employees are part of a certain de-
partment, or even participatedIn and followedBy relations between Employee and
Training.
Including these inverse relations in our ontology enhances its expressivity and allows for
more versatile querying. Whether a user wants to determine which company a particular
employee works for, or which employees are employed by a specific company, the inverse
relations can provide a direct and efficient answer. By incorporating inverse relations,
we are able to cater to a wide range of query requirements and ensure our ontology is
user-friendly and data-rich.

4.2.6 Glosses
In Annexes A.1, we present a glossary of the key concepts used in our ontologies. These
tables serve as a dual-language reference, providing human-readable labels and definitions

44
in both English and French3 . The definitions focus on distinguishing characteristics and
intrinsic properties that differentiate each concept, adhering to guidelines of clear glossary
construction4 .

3
It’s pertinent to mention that SAB is a company where French is the predominant language of
communication. Consequently, to ensure clarity and facilitate comprehension for the stakeholders, it was
essential to incorporate French descriptions. This approach aligns with the company’s linguistic context
and adapts to the native linguistic preferences of its employees.
4
It’s essential to note that the descriptions provided in this glossary are general in nature and serve
primarily for illustrative purposes in this context. They may not necessarily mirror the exact terminologies
or definitions as employed within the applications at SAB

45
4.2.7 Visualization
string

decimal

office ID

longitude

SpatialThing
(external)
Person
(external)
Subclass of
string
string
Office
string
string

latitude

surname
registration number
Subclass of decimal
phone number is located at
string attach

is secretary of
position is correspondent of
string
company name
Company
works for anyURI
email
employs
Employee

phone number (mobile)

is active
string
string
photo
is superior of
(transitive) first name

expertiseis subordinate of Subclass of

id
anyURI status

includes string

belongs to
string
Active employee
string

string

Department

department name

string

Employee
(external)
string

string

Company
(external)

string
name integer
participated in
string
end date followed by

provided by integer

string

country
id remaining lives
course level
string
stars earned
Training

store
Subclass of
string
company
DigitalAcadem...
start date
engine

string
string
completed SCOs percent
float
Subclass of string

last completion date

remaining extra life

branch float
string time in training (min)
cost center

cost

domain string
string
integer
year string
float

SAB Training
mod_ ref comment
string
string

days per_ exp

completed real
float string

hours sub-domain
organization label

string string

string
float

string string

Figure 4.3. Employee (top) and Taining (down) Ontologies WebVOWL

46
4.3 Data Transformation
Data Transformation refers to the process of converting non-KG data into Knowledge
Graphs. In the context of EKGs, data transformation is crucial, as data often resides
in diverse structured repositories like relational databases. The primary goal of the
transformation process is to ensure that data is in a suitable format and structure to
be ingested and integrated into the knowledge graph, maintaining the semantic meaning
and relationships between data items, this process is also refereed as mapping the data,
or populate the KG. As detailed in Figure 4.1, this process can be mainly automatic as
mapping scripts can be write, or generate through graphic tools, in order to populate the
KG automatically.

4.3.1 Structured Data

Structured data, especially from SQL databases5 , often constitutes the backbone of
enterprise structured data systems. Such data is organized in predefined models, typically
tables, where relationships are explicitly defined through keys and foreign keys. Despite
the structured nature of SQL databases, transforming this data into knowledge graph
required schema mapping scripts to guide the KG population.
The schema in SQL databases, which defines tables, columns, and relationships, must
be mapped to the ontology of the target KG. This involves ensuring that table entities
and their respective attributes in the database correspond to the appropriate nodes and
properties in the knowledge graph. While foreign keys in SQL databases define explicit
relationships, implicit relationships might be inferred from the data. During transformation,
such relationships need to be recognized and properly represented for the knowledge graph
(Arenas et al. 2012; Sequeda et al. 2009).

4.3.2 Tools and Frameworks

Several tools and frameworks have been developed to aid in the transformation of structured
data into a format suitable for knowledge graphs. One notable tool is the R2RML
framework, a W3C recommendation for expressing customized mapping between relational
databases and RDF datasets (Arenas et al. 2012; Debruyne and O’Sullivan 2016). Such
tools play a primordial role in automating and streamlining the transformation process,
reducing manual intervention, and ensuring the integrity of the transformed data. However,
in the frame of our use case in an enterprise setting, we used, Stardog Designer, a graphic
mapping tool allowing a fast and visual mapping process, and thereby more understandable
by stakeholders (Figure 4.4).

5
It’s worth noting that while SQL databases are a primary example of structured data sources, the
term "structured data" isn’t exclusive to them. Structured data refers to any data that adheres to a
specific format or model, making it easily searchable and queryable. For instance, even data scraped
from web pages, such as LinkedIn profiles, can be considered structured if it’s consistently organized
into defined fields or categories (e.g., Name, Position, Company, etc.). Essentially, any data source that
provides information in a predictable, fielded format can be categorized under structured data, irrespective
of its underlying storage or retrieval mechanism.

47
Figure 4.4. Stardog Designer Mapping Tool

4.3.3 Future Directions and Unstructured Data

Data transformation, instrumental for comprehensive knowledge graphs, faces challenges

from dynamic enterprise data and evolving ontologies. There’s real need on adaptive
transformation processes that can automatically adjust to changing data structures and
ontologies (Halevy et al. 2005).

LLMs, with their adeptness at understanding human text and database structures, are
well suite to significantly ease these processes. Indeed, LLMs can assist in auto-mapping
database structures to knowledge graph ontologies, minimizing manual interventions as
illustrate by Figure 4.5.

48
Figure 4.5. Stardog Designer, Suggest Mapping Tool

Furthermore, the massive amount of unstructured data emphasizes the need for its
transformation into structured formats for knowledge graphs. LLMs offer a vision where
unstructured documents will be effortlessly mapped into knowledge graphs. It’s worth
noting that the effective integration of such LLMs into enterprise systems is not ready
and will demands more research in this direction. However, technology such as AIASHI
(Poumay 2019) are opening the path.

4.4 Data Annotations

Data annotation allows to go beyond simple categorization of data in a graph. It serves as a
enriching the semantic and context of the data. Through this process, raw data is appended
with meaningful descriptions or metadata, thus adding to the information a layers of
Meaning. Annotations can be pivotal in capturing nuances, subtleties, and associations
that are innate to human. Moreover, when integrated within an EKG, these annotations
amplify the capabilities of semantic search, facilitate more intricate inferencing, and elevate
the granularity and precision of insights extracted from the graph, especially when couple
with a LLM. By nurturing a more holistic understanding of the data, annotations lay the
groundwork for a more intuitive and responsive knowledge ecosystem within the enterprise.
This process must be ensured by domain expert, and is mainly manual, even if well
structured documentation it can be semi-automated.

4.5 Quality Assurance

Quality Assurance for Enterprise Knowledge Graphs (EKGs) is paramount to ensure data
accuracy, relevance, and utility. EKGs, being central to informed business decisions, must
uphold data integrity and consistency to maintain stakeholder trust. The QA process
should encompass data validation, ensuring consistency, checking for completeness and

49
accuracy, and conducting usability and security assessments. Employing tools like SHACL,
SPARQL, and OntoClean can streamline these checks. Furthermore, continuous monitoring
and establishing feedback loops with end-users are crucial for real-time improvements and
adaptability. In essence, rigorous Quality Assurance ensures EKGs remain trustable tools
for data-driven decision-making.
To conclude, the development of an EKG is a multifaceted process, characterized by
an intricate interplay of data, technologies, and human expertise. The outlined steps,
Ontology Development, Data Transformation, Data Annotations, and Quality Assurance
(Figure 4.1) provide a high-level schematic of the journey. However, every stage has its
unique set of intricacies and nuances that demand in-depth exploration and knowledge
engineering expertise. By discovering these foundational milestones, one gains a footing to
embark on this complex process.
In the specific context of my research and prototype development, it’s pertinent to
note that Data Annotations and Quality Assurance were not delved into exhaustively.
The rationale behind this decision justified by the substantial company resources they
demanded, specifically the time of company experts. Furthermore, these elements, although
primordial in a complete EKG deployment, were not substantial for the development of
my KG coupled with LLM Prototype. Nevertheless, a comprehensive EKG strategy would
invariably necessitate a robust mechanism for data annotation and quality assurance.

50
Chapter 5

SPARCoder

Before diving into the details of our prototype development, it’s crucial to address a
foundational question:

Why not employ a Knowledge Graph or a Large Language Model in isolation? What
advantages arise from their combined utilization ?

5.1 Limitations of Solely Relying on Enterprise

Knowledge Graphs
EKGs have become increasingly popular due to their potential to represent complex
relationships within enterprise data 4.2. However, using EKGs as the only information
system within a company has its limitations.

• Complexity of Direct Querying: For employees unfamiliar with SPARQL or

the specific structure and ontologies of the EKG, querying this graph directly can
be a complex task. The ability to frame questions in natural language and obtain
accurate SPARQL queries is necessary for wider adoption within an enterprise.

• Inherent Ambiguities in Natural Language: Employees might frame questions

in ways that are no compatible to the ontology vocabulary. An LLM, ontology aware,
is inherently better at handling the nuances and intricacies of natural language than
a standalone EKG.

• Dynamic Evolution of Data: EKGs, while structured, can be ever-evolving as

new relationships and entities are discovered. A static EKG may quickly become
outdated or incomplete. LLMs, especially when fine-tuned, can assist in identifying
gaps, suggesting additions, or even automatically updating the EKG based on new
information.1

It’s important to recognize that commercial Enterprise Knowledge Graph platforms

have made significant strides in making the utilization of EKGs more accessible. Many of
these platforms come equipped with a suite of tools designed to streamline the experience

1
While our SPARCoder prototype does not currently address this capability, it presents an intriguing
avenue for future development, especially when a LLM tailored to Knowledge Graph already exists

51
for users. These tools often encompass visualization capabilities to represent the intricate
relationships in an intuitive manner, query builders that allow users without expertise in
SPARQL to extract relevant information, and comprehensive designer tools that aid in
ontology construction and population. Furthermore, they provide reliable solutions for
storing and managing the growing knowledge graph in an efficient and scalable manner.
The value of these tools should not be understated, as they significantly reduce the barriers
to EKG adoption within enterprises. In the subsequent sections, we will delve deeper
into the specific Enterprise Knowledge Graph platform we’ve adopted for our SPARCoder
prototype, discussing its capabilities and the rationale behind our selection.

5.2 Limitations of Solely Relying on LLMs

Large Language Models, while they excel at generating human-like text, understanding
context, and answering general domain questions, relying solely on them as an enterprise’s
primary information system poses challenges that, as of now, remain insurmountable:

• Scalability and Computational Demands: To replace an entire information

system, especially in a large-scale enterprise, would necessitate an LLM of huge
scale. The computational infrastructure to support the storage, inference, and
training of such a model would be immense. Not only would this require substantial
GPU resources, but the energy consumption and maintenance costs would also be
prohibitive. Even for multinationals, such as Safran, with significant financial power,
and able to leverage the LLM across its subsidiaries, the challenge remains significant.
The large size of the LLM required, juxtaposed with the relatively modest user
base, renders such an approach potentially inefficient. To put that in perspective,
while Safran employs around 83,000 individuals, the practical user base within the
company would be much smaller. This figure seems large in isolation, but when
compared to platforms like OpenAI, which caters to over billions of monthly views
(Appendix B.2), the scale of Safran’s user base can be considered modest.
• Black Box Nature and Explainability: One of the pivotal critiques of deep learn-
ing models, especially LLMs, is their inherent lack of explainability. In an enterprise
setting, decision-makers often require insights into how and why a particular decision
or prediction was made. Relying on an LLM’s output without the means to interpret
its reasoning can be problematic, especially in sectors where transparency, accuracy
and accountability are paramount. This concern becomes even more pronounced
under the regulations of the General Data Protection Regulation (GDPR), which, as
explain in section 1.1.6, emphasizes the rights of individuals to understand decisions
made about them using automated systems, article 22, European Commission (2016).
• Data Privacy Concerns: While LLMs are trained on diverse datasets, there is
an ever-present concern about the model’s retention of specific data or unintended
data leakage. An enterprise, such as SAB, which handles sensitive and proprietary
information, would find this particularly problematic. Ensuring that an LLM doesn’t
inadvertently disclose confidential information or harbor biases when queried is
primordial. This concern gains greater weight when considered under the lens of
regulations like the General Data Protection Regulation (GDPR), which places
rigorous standards on individuals data protection, article 5, European Commission
(2016) (section 1.1.6). A recent incident exemplifies the scenario of relying overly on

52
LLMs and highlights the potential vulnerabilities of such systems. Termed as the
Grandma Hack, this exploit leverages the model’s tendency to act compliantly when
posed with emotionally evocative scenarios. A user demonstrated that by tricking
ChatGPT into believing it is a deceased grandmother speaking to her grandchildren, it
can be manipulated to divulge sensitive information such as Windows activation keys
or even IMEI numbers of phones2 . The exploit not only reflects the potential misuse of
these LLMs but also raises questions about the data they might inadvertently retain.
Thereby, with the ever-evolving sophistication of these hacks, ensuring complete
security remains a challenge. The Grandma Hack serves as a cautionary tale,
emphasizing the need for stringent safeguards when employing LLMs in any capacity,
especially within enterprises dealing with sensitive information. It’s imperative for
organizations to be aware of such vulnerabilities.

• Handling Real-time Data: LLMs are static in their knowledge once trained, and
updating them with new data is not a straightforward task. SAB have to deal with
real-time data and need their information systems to adapt swiftly (section 1.1.3).
A LLM-centric system might struggle to provide timely insights in fast-changing
business environments. Thereby, LLMs would require continuous fine-tuning to
remain relevant, this constant need for updates can be resource-intensive and may
result in lapses if not managed effectively.

• Domain-Specific Expertise: While LLMs possess a broad knowledge base, they

might not always have the depth of expertise required for specialized domains, or
would require additional training and fine-tuning.

In conclusion, while LLMs offer remarkable capabilities in natural language understand-

ing and generation, relying on them as the sole foundation for an enterprise information
system can introduce various issues. A hybrid approach that leverages the strengths of
both EKGs and LLMs might offer a more balanced, efficient, and sustainable solution for
enterprises.

5.3 Selection of the Enterprise Knowledge Graph

Platform
In our development journey to bridge the capabilities of both EKGs and LLMs, the
selection of the right Enterprise Knowledge Graph platform was primordial. Beyond
the technical advantages, it was crucial for our chosen platform to align with enterprise
constraints, but also offering user-friendliness to ensure swift adoption and integration
within an organization. Stardog emerged as the leading choice, based on the following
points:

• Comprehensive EKG Platform: Stardog’s complete platform offering ensured

that we could address the entire EKG lifecycle, from the design of the ontology
to the querying of knowledge graph, including its storage. A holistic approach is
essential for maintaining consistency and ensuring the robustness of the EKG within
an enterprise setting.

2
Source: softonic - Cracking the Code: How to Hack ChatGPT and Activate Grandma Mode

53
• Virtual Knowledge Graphs: One of Stardog’s standout features is its unique
technology of virtual knowledge graphs. Unlike traditional methods where data
needs to be imported into the knowledge graph, Stardog’s virtual connectors enable
ontology-based data access without the need for data migration. This not only
ensures real-time accuracy but also significantly reduces the overhead associated
with data replication. The distinctiveness of this feature places Stardog a cut above
its competitors, none of whom offer a similar technology.

• Expert-Driven Features: Stardog offers a plethora of functionalities tailored

for those deeply versed in Knowledge Graph development. The platform provides
features like direct handling of ontology .owl, or .ttl files, R2RML mapping capabili-
ties, comprehensive namespace management, graph database project import/export
functionalities, or model management.

• Diverse API Support: With the provision of APIs including Python and JavaScript,
Stardog ensures integration with a other platforms and development environments.

• W3C Compliant: Stardog Platform follows the W3C recommendation standard

allowing high interoperability, and also compatibility with other W3C compliant
EKG platforms in case of Database migration.

• Alignment with Enterprise Constraints:

– Deployment Flexibility: Stardog’s availability in both cloud and on-premise

versions ensures that enterprises have the flexibility to deploy as per their
infrastructure needs and data sensitivity considerations.
– Access Control and Security: Stardog provides access control mechanisms
which a prerequisite for large enterprise systems. Furthermore, their Encryption
at Rest feature ensures that the data remains protected even when not in
transit, a feature of paramount importance for SAB that handle sensitive and
proprietary information.

• Visual Tools for Enhanced Usability:

– Stardog Designer: Stardog Designer serves as a efficient tool for those engaged
in ontology development, mapping, and KG population. By offering a visual
interface, it simplifies the typically intricate process of ontology design, making
it more accessible and less error-prone.
– Stardog Explorer: This tool allows users, regardless of their expertise in
SPARQL, to visually explore the knowledge graph and construct queries in an
intuitive manner using the query builder. The user-friendly interface ensures
that a wider demographic within an enterprise can harness the benefits of the
EKG.

• Alignment with LLM Integration: The synergy between our objectives and
Stardog’s roadmap (Figure 5.1) was evident when, during the period of our internship,
they announced their intentions to integrate LLM tools similar to our SPARCoder.
Thereby proposing a commercial, and enterprise ready solution, to the need my
prototype tries to answer.

54
Figure 5.1. Stardog AI Roadmap

In our pursuit of the optimal Enterprise Knowledge Graph platform, we explored also
other potential solutions. Each had its own merits, but also limitations that rendered them
less suitable for our requirements than Stardog. Here’s a brief overview of the platforms
considered and the reasons for their exclusion:

• GraphDB: While it offers a robust system, it does not provide Encryption at Rest.
This raised concerns about the security and integrity of the data stored within,
particularly for sensitive enterprise applications.

• RDFox: Two primary concerns led to RDFox’s exclusion. Firstly, it does not offer
Encryption at Rest. Secondly, RDFox does not offer a comprehensive platform. Its
reliance on external platforms like Metaphacts for complete functionality was deemed
unsuitable for our streamlined needs.

• AllegroGraph: While it has Encryption at Rest, it’s reliant on third-party tools.

This indirect implementation raised concerns about the stability and reliability of
data security. Moreover, user feedback indicated that it wasn’t as user-friendly as
other platforms, which would potentially hinder swift adoption in an enterprise
setting.

• Apache Jena: Being open source, Apache Jena has its advantages. However, it
does not provide a comprehensive platform, and the absence of Encryption at Rest
further weakened its candidacy for our requirements.

• Amazon Neptune: Amazon Neptune excels as a graph database but falls short
when considering the broader scope of an Enterprise Knowledge Graph platform.
The necessity of integrating an additional enterprise KG platform which for complete
functionality was deemed unsuitable for our streamlined needs.

• Neo4j: A significant drawback was its lack of native compliance with W3C standards.
Given the importance of standards in ensuring compatibility and future scalability,
this was a red flag.

55
Selecting the Enterprise Knowledge Graph platform was a decision, shaped by the
examination of available solutions and the matching with our requirements. Stardog
emerged as the prime choice due to its comprehensive offerings, security features, alignment
with enterprise constraints, and potential of synergies with LLMs through API. Other
platforms, while having their own strengths, were lacking in one or more critical areas that
were essential for our use-case.

5.4 From StarChat to SPARCoder

How can Large Language Models be effectively fine-tuned to query knowledge graphs
using natural language ?

In the subsequent section, we endeavor to comprehensively address the aforementioned

question, to do so, we delve into the intricate process of fine-tuning our Language Model.
Building upon the foundation provided by StarChat, our objective is to develop SPARCoder,
a Language Model that understands and translates natural language questions into ontology-
based SPARQL queries, queries a KG, and then reformulates the retrieved results into a
human-understandable response. A crucial element in this process is the dataset employed
for training.

5.4.1 Dataset Selection and Creation

Finding a dataset tailored to our needs posed a significant challenge. While there are
datasets available that cater to the conversion of text to SPARQL, most are tailored
for general knowledge graphs. One notable dataset in this realm is LC-QuAD (Trivedi
et al. 2017), designed primarily for general-purpose knowledge graphs, notably DBpedia.
However, our focus lies in creating a Language Model adept at querying domain-specific
enterprise KGs based on a mid-size ontology. As such, the broad and generalist nature of
LC-QuAD did not align with our objectives.
Faced with a lack of suitable datasets, we decided to embark on the ambitious journey
of creating our dataset. Our primary source for ontologies was the Stardog knowledge
kit, which provided ten distinct ontologies suitable for enterprise dataset. Leveraging
these ontologies, we initiated the process of formulating questions relevant to each domain.
Following this, for every question generated, we implemented its corresponding SPARQL
query, generated the corresponding result trough Stardog endpoint, and rephrased this
result in order to mimic a discussion with a knowledge graph assistant.
This iterative process resulted in a set of tuples, each containing an ontology, a question,
and its related SPARQL query. Recognizing the value of diverse training samples, and
with an aim to enrich our dataset, we then utilized these human-generated samples as
seed prompts. By feeding these samples to the text-davici-003 model using the GPT
Completion API, we generated additional diverse samples related to our ontologies. This
process significantly expanded our dataset, culminating in a total of 511 discussion samples
(466 train, 33 validation, and 32 test)3 .
Also, it is important to note that while the human-generated SPARQL queries were
meticulously tested and verified to ensure their correctness, the same rigorous validation
could not be applied to all the synthetic samples generated via the text-davici-003 model.
3
Corentin-tin/text2SPARQL-dataset

56
This distinction introduces a potential concern regarding the overall quality and accuracy
of the synthetic data. While leveraging advanced models like text-davici-003 model can
amplify the size of our dataset, it can brings challenges in ensuring the reliability of every
generated sample. As such, users and researchers leveraging our dataset must be aware of
and account for this potential variability in data quality when conducting experiments or
assessments.
Of these ten ontologies, they collectively formed the primary training data for our
model. Furthermore, for evaluation purposes, the EmployeeTraining Dataset from our use
case was split into two subsets: one for testing and another for validation. This ensures a
rigorous assessment of the model’s performance against both familiar training data and
newer, unseen data, allowing us to observe its generalization capabilities more effectively.
For those interested in exploring or utilizing our dataset, it has been made available on
Hugging Face as a DictDataset. This ensures accessibility and encourages further research
and development in the domain.

5.4.2 Fine-Tuning
5.4.2.1 Parameters Efficient Fine-tuning: LoRA
Given our constrained computational power and limited GPU bandwidth, we used a
parameter-efficient fine-tuning technique. One such method, which has demonstrated
notably favorable results, is LoRA, as introduced in the publication LoRA: Low-Rank
Adaptation of Large Language Models by E. J. Hu et al. (2021). This approach enables
the fine-tuning of LLMs by training only a small fraction of the model’s total parameters.
Impressively, it retains a performance level comparable to full fine-tuning, where all
parameters are trained.
Transformers largely depend on dense layers to perform essential matrix multiplica-
tions. These weight matrices are typically of full-rank. However, during the task-specific
adaptation of pre-trained language models, it’s observed that the models exhibit a lower
intrinsic dimension(Aghajanyan et al. 2020). This low intrinsic dimension suggests that
even with a random projection to a smaller subspace, the model can learn efficiently.
Inspired by this observation, LoRA hypothesizes that the updates to the weight matrices
also possess a low intrinsic rank during the adaptation phase. Specifically, if we consider
a pre-trained weight matrix W0 ∈ Rd×k , its update can be represented using a low-rank
decomposition:
W0 + ∆W = W0 + BA
where B ∈ Rd×r , A ∈ Rr×k , and the rank r is much smaller than d and k.
This method involves freezing the W0 during training while allowing A and B to be
trainable. Such an approach modifies the forward pass to be:

h = W0 x + ∆W x = W0 x + BAx

(Figure 5.2)

LoRA introduces this efficient reparametrization with an initial condition where ∆W =

BA starts from zero, typically initializing A with a random Gaussian and B with zeros.
To manage the scaling, the term ∆W x is scaled by α/r, where α remains a constant in
relation to r. A critical aspect of this approach is that tuning α is similar to tuning the
learning rate, given an appropriate scaling initialization.

57
Figure 5.2. LoRA

5.4.2.2 Instruction Fine-Tuning

Instruction-based fine-tuning provides an additional dimension of model control by giving
an explicit context for the model’s behavior. By embedding a set of instructions within
the input, we can guide the model to generate specific outputs that align closely with our
objectives, in order to exploit assistant generations. Prompt above show the form of
discussions prompts that has been used to train the model. A complete discussion prompt
can be found in Appendix B.1.
<| system | >
You are a helpful assistant expert in Knowledge Engineering (
Knowledge Graph , SPARQL , etc ) . You are tasked to help user
getting answers about a Knowledge Graph using SPARQL ( Ontology -
based Data Access ) .
Given the data model provided below , craft a robust SPARQL query ,
then using the SPARQL result answer to the user .

Data Model :
{ ONTOLOGY_MODEL . ttl }

The query should be executed to answer the specific question

provided and based on the data model .
Note :
- Use regex for case insensitivity .
- Use OPTIONAL where applicable to get an answer even if some
data properties are missing .
- csv query result is stored in folder ’ csv_folder ’ <| end | >
<| user | >
{ USER QUESTION } <| end | >
<| assistant | >
SPARQL QUERY :
‘‘‘ sparql
{ SPARQL QUERY }
‘ ‘ ‘ <| end | >
<| user | >
RESULT :
{ CSV RESULT }

58
RESPONSE SHAPE : { CSV SHAPE } <| end | >
<| assistant | >
{ ASSISTANT ANSWER } <| end | >
<| endoftext | >

Several points should be highlighted regarding this fine-tuning approach:

• Contextual Information: The instruction should offer the model a context, framing
its role as an assistant expert in Knowledge Engineering. This helps in channeling
the model’s behavior towards a specific domain of expertise.

• Explicit Constraints: We define specific constraints that the model must adhere to
when generating SPARQL queries. This includes considerations for case insensitivity,
flexibility in dealing with missing properties, and where the results should be stored.

• Dynamics of Data Model Integration: The placeholder ONTOLOGY_MODEL.ttl

allows for dynamic embedding of different data models, ensuring that the model
generates SPARQL queries tailored to each specific ontology.

• Special Tokens and Tags: The instruction prompt use the three special token
from StarChat fine-tuning (<|system|>, <|user|>, and <|assistant|>), demarcate
the interaction of the discussion. Additionally, we user tags SPARQL QUERY and CSV
RESULT, respectively to indicate the underlying python script to query the Knowledge
graph, and to indicate SPARCoder a query result.

Thus, instruction fine-tuning emerges as a powerful technique to embed context and

specific guidelines directly into the model’s input. By providing explicit instructions,
we are able to harness the extensive capabilities of foundation models and direct them
towards precise, domain-specific tasks. In the case of our model, this approach ensures the
generation of SPARQL queries that are both relevant and compliant with the constraints
of the enterprise knowledge graph. Through the integration of contextual information, we
provide the model with a clear identity, positioning it as an adept assistant in Knowledge
Engineering. By defining explicit constraints, we ensure the reliability and accuracy of
the generated queries. Moreover, the incorporation of dynamics of data model integration
allows our system to remain versatile and adaptable across different ontological structures.
Special tokens and tags further refine this process, offering structure and clarity to
the discussions, and facilitating the seamless interaction between SPARCoder and the
underlying systems.

5.4.2.3 Training
Training LLMs like ours, is a complex process requiring the appropriate selection of
architecture, data, hyperparameters, and fine-tuning techniques. In our endeavor to
train our SPARCoder model, we turned to the HuggingFace platform. Recognized for
its comprehensive set of tools tailored for training and fine-tuning transformer models,
HuggingFace’s Transformer library, with its ‘Trainer‘ interface, was an indispensable asset
in our journey. We also used HuggingFace’s Model and Dataset Hub which is a convenient
solution to handle LLMs and dataset.
One important strategy to mention that we employed, is the system and user masking
strategy. Inspired by the Masking user labels approach used to train StarChat explained
by Tunstall et al. (2023), where special tokens facilitate selective masking of user input

59
in dialogues, we applied a similar methodology. The core idea is to guide the model to
condition its behavior on certain segments of the data, yet train it to predict specific
segments that are essential during the inference phase.

In chat models, this masking strategy ensures the model conditions on user input but
is optimized to predict only the assistant’s responses. Such a distinction is pivotal as
it focuses the model’s attention to what is imperative during the actual deployment or
inference. In our use case, considering the substantial size of the system prompts, we
further expanded this strategy to mask the system as well. This step ensures the model
does not overfit to the specific ontology model in the system prompts, while still absorbing
the important context given by the ontology in the system prompt.

By leveraging this refined masking strategy, the model is trained to condition its
responses based on both the user’s input and the system’s cues. However, its parameters
are trained to the generation of the SPARQL queries, and not to guess the ontology model
our the user questions. It also prevents data breach ensuring limited knowledge retention
on the enterprise ontologies the models is trained on.

In our training process, it was essential to meticulously choose and set hyperparameters
to ensure optimal performance of the SPARCoder model. These hyperparameters determine
the model’s learning pace, regularization, and overall training dynamics. The specific
values adopted for our model are detailed in Table 5.1. This configuration was selected
based on parameters commonly used for the fine-tuning of StarChat, ensuring that our
model benefits from proven practices in similar settings.

Parameter Value
Foundation Model Path HuggingFaceH4/starchat-alpha
Dataset Corentin-tin/text2SPARQL-dataset
Sequence Length 5700
Max Steps 300
Training Batch Size 1
Evaluation Batch Size 1
Gradient Accumulation Steps 4
LoRA Rank 16
LoRA Alpha 32
LoRA Dropout 0.1
Learning Rate 2 × 10−5
LR Scheduler Type cosine
Number of Warmup Steps 50
Weight Decay 0.05
Table 5.1. Training Details for SPARCoder Model Fine-tuning

60
Figure 5.3. Dynamic Learning Rate - Linear Warmup and Cosine Decay

Upon monitoring the training and evaluation losses during the model’s training process,
a distinct pattern can be observed on Figure 5.4. Both the training and evaluation losses
exhibited a pronounced decreasing phase during the initial 80 steps. This is an anticipated
behavior in the initial stages of training as the model refines its parameters to better
fit the data. The model begins to learn the expected response form with respect to the
provided instruction format, ensuring that its generated outputs are not only accurate but
also conformant to the structural expectations set by the training data.

However, post these initial 80 steps, while the training loss continued its decline at a
slower rate, the evaluation loss appeared to plateau. This stagnation in the evaluation loss,
despite the persistent decline in the training loss, is as we know indicative of an overfitting
scenario. The model, although increasingly fitting well to the training data, might not
be generalizing as effectively to unseen data, a concern further compounded given the
intricate nature of our dataset.

61
Figure 5.4. train and evaluation Losses

A significant point of consideration is the complexity introduced by ontology models

that are composed of multiple ontologies. Our observations suggested that the model
occasionally failed with generalizing the use of appropriate prefixes in relation to the
correct classes or properties. Given that our dataset consists of various ontology models,
and considering the smaller representation of some ontology types, the model might be
finding it challenging to establish robust generalizations across them.
This particular challenge with prefixes and ontology constructs will be further dissected
and elaborated upon in the qualitative result observation section. There, we will delve
into specific instances and scenarios where these shortcomings are evident and discuss
potential reasons and implications.
All training procedures were performed on the Paperspace platform,4 making use of a
Tesla A100 80Gb GPU. Detailed metrics on memory allocation and power consumption
during the training process have been documented and can be found in appendices B.3
and B.4, respectively.

5.4.2.4 Qualitative Analysis

As explained in the section, KG is a niche domain with limited data availability, especially
tailored to our use case. Due to this, achieving optimal model performance through
quantitative evaluation can be challenging. Consequently, and also due to results not as
good as expected, we limit or selves to a qualitative approach, focusing on the model’s
interpretation and response patterns rather than numerical metrics. In this section, we’ll
look at SPARCoder’s outputs, understanding its strengths and areas of improvement to
provide a comprehensive assessment of its capabilities. As reminder, SPARCoder has never
be trained on the EmployeeTraining ontology. Note that the true5 output of each example
can be found at Appendix B.2.1 to B.2.4.
4
Paperspace was utilized for the entirety of our project’s development and testing phases.
5
Containing special tokens and tags

62
1. Example 1
User : How many active Employees ?
SPARQL :
SELECT ( COUNT (? employee ) AS ? count )
WHERE {
? employee a emp : ActiveEmployee .
}
RESULT :
count
0 4204
Assistant : There are 4204 active employees in the Knowledge
Graph
User : And at Safran Aero Boosters ?
SPARQL QUERY :
SELECT ( COUNT (? employee ) AS ? count )
WHERE {
? employee a emp : ActiveEmployee .
? employee emp : worksFor ? company .
? company emp : companyName ? companyName .
FILTER regex (? companyName , " Safran Aero Boosters " , " i " ) .
}
RESULT :
count
0 1333
RESPONSE SHAPE : (1 , 1)
Assistant : There are 1333 active employees at Safran Aero
Boosters

As we can see with these two simple questions the model act as expected. We can
notice the fact that SPARCoder keep track of the context. Indeed, just by asking as
a second question "And at Safran Aero Boosters ?", the model know that he must
count the number of employees a SAB.
2. Example 2
User : How many training cost more than 1000 ?
SPARQL QUERY :
SELECT ( COUNT (? training ) AS ? count )
WHERE {
? training a tr : Training .
? training tr : cost ? cost .
FILTER (? cost > 1000) .
}
RESULT :
count
0 4057
RESPONSE SHAPE : (1 , 1)
Assistnat : There are 4057 trainings that cost more than 1000.
User : Can I have there names ?
SPARQL QUERY :
SELECT ? name
WHERE {

63
? training a tr : Training .
? training tr : name ? name .
}
RESULT :
name
0 The 5 key principles of personal data protection
1 Jump into the Digital Transformation @ Safran
2 Jump in Engineering 4.0
3 Jump in Manufacturing 4.0
4 CoorpCast | Ep 13| Slash careers : Multiple Jobs , Multiple
Lives
RESPONSE SHAPE : (584 , 1)
Assistant : The 5 key principles of personal data protection
Jump into the Digital Transformation @ Safran
Jump in Engineering 4.0
Jump in Manufacturing 4.0
CoorpCast | Ep 13| Slash careers : Multiple Jobs , Multiple Lives

Here we can see that the first SPARQL query is correct, however the second is a
valid SPARQL query but does not keep the context into account, indeed, the result
is not about names of training more than 1000€, but simply contains the name of
all training.

3. Example 3
User : How many training each employee has attended ?
SPARQL QUERY :
SELECT ? employee ? training ( COUNT (? training ) AS ? count )
WHERE {
? employee emp : participatedIn ? training .
}
GROUP BY ? employee ? training
ORDER BY DESC (? count )
RESULT :
employee training count
0 NaN NaN 0
RESPONSE SHAPE : (1 , 3)

In this case, the query is not valid with respect to the ontology because Objects Prop-
erty emp:participatedIn?training does not exist, the appropriate Objects Prop-
erty is define in the Training ontology and thus is tr:participatedIn?training.
This issue is the most frequent, indeed, and the most problematic as it means that
the model does not strictly respect the underlying ontology. Most of the time, when
there are more than one ontology prefix to use in the SPARQL query, the model fail
to use appropriate prefixes.

4. Example 4
User : What is the name of the manager of ’ jt85264 ’
SELECT ? name
WHERE {
? employee emp : id ? id ;

64
emp : firstName ? name .
FILTER regex (? id , " jt85264 " , " i " ) .
}
RESULT :
Empty DataFrame
Columns : [ name ]
Index : []

RESPONSE SHAPE : (0 , 1)

The model struggles to generate this basic query, likely because it lacks natural
language comprehension, having been primarily trained on code.

5.5 Discussion
The results obtained from our model, SPARCoder, provide both a window into the model’s
potential capabilities as well as its limitations, particularly in the niche and intricate
domain of EKGs. As we navigate through the outcomes, there are several crucial aspects
that emerge, helping to understand the nuances and challenges inherent to training a
model in this domain.

1. Contextual Understanding: Ex1 demonstrate that SPARCoder possesses a

commendable ability to remember context. This is evident when the model, both use
the appropriate class from the ontology and correctly interprets the second question
in light of the preceding one, suggesting a commendable ability to maintain context
over subsequent queries. This aspect of contextual understanding is important,
especially in a conversational setting where users would expect the model to retain
prior information and build upon it. However, Ex2 showcases a dichotomy in the
model’s performance. While the model crafts a correct SPARQL query initially,
the subsequent query loses track of context, indicating inconsistency in results. It’s
imperative that the model maintains a uniform level of performance throughout
its operations, particularly when dealing with sequential and related queries. This
inconsistency is probably a result of the model’s training data not having enough
varied examples to teach it how to maintain context over longer sequences or more
complex questions as the human generated discussion was crafted to keep the context
but, it is not the case whit generated samples which are the majority.
2. Ontology Prefix Challenges: Ex3 unveils a significant challenge in handling
multiple ontologies. The misuse of prefixes, especially when juggling between different
ontologies, is a glaring issue. It indicates that the model finds it arduous to distinguish
between closely related or similar ontologies. This could be attributed to the
few representation ontology and SPARQL queries related to several ontologies,
indeed Stardog Knowledge Kit’s ontologies ofen use only one prefix6 . This shortfall
underscores the importance of a well-balanced dataset, especially in a domain as
precise and nuanced as KGs.
3. Limitation in Natural Language Processing: Ex4 highlights a fundamental
limitation in the model’s ability to comprehend and interpret natural language
6
Except from the basic prefixes of course.

65
queries. While it has been trained extensively on code, the bridging of natural
language queries to structured SPARQL queries is evidently still a challenge. This
limitation emphasizes the need for an even blend of natural language and structured
data during the training phase to enhance the model’s versatility.

In conclusion, the results highlight both the promises and the challenges of training
models for specialized in ontologies and KGs. The commendable capability of SPARCoder
in contextual understanding sets a positive precedent. However, the inconsistencies in
results and struggles with multiple ontologies indicate the intricate nature of the problem at
hand. These findings, while pointing out areas of improvement, also pave the way for future
research. By understanding these limitations, we can refine our training methodologies,
dataset compositions, and evaluation metrics to create models that are more adept and
consistent in their performance. This journey, while filled with challenges, holds immense
potential to revolutionize the way we interface with domain-specific enterprise KGs.

5.6 Architecture
In order to provide seamless integration between the front-end user interface and the
backend capabilities of SPARCoder and the Knowledge Graph, a well planned architecture
is important. Our architecture is designed to take into account the specific requirements
of our prototype.
While our current setup is apt for demonstration purposes, a production-grade applica-
tion demands further enhancements (Appendix B.1).

5.6.1 Overview
At a high level, our system is divided into three primary components:

1. Frontend Web Interface

2. Flask Server Middleware

3. Backend Services comprising the LLM and Stardog Knowledge Graph

These components ensure that the user queries from the web interface are processed,
converted into SPARQL queries (where necessary), and retrieve results from the appropriate
backend sources.

5.6.2 Frontend Web Interface

The web interface is built using a combination of HTML, CSS, and JavaScript. This
ensures a responsive and interactive user experience. All user interactions on the chat
interface are captured here and sent to the Flask server for processing.

• HTML: To define the structure of our web content, laying out the chat interface,
buttons, and input fields.

• CSS: To dictate the aesthetics of our interface, such as the color scheme, typography,
and responsive design.

66
• JavaScript: The interactive features of our interface. With JavaScript, we capture
user inputs, communicate with the Flask server, and update the chat interface in
real-time.

5.6.3 Flask Server Middleware

Acting as the bridge between the frontend and the backend, our Flask server performs
several key functions:

• Request Handling: Capturing and interpreting user inputs from the frontend.

• LLM Communication: Interfacing with the HuggingFace servers to utilize the

SPARCoder’s capabilities through the gradio API.

• Stardog Knowledge Graph Interaction: For questions that require querying

the Knowledge Graph, the Flask application interacts with Stardog through the
pystardog Python API.

• Storage Management: As per our current setup, discussions are stored in text
files corresponding to chat sessions. The server manages the reading and writing of
these sessions.

5.6.4 Backend Services

Our backend consists of two primary services:

1. SPARCoder on HuggingFace Servers: SPARCoder, the fine-tuned LLM, is

hosted on HuggingFace servers. All textual query that requires the LLM’s capabilities
is computed here.

2. Stardog Knowledge Graph: Queries that need factual information or are depen-
dent on the enterprise data are directed towards Stardog query endpoint.

5.6.5 Architecture
A visual representation of our architecture, illustrating the interaction between different
components, can be seen in Figure 5.6. For a more detailed view, refer to the full-size
diagram in Appendix B.5.

67
Messaging Ontology Development
Application/Server Knowledge Graph
Platform Ontology Knowledge
Model Management Expert

& Inference Engine Mapping Engine

User KG Connectors Ontology

SPARCoder (LLM) (OBDA)
Interface Endpoint SAB Expert
API

Context Window

<|system|> pystardog Stardog

Virtual
Ontology Connector
SPARQL Knowledge

HRM ERP
<|user|> Natural Language Request
DBs DBs

External

Commodity
Market Prices

Figure 5.5. Prototype Architecture

Figure 5.6. Chatbot Interface

68
5.7 Conclusion
This thesis embarked on a journey to address the prevalent challenge of making vast data
accessible and comprehensible to non-expert users, specifically within the realm of large
organizations. We delved into the potential of coupling the Enterprise Knowledge Graphs
with an ontology-aware Text-to-SPARQL fine-tuned Large Language Model. Our prototype,
the SPARCoder, tried to showcase the power of marrying structured knowledge graphs
with advanced natural language processing techniques.
Several points emerged from our research. Firstly, the integration of large language
models with knowledge graph querying indeed presents a promising avenue for enhancing
data accessibility for non-experts. Our "structured data search engine" prototype not
only offers a more intuitive data interaction experience but also broadens the horizons for
diverse user interactions, thereby promoting data-driven decision-making.
However, as highlighted in our discussions, the journey is not devoid of challenges. The
model’s inconsistencies in maintaining context across queries, its struggles with ontology
prefixes, and challenges in seamlessly bridging natural language queries with structured
SPARQL requests underscore the complexities of this venture. These challenges, rather
than looking insurmountable, point towards avenues for future research. They emphasize
the importance of well-curated training datasets, particularly in specialized domains like
KGs, and hint at the potential refinements needed in our training methodologies.
As we conclude, it’s evident that the association between knowledge graphs and large
language models holds great promise. As the current system offers a little leap towards
bridging the gap between vast and fragmented data sources and end-users, the frontier to
be explored is still vast.
Building on this foundational work and recognizing the immense landscape yet to be
charted, we now turn our attention to the potential Future Directions

5.7.1 Future Directions

One promising avenue for further research revolves around the integration of ontology
parsing tools with LLMs. Currently, our system relies on the LLMs capability to generate
queries from scratch, leading it to occasionally create outputs that might not adhere
strictly to the ontological constraints. By leveraging ontology parsing tools, we can shift
from a purely generative approach to a more controlled, guided mechanism for query
construction.
Rather than allowing the LLM to generate queries in a "blind" manner, we could
transform its role into a classifier. In this setup, the ontology parsing tool would present
a range of potential constraints or query components based on the input. The LLM
would then be responsible for selecting and arranging these components one after the
other, ensuring that the final query is semantically accurate and fully compliant with the
ontology’s constraints and match best users queries.
This approach could significantly reduce the errors associated with free-form query
generation, as the model would always work within the bounds of the predefined constraints
presented by the ontology parsing tool. It represents a fusion of structured ontology logic
with the nuanced understanding and flexibility of LLMs, potentially leading to a more
robust and error-resistant query generation system.
Another intriguing direction worth pursuing is the evaluation and adoption of alternative
LLM architectures. In our initial approach, we opted for a Code LLM given the primary

69
objective of generating SPARQL. This choice was underpinned by the presumption that
the nature of SPARQL generation would align closely with typical coding tasks. However,
an open question remains: What if the semantics and structure of KGs and SPARQL are
more congruent with natural language processing rather than classic coding?
Knowledge Graphs inherently capture relations and entities, similar to the way natural
language encapsulates subjects, predicates, and objects. The syntactic nature of SPARQL
might be closer to natural language constructs than traditional code, given its emphasis
on querying relationships and attributes. Recognizing this potential similarity, it might be
beneficial to investigate LLM architectures that are fine-tuned or primarily designed for
natural language processing tasks.

70
Appendix A

StarCoder

Figure A.1. Comparing StarCoder to multi-language open-access (e.g., CodeGen-16B-

Multi) and closed-access models (e.g., code-cushman-001) on 19 programming languages.
We report pass@1 on HumanEval (Chen, Tworek, Jun, Yuan, Oliveira Pinto, et al. 2021),
which we translate from Python to the other languages using MultiPL-E (Cassano et al.
2022) - (Li et al. 2023)

Model Name Number of Parameters Context Window Size

GPT-3 175 B 2048 tokens
GPT-4 Standard 1.76 T 8192 tokens
GPT-4 Extended 1.76 T 32768 tokens
LLaMA 7B, 13B, 33B, 65B 2048 tokens
LLaMA 2 7B, 13B, 70B 4096 tokens
CodeGen 350M, 1B, 3B, 7B 2048 tokens
code-cushman-001 2.5B 2048 tokens
StarCoder 15.5B 8192 tokens

Table A.1. LLMs Context Window

71
Figure A.2. Overview of the training data for StarCoder. For the selected programming
languages, we show the number of files and data volume after near-deduplication, as well
as after filtering - part 1 (Li et al. 2023)

72
Figure A.3. Overview of the training data for StarCoder. For the selected programming
languages, we show the number of files and data volume after near-deduplication, as well
as after filtering - part 2 (Li et al. 2023)

A.1 Glosses

A.1.1 Employee

Entities

73
ConceptID Context Term(en) Term(fr)
... Employee Employee Employé
... Employee Office Bureau
... Employee Company Entreprise
... Employee Department Département
Gloss(en) Gloss(fr)
An individual who works part-time or full-time Un individu qui travaille à temps partiel ou à
under a contract of employment. temps plein sous un contrat de travail.
A location where an employee carries out their Un lieu où un employé réalise ses activités de
work activities. travail.
An organized group of people with a particular Un groupe organisé de personnes ayant un but
purpose, such as a business. particulier, comme une entreprise.
A functional unit within an organization. Une unité fonctionnelle au sein d’une organisation.

Table A.2. List of entities concepts described by glosses

Relations

ConceptID Context Term(en) Term(fr)

... Employee works for travaille pour
... Company employs emploie
... Employee is located at est situé à
... Employee belongs to appartient à
... Department includes inclut
... Employee is superior of est le supérieur de
... Employee is subordinate of est le subordonné de
... Employee is secretary of est secrétaire de
... Employee is correspondent of est correspondant de
Gloss(en) Gloss(fr)
Links an employee to the company they work for. Lie un employé à l’entreprise pour laquelle il tra-
vaille.
Links a company to the employee it employs. Lie une entreprise à l’employé qu’elle emploie.
Links an employee to the office they are located Lie un employé au bureau où il est situé.
at.
Links an employee to the department they are Lie un employé au département auquel il appar-
part of. tient.
Links a department to the employee it includes. Lie un département à l’employé qu’il inclut.
Links an employee to the employees they manage Lie un employé aux employés qu’il gère ou dont il
or have authority over. a l’autorité.
Links an employee to their superior in the com- Lie un employé à son supérieur dans la hiérarchie
pany hierarchy. de l’entreprise.
Links an employee who acts as a secretary for Lie un employé qui agit en tant que secrétaire
another employee. pour un autre employé.
Links an employee who corresponds with another Lie un employé qui correspond avec un autre em-
employee. ployé.

Table A.3. List of relations concepts described by glosses

A.1.2 Training
Entities

74
ConceptID Context Term(en) Term(fr)
... Training Training Formation
... Training DigitalAcademy Training Formation DigitalAcademy
... Training SAB Training Formation SAB
Gloss(en) Gloss(fr)
A structured training and development program Un programme structuré de formation et de
in the context of professional development. développement dans le contexte du développement
professionnel.
An online platform offering a variety of training Une plateforme en ligne proposant une variété de
courses. formations.
A development program offering the possibility Un programme structuré de formation et de
variety. développement dans le contexte du développement
professionnel.

Table A.4. List of entities concepts described by glosses

Relations

ConceptID Context Term(en) Term(fr)

... Training participated in a participé à
... Training provided by dispensée par
Gloss(en) Gloss(fr)
Links an employee with a Training they partici- Lie un employé à une formation à laquelle il a
pated in. participé.
Links a Training to the company that provides it. Lie une formation à l’entreprise qui la dispense.

Table A.5. List of relations concepts described by glosses

75
77
Appendix B

SPARCoder

<| system | >

You are a helpful assistant expert in Knowledge Engineering
( Knowledge Graph , SPARQL , etc ) . You are tasked to help a user
get answers about a Knowledge Graph using SPARQL ( Ontology -
based Data Access ) . Given the data model provided below , craft
a robust SPARQL query , then using the SPARQL result , answer the
user .

Data Model :
{ ONTOLOGY_MODEL . ttl }

The query should be executed to answer the specific question

provided and based on the data model .
Note :
- Use regex for case insensitivity .
- Use OPTIONAL where applicable to get an answer even if some
data properties are missing .
- CSV query result is stored in folder ’ csv \ _folder ’. <| end | >
<| user | >
What is the position of the active employee with the first name
’ John ’? <| end | >
<| assistant | >
SPARQL QUERY :
‘‘‘ sparql
PREFIX fo : < https :// www . safran - group . com / companies / safran - aero -
boosters / ontology / formation_onto / v1 / >
PREFIX emp : < https :// www . safran - group . com / companies / safran - aero -
boosters / ontology / employee_onto / v1 / >

SELECT ? position \ nWHERE {

? employee a emp : ActiveEmployee .
? employee emp : isActive ? isActive ;
emp : firstName ? firstName ;
emp : position ? position .
FILTER regex (? firstName , \ " John \" , \" i \") .
FILTER regex (? isActive , \" true \" , \" i \") .\ n }
‘ ‘ ‘ <| end | >

78
<| user | >
RESULT :
employeeName employeeID
0 John Doe ID3124
1 Jane Doe ID3125
2 John Smith ID3126

RESPONSE SHAPE :(3 , 2) <| end | >

<| assistant | >
I found 3 active employees who works for ’ Safran Aero Boosters ’:
1. John Doe ( ID3124 )
2. Jane Doe ( ID3125 )
3. John Smith ( ID3126 )

Do you need any further information about them ? <| end | >
<| endoftext | >

Figure B.1. Discussion Prompt Example

Figure B.2. ChatGPT Website Monthly Views - DEMANDSAGE - source: Similarweb

79
Figure B.3. GPU Memory Allocated

80
Figure B.4. GPU Power Usage (W)

B.1 Interface and Server, Production Considerations

1. Scalable Server Infrastructure: A robust backend system, possibly involving

frameworks like Django or platforms like Node.js, would be better suited to handle
high concurrent user loads.

2. Database Integration: Relying on text files for storage is not scalable or efficient.
In a production setting, integrating a relational or NoSQL database would ensure
faster data retrieval, backup mechanisms, and better data integrity.

3. Security Measures: Protecting user data and ensuring secure interactions between
the frontend, backend, and LLM are paramount. This might involve implementing
HTTPS, data encryption, using secure APIs, and login/password to connect.

81
Messaging Ontology Development
Application/Server Knowledge Graph
Platform Ontology Knowledge
Model Management Expert
& Inference Engine Mapping Engine
User KG Connectors Ontology
SPARCoder (LLM) (OBDA)
Interface Endpoint

Figure B.5. Prototype Architecture

SAB Expert
API
Context Window
<|system|> pystardog Stardog
Virtual
Ontology Connector
SPARQL Knowledge

Access Control

82
<|user|> SPARQL
Query Graph
Natural Language Request
Natural Query Engine (Triple Store) Data sources
Language 1
Request
<|assistant|> SPARQL Query 2 Internal (Isolated Data Silos)
Natural
Language 5 Relevant
<|user|>
Response Relevant Data (CSV) 4 Data 3 PLM
(CSV) 3DExp
Natural
DBs DBs
Language <|assistant|>
Request Natural Language Response
HRM ERP
<|user|> Natural Language Request
DBs DBs
External
Commodity
Market Prices
B.2 Qualitative Analysis
B.2.1 Example 1

<| system | >

... <| end | >
<| user | >
How many active Employee ? <| end | >
<| assistant | >
SPARQL QUERY :
‘‘‘
PREFIX emp : < https :// www . safran - group . com / companies / safran - aero -
boosters / ontology / employee / v0 / >

SELECT ( COUNT (? employee ) AS ? count )

WHERE {
? employee a emp : ActiveEmployee .
}
‘ ‘ ‘ <| end | >
<| user | >
RESULT :
count
0 4204

RESPONSE SHAPE : (1 , 1) <| end | >

<| assistant | >
There are 4204 active employees in the Knowledge Graph . <| end | >
<| user | >
And at Safran Aero Boosters ? <| end | >
<| assistant | >
SPARQL QUERY :
‘‘‘
PREFIX emp : < https :// www . safran - group . com / companies / safran - aero -
boosters / ontology / employee / v0 / >

SELECT ( COUNT (? employee ) AS ? count )

WHERE {
? employee a emp : ActiveEmployee .
? employee emp : worksFor ? company .
? company emp : companyName ? companyName .
FILTER regex (? companyName , " Safran Aero Boosters " , " i " ) .
}
‘ ‘ ‘ <| end | >
<| user | >
RESULT :
count
0 1333

RESPONSE SHAPE : (1 , 1) <| end | >

<| assistant | >

83
B.2.2 Example 2

<| system | >

... <| end | >
<| user | >
How many fromation cost more than 1000 ? <| end | >
<| assistant | >
SPARQL QUERY :
‘‘‘
PREFIX tr : < https :// www . safran - group . com / companies / safran - aero -
boosters / ontology / training / v0 / >

SELECT ( COUNT (? training ) AS ? count )

WHERE {
? training a tr : Training .
? training tr : cost ? cost .
FILTER (? cost > 1000) .
}
‘ ‘ ‘ <| end | >
<| user | >
RESULT :
count
0 4057

RESPONSE SHAPE : (1 , 1) <| end | >

<| assistant | >
There are 4057 trainings that cost more than 1000. <| end | >
<| user | >
How many training cost more than 1000 ? <| end | >
<| assistant | >
SPARQL QUERY :
‘‘‘
PREFIX tr : < https :// www . safran - group . com / companies / safran - aero -
boosters / ontology / training / v0 / >

SELECT ( COUNT (? training ) AS ? count )

WHERE {
? training a tr : Training .
? training tr : cost ? cost .
FILTER (? cost > 1000) .
}
‘ ‘ ‘ <| end | >
<| user | >
RESULT :
count
0 4057

RESPONSE SHAPE : (1 , 1) <| end | >

<| assistant | >
There are 4057 trainings that cost more than 1000. <| end | >
<| user | >

84
Can I have there name ? <| end | >
<| assistant | >
SPARQL QUERY :
‘‘‘
PREFIX tr : < https :// www . safran - group . com / companies / safran - aero -
boosters / ontology / training / v0 / >

SELECT ? name
WHERE {
? training a tr : Training .
? training tr : name ? name .
}
‘ ‘ ‘ <| end | >
<| user | >
RESULT :
name
0 The 5 key principles of personal data protection
1 Jump into the Digital Transformation @ Safran
2 Jump in Engineering 4.0
3 Jump in Manufacturing 4.0
4 CoorpCast | Ep 13| Slash careers : Multiple Jobs , Multiple Lives

RESPONSE SHAPE : (584 , 1) <| end | >

<| assistant | >
The 5 key principles of personal data protection
Jump into the Digital Transformation @ Safran
Jump in Engineering 4.0
Jump in Manufacturing 4.0
CoorpCast | Ep 13| Slash careers : Multiple Jobs , Multiple Lives

B.2.3 Example 3

<| system | >

... <| end | >
<| user | >
How many training each employee has attended ? <| end | >
<| assistant | >
SPARQL QUERY :
‘‘‘
PREFIX emp : < https :// www . safran - group . com / companies / safran - aero -
boosters / ontology / employee / v0 / >
PREFIX tr : < https :// www . safran - group . com / companies / safran - aero -
boosters / ontology / training / v0 / >

SELECT ? employee ? training ( COUNT (? training ) AS ? count )

WHERE {
? employee emp : participatedIn ? training .
}
GROUP BY ? employee ? training
ORDER BY DESC (? count )

85
‘ ‘ ‘ <| end | >
<| user | >
RESULT :
employee training count
0 NaN NaN 0

RESPONSE SHAPE : (1 , 3) <| end | >

<| assistant | >
SPARQL QUERY :
‘‘‘
PREFIX emp : < https :// www . safran - group . com / companies / safran - aero -
boosters / ontology / employee / v0 / >
PREFIX tr : < https :// www . safran - group . com / companies / safran - aero -
boosters / ontology / training / v0 / >

SELECT ? employee ? training_count

WHERE {
? employee a emp : Employee .
? employee tr : participatedIn ? training .
? training a tr : Training .
? training tr : id ? training_id .
? training tr : name ? training_name .
? training tr : participatedIn ? training .
? training tr : participatedIn ? training .
? training tr : participatedIn ? training .
....

B.2.4 Example 4

<| system | >

... <| end | >
What is the name of the manager of ’ jt85264 ’ <| end | >
<| assistant | >
SPARQL QUERY :
‘‘‘
PREFIX emp : < https :// www . safran - group . com / companies / safran - aero -
boosters / ontology / employee / v0 / >

SELECT ? name
WHERE {
? employee emp : id ? id ;
emp : firstName ? name .
FILTER regex (? id , " jt85264 " , " i " ) .
}
‘ ‘ ‘ <| end | >
<| user | >
RESULT :
Empty DataFrame
Columns : [ name ]
Index : []

86
RESPONSE SHAPE : (0 , 1) <| end | >
<| assistant | >
The manager of ’ jt85264 ’ is not known .

Do you want to know more about other employees or the company ?

87
Bibliography

Li, Raymond et al. (2023). “StarCoder: may the source be with you!” arXiv: 2305.06161
[cs.CL] (pages 2, 14, 17, 31, 71–73).
Mazumdar, Somnath et al. (2019). “A survey on data storage and placement methodologies
for Cloud-Big Data ecosystem”. In: Journal of Big Data 6.1, p. 15. issn: 2196-1115.
doi: 10.1186/s40537-019-0178-3. url: https://doi.org/10.1186/s40537-019-
0178-3 (page 1).
Studer, Rudi, V.Richard Benjamins, and Dieter Fensel (1998). “Knowledge engineering:
Principles and methods”. In: Data Knowledge Engineering 25.1, pp. 161–197. issn:
0169-023X. doi: https://doi.org/10.1016/S0169-023X(97)00056-6. url: https:
//www.sciencedirect.com/science/article/pii/S0169023X97000566 (pages 1,
39).
Brown, Tom B. et al. (2020). “Language Models are Few-Shot Learners”. arXiv: 2005.14165
[cs.CL] (page 1).
Devlin, Jacob et al. (2019). “BERT: Pre-training of Deep Bidirectional Transformers for
Language Understanding”. arXiv: 1810.04805 [cs.CL] (pages 1, 7, 11).
Sun, Ruoxi et al. (2023). “SQL-PaLM: Improved Large Language Model Adaptation for
Text-to-SQL”. arXiv: 2306.00739 [cs.CL] (pages 1, 15).
European Commission (2016). “Regulation (EU) 2016/679 of the European Parliament
and of the Council of 27 April 2016 on the protection of natural persons with regard to
the processing of personal data and on the free movement of such data, and repealing
Directive 95/46/EC (General Data Protection Regulation) (Text with EEA relevance)”.
url: https://eur-lex.europa.eu/eli/reg/2016/679/oj (pages 4, 52).
Hambarde, Kailash and Hugo Proença (2023). “Information Retrieval: Recent Advances
and Beyond”. In: arXiv preprint arXiv:2301.08801 (page 5).
Chaudhri, Vinay K et al. (2022). “Knowledge graphs: Introduction, history, and perspec-
tives”. In: AI Magazine 43.1, pp. 17–29 (pages 5, 13).
Brin, Sergey and Lawrence Page (1998). “The Anatomy of a Large-Scale Hypertextual
Web Search Engine”. In: Seventh International World-Wide Web Conference (WWW
1998) (page 6).
Bugaje, Maryam and Gobinda Chowdhury (Nov. 2017). “Is Data Retrieval Different from
Text Retrieval? An Exploratory Study”. In: pp. 97–103. isbn: 978-3-319-70231-5. doi:
10.1007/978-3-319-70232-2_8 (page 6).
Tiwari, Anil (2023). “Emergence of Vector Databases with AI wave”. Medium. url:
https://tiw- anilk.medium.com/emergence- of- vector- databases- with- ai-
wave-dd9976dedc2f (page 6).

89
Cer, Daniel et al. (2018). “Universal Sentence Encoder”. In: CoRR abs/1803.11175. arXiv:
1803.11175. url: http://arxiv.org/abs/1803.11175 (page 7).
Johnson, Jeff, Matthijs Douze, and Hervé Jégou (2017). “Billion-scale similarity search
with GPUs”. In: CoRR abs/1702.08734. arXiv: 1702.08734. url: http://arxiv.org/
abs/1702.08734 (page 7).
Louis, Antoine (July 2020). “A Brief History of Natural Language Processing”. In: Medium.
url: https : / / medium . com / @antoine . louis / a - brief - history - of - natural -
language-processing-part-1-ffbcb937ebce (page 9).
Chiusano, Fabio (Sept. 2022). “A Brief Timeline of NLP: A journey across grammars, expert
systems, ontologies, statistical models, neural networks, word embeddings, transformers,
etc.” In: NLPlanet. https://medium.com/nlplanet/a-brief-timeline-of-nlp-bc45b640f07d
(page 9).
Bommasani, Rishi et al. (2021). “On the Opportunities and Risks of Foundation Models”.
In: arXiv preprint arXiv:2108.07258 (pages 9, 11, 12).
Weaver, Warren (1949). “Translation”. In: Machine Translation of Languages. Ed. by
William N. Locke and A. Donald Boothe. Reprinted from a memorandum written by
Weaver in 1949. Cambridge, MA: MIT Press, pp. 15–23 (page 9).
Chomsky, Noam (1957). “Syntactic Structures”. Mouton (page 10).
Pierce, John R et al. (1966). “Language and machines — computers in translation and
linguistics”. Tech. rep. Washington, DC: National Academy of Sciences, National
Research Council (page 10).
Weizenbaum, Joseph (1966). “ELIZA—A Computer Program for the Study of Natural
Language Communication between Man and Machine”. In: Communications of the
ACM 9.1, pp. 36–45. doi: 10.1145/365153.365168 (page 10).
Winograd, Terry (1971). “Procedures as a representation for data in a computer program
for understanding natural language”. Tech. rep. Massachusetts Institute of Technology,
Cambridge Project (page 10).
Woods, W., R. Kaplan, and B. Nash-Webber (1972). “The Lunar Sciences Natural Language
Information System: Final Report”. Tech. rep. Bolt, Beranek and Newman, Cambridge,
MA (pages 10, 12).
Dyer, Michael G. (1983). “The Role of Affect in Narratives”. In: Cognitive Science 7.3,
pp. 211–242 (page 10).
Bengio, Yoshua et al. (2003). “A Neural Probabilistic Language Model”. In: Journal of
Machine Learning Research 3, pp. 1137–1155 (page 10).
Rumelhart, David E., Geoffrey E. Hinton, and Ronald J. Williams (1986). “Learning
representations by back-propagating errors”. In: Nature 323, pp. 533–536 (page 10).
Hochreiter, Sepp and Jürgen Schmidhuber (1997). “Long Short-Term Memory”. In: Neural
Computation 9.8, pp. 1735–1780 (pages 10, 23).
Mikolov, Tomáš et al. (2010). “Recurrent Neural Network Based Language Model”. In: Pro-
ceedings of the Eleventh Annual Conference of the International Speech Communication
Association (page 10).

90
Graves, Alex (2013). “Generating Sequences with Recurrent Neural Networks”. In: arXiv
preprint arXiv:1308.0850 (page 10).
Collobert, Ronan and Jason Weston (2008). “A Unified Architecture for Natural Language
Processing: Deep Neural Networks with Multitask Learning”. In: Proceedings of the
25th International Conference on Machine Learning, pp. 160–167 (page 10).
LeCun, Yann et al. (1998). “Gradient-Based Learning Applied to Document Recognition”.
In: Proceedings of the IEEE 86.11, pp. 2278–2324 (page 10).
Mikolov, Tomas et al. (2013). “Efficient Estimation of Word Representations in Vector
Space”. In: arXiv preprint arXiv:1301.3781 (pages 10, 18).
Socher, Richard et al. (2013). “Recursive Deep Models for Semantic Compositionality Over
a Sentiment Treebank”. In: Proceedings of the 2013 Conference on Empirical Methods
in Natural Language Processing, pp. 1631–1642 (page 10).
Sutskever, Ilya, Oriol Vinyals, and Quoc V. Le (2014). “Sequence to Sequence Learning with
Neural Networks”. In: Advances in Neural Information Processing Systems, pp. 3104–
3112 (page 10).
Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio (2015). “Neural Machine Trans-
lation by Jointly Learning to Align and Translate”. In: Proceedings of the International
Conference on Learning Representations (ICLR) (page 10).
Vaswani, Ashish et al. (2017). “Attention Is All You Need”. In: Advances in Neural
Information Processing Systems (pages 10, 20, 24–27).
Radford, Alec et al. (2018). “Improving Language Understanding by Generative Pre-
Training”. In: OpenAI (pages 11, 28, 30).
Clark, Peter et al. (2020). “From ‘F’ to ‘A’ on the N.Y. Regents Science Exams: An
Overview of the Aristo Project”. In: AI Magazine 41.4, pp. 39–53 (page 11).
Ben Zaken, Elad, Shauli Ravfogel, and Yoav Goldberg (2022). “BitFit: Simple Parameter-
efficient Fine-tuning for Transformer-based Masked Language-models”. In: arXiv
preprint arXiv:2106.10199 (page 12).
Hu, Edward J et al. (2021). “LoRA: Low-Rank Adaptation of Large Language Models”.
In: arXiv preprint arXiv:2106.09685 (pages 12, 57).
Hayes, Patrick J (1981). “Computing science and statistics: The interface”. In: Journal of
the American Statistical Association 76.374, pp. 7–15 (page 13).
McCarthy, John (1989). “First-order logic and AI”. In: Readings in artificial intelligence.
Morgan Kaufmann Publishers Inc., pp. 13–23 (page 13).
Taylor, R. and M. Frank (1976). “Data base management systems”. In: Computer 9.2,
pp. 38–44 (page 13).
Codd, Edgar F (1982). “Relational database: a practical foundation for productivity”. In:
Communications of the ACM 25.2, pp. 109–117 (page 13).
Lenat, Douglas B, Ramanathan V Guha, et al. (1995). “CYC: A large-scale investment in
knowledge infrastructure.” In: AAAI/IAAI. Vol. 1995, pp. 673–680 (page 13).
Lenat, Douglas B, Ramanathan V Guha, et al. (1991). “Building large knowledge-based
systems; representation and inference in the Cyc project.” In: AAAI/IAAI. Vol. 91,
pp. 1168–1175 (page 13).

91
Page, Lawrence et al. (1999). “The PageRank Citation Ranking: Bringing Order to the
Web.” In: Technical report. url: http://ilpubs.stanford.edu:8090/422/1/1999-
66.pdf (page 13).
Guha, Ramanathan V. (1996). “Metadata for the World Wide Web”. In: Proceedings of
the First International Conference on the World-Wide Web. url: http://www.cs.
wustl.edu/~schmidt/PDF/meta-www96.pdf (page 13).
Pan, Shirui et al. (2023). “Unifying Large Language Models and Knowledge Graphs: A
Roadmap”. In: arXiv preprint arXiv:2306.08302 (page 13).
Xu, Yichong et al. (2021). “Fusing context into knowledge graph for commonsense question
answering”. In: Findings of the Association for Computational Linguistics: ACL-
IJCNLP 2021, pp. 1201–1207 (page 14).
Hu, Nan et al. (2023). “An empirical study of pre-trained language models in simple
knowledge graph question answering”. In: arXiv preprint arXiv:2303.10368 (page 14).
Wang, Xiaozhi et al. (2021). “KEPLER: A Unified Model for Knowledge Embedding
and Pre-trained Language Representation”. In: Transactions of the Association for
Computational Linguistics 9, pp. 176–194 (page 14).
Ke, Pei et al. (Aug. 2021). “JointGT: Graph-Text Joint Representation Learning for Text
Generation from Knowledge Graphs”. In: Findings of the Association for Computational
Linguistics: ACL-IJCNLP 2021. Online: Association for Computational Linguistics
(page 14).
Kocetkov, Denis et al. (2022). “The Stack: 3 TB of permissively licensed source code”. In:
Preprint (pages 14, 31, 32).
Chen, Mark, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto,
et al. (2021). “Evaluating Large Language Models Trained on Code”. In: arXiv preprint
arXiv:2107.03374 (page 14).
Nijkamp, Erik et al. (2022). “A Conversational Paradigm for Program Synthesis”. In:
arXiv preprint (page 14).
Wu, Sen, Laurel Orr, and Manasi Ganti (2023). “Introducing NSQL: Open-source SQL Copi-
lot Foundation Models”. In: Numbers Station. url: https://www.numbersstation.
ai / post / introducing - nsql - open - source - sql - copilot - foundation - models
(page 15).
Fleuret, François (2021). “Deep Learning Course 14x050”. University of Geneva, Switzer-
land. Available at: https://fleuret.org/dlc/. Includes slides, recordings, and a
virtual machine. url: https://fleuret.org/dlc/ (page 20).
Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio (2014). “Neural machine trans-
lation by jointly learning to align and translate”. In: arXiv preprint arXiv:1409.0473
(page 21).
Louppe, Gilles (2023). “Lecture 7: Attention and Transformers”. Lectures for INFO8010 -
Deep Learning, ULiège, Spring 2023. url: https://github.com/glouppe/info8010-
deep-learning (pages 22, 24).
Zhang, Aston et al. (2021). “Dive into Deep Learning”. In: arXiv preprint arXiv:2106.11342
(pages 22, 24).

92
Chung, Junyoung et al. (2014). “Empirical Evaluation of Gated Recurrent Neural Networks
on Sequence Modeling”. In: CoRR abs/1412.3555. arXiv: 1412 . 3555. url: http :
//arxiv.org/abs/1412.3555 (page 23).
Shazeer, Noam (2019). “Fast Transformer Decoding: One Write-Head is All You Need”.
In: arXiv preprint arXiv:1911.02150 (pages 25, 26, 30).
Ainslie, Joshua et al. (May 2023). “GQA: Training Generalized Multi-Query Transformer
Models from Multi-Head Checkpoints” (page 26).
He, Kaiming et al. (June 2016). “Deep Residual Learning for Image Recognition”. In:
pp. 770–778. doi: 10.1109/CVPR.2016.90 (page 27).
Ba, Jimmy Lei, Jamie Ryan Kiros, and Geoffrey E. Hinton (2016). “Layer Normalization”.
arXiv: 1607.06450 [stat.ML] (page 27).
Liu, Peter J. et al. (2018). “Generating Wikipedia by Summarizing Long Sequences”. In:
CoRR abs/1801.10198. arXiv: 1801.10198. url: http://arxiv.org/abs/1801.10198
(page 28).
Allal, Loubna Ben et al. (2023). “SantaCoder: don’t reach for the stars!” arXiv: 2301.03988
[cs.SE] (page 31).
Kingma, Diederik P. and Jimmy Ba (2017). “Adam: A Method for Stochastic Optimization”.
arXiv: 1412.6980 [cs.LG] (page 33).
Gershgorn, Dave (2021). “GitHub and OpenAI launch a new AI tool that generates its own
code: Microsoft gets a taste of OpenAI’s tech”. In: url: https://www.theverge.com/
2021/6/29/22555777/github-openai-ai-tool-autocomplete-code (page 34).
Köpf, Andreas et al. (2023). “OpenAssistant Conversations – Democratizing Large Lan-
guage Model Alignment”. arXiv: 2304.07327 [cs.CL] (page 34).
Conover, Mike et al. (2023). “Free Dolly: Introducing the World’s First Truly Open
Instruction-Tuned LLM”. url: https://www.databricks.com/blog/2023/04/12/
dolly- first- open- commercially- viable- instruction- tuned- llm (visited on
06/30/2023) (page 34).
Ehrlinger, Lisa and Wolfram Wöß (Sept. 2016). “Towards a Definition of Knowledge
Graphs”. In: (page 37).
Singhal, Amit (May 2012). “Introducing the Knowledge Graph: things, not strings”.
Accessed: [Your Access Date Here]. url: https://blog.google/products/search/
introducing-knowledge-graph-things-not/ (page 37).
Fensel, Dieter, ed. (2011). “Foundations for the Web of Information and Services: A Review
of 20 Years of Semantic Web Research”. Springer Berlin Heidelberg (page 37).
Debruyne, Christophe (2023). “Knowledge Representation and Reasoning”. https://
www.programmes.uliege.be/cocoon/20232024/cours/INFO9014-1.html. Lecture
2. Liège, Belgium: University of Liège (pages 37, 38).
Wu, Honghan et al. (Jan. 2017). “Understanding Knowledge Graphs”. English. In: Exploit-
ing Linked Data and Knowledge Graphs in Large Organisations. Switzerland: Springer
International Publishing AG, pp. 147–180. isbn: 9783319456522. doi: 10.1007/978-
3-319-45654-6_6 (page 38).

93
De Leenheer, Pieter and Tom Mens (2008). “Ontology Evolution”. In: Ontology Man-
agement: Semantic Web, Semantic Web Services, and Business Applications. Ed. by
Martin Hepp et al. Boston, MA: Springer US, pp. 131–176. isbn: 978-0-387-69900-4.
doi: 10.1007/978-0-387-69900-4_5. url: https://doi.org/10.1007/978-0-387-
69900-4_5 (page 39).
Gruber, Thomas R. (1995). “Toward principles for the design of ontologies used for
knowledge sharing?” In: International Journal of Human-Computer Studies 43.5,
pp. 907–928. issn: 1071-5819. doi: https://doi.org/10.1006/ijhc.1995.1081. url:
https : / / www . sciencedirect . com / science / article / pii / S1071581985710816
(page 39).
Arenas, Marcelo et al. (2012). “A Direct Mapping of Relational Data to RDF”. Technical
Report. Available at: http://www.w3.org/TR/2012/REC- rdb- direct- mapping-
20120927/. W3C Recommendation. url: http://www.w3.org/TR/2012/REC-rdb-
direct-mapping-20120927/ (page 47).
Sequeda, Juan et al. (Jan. 2009). “Direct mapping SQL databases to the semantic web: A
survey”. In: (page 47).
Debruyne, Christophe and Declan O’Sullivan (Jan. 2016). “R2RML-F: Towards Sharing
and Executing Domain Logic in R2RML Mappings”. In: (page 47).
Halevy, Alon et al. (June 2005). “Enterprise information integration: successes, challenges
and controversies”. In: pp. 778–787. doi: 10.1145/1066157.1066246 (page 48).
Poumay, J. (2019). “Master thesis : Term extraction from domain specific texts”. Unpub-
lished master’s thesis. MA thesis. Liège, Belgique: Université de Liège. url: https:
//matheo.uliege.be/handle/2268.2/7487 (page 49).
Trivedi, Priyansh et al. (2017). “Lc-quad: A corpus for complex question answering over
knowledge graphs”. In: International Semantic Web Conference. Springer, pp. 210–218
(page 56).
Aghajanyan, Armen, Luke Zettlemoyer, and Sonal Gupta (2020). “Intrinsic Dimensionality
Explains the Effectiveness of Language Model Fine-Tuning”. arXiv: 2012 . 13255
[cs.LG] (page 57).
Tunstall, Lewis et al. (2023). “Creating a Coding Assistant with StarCoder”. In: Hugging
Face Blog. https://huggingface.co/blog/starchat (page 59).
Chen, Mark, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira
Pinto, et al. (2021). “Evaluating Large Language Models Trained on Code”. In: arXiv:
2107.03374 [cs.LG] (page 71).
Cassano, Federico et al. (2022). “MultiPL-E: A Scalable and Extensible Approach to
Benchmarking Neural Code Generation”. arXiv: 2208.08227 [cs.LG] (page 71).

Download Full Natural Language Processing 1st Edition Jacob Eisenstein PDF All Chapters
100% (3)
Download Full Natural Language Processing 1st Edition Jacob Eisenstein PDF All Chapters
55 pages
A Survey on Large Language Models With Some Insights
No ratings yet
A Survey on Large Language Models With Some Insights
174 pages
On The Applicability of Deep Learning To Construct Process Models From Natural Text 16 05
No ratings yet
On The Applicability of Deep Learning To Construct Process Models From Natural Text 16 05
66 pages
MapReduce Book Final
No ratings yet
MapReduce Book Final
175 pages
Cristian-Stefan Tutuianu PDF
No ratings yet
Cristian-Stefan Tutuianu PDF
40 pages
Machine Learning Systems
No ratings yet
Machine Learning Systems
300 pages
MapReduce Book 20100219
No ratings yet
MapReduce Book 20100219
152 pages
2013NICE4149
No ratings yet
2013NICE4149
229 pages
llm-book (1)
No ratings yet
llm-book (1)
161 pages
Speech and Language Processing: Third Edition Draft
No ratings yet
Speech and Language Processing: Third Edition Draft
287 pages
Speech and Language Processing
100% (1)
Speech and Language Processing
623 pages
Artifi Cial Intelligence: G. Konidaris
No ratings yet
Artifi Cial Intelligence: G. Konidaris
76 pages
Natural Language Processing Jacob Eisenstein Itebooks instant download
No ratings yet
Natural Language Processing Jacob Eisenstein Itebooks instant download
86 pages
dokumen.pub_designing-and-building-enterprise-knowledge-graphs-1nbsped-1636391745-9781636391748-9781636391755-9781636391762
No ratings yet
dokumen.pub_designing-and-building-enterprise-knowledge-graphs-1nbsped-1636391745-9781636391748-9781636391755-9781636391762
168 pages
FYP Proposal
No ratings yet
FYP Proposal
18 pages
Enhancing Machine Comprehension of GDPR Privacy Policies Using Recent Advancements in NLP
No ratings yet
Enhancing Machine Comprehension of GDPR Privacy Policies Using Recent Advancements in NLP
125 pages
310 Artificial Intelligence
No ratings yet
310 Artificial Intelligence
79 pages
PaLM 2
No ratings yet
PaLM 2
93 pages
WuV_notes
No ratings yet
WuV_notes
63 pages
Towards A Multifunctional Lexical Resource Design And Implementation Of A Graphbased Lexicon Model Dennis Spohr instant download
No ratings yet
Towards A Multifunctional Lexical Resource Design And Implementation Of A Graphbased Lexicon Model Dennis Spohr instant download
91 pages
Description Logics Reasoning Support For The Semantic Web Phd Thesis Jeff Z Pan pdf download
No ratings yet
Description Logics Reasoning Support For The Semantic Web Phd Thesis Jeff Z Pan pdf download
78 pages
Improving-Retrieval-Augmented-Generation
No ratings yet
Improving-Retrieval-Augmented-Generation
33 pages
Thesis RAG Retrieval Augmented Generation For The IR-Anthology
No ratings yet
Thesis RAG Retrieval Augmented Generation For The IR-Anthology
83 pages
Ontology-Based Information Sharing in Weakly Structure Enviroments
No ratings yet
Ontology-Based Information Sharing in Weakly Structure Enviroments
195 pages
Natural Language Processing For Hackers
No ratings yet
Natural Language Processing For Hackers
176 pages
Near_real_time_fraud_detection_with_Apac
No ratings yet
Near_real_time_fraud_detection_with_Apac
87 pages
CS985 Project FrankMitchell BiP Solutions
No ratings yet
CS985 Project FrankMitchell BiP Solutions
66 pages
Seed1.5-VL Technical Report
No ratings yet
Seed1.5-VL Technical Report
77 pages
Natural Language Processing 1st Edition Jacob Eisenstein - The ebook in PDF format is ready for download
100% (4)
Natural Language Processing 1st Edition Jacob Eisenstein - The ebook in PDF format is ready for download
50 pages
Get Natural Language Processing 1st Edition Jacob Eisenstein Free All Chapters
100% (6)
Get Natural Language Processing 1st Edition Jacob Eisenstein Free All Chapters
52 pages
Neural Graph Reasoning: Complex Logical Query Answering Meets Graph Databases
No ratings yet
Neural Graph Reasoning: Complex Logical Query Answering Meets Graph Databases
65 pages
An Introduction To Ontology Engineering
No ratings yet
An Introduction To Ontology Engineering
270 pages
Full Download Zefs Guide to Deep Learning 1st Edition Roy Keyes PDF DOCX
No ratings yet
Full Download Zefs Guide to Deep Learning 1st Edition Roy Keyes PDF DOCX
36 pages
CondensedSummaries
No ratings yet
CondensedSummaries
419 pages
Applied Artificial Intelligence An Engineering Approach 2nd Edition Bernhard G Humm instant download
No ratings yet
Applied Artificial Intelligence An Engineering Approach 2nd Edition Bernhard G Humm instant download
44 pages
AI Subfields
No ratings yet
AI Subfields
18 pages
Rapport_ISTIC_2023_2024_Ilef___Tasnim (2)
No ratings yet
Rapport_ISTIC_2023_2024_Ilef___Tasnim (2)
94 pages
Where can buy Natural Language Processing 1st Edition Jacob Eisenstein ebook with cheap price
100% (6)
Where can buy Natural Language Processing 1st Edition Jacob Eisenstein ebook with cheap price
55 pages
REPORT-MTechPESJul23BGrp2-3 (22-02-25)
No ratings yet
REPORT-MTechPESJul23BGrp2-3 (22-02-25)
15 pages
AI ct-1
No ratings yet
AI ct-1
8 pages
Data-Intensive Text Processing With MapReduce
100% (1)
Data-Intensive Text Processing With MapReduce
178 pages
Practical Artificial Intelligence Programming With Java 3rd Mark Watson instant download
No ratings yet
Practical Artificial Intelligence Programming With Java 3rd Mark Watson instant download
90 pages
Computer Science 2
No ratings yet
Computer Science 2
66 pages
Modulhandbuch M SC Artificial Intelligence and Machine Learning Po 2023.en
No ratings yet
Modulhandbuch M SC Artificial Intelligence and Machine Learning Po 2023.en
132 pages
IntentEmail PROJECT REPORT
No ratings yet
IntentEmail PROJECT REPORT
58 pages
Statistical Machine Learning For Information Retrieval - Adam Berger PDF
No ratings yet
Statistical Machine Learning For Information Retrieval - Adam Berger PDF
147 pages
[FREE PDF sample] Zefs Guide to Deep Learning 1st Edition Roy Keyes ebooks
100% (3)
[FREE PDF sample] Zefs Guide to Deep Learning 1st Edition Roy Keyes ebooks
21 pages
Free and Open Machine Learning
No ratings yet
Free and Open Machine Learning
143 pages
Dissertation Color
No ratings yet
Dissertation Color
171 pages
1177 Modular Deep Learning
No ratings yet
1177 Modular Deep Learning
76 pages
Unit 1 Assignment 2020
No ratings yet
Unit 1 Assignment 2020
4 pages
Unit 1 Assignment CGC1W
No ratings yet
Unit 1 Assignment CGC1W
2 pages
Hsp3u Isu Package
No ratings yet
Hsp3u Isu Package
13 pages
Unit 4 Assignment CGC1W
No ratings yet
Unit 4 Assignment CGC1W
2 pages
Unit_1_A_-_Liam_O_Connor
No ratings yet
Unit_1_A_-_Liam_O_Connor
2 pages
1726759684-Unit 2 Assignment c
No ratings yet
1726759684-Unit 2 Assignment c
1 page
G.9 French Worksheet - Unit 1
No ratings yet
G.9 French Worksheet - Unit 1
2 pages
Course Culminating Assignment - Shane Telfer
No ratings yet
Course Culminating Assignment - Shane Telfer
3 pages
CSEC English A June 2016 P1
100% (1)
CSEC English A June 2016 P1
14 pages
Course Culminating Assignment - Marcus Borde
No ratings yet
Course Culminating Assignment - Marcus Borde
3 pages
Ben 2021 p3
No ratings yet
Ben 2021 p3
9 pages
HIF - Unit 1
No ratings yet
HIF - Unit 1
5 pages
OVS GLC2O Unit 1 CULMINATING Assignment
No ratings yet
OVS GLC2O Unit 1 CULMINATING Assignment
5 pages