Master Thesis
Master Thesis
Tutor
Jean-Louis Jacquerie
Advisor
Ashwin Ittoo
1
Abstract
Increasingly, large organizations are faced with the challenge of making data
accessible and understandable to non-expert users1 . Despite the advances in natural
language processing and knowledge representation, turning data into natural language
responses that can be understood by a general audience remains a significant challenge.
Moreover, this issue is exacerbated by the exponential growth of information and
the fragmentation of data into isolated silos, which underscores the urgent need for
tools to provide more straightforward, single-point data access.
This thesis aims to address these challenges by introducing the use of Enterprise
Knowledge Graphs as a unified data structure for consolidating and representing
disparate data sources, coupled with SPARCoder, our ontology-aware 2 Text-to-
SPARQL fine-tuned Large Language Model based on StarCoder (Li et al. 2023),
capable of querying knowledge graphs to retrieve data using natural language. The
proposed natural language "search engine" architecture leverages the strengths of
Large Language Models in understanding and generating human-like text, combined
with the structured representation of information provided by knowledge graphs. In
essence, this approach bridges the gap between complex data and end-users, offering
a more accessible interface.
In this work, we undertake a comprehensive description of our proposed system,
contrasting its advantages and drawbacks with traditional methods of data access
and retrieval as well as other state-of-the-art large language models.
Consequently, we assert that the integration of large language models with
knowledge graph querying significantly improves data accessibility for non-expert
users. The proposed "search engine" prototype not only facilitates a more intuitive
and accessible way of interacting with data but also opens up new possibilities for
user interaction, leading to more informed and data-driven decision making.
1
By non-expert users, we refer to individuals who may not have formal training or deep familiarity
with database querying languages, or advanced data analytics.
2
Ontology-aware means that our system aims to understand and utilize the ontology’s semantic
information and structures to generate more accurate and semantically relevant SPARQL queries.
2
Acknowledgements
I’d like to express my deepest gratitude to my tutors, both from the University and
Safran Aero Boosters. Their guidance and expertise support have been instrumental
to this research.
I also wish to acknowledge to Safran Aero Boosters for the invaluable internship
experience and for providing the crucial computational resources, particularly the
access to high-end GPUs, that significantly propelled my work forward.
Special thanks to the Plateau Digital team at Safran. Their welcoming nature
and camaraderie made my time there both enriching and memorable.
On a personal note, my profound thanks go to my parents, whose love, guidance,
and belief in me have been the foundation of all my endeavors. Additionally, I
extend heartfelt appreciation to Charline, my girlfriend, for her enduring support
and encouragement throughout this journey.
3
Contents
Contents i
1 Introduction 1
1.1 Data Challenges at Safran Aero Boosters . . . . . . . . . . . . . . . . . . . 2
1.1.1 Volume and Variety . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.2 Data Silos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.3 Real-time Access . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.4 Quality and Consistency . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.5 Security . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.6 Compliance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1.7 Tailored Data and Information Retrieval . . . . . . . . . . . . . . . 4
1.2 Information and Data Retrieval: Towards Knowledge Retrieval . . . . . . . 5
1.2.1 Data Retrieval Systems . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.2 Information Retrieval Systems . . . . . . . . . . . . . . . . . . . . . 6
1.2.3 Knowledge Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2.4 Evolving Information Retrieval: Vector Databases . . . . . . . . . . 6
1.3 Internship and Research Questions . . . . . . . . . . . . . . . . . . . . . . 7
1.3.1 Internship Proposition . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3.2 Choosing a Path: Structured vs. Unstructured Data . . . . . . . . 7
1.3.3 SAB’s Vision and Research Question . . . . . . . . . . . . . . . . . 8
2 Related Work 9
2.1 Natural Language Processing . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.1 Language Processing before the Deep Learning . . . . . . . . . . . . 9
2.1.2 Language Processing in the Deep Learning Age . . . . . . . . . . . 10
2.1.3 Language Processing in the Foundation Models Age . . . . . . . . . 11
2.2 Knowledge Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.1 Brief History of Knowledge Graphs . . . . . . . . . . . . . . . . . . 12
2.3 Unifying LLMs and KGs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3.1 Enriching LLMs with KGs . . . . . . . . . . . . . . . . . . . . . . . 14
2.3.2 Amplifying KGs using LLMs . . . . . . . . . . . . . . . . . . . . . . 14
2.3.3 Mutual Enhancement . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.4 Large Language Models for Code Generation . . . . . . . . . . . . . . . . . 14
2.4.1 State-of-the-Art in Code Generative LLMs . . . . . . . . . . . . . . 14
2.4.2 From Text-to-SQL to Text-to-SPARQL . . . . . . . . . . . . . . . . 14
3 StarCoder 17
3.1 Key Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.1.1 Word and Token Embeddings . . . . . . . . . . . . . . . . . . . . . 17
i
3.1.2 Positional Encodings . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.1.3 Context Length . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.1.4 Attention Mechanisms . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.1.4.1 Self-Attention . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.1.4.2 Multi-Head vs Multi-Query Attention . . . . . . . . . . . 24
3.1.5 Transformers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.1.5.1 Encoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.1.5.2 Decoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.1.6 Decoder-only Transformers and LLMs . . . . . . . . . . . . . . . . 28
3.1.6.1 Training Process . . . . . . . . . . . . . . . . . . . . . . . 28
3.1.7 Fast Decoders Transformers . . . . . . . . . . . . . . . . . . . . . . 30
3.2 StarCoder Foundation Model . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.2.1 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.2.1.1 Token Embeddings . . . . . . . . . . . . . . . . . . . . . . 31
3.2.1.2 Positional Embeddings . . . . . . . . . . . . . . . . . . . . 31
3.2.1.3 Fast Decoders Transformer Blocks . . . . . . . . . . . . . 31
3.2.1.4 Linear Head . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.2.2 StarCoderBase Data Preparation . . . . . . . . . . . . . . . . . . . 32
3.2.2.1 Source Selection . . . . . . . . . . . . . . . . . . . . . . . 32
3.2.2.2 Code Language Selection . . . . . . . . . . . . . . . . . . 32
3.2.2.3 Data Quality Assurance . . . . . . . . . . . . . . . . . . . 32
3.2.3 Training Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.2.3.1 Pre-Training . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.2.3.2 Pre-Training for Python . . . . . . . . . . . . . . . . . . . 33
3.2.3.3 Clusters and Carbon Footprint . . . . . . . . . . . . . . . 33
3.2.4 StarChat: From StarCoder to Assistant . . . . . . . . . . . . . . . . 34
3.2.4.1 Training Details . . . . . . . . . . . . . . . . . . . . . . . 34
3.2.4.2 From StarChat to SPARCoder . . . . . . . . . . . . . . . 34
ii
4.3.1 Structured Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.3.2 Tools and Frameworks . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.3.3 Future Directions and Unstructured Data . . . . . . . . . . . . . . 48
4.4 Data Annotations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.5 Quality Assurance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5 SPARCoder 51
5.1 Limitations of Solely Relying on Enterprise Knowledge Graphs . . . . . . . 51
5.2 Limitations of Solely Relying on LLMs . . . . . . . . . . . . . . . . . . . . 52
5.3 Selection of the Enterprise Knowledge Graph Platform . . . . . . . . . . . 53
5.4 From StarChat to SPARCoder . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.4.1 Dataset Selection and Creation . . . . . . . . . . . . . . . . . . . . 56
5.4.2 Fine-Tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.4.2.1 Parameters Efficient Fine-tuning: LoRA . . . . . . . . . . 57
5.4.2.2 Instruction Fine-Tuning . . . . . . . . . . . . . . . . . . . 58
5.4.2.3 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.4.2.4 Qualitative Analysis . . . . . . . . . . . . . . . . . . . . . 62
5.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.6 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.6.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.6.2 Frontend Web Interface . . . . . . . . . . . . . . . . . . . . . . . . 66
5.6.3 Flask Server Middleware . . . . . . . . . . . . . . . . . . . . . . . . 67
5.6.4 Backend Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.6.5 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.7.1 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
A StarCoder 71
A.1 Glosses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
A.1.1 Employee . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
A.1.2 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
B SPARCoder 78
B.1 Interface and Server, Production Considerations . . . . . . . . . . . . . . . 81
B.2 Qualitative Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
B.2.1 Example 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
B.2.2 Example 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
B.2.3 Example 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
B.2.4 Example 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
Bibliography 89
iii
Chapter 1
Introduction
In the age of information, our ability to generate and store vast amounts of data has
grown exponentially (Mazumdar et al. 2019). For multinational corporations such as
Safran Group, there is a surging need to store, manage, and access this ever-expanding
information ecosystem. While this large amounts of data holds the promise of novel
insights and transformative discoveries through data science, it often remains underutilized.
One of the primary reasons for this underutilization is the complexity and inaccessibility
of data, which hinders its understanding and use.
The power of data is not solely in its volume but in the insights that can be drawn
from it. However, realizing these insights necessitates tools that enable intuitive access
and understanding. Despite being in the golden age of technological advancements in
natural language processing (NLP) and knowledge representation, there still exists a gap
between vast datasets and the ability to extract meaningful information from them. This
disconnect is a pressing concern, especially as the need for data-driven decision-making
becomes pivotal in every sector.
Knowledge Graphs (KGs) have emerged as a promising solution to address this data
fragmentation issue. By consolidating information from multiple sources into a unified
knowledge, KGs offer a structured representation of data (Studer et al. 1998). However,
while they do consolidate and structure data, the interaction mechanism with these graphs
remains technical, necessitating expertise in querying languages like SPARQL.
On the other hand, Large Language Models (LLMs) have demonstrated a strong ability
to understand and generate human-like text (Brown et al. 2020; Devlin et al. 2019), as
well as machine code, making them an excellent bridge between technical data structures
and the end-user (Sun et al. 2023). By integrating the structured approach of KGs with
the intuitive interaction enabled by LLMs, we hope to democratize data accessibility.
In this thesis, our investigation is structured around three primary points. Firstly, we
turn our attention to the cutting-edge domain of Large Language Models, specifically,
the development of a Text-to-SPARQL Assistant Model. This exploration will cover on
how to choose a LLM that we can train and use to effectively query an underlying KG,
making the interaction intuitive for end-users. Secondly, we delve into the process of
developing an Enterprise Knowledge Graph, presenting the inherent challenges associated
with its creation and maintenance. This will provide insights into the foundation of our
proposed system, emphasizing the importance of a well-structured and comprehensive KG.
Lastly, we converge our insights from the former domains to present the architecture of our
chatbot-driven search engine, which integrates the enterprise KG with the text-to-SPARQL
model. Thereby, we aim to offer a comprehensive understanding of our system.
1
1.1 Data Challenges at Safran Aero Boosters
Safran Aero Boosters (SAB) stands as an industry leader in the design, production, and
testing of low-pressure compressors, oil-system equipment, and test benches for aircraft.
As an integral subsidiary of the Safran Group , it underpins the propulsion systems
of countless aircraft around the globe (Figure 1.1). While its primary mission orbits
around the manufacturing of boosters, the intricate realm of aerospace manufacturing
demands high precision and efficiency. In such a complex environment, allocating resources
to continually update and refine state-of-the-art information systems might not always
ascend to the peak of priorities. But in the competitive arena of aerospace manufacturing,
accentuated by elevated labor costs in Belgium, it’s imperative for SAB to constantly
pioneer in research and development. This ensures not only sustained profitability but
also a competitive edge. Achieving this requires arming the workforce with optimal tools,
both in tangible assets and in cutting-edge information systems. However, this endeavor
presents its own set of challenges.
2
A primary contributor is the aforementioned variety of data. Given the diverse nature of
data being generated, from structured to unstructured data, it is often more convenient,
or even necessary, to utilize specialized databases tailored to specific data types. For
instance, structured data such as machine recordings might be best managed in SQL-based
relational databases, while unstructured data like quality assurance reports could be better
handled using NoSQL databases.
Another dimension to consider is the historical development of departments or divisions.
Over time, as each unit evolved, they might have adopted or developed systems best
suited to their immediate needs, often without a company-wide strategy in mind. This
decentralized approach, while providing short-term solutions, leads to data fragmentation.
This compartmentalization of data not only hinders a comprehensive view but also poses
challenges in cross-referencing data points across departments. For instance, correlating
insights from the Quality Control Department with data from machine recordings becomes
a laborious task, reducing the potential for data-driven decision-making.
Lastly, the presence of data silos often amplifies redundancy, as the same data might be
stored and managed in multiple locations, leading to inefficiencies in storage and potential
inconsistencies in data interpretation.
In light of these complexities, there’s a pressing need to bridge these silos, ensuring
that data, irrespective of its source or type, is accessible, and actionable across the entire
company. Resolving this data fragmentation issue would not only provide a unified view
of data but also lay the foundation for more collaboration across various departments.
1.1.5 Security
Safran Group, being a pivotal player in the industrial landscape, possesses invaluable
intellectual property and industrial secrets that underscore its competitive edge. Given
3
this unique position, the company is dealing not just with standard corporate data, but
with sensitive information that, if compromised, could compromise its strategic advantage
and market reputation.
Furthermore, Safran’s engagement in defense activities amplifies the importance of
this responsibility. Defense-related data is not only commercially sensitive but can also be
of national security interest. Any breach or unauthorized access to this data could have
consequences beyond the company and potentially impacting national security.
Ensuring the maximum security for this data is primordial. This demands a multi-
faceted approach, encompassing rigorous access controls, advanced encryption standards,
and continuous monitoring for potential threats. Additionally, it’s vital for Safran to
foster a security-centric culture, ensuring that every stakeholder, from executives to the
operational workforce, is aware of the importance of data security and is equipped with
the tools and knowledge to uphold it.
1.1.6 Compliance
Since 2018, compliance extends beyond traditional industrial regulations and delves into
the realm of data protection and privacy. In fact, the emergence of data protection laws,
such as the General Data Protection Regulation (GDPR) in Europe, signifies a global shift
towards safeguarding individual privacy and ensuring responsible data management. For
instance, an important article from the GDPR is Article 5, which stipulates that personal
data shall be processed lawfully, fairly, and in a transparent manner in relation to the data
subject1 , emphasizing principles of data minimization and accuracy (European Commission
2016). Moreover, Article 22 of the GDPR is also particularly pertinent in the context
of modern AI-driven enterprises2 . It deals with automated individual decision-making,
including profiling, and states that the data subject shall have the right not to be subject
to a decision based solely on automated processing, including profiling, which produces
legal effects concerning him or her or similarly significantly affects him or her (European
Commission 2016).
For Safran, aligning with such regulations is vital, especially given the increasing
integration of AI in various business processes. Adherence implies stringent requirements
on how personal data is collected, stored, processed, and shared, while violations can lead
to severe financial penalties, but more critically, can damage the trust and reputation that
Safran holds with its stakeholders, partners, and customers.
1
In the context of the GDPR, a data subject refers to an identified or identifiable natural person.
2
By modern AI-driven enterprises, we refer to organizations that heavily rely on data to fuel their
artificial intelligence and machine learning applications for various processes and decision-making.
4
Furthermore, the complexity of the data in question can vary. Some users might require
a deep dive into the data, while others could be seeking a broad overview. A tailored data
and information retrieval system must be agile enough to handle both these extremes and
everything in between.
Understanding the context is another pivotal facet. The same piece of data can be
interpreted differently based on the department or the role of the user. For instance, a
sudden increase in raw material costs might be viewed as a procurement challenge by the
buyer’s department, but as a potential price adjustment scenario by the sales department.
A sophisticated retrieval system should not only fetch the data but also provide auxiliary
information or related data points that can aid in contextual understanding.
Lastly, the way information is presented is equally crucial. The interface and user
experience play a significant role in how efficiently users can extract value from the data.
Some might prefer visual representations such as charts or graphs, while others could lean
towards tabulated data or detailed text reports. The retrieval system must be versatile
enough to adapt to these varied presentation preferences.
In essence, the true value of data lies not just in its availability, but in its accessibility.
For SAB, navigating its intricate field of operations and myriad stakeholders, implement-
ing a tailored, context-aware data retrieval mechanism could significantly enhance data
accessibility and utilization.
3
PowerBI (Microsoft) is a tool that is extensively used to join tables from various databases at SAB
and in many other companies.
5
1.2.2 Information Retrieval Systems
On the contrary, information retrieval systems navigate the ocean of unstructured or
semi-structured data. Web search engines are the prime exemplars of this system (Brin
et al. 1998). Unlike SQL-based systems which rely on structured queries, information
retrieval systems, especially those utilizing keyword-based technologies like ElasticSearch,
scan and index vast textual data amounts. Post indexing, these systems retrieve pertinent
documents or web pages in response to user queries. While their prowess is in pinpointing
data across enormous volumes, their mechanism is predominantly keyword matching,
which sometimes may not capture the nuanced contexts.
6
Semantic matching furthers this idea by focusing on understanding the meaning or
context behind a query rather than just the exact words. It takes into account synonyms,
related terms, and the broader context of the query to provide more relevant results,
drawing from concepts in natural language processing and machine learning.
7
1.3.3 SAB’s Vision and Research Question
A key directive from my tutor at SAB was the development of a Chatbot-driven search
engine. This tool was envisioned to be compatible with all the company’s information
systems, ensuring holistic data accessibility. Moreover, it was imperative that this solution
be both technologically feasible, cost-effective and aligned with SAB data challenges
described in section 1.1, reflecting a practical approach to enhancing data accessibility.
With this in mind, and given the decision to focus on structured data, as we delved
deeper into potential solutions, we recognized that Enterprise Knowledge Graphs (EKGs)
align well with the constraints presented (Section ??). EKGs, with their structured data
representation and integration capabilities, are designed for large company data. The
availability of commercial solutions and comprehensive EKG platforms suggests their
growing industry acceptance, making them an appealing option for SAB’s requirements.
However, while EKGs offer an organized and integrated data view, their intricacies
might render them less accessible to non-experts. Here’s where LLMs can bridge the gap.
LLMs have the potential to serve as a user-friendly interface, interpreting natural language
queries and retrieving relevant results from the EKGs (Section ??). This combination could
not only meet the constraints set by SAB but also democratize access to the information
housed within EKGs. Thereby, my research question crystallized:
Research Question: Can large language models effectively integrate with enterprises
knowledge graphs to enhance data accessibility in enterprises ?
1. What architectural factors should be taken into account when selecting the most
suitable Foundation Large Language Model ?
8
Chapter 2
Related Work
9
approach to linguistic analysis (Chomsky 1957). However, the initial enthusiasm of the
1950s was tempered by the Automatic Language Processing Advisory report in 1966,
which critiqued the feasibility of MT and recommended reduced funding (Pierce et al.
1966). Despite this, the following decades NLP advancements continued, systems like
ELIZA, SHRDLU, and LUNAR showcased the potential of NLP in various applications
(Weizenbaum 1966; Winograd 1971; Woods et al. 1972). The 1980s predominantly
used symbolic approaches, employing complex rules for language parsing (Dyer 1983).
A paradigm shift occurred in the late 1980s and early 1990s when statistical models,
powered by the rise in computational capabilities and machine learning, began to supplant
traditional rule-based systems, heralding a new era in NLP research which bring us to the
modern NLP era.
10
a vast array of linguistic tasks, making them a central block in modern NLP.
11
This paradigm shift, initiated by foundation models, has redefined the trajectory of NLP.
From originally focusing on crafting tailored architectures for individual tasks, the current
emphasis revolves around maximizing the capabilities of foundation models, steering
research towards more effective adaptation methods (Figure 2.4) and understanding the
subtleties of these models (Ben Zaken et al. 2022; E. J. Hu et al. 2021).
Figure 2.4. Pre-Training and Addaptation of Foundation Model (Bommasani et al. 2021)
12
Seminal works in the domain explored computational interpretations of these semantic
networks, employing first-order logic (FOL) (Hayes 1981; McCarthy 1989). Initially
grounded in network data models, databases evolved into relational models, sharing the
foundational logic with programming (Taylor et al. 1976; Codd 1982).
Historical KR systems, took elements from FOL and semantic networks, demonstrating
the capability to capture and represent diverse knowledge facets, from causality rules to
expert insights. The trajectory of these developments predominantly encapsulated a shift
from explicit representations to expert systems and eventually to extensive common-sense
knowledge bases (Lenat, Ramanathan V Guha, et al. 1995, 1991).
The dawn of the internet age in the mid-1990s fundamentally altered the landscape.
With the information explosion, methods to access, comprehend, and search this in-
formation go through rapid evolution. Algorithms like Page Rank marked the initial
breakthroughs (Page et al. 1999). However, the vision of semantic-enhanced searches soon
became a reality with resources like Wikidata and Data Commons, rooted in the principles
of the earlier Meta Content Format (Ramanathan V. Guha 1996). Contrasting earlier
AI systems, modern KGs largely focus on representing vast ground facts, reducing the
interest around complex inferences.
More recently, Knowledge Graphs are regaining attention in the AI domain. Their
evolution, from early directed labeled graphs to today’s sophisticated KGs, show the pro-
gression and aspirations of the AI domain, striving for more meaningful and comprehensible
representations of knowledge (Chaudhri et al. 2022).
Figure 2.5. LLMs and KGs, Pros and Cons (Pan et al. 2023)
Three main strategies seems to emerged in order to unify LLMs and knowledge graphs
(Pan et al. 2023).
13
2.3.1 Enriching LLMs with KGs
This modality focuses on infusing LLMs with structured knowledge from KGs during their
formative and operational stages. The primary intent is to make LLMs more factual and
knowledge-aware (Xu et al. 2021).
14
translating natural language queries into Structured Query Language (SQL) commands
(Sun et al. 2023). We can also cite NSQL from Numbers Station, an other Text-to-SQL
family of model based on CodeGen Models (S. Wu et al. 2023).
Back to our study case, this capability lays to confirm the feasibility of our research’s
focus: developing an LLM adept at Text-to-SPARQL transformations. Given the structural
similarities between SQL and SPARQL, and the demonstrated capabilities of existing
models, which is an important point for potential success of our proposed LLM.
15
Chapter 3
StarCoder
In the vast landscape of LLMs for code generation, selecting the right technology is the
first step towards the success of our system. The model that we are looking for not
only needs to understand the nuances of human language but also need to interact with
programming-centric data structures. This section motivates our model choice, StarCoder
(Li et al. 2023), because it outperforms open-source Code LLMs and is in close contention
for the top spot with closed-source models (Appendix A).
StarCoder stands out as an avant-garde creation from the BigCode community, a
collaborative endeavor dedicated to innovating Large Language Models specifically for
code (Code LLMs). It is at the intersection of advanced natural language processing and
code understanding, with an interesting ratio between it large context window1 and it
relatively small size (Appendix A), making it a prime candidate as foundation model for
our ontology aware Text-to-SPARQL LLM.
In the following, we will deeply explain the motivation behind our choice of StarCoder
as our foundation model, explaining how it aligns with the objectives of our system. In
Section 3.1, we detail its key components, then in Section 3.2, we present StarCoder
as our Foundation Model, and data it was trained on. Let’s start by discussing the
key components of the StarCoder arichitercture before bringing them together to form
StarCoder’s architecture.
1
The context window of a LLM refers to the amount of text it can consider at once.
17
high-dimensional space. This concept is fundamental for natural language model, as it
provides a dense and continuous representation for words, as opposed to sparse and discrete
representations like one-hot encoding.
This concepte has been popularized with word2vec (Tomas Mikolov et al. 2013), which
is one of the most popular word embedding techniques. Word2vec uses shallow neural
networks to produce word embeddings by either predicting the context given a word
(Skip-Gram) or predicting a word given its context (Continuous Bag of Words).
In the context of StarCoder, and more globally in the context of transformers model,
the concept of word embeddings is slightly modified to what is named token embeddings.
Unlike traditional word embeddings that represent individual words, token embeddings
represent tokens, which can be as short as a single character or as long as a word (Figure
3.3). This granularity is crucial for tasks like code generation, where understanding
individual symbols, operators, or short sequences can be as important as understanding
full words.
2
Source: https://www.tensorflow.org/tutorials/representation/word2vec
3
Source:https://jalammar.github.io/illustrated-word2vec/
18
Figure 3.3. Open AI’s GPT3 Tokenizer4
The tokenization process breaks down input text into these smaller units or tokens.
For instance, a line of code might be tokenized into individual symbols, keywords, and
identifiers. This tokenization allows the model to understand and generate code at a
granular level, ensuring that even subtle nuances in code syntax and semantics are captured.
Once the text is tokenized, each token is mapped to a vector in a high-dimensional
space, akin to conventional word embeddings. This vector representation is refined during
the training phase. The embeddings encapsulate not only semantic and syntactic details
about the tokens but also morphological information. Morphology delves into the internal
composition of words, breaking them down into fundamental units called morphemes. For
instance, in the word ’compressor’ (Figure 3.3), ’compress’ denotes the root morpheme
conveying the core meaning, while ’or’ is a derivational morpheme that transforms the verb
compress into a noun indicating an entity performing the action. Additionally, embeddings
can also capture conjugation details similarly.
Once the text is tokenized, each token is mapped to a vector in a high-dimensional
space like traditional word embeddings. This vector representation is learned during the
training process. The embeddings capture semantic and syntactic information about the
tokens.
4
https://platform.openai.com/tokenizer
19
Figure 3.4. Influence of Positional Encoding on Training (Fleuret 2021)
In practice, positional encodings are added to the word embeddings (Figure 3.5).
These positional encodings are illustrated on Figure 3.6 illustrated on Figure 3.6 and the
mathematical formulation is given by:
!
t
P E(t, 2i) = sin 2i
10000 dmodel !
t
P E(t, 2i + 1) = cos 2i
10000 dmodel
Where:
• P E(t, 2i) represents the positional encoding for even indices 2i of the dimension.
• P E(t, 2i+1) represents the positional encoding for odd indices 2i+1 of the dimension.
5
https://erdem.pl/2021/05/understanding-positional-encoding-in-transformers
20
3.1.3 Context Length
Context Length, Max Context Length or Context Window commonly, refers to the number
of token a LLM can consider at once. The primary focus of positional encodings is to
provide the model with the order of tokens, therefore, when defining positional encodings,
it’s crucial to ensure positional encodings can accommodate the desired model’s context
length such that each potential position in the input has a unique encoding.
i=1
Here, softmaxi refers to the softmax operation applied to the ith component. a(q, k)
is the scoring function between the query q ∈ Rq and the key k ∈ Rk . It is given by:
wv ∈ Rh
Wq ∈ Rq×h
Wk ∈ Rk×h
6
A sequence-to-sequence model refers to neural network architecture designed to take a sequence as
input and produce a sequence as output.
21
Figure 3.7. Attention Layer (Louppe 2023; Zhang et al. 2021)
The function a(q, k) essentially computes a scalar that measures the similarity between
the query and the key. The learnable weights matrices Wq and Wk are used to transform
the original query and key into a shared representation space, and the resulting vectors
are then combined using the weights vector wv . The tanh function ensures the output lies
between -1 and 1. The final softmax operation across all keys ensures that the attention
weights sum to 1, allowing the mechanism to distribute its attention across the key-value
pairs.
′
Thereby, given the two input sequence X ∈ Rn×x and X ′ ∈ Rm×x (n and m are the
sequences lengths and x and x′ are the embeddings sizes), the formulation of a classical
attention layer is such that,
QKT
!
attention(Q, K, V) = softmax √ V (3.3)
d
with,
3.1.4.1 Self-Attention
The standard attention mechanism as described above is focused on mapping a query to a
set of key-value pairs. However, self-attention, the variant of the attention mechanism used
in the Transformer architecture, instead of requiring separate sets of queries, keys, and
values, in self-attention, these all derive from the same input sequence. This self-attention
enables the model to focus on different parts of the input sequence to generate its output
understanding the relation between each token in the sequence.
Mathematically, in self-attention, Q, K, and V are all derived from the same input
sequence X. For a given input sequence X ∈ Rn×d , the queries, keys, and values are
computed as:
22
Q = XWq , ∈ Rn×d , Wq , ∈ Rd×x
K = XWk , ∈ Rn×d , Wk , ∈ Rd×x
V = XWv , ∈ Rn×v , Wv , ∈ Rv×x
As a result, the attention scores and the output are determined entirely by the input
sequence X. In essence, every token within the sequence gets an opportunity to interact
with every other token, regardless of distance or position. I order to exemplify, Figure 3.8
and 3.9 show self attention scores. On Figure 3.8 we can observe that the it gives the highest
attention score to the word intership which is refereed by it. Same on Figure 3.9 where we
can see that employee referees to Paul. This design inherently the limitations imposed by
predefined context sizes, enabling the model to identify and leverage dependencies that
span across long sequences.
The last important point about attention mechanism lies in its inherent parallelizability.
Traditional recurrent architectures, such as LSTMs (Hochreiter et al. 1997) or GRUs (Chung
et al. 2014), process input sequences sequentially, which inherently limits the potential for
7
https://github.com/jessevig/bertviz
23
parallel computation. In contrast, the self-attention mechanism in Transformers treats
each position in the sequence independently and in parallel. This means that all positions
can be computed simultaneously, significantly speeding up training and inference times,
especially when leveraging modern GPU architectures.
While the self-attention mechanism offers significant benefits in terms of capturing depen-
dencies within sequences, there have been variations to expand its power. The concept of
multi-head attention was first introduced by Vaswani et al. (2017) in the paper Attention Is
All You Need which has introduced what is commonly called transformers. This mechanism
employs multiple sets of learnable weight matrices for queries, keys, and values as it is
shown of Figure 3.10.
Formally, for multi-head attention, given h matrices Hi , the i-th heads, we have:
With WiQ , WiK , and WiV ), the weight matrices for the i-th head.
Thus, the output of the multi-head attention mechanism is:
with,
WO ∈ Rhdv ×dmodel
24
Figure 3.10. Multi-Head Attention(Vaswani
et al. 2017) Figure 3.11. Multi-Query Attention
Thus, as for the multi-head attention mechanism, the output of the multi-query
attention mechanism is given by:
25
with,
WO ∈ Rhdv ×dmodel
• Memory Efficiency: Since there’s no need for additional key and value matrices for
each perspective, multi-query attention is more memory efficient than its multi-head
cousin.
3.1.5 Transformers
Transformers, introduced by Vaswani et al. (2017) in the paper Attention Is All You Need,
strongly impacts the field of natural language processing by providing a new paradigm
for sequence modeling. Unlike traditional LSTMs and GRUs, which rely on sequential
computation, Transformers exploit the parallel processing capabilities of modern GPUs
through attention mechanisms, making training more efficient, enabling them to capture
longer-range dependencies, and enabling transformers model to scale efficiently.
The Transformer model consists of token embeddings and positional encoding blocks,
as well as the core of the architecture, an encoder and a decoder stack as described on
Figure 3.12. The encoder and decoder each containing multiple blocks. Let’s, outline the
components of the architecture and their interactions, starting with the encoder.
26
Figure 3.12. The Transformers - Model Architecture (Vaswani et al. 2017)
3.1.5.1 Encoder
Each encoder block (on the left, Figure 3.12) repeated Nx times, consists of three main
components:
• Normalization layers
The input to the encoder is passed through the multi-head self-attention mechanism
block, which allows to weigh the relevance of different parts of the input sequence, as
previously detailed (section 3.1.4). The output of the attention mechanism layer passes
through a feed-forward neural network, applied identically to each position. Additionally,
residual connections surround the Multi-head self-attention layers and feed-forward net-
works layers, followed by layer normalization (He et al. 2016; Ba et al. 2016).
3.1.5.2 Decoder
The decoder (on the right, 3.12), repeated Nx times, has an architecture similar to the
encoder but with one additional multi-head attention layer. This layer attends to the
encoder’s output. The four components of each decoder layer are:
27
• Position-wise feed-forward networks layers
• Normalization layers
Masking in the self-attention mechanism ensures that the prediction for a particular
word does not depend on future words in the sequence, by masking tokens after the
prediction, preserving the auto-regressive property. In other words, this means that, during
the attention computation, future tokens have no influence on the current or previous
tokens. For a more visual understanding of transformers components, refer to Figure 3.12
accompanying this description.
Positional Encoding
As explained in section 3.1.4, attention layer integrate any form of recurrence. To make
up for this and to give the transformers some information about the relative and absolute
position of words in a sequence, the authors used positional encodings. These encodings
are added to the input embeddings at entry of the encoder and decoder. The encoding
use sine and cosine functions of different frequencies (section 3.1.2), which ensure a unique
encoding for each position (Figure 3.6).
7
E.g., for translation tasks
28
GPTs are first pre-trained on huge corpora of text. During this phase, it learns to
predict the next word in a sequence, effectively becoming a language model. This
self-supervised learning helps the model capture the structure of our language, as
well as vast amounts of general linguistic knowledge from diverse contexts without
explicit labels.
1X T
L=− log P (wt |w1 , . . . , wt−1 )
T t=1
• For classification, it’s typically Cross-Entropy Loss between the predicted and
true class labels.
• For regression tasks, it might be Mean Squared Error between the predicted
and true values.
However, the core idea remains rooted in the cross-entropy loss, especially for tasks
that involve predicting a probability distribution over discrete outcomes.
9
We use words for a better understanding, however in practice the cross-entropy loss is computed over
tokens
29
Figure 3.16. GPT Downstream Tasks Fine-
tuning (Radford et al. 2018)
30
Empirical evaluations have showcased the strengths of this approach. Decoding using
the multi-query attention mechanism is notably faster compared to the traditional multi-
head attention mechanism. Crucially, this boost in speed does not come at the expense
of performance; the observed quality degradation is minimal, making it an acceptable
trade-off for many applications.
Having delved into the primordial components that constitute StarCoder, from the
initial token embeddings to the innovations in fast decoders, it’s clear that each element
plays a crucial role. These are not just isolated concepts, together, they pave the path to
the very essence of our focus, the StarCoder Foundation Model.
3.2.1 Architecture
StarCoder’s architecture, based on SantaCoder’s architecture (Allal et al. 2023), follows
the principles of the Generative Pre-Trained Transformers family, while embracing special
components suited to handle coding tasks, it also allows a large context window while
keeping a relatively low number of parameters (Appendix A, Figure A.1).
31
3.2.2 StarCoderBase Data Preparation
Building an effective Code LLM demands meticulous data selection. In this section, we
summarize the data preparation used by the BigCode team to pretrain StarCoderBase.
• Volume and Popularity: They selected languages with data exceeding 500 MB
and those ranking in the top 50 on platforms such as Githut10 and the TIOBE
Index11 2022.
They dataset also made room for data formats like JSON and YAML, but with
restricted volume, given their data-centric nature as opposed to code-centric.
• Filtering: To ensure data quality, various filters were applied: XML filter to
eliminate non-code XML content, an alpha filter targeting non-code files based on
alphabetic character count, an HTML filter focusing on content visibility, and specific
length and character-based filters for data-heavy formats like JSON and YAML.
10
Githut is an analytical tool that presents statistics about programming languages based on the
number of repositories and pushes to repositories on GitHub. It provides insights into the popularity and
activity levels of programming languages.
11
TIOBE Index is an indicator of the popularity of programming languages.
32
3.2.3 Training Details
I order to transform this empty architecture in a foundation model, BigCode team proceeds
to an extensive training. In the subsequent sections, we develop the details about BigCode
team’s training process.
3.2.3.1 Pre-Training
StarCoderBase has been trained to optimize its efficiency over a huge dataset detailed in
previous section 3.2.2, following these settings:
• Iterations: 250 000
• Batch Size: 4M tokens
• Cumulative Tokens: 1T
• Training Data: Prepared dataset (section 3.2.2)
• Optimizer: Adam (Kingma et al. 2017) with parameters β1 = 0.9, β2 = 0.95, ε = 10−8
• Weight Decay: 0.1
• Learning Rate: Cosine decay, starting at 3 × 10−4 and attenuating to 3 × 10−5
following 2 000 iterations of linear warm-up.
33
region and average AWS datacenter metrics, this results to an estimated CO2 emission
of 16.68 tonnes12 , underscoring the environmental considerations to keep in mind when
developing LLMs.
• Built-in Special Tokens: One of the striking features that make StarChat a
suitable candidate for conversational code assistance is its familiarity with special
tokens <|assistant|> and <|user|>. These tokens enable the model to understand
and delineate between user prompts and model-generated responses, allowing for
more structured and coherent interactions.
• Versatility: StarChat is designed not only to understand and generate code but
also to converse about it. This dual capability is essential to discuss with users who
need not just code generation but explanations, and rephrasing of results.
12
It is equivalent to over two times the average annual CO2 emissions per capita in Belgium, which
stood at 8.10 metric tons in 2019 (Data Commons). This underscores the environmental impact of training
LLMs. One potential approach to mitigate such environmental costs is to leverage pretrained LLMs, which
can be fine-tuned for specific tasks without the need for extensive retraining. Such strategies accentuate
the need for sustainable and environmentally conscious approaches in AI research and development.
13
StarCoderEx is a Code Generator extention for VS Code based on StarCoder
34
What architectural factors should be taken into account when selecting the most
suitable Foundation Large Language Model ? (Underlying Research Question 1.3.3)
The decision to use StarChat as the foundation for our SPARCoder model was driven
by several compelling reasons:
• Context Length: Starcoder boasts the longest context length compared to other
open-source models. (Appendix A.1)
• Coding and Natural Language Capabilities: Given its training on code, in-
cluding SPARQL (Appendix A.3) and conversational datasets, a SPARQL finetuned
StarChat can provide guidance by understanding natural language queries and
translate these queries into SPARQL.
35
Chapter 4
As it has been discused in Section 1.1, the ability to capture, structure, and utilize the
vast amounts of enterprise data has become pivotal. This data from various source and
nature, if not handled efficiently, can become overwhelming, leading to inefficiencies and
missed opportunities. This is precisely where Knowledge Graphs, especially Enterprise
Knowledge Graphs (EKG), become an appropriate tool.
A Knowledge Graph (KG) can be broadly defined as a graph-structured knowledge base,
designed to store information in nodes (entities) and edges (relationships) to represent and
connect real-world entities and their interrelations in a semantically meaningful manner
(Ehrlinger et al. 2016). While the concept of Knowledge Graphs has been around for some
time, their relevance and utility have skyrocketed with the ascent of large-scale knowledge
bases like Google’s Knowledge Graph, which aims to understand facts about people, places,
and things and how these entities are all interconnected (Singhal 2012).
Enterprise Knowledge Graphs, on the other hand, are specialized versions of knowledge
graphs tailored for the needs of enterprises. They bridge silos of data, provide a unified
view of data sources, and enable advanced analytics, thereby powering more informed
decision-making processes. The use of EKGs aids businesses in recognizing patterns, opti-
mizing operations, fostering innovation, and enhancing customer experiences by leveraging
connections that would have otherwise remained hidden in isolated data sets (Fensel 2011).
Despite their promising benefits, the deployment and maintenance of EKGs are not
without challenges. These include issues related to data integration, scalability, real-time
processing, security, and more. The intricacies of such challenges and the processes of
implementing Enterprise Knowledge Graphs can often be overwhelming. To provide clarity
and a structured approach to these complexities, this section has been largely inspired by
the Knowledge Representation and Reasoning course given by professor Debruyne (2023).
Thereby, we will follow the roadmap of building and maintaining a KG as described in
Figure 4.1. To further elucidate this process, we will use the practical use case which has
been given by SAB, which involves integrating Employee, Training, and Digital Training
databases to pinpoint areas of expertise of each employee, department, etc.
37
Figure 4.1. Building and Maintaining a KG (Debruyne 2023), based on (H. Wu et al. 2017)
38
capture this intricate granularity, linking each employee’s professional data with their
training records. Beyond offering a view of their capabilities, the EKG also traces the
professional interconnections among employees. While such a consolidated representation
can significantly aid HR and management decisions, its fundamental purpose is to interface
with SPARCoder, our Text-to-SPARQL LLM. This ensures precise answers concerning
employees, their interrelations, and training.
Having established the foundational need and purpose for our EKG at SAB, let’s now
delve into the ontology development process.
39
Definition 4.2.2 An Ontology is a [formal,] explicit specification of a [shared] conceptu-
alization.
With this in mind, let’s delve into the ontology development process. In the subsequent
sections, we will walk through each step involved in this process, beginning with the task
of defining the scope of application for our ontologies.
4.2.2 Application
As we have alreday discusted, defining the scope of application is an essential first step in
ontology development, as it establishes the boundaries of the knowledge domain and guides
the subsequent design and implementation stages. This foundational phase is crucial in
ensuring that the ontology is purposeful, fit for its intended use, and avoids unnecessary
complexity.
Our application is designed with a primary focus on querying the corporate hierarchy
and expertise within a company. It integrates data from three primary sources, namely
Employee, Traning, and Digital Academy Training databases as described in section
4.2.3.
The Employee database primarily provides data concerning employee details, including
their hierarchical positions within the company, their department, etc. The Traning
database is the source for information related to employee training and the specific skills or
knowledge they have gained as a result. Lastly, the Digital Academy Training database
is the source for information related to online training undertaken by employees, offering
additional insight into their skills and knowledge base.
The key constraints in our ontology model pertain to the relationships between different
entities. For example, every employee is assumed to be part of a department and reports to
a manager. Or, concerning business rules, we can incorporated the rule that an employee
can be part of multiple formations, representing their multidisciplinary training and
expertise. Rules and constraints will be discussed in the ontology development section
4.2.5.
In conclusion, the goal of our application is to leverage the ontology and underlying
databases to represent and facilitate queries pertaining to the company hierarchy and
expertise.
4.2.3 Databases
For the purpose of our proof of concept project, we leverage three separate CSV files,
extracted from databases, as our core databases: the Employee, the Traning, and the
Digital Academy Training databases (Fig.4.2). In order to comply with GDPR reg-
ulations, these databases are populated with anonymized data, effectively eliminating
sensitive information such as names, email addresses, and other personally identifiable
details. This approach is adopted in the interest of maintaining data security, ensuring that
our project activities do not pose any potential risk to the confidentiality of the data sub-
jects. Using CSV databases also provides a crucial advantage by eliminating the need for
direct connections to production databases, thereby preventing the risk of unintentionally
affecting the integrity and availability of the operational systems. Furthermore, utilizing
CSV files simplifies the overall setup of the project due to their ubiquitous compatibility
and ease of manipulation. Such an approach also greatly enhances the accessibility and
40
transferability of our project, allowing for effortless migration, inspection, and sharing of
the datasets.
However, it’s important to note that the approach and methodology we’re utilizing for
this proof of concept project are not strictly confined to CSV files or anonymized databases.
Indeed, the same principles can be efficiently employed when dealing with actual, live
databases, whether they are hosted on AWS, Oracle, or other such platforms. The process is
complemented by tools provided by Knowledge Graphs Platforms, which offers a multitude
of connectors for various databases1 . These connectors facilitate seamless interfacing with
a wide array of databases, enabling us to retrieve, manipulate, and annotate data from
multiple sources. Therefore, while we’re using CSV files for their simplicity and security
advantages in the context of this proof of concept, we could just as easily adapt our
approach to real-world, production databases when the necessity arises.
4.2.4 Namespace
Within the field of ontology development, namespaces stand as a primordial element for
identifying resources. Understanding namespaces is crucial to ensuring the unambiguity,
and reliability of ontologies, especially in environments where multiple ontologies coexist.
Namespaces, which are URIs, selected for an ontology plays a role in ensuring its unique
identification. In other words, an URI namespace is use to identify an ontology. To that
end, our ontology utilizes a systematic approach in determining its namespace, that we
will exemplify with our use case. Note also that a prefix is a short, human-readable label
that stands in for a full namespace URI. Instead of writing the full URI, we can use the
prefix, which acts as an abbreviation. Once a prefix is declared and associated with a
namespace URI, it can be used in its place.
1
E.g., https://www.stardog.com/platform/connectors/
41
4.2.4.1 URI Selection
A common practice is to represent ontology namespaces as URLs2 . Although it need
not be a live URL, the namespace chosen must be associated with the project’s domain,
offering clarity regarding its source. For instance, if the ontology is developed under the
auspices of a company, the namespace could start with the company’s website URL, such
as http://www.company-example.com/ontology/.
4.2.4.3 Versioning
To accommodate the evolution of the ontology over time, our namespace includes versioning
information. This approach is particularly helpful when updates are made, providing a
clear record of the ontology’s version history, e.g.,
http://www.company-example.com/ontology/example_onto/v0/.
2
A URL (Uniform Resource Locator) is a specific type of URI (Uniform Resource Identifier) that not
only allows to identify a resource but also provides a means to locate it
42
4.2.5 Ontologies Engineering
As already mentioned, our project leverages two distinct ontologies to encapsulate concepts,
relations, and instances pertaining from our databases.
43
Relations include:
4.2.6 Glosses
In Annexes A.1, we present a glossary of the key concepts used in our ontologies. These
tables serve as a dual-language reference, providing human-readable labels and definitions
44
in both English and French3 . The definitions focus on distinguishing characteristics and
intrinsic properties that differentiate each concept, adhering to guidelines of clear glossary
construction4 .
3
It’s pertinent to mention that SAB is a company where French is the predominant language of
communication. Consequently, to ensure clarity and facilitate comprehension for the stakeholders, it was
essential to incorporate French descriptions. This approach aligns with the company’s linguistic context
and adapts to the native linguistic preferences of its employees.
4
It’s essential to note that the descriptions provided in this glossary are general in nature and serve
primarily for illustrative purposes in this context. They may not necessarily mirror the exact terminologies
or definitions as employed within the applications at SAB
45
4.2.7 Visualization
string
decimal
office ID
longitude
SpatialThing
(external)
Person
(external)
Subclass of
string
string
Office
string
string
latitude
surname
registration number
Subclass of decimal
phone number is located at
string attach
is secretary of
position is correspondent of
string
company name
Company
works for anyURI
email
employs
Employee
id
anyURI status
includes string
belongs to
string
Active employee
string
string
Department
department name
string
Employee
(external)
string
string
Company
(external)
string
name integer
participated in
string
end date followed by
provided by integer
string
country
id remaining lives
course level
string
stars earned
Training
store
Subclass of
string
company
DigitalAcadem...
start date
engine
string
string
completed SCOs percent
float
Subclass of string
cost
domain string
string
integer
year string
float
SAB Training
mod_ ref comment
string
string
completed real
float string
hours sub-domain
organization label
string string
string
float
string string
46
4.3 Data Transformation
Data Transformation refers to the process of converting non-KG data into Knowledge
Graphs. In the context of EKGs, data transformation is crucial, as data often resides
in diverse structured repositories like relational databases. The primary goal of the
transformation process is to ensure that data is in a suitable format and structure to
be ingested and integrated into the knowledge graph, maintaining the semantic meaning
and relationships between data items, this process is also refereed as mapping the data,
or populate the KG. As detailed in Figure 4.1, this process can be mainly automatic as
mapping scripts can be write, or generate through graphic tools, in order to populate the
KG automatically.
5
It’s worth noting that while SQL databases are a primary example of structured data sources, the
term "structured data" isn’t exclusive to them. Structured data refers to any data that adheres to a
specific format or model, making it easily searchable and queryable. For instance, even data scraped
from web pages, such as LinkedIn profiles, can be considered structured if it’s consistently organized
into defined fields or categories (e.g., Name, Position, Company, etc.). Essentially, any data source that
provides information in a predictable, fielded format can be categorized under structured data, irrespective
of its underlying storage or retrieval mechanism.
47
Figure 4.4. Stardog Designer Mapping Tool
LLMs, with their adeptness at understanding human text and database structures, are
well suite to significantly ease these processes. Indeed, LLMs can assist in auto-mapping
database structures to knowledge graph ontologies, minimizing manual interventions as
illustrate by Figure 4.5.
48
Figure 4.5. Stardog Designer, Suggest Mapping Tool
Furthermore, the massive amount of unstructured data emphasizes the need for its
transformation into structured formats for knowledge graphs. LLMs offer a vision where
unstructured documents will be effortlessly mapped into knowledge graphs. It’s worth
noting that the effective integration of such LLMs into enterprise systems is not ready
and will demands more research in this direction. However, technology such as AIASHI
(Poumay 2019) are opening the path.
49
accuracy, and conducting usability and security assessments. Employing tools like SHACL,
SPARQL, and OntoClean can streamline these checks. Furthermore, continuous monitoring
and establishing feedback loops with end-users are crucial for real-time improvements and
adaptability. In essence, rigorous Quality Assurance ensures EKGs remain trustable tools
for data-driven decision-making.
To conclude, the development of an EKG is a multifaceted process, characterized by
an intricate interplay of data, technologies, and human expertise. The outlined steps,
Ontology Development, Data Transformation, Data Annotations, and Quality Assurance
(Figure 4.1) provide a high-level schematic of the journey. However, every stage has its
unique set of intricacies and nuances that demand in-depth exploration and knowledge
engineering expertise. By discovering these foundational milestones, one gains a footing to
embark on this complex process.
In the specific context of my research and prototype development, it’s pertinent to
note that Data Annotations and Quality Assurance were not delved into exhaustively.
The rationale behind this decision justified by the substantial company resources they
demanded, specifically the time of company experts. Furthermore, these elements, although
primordial in a complete EKG deployment, were not substantial for the development of
my KG coupled with LLM Prototype. Nevertheless, a comprehensive EKG strategy would
invariably necessitate a robust mechanism for data annotation and quality assurance.
50
Chapter 5
SPARCoder
Before diving into the details of our prototype development, it’s crucial to address a
foundational question:
Why not employ a Knowledge Graph or a Large Language Model in isolation? What
advantages arise from their combined utilization ?
1
While our SPARCoder prototype does not currently address this capability, it presents an intriguing
avenue for future development, especially when a LLM tailored to Knowledge Graph already exists
51
for users. These tools often encompass visualization capabilities to represent the intricate
relationships in an intuitive manner, query builders that allow users without expertise in
SPARQL to extract relevant information, and comprehensive designer tools that aid in
ontology construction and population. Furthermore, they provide reliable solutions for
storing and managing the growing knowledge graph in an efficient and scalable manner.
The value of these tools should not be understated, as they significantly reduce the barriers
to EKG adoption within enterprises. In the subsequent sections, we will delve deeper
into the specific Enterprise Knowledge Graph platform we’ve adopted for our SPARCoder
prototype, discussing its capabilities and the rationale behind our selection.
52
LLMs and highlights the potential vulnerabilities of such systems. Termed as the
Grandma Hack, this exploit leverages the model’s tendency to act compliantly when
posed with emotionally evocative scenarios. A user demonstrated that by tricking
ChatGPT into believing it is a deceased grandmother speaking to her grandchildren, it
can be manipulated to divulge sensitive information such as Windows activation keys
or even IMEI numbers of phones2 . The exploit not only reflects the potential misuse of
these LLMs but also raises questions about the data they might inadvertently retain.
Thereby, with the ever-evolving sophistication of these hacks, ensuring complete
security remains a challenge. The Grandma Hack serves as a cautionary tale,
emphasizing the need for stringent safeguards when employing LLMs in any capacity,
especially within enterprises dealing with sensitive information. It’s imperative for
organizations to be aware of such vulnerabilities.
• Handling Real-time Data: LLMs are static in their knowledge once trained, and
updating them with new data is not a straightforward task. SAB have to deal with
real-time data and need their information systems to adapt swiftly (section 1.1.3).
A LLM-centric system might struggle to provide timely insights in fast-changing
business environments. Thereby, LLMs would require continuous fine-tuning to
remain relevant, this constant need for updates can be resource-intensive and may
result in lapses if not managed effectively.
2
Source: softonic - Cracking the Code: How to Hack ChatGPT and Activate Grandma Mode
53
• Virtual Knowledge Graphs: One of Stardog’s standout features is its unique
technology of virtual knowledge graphs. Unlike traditional methods where data
needs to be imported into the knowledge graph, Stardog’s virtual connectors enable
ontology-based data access without the need for data migration. This not only
ensures real-time accuracy but also significantly reduces the overhead associated
with data replication. The distinctiveness of this feature places Stardog a cut above
its competitors, none of whom offer a similar technology.
• Diverse API Support: With the provision of APIs including Python and JavaScript,
Stardog ensures integration with a other platforms and development environments.
– Stardog Designer: Stardog Designer serves as a efficient tool for those engaged
in ontology development, mapping, and KG population. By offering a visual
interface, it simplifies the typically intricate process of ontology design, making
it more accessible and less error-prone.
– Stardog Explorer: This tool allows users, regardless of their expertise in
SPARQL, to visually explore the knowledge graph and construct queries in an
intuitive manner using the query builder. The user-friendly interface ensures
that a wider demographic within an enterprise can harness the benefits of the
EKG.
• Alignment with LLM Integration: The synergy between our objectives and
Stardog’s roadmap (Figure 5.1) was evident when, during the period of our internship,
they announced their intentions to integrate LLM tools similar to our SPARCoder.
Thereby proposing a commercial, and enterprise ready solution, to the need my
prototype tries to answer.
54
Figure 5.1. Stardog AI Roadmap
In our pursuit of the optimal Enterprise Knowledge Graph platform, we explored also
other potential solutions. Each had its own merits, but also limitations that rendered them
less suitable for our requirements than Stardog. Here’s a brief overview of the platforms
considered and the reasons for their exclusion:
• GraphDB: While it offers a robust system, it does not provide Encryption at Rest.
This raised concerns about the security and integrity of the data stored within,
particularly for sensitive enterprise applications.
• RDFox: Two primary concerns led to RDFox’s exclusion. Firstly, it does not offer
Encryption at Rest. Secondly, RDFox does not offer a comprehensive platform. Its
reliance on external platforms like Metaphacts for complete functionality was deemed
unsuitable for our streamlined needs.
• Apache Jena: Being open source, Apache Jena has its advantages. However, it
does not provide a comprehensive platform, and the absence of Encryption at Rest
further weakened its candidacy for our requirements.
• Amazon Neptune: Amazon Neptune excels as a graph database but falls short
when considering the broader scope of an Enterprise Knowledge Graph platform.
The necessity of integrating an additional enterprise KG platform which for complete
functionality was deemed unsuitable for our streamlined needs.
• Neo4j: A significant drawback was its lack of native compliance with W3C standards.
Given the importance of standards in ensuring compatibility and future scalability,
this was a red flag.
55
Selecting the Enterprise Knowledge Graph platform was a decision, shaped by the
examination of available solutions and the matching with our requirements. Stardog
emerged as the prime choice due to its comprehensive offerings, security features, alignment
with enterprise constraints, and potential of synergies with LLMs through API. Other
platforms, while having their own strengths, were lacking in one or more critical areas that
were essential for our use-case.
56
This distinction introduces a potential concern regarding the overall quality and accuracy
of the synthetic data. While leveraging advanced models like text-davici-003 model can
amplify the size of our dataset, it can brings challenges in ensuring the reliability of every
generated sample. As such, users and researchers leveraging our dataset must be aware of
and account for this potential variability in data quality when conducting experiments or
assessments.
Of these ten ontologies, they collectively formed the primary training data for our
model. Furthermore, for evaluation purposes, the EmployeeTraining Dataset from our use
case was split into two subsets: one for testing and another for validation. This ensures a
rigorous assessment of the model’s performance against both familiar training data and
newer, unseen data, allowing us to observe its generalization capabilities more effectively.
For those interested in exploring or utilizing our dataset, it has been made available on
Hugging Face as a DictDataset. This ensures accessibility and encourages further research
and development in the domain.
5.4.2 Fine-Tuning
5.4.2.1 Parameters Efficient Fine-tuning: LoRA
Given our constrained computational power and limited GPU bandwidth, we used a
parameter-efficient fine-tuning technique. One such method, which has demonstrated
notably favorable results, is LoRA, as introduced in the publication LoRA: Low-Rank
Adaptation of Large Language Models by E. J. Hu et al. (2021). This approach enables
the fine-tuning of LLMs by training only a small fraction of the model’s total parameters.
Impressively, it retains a performance level comparable to full fine-tuning, where all
parameters are trained.
Transformers largely depend on dense layers to perform essential matrix multiplica-
tions. These weight matrices are typically of full-rank. However, during the task-specific
adaptation of pre-trained language models, it’s observed that the models exhibit a lower
intrinsic dimension(Aghajanyan et al. 2020). This low intrinsic dimension suggests that
even with a random projection to a smaller subspace, the model can learn efficiently.
Inspired by this observation, LoRA hypothesizes that the updates to the weight matrices
also possess a low intrinsic rank during the adaptation phase. Specifically, if we consider
a pre-trained weight matrix W0 ∈ Rd×k , its update can be represented using a low-rank
decomposition:
W0 + ∆W = W0 + BA
where B ∈ Rd×r , A ∈ Rr×k , and the rank r is much smaller than d and k.
This method involves freezing the W0 during training while allowing A and B to be
trainable. Such an approach modifies the forward pass to be:
h = W0 x + ∆W x = W0 x + BAx
(Figure 5.2)
57
Figure 5.2. LoRA
Data Model :
{ ONTOLOGY_MODEL . ttl }
58
RESPONSE SHAPE : { CSV SHAPE } <| end | >
<| assistant | >
{ ASSISTANT ANSWER } <| end | >
<| endoftext | >
• Contextual Information: The instruction should offer the model a context, framing
its role as an assistant expert in Knowledge Engineering. This helps in channeling
the model’s behavior towards a specific domain of expertise.
• Explicit Constraints: We define specific constraints that the model must adhere to
when generating SPARQL queries. This includes considerations for case insensitivity,
flexibility in dealing with missing properties, and where the results should be stored.
• Special Tokens and Tags: The instruction prompt use the three special token
from StarChat fine-tuning (<|system|>, <|user|>, and <|assistant|>), demarcate
the interaction of the discussion. Additionally, we user tags SPARQL QUERY and CSV
RESULT, respectively to indicate the underlying python script to query the Knowledge
graph, and to indicate SPARCoder a query result.
5.4.2.3 Training
Training LLMs like ours, is a complex process requiring the appropriate selection of
architecture, data, hyperparameters, and fine-tuning techniques. In our endeavor to
train our SPARCoder model, we turned to the HuggingFace platform. Recognized for
its comprehensive set of tools tailored for training and fine-tuning transformer models,
HuggingFace’s Transformer library, with its ‘Trainer‘ interface, was an indispensable asset
in our journey. We also used HuggingFace’s Model and Dataset Hub which is a convenient
solution to handle LLMs and dataset.
One important strategy to mention that we employed, is the system and user masking
strategy. Inspired by the Masking user labels approach used to train StarChat explained
by Tunstall et al. (2023), where special tokens facilitate selective masking of user input
59
in dialogues, we applied a similar methodology. The core idea is to guide the model to
condition its behavior on certain segments of the data, yet train it to predict specific
segments that are essential during the inference phase.
In chat models, this masking strategy ensures the model conditions on user input but
is optimized to predict only the assistant’s responses. Such a distinction is pivotal as
it focuses the model’s attention to what is imperative during the actual deployment or
inference. In our use case, considering the substantial size of the system prompts, we
further expanded this strategy to mask the system as well. This step ensures the model
does not overfit to the specific ontology model in the system prompts, while still absorbing
the important context given by the ontology in the system prompt.
By leveraging this refined masking strategy, the model is trained to condition its
responses based on both the user’s input and the system’s cues. However, its parameters
are trained to the generation of the SPARQL queries, and not to guess the ontology model
our the user questions. It also prevents data breach ensuring limited knowledge retention
on the enterprise ontologies the models is trained on.
In our training process, it was essential to meticulously choose and set hyperparameters
to ensure optimal performance of the SPARCoder model. These hyperparameters determine
the model’s learning pace, regularization, and overall training dynamics. The specific
values adopted for our model are detailed in Table 5.1. This configuration was selected
based on parameters commonly used for the fine-tuning of StarChat, ensuring that our
model benefits from proven practices in similar settings.
Parameter Value
Foundation Model Path HuggingFaceH4/starchat-alpha
Dataset Corentin-tin/text2SPARQL-dataset
Sequence Length 5700
Max Steps 300
Training Batch Size 1
Evaluation Batch Size 1
Gradient Accumulation Steps 4
LoRA Rank 16
LoRA Alpha 32
LoRA Dropout 0.1
Learning Rate 2 × 10−5
LR Scheduler Type cosine
Number of Warmup Steps 50
Weight Decay 0.05
Table 5.1. Training Details for SPARCoder Model Fine-tuning
60
Figure 5.3. Dynamic Learning Rate - Linear Warmup and Cosine Decay
Upon monitoring the training and evaluation losses during the model’s training process,
a distinct pattern can be observed on Figure 5.4. Both the training and evaluation losses
exhibited a pronounced decreasing phase during the initial 80 steps. This is an anticipated
behavior in the initial stages of training as the model refines its parameters to better
fit the data. The model begins to learn the expected response form with respect to the
provided instruction format, ensuring that its generated outputs are not only accurate but
also conformant to the structural expectations set by the training data.
However, post these initial 80 steps, while the training loss continued its decline at a
slower rate, the evaluation loss appeared to plateau. This stagnation in the evaluation loss,
despite the persistent decline in the training loss, is as we know indicative of an overfitting
scenario. The model, although increasingly fitting well to the training data, might not
be generalizing as effectively to unseen data, a concern further compounded given the
intricate nature of our dataset.
61
Figure 5.4. train and evaluation Losses
62
1. Example 1
User : How many active Employees ?
SPARQL :
SELECT ( COUNT (? employee ) AS ? count )
WHERE {
? employee a emp : ActiveEmployee .
}
RESULT :
count
0 4204
Assistant : There are 4204 active employees in the Knowledge
Graph
User : And at Safran Aero Boosters ?
SPARQL QUERY :
SELECT ( COUNT (? employee ) AS ? count )
WHERE {
? employee a emp : ActiveEmployee .
? employee emp : worksFor ? company .
? company emp : companyName ? companyName .
FILTER regex (? companyName , " Safran Aero Boosters " , " i " ) .
}
RESULT :
count
0 1333
RESPONSE SHAPE : (1 , 1)
Assistant : There are 1333 active employees at Safran Aero
Boosters
As we can see with these two simple questions the model act as expected. We can
notice the fact that SPARCoder keep track of the context. Indeed, just by asking as
a second question "And at Safran Aero Boosters ?", the model know that he must
count the number of employees a SAB.
2. Example 2
User : How many training cost more than 1000 ?
SPARQL QUERY :
SELECT ( COUNT (? training ) AS ? count )
WHERE {
? training a tr : Training .
? training tr : cost ? cost .
FILTER (? cost > 1000) .
}
RESULT :
count
0 4057
RESPONSE SHAPE : (1 , 1)
Assistnat : There are 4057 trainings that cost more than 1000.
User : Can I have there names ?
SPARQL QUERY :
SELECT ? name
WHERE {
63
? training a tr : Training .
? training tr : name ? name .
}
RESULT :
name
0 The 5 key principles of personal data protection
1 Jump into the Digital Transformation @ Safran
2 Jump in Engineering 4.0
3 Jump in Manufacturing 4.0
4 CoorpCast | Ep 13| Slash careers : Multiple Jobs , Multiple
Lives
RESPONSE SHAPE : (584 , 1)
Assistant : The 5 key principles of personal data protection
Jump into the Digital Transformation @ Safran
Jump in Engineering 4.0
Jump in Manufacturing 4.0
CoorpCast | Ep 13| Slash careers : Multiple Jobs , Multiple Lives
Here we can see that the first SPARQL query is correct, however the second is a
valid SPARQL query but does not keep the context into account, indeed, the result
is not about names of training more than 1000€, but simply contains the name of
all training.
3. Example 3
User : How many training each employee has attended ?
SPARQL QUERY :
SELECT ? employee ? training ( COUNT (? training ) AS ? count )
WHERE {
? employee emp : participatedIn ? training .
}
GROUP BY ? employee ? training
ORDER BY DESC (? count )
RESULT :
employee training count
0 NaN NaN 0
RESPONSE SHAPE : (1 , 3)
In this case, the query is not valid with respect to the ontology because Objects Prop-
erty emp:participatedIn?training does not exist, the appropriate Objects Prop-
erty is define in the Training ontology and thus is tr:participatedIn?training.
This issue is the most frequent, indeed, and the most problematic as it means that
the model does not strictly respect the underlying ontology. Most of the time, when
there are more than one ontology prefix to use in the SPARQL query, the model fail
to use appropriate prefixes.
4. Example 4
User : What is the name of the manager of ’ jt85264 ’
SELECT ? name
WHERE {
? employee emp : id ? id ;
64
emp : firstName ? name .
FILTER regex (? id , " jt85264 " , " i " ) .
}
RESULT :
Empty DataFrame
Columns : [ name ]
Index : []
RESPONSE SHAPE : (0 , 1)
The model struggles to generate this basic query, likely because it lacks natural
language comprehension, having been primarily trained on code.
5.5 Discussion
The results obtained from our model, SPARCoder, provide both a window into the model’s
potential capabilities as well as its limitations, particularly in the niche and intricate
domain of EKGs. As we navigate through the outcomes, there are several crucial aspects
that emerge, helping to understand the nuances and challenges inherent to training a
model in this domain.
65
queries. While it has been trained extensively on code, the bridging of natural
language queries to structured SPARQL queries is evidently still a challenge. This
limitation emphasizes the need for an even blend of natural language and structured
data during the training phase to enhance the model’s versatility.
In conclusion, the results highlight both the promises and the challenges of training
models for specialized in ontologies and KGs. The commendable capability of SPARCoder
in contextual understanding sets a positive precedent. However, the inconsistencies in
results and struggles with multiple ontologies indicate the intricate nature of the problem at
hand. These findings, while pointing out areas of improvement, also pave the way for future
research. By understanding these limitations, we can refine our training methodologies,
dataset compositions, and evaluation metrics to create models that are more adept and
consistent in their performance. This journey, while filled with challenges, holds immense
potential to revolutionize the way we interface with domain-specific enterprise KGs.
5.6 Architecture
In order to provide seamless integration between the front-end user interface and the
backend capabilities of SPARCoder and the Knowledge Graph, a well planned architecture
is important. Our architecture is designed to take into account the specific requirements
of our prototype.
While our current setup is apt for demonstration purposes, a production-grade applica-
tion demands further enhancements (Appendix B.1).
5.6.1 Overview
At a high level, our system is divided into three primary components:
These components ensure that the user queries from the web interface are processed,
converted into SPARQL queries (where necessary), and retrieve results from the appropriate
backend sources.
• HTML: To define the structure of our web content, laying out the chat interface,
buttons, and input fields.
• CSS: To dictate the aesthetics of our interface, such as the color scheme, typography,
and responsive design.
66
• JavaScript: The interactive features of our interface. With JavaScript, we capture
user inputs, communicate with the Flask server, and update the chat interface in
real-time.
• Request Handling: Capturing and interpreting user inputs from the frontend.
• Storage Management: As per our current setup, discussions are stored in text
files corresponding to chat sessions. The server manages the reading and writing of
these sessions.
2. Stardog Knowledge Graph: Queries that need factual information or are depen-
dent on the enterprise data are directed towards Stardog query endpoint.
5.6.5 Architecture
A visual representation of our architecture, illustrating the interaction between different
components, can be seen in Figure 5.6. For a more detailed view, refer to the full-size
diagram in Appendix B.5.
67
Messaging Ontology Development
Application/Server Knowledge Graph
Platform Ontology Knowledge
Model Management Expert
Context Window
Access Control
<|user|> SPARQL
Query Graph
Natural Language Request
Natural Query Engine (Triple Store) Data sources
Language 1
Request
<|assistant|> SPARQL Query 2 Internal (Isolated Data Silos)
Natural
Language 5 Relevant
<|user|>
Response Relevant Data (CSV) 4 Data 3 PLM
(CSV) 3DExp
Natural
DBs DBs
Language <|assistant|>
Request Natural Language Response
HRM ERP
<|user|> Natural Language Request
DBs DBs
External
Commodity
Market Prices
68
5.7 Conclusion
This thesis embarked on a journey to address the prevalent challenge of making vast data
accessible and comprehensible to non-expert users, specifically within the realm of large
organizations. We delved into the potential of coupling the Enterprise Knowledge Graphs
with an ontology-aware Text-to-SPARQL fine-tuned Large Language Model. Our prototype,
the SPARCoder, tried to showcase the power of marrying structured knowledge graphs
with advanced natural language processing techniques.
Several points emerged from our research. Firstly, the integration of large language
models with knowledge graph querying indeed presents a promising avenue for enhancing
data accessibility for non-experts. Our "structured data search engine" prototype not
only offers a more intuitive data interaction experience but also broadens the horizons for
diverse user interactions, thereby promoting data-driven decision-making.
However, as highlighted in our discussions, the journey is not devoid of challenges. The
model’s inconsistencies in maintaining context across queries, its struggles with ontology
prefixes, and challenges in seamlessly bridging natural language queries with structured
SPARQL requests underscore the complexities of this venture. These challenges, rather
than looking insurmountable, point towards avenues for future research. They emphasize
the importance of well-curated training datasets, particularly in specialized domains like
KGs, and hint at the potential refinements needed in our training methodologies.
As we conclude, it’s evident that the association between knowledge graphs and large
language models holds great promise. As the current system offers a little leap towards
bridging the gap between vast and fragmented data sources and end-users, the frontier to
be explored is still vast.
Building on this foundational work and recognizing the immense landscape yet to be
charted, we now turn our attention to the potential Future Directions
69
objective of generating SPARQL. This choice was underpinned by the presumption that
the nature of SPARQL generation would align closely with typical coding tasks. However,
an open question remains: What if the semantics and structure of KGs and SPARQL are
more congruent with natural language processing rather than classic coding?
Knowledge Graphs inherently capture relations and entities, similar to the way natural
language encapsulates subjects, predicates, and objects. The syntactic nature of SPARQL
might be closer to natural language constructs than traditional code, given its emphasis
on querying relationships and attributes. Recognizing this potential similarity, it might be
beneficial to investigate LLM architectures that are fine-tuned or primarily designed for
natural language processing tasks.
70
Appendix A
StarCoder
71
Figure A.2. Overview of the training data for StarCoder. For the selected programming
languages, we show the number of files and data volume after near-deduplication, as well
as after filtering - part 1 (Li et al. 2023)
72
Figure A.3. Overview of the training data for StarCoder. For the selected programming
languages, we show the number of files and data volume after near-deduplication, as well
as after filtering - part 2 (Li et al. 2023)
A.1 Glosses
A.1.1 Employee
Entities
73
ConceptID Context Term(en) Term(fr)
... Employee Employee Employé
... Employee Office Bureau
... Employee Company Entreprise
... Employee Department Département
Gloss(en) Gloss(fr)
An individual who works part-time or full-time Un individu qui travaille à temps partiel ou à
under a contract of employment. temps plein sous un contrat de travail.
A location where an employee carries out their Un lieu où un employé réalise ses activités de
work activities. travail.
An organized group of people with a particular Un groupe organisé de personnes ayant un but
purpose, such as a business. particulier, comme une entreprise.
A functional unit within an organization. Une unité fonctionnelle au sein d’une organisation.
Relations
A.1.2 Training
Entities
74
ConceptID Context Term(en) Term(fr)
... Training Training Formation
... Training DigitalAcademy Training Formation DigitalAcademy
... Training SAB Training Formation SAB
Gloss(en) Gloss(fr)
A structured training and development program Un programme structuré de formation et de
in the context of professional development. développement dans le contexte du développement
professionnel.
An online platform offering a variety of training Une plateforme en ligne proposant une variété de
courses. formations.
A development program offering the possibility Un programme structuré de formation et de
variety. développement dans le contexte du développement
professionnel.
Relations
75
77
Appendix B
SPARCoder
Data Model :
{ ONTOLOGY_MODEL . ttl }
78
<| user | >
RESULT :
employeeName employeeID
0 John Doe ID3124
1 Jane Doe ID3125
2 John Smith ID3126
Do you need any further information about them ? <| end | >
<| endoftext | >
79
Figure B.3. GPU Memory Allocated
80
Figure B.4. GPU Power Usage (W)
2. Database Integration: Relying on text files for storage is not scalable or efficient.
In a production setting, integrating a relational or NoSQL database would ensure
faster data retrieval, backup mechanisms, and better data integrity.
3. Security Measures: Protecting user data and ensuring secure interactions between
the frontend, backend, and LLM are paramount. This might involve implementing
HTTPS, data encryption, using secure APIs, and login/password to connect.
81
Messaging Ontology Development
Application/Server Knowledge Graph
Platform Ontology Knowledge
Model Management Expert
& Inference Engine Mapping Engine
User KG Connectors Ontology
SPARCoder (LLM) (OBDA)
Interface Endpoint
Access Control
82
<|user|> SPARQL
Query Graph
Natural Language Request
Natural Query Engine (Triple Store) Data sources
Language 1
Request
<|assistant|> SPARQL Query 2 Internal (Isolated Data Silos)
Natural
Language 5 Relevant
<|user|>
Response Relevant Data (CSV) 4 Data 3 PLM
(CSV) 3DExp
Natural
DBs DBs
Language <|assistant|>
Request Natural Language Response
HRM ERP
<|user|> Natural Language Request
DBs DBs
External
Commodity
Market Prices
B.2 Qualitative Analysis
B.2.1 Example 1
83
B.2.2 Example 2
84
Can I have there name ? <| end | >
<| assistant | >
SPARQL QUERY :
‘‘‘
PREFIX tr : < https :// www . safran - group . com / companies / safran - aero -
boosters / ontology / training / v0 / >
SELECT ? name
WHERE {
? training a tr : Training .
? training tr : name ? name .
}
‘ ‘ ‘ <| end | >
<| user | >
RESULT :
name
0 The 5 key principles of personal data protection
1 Jump into the Digital Transformation @ Safran
2 Jump in Engineering 4.0
3 Jump in Manufacturing 4.0
4 CoorpCast | Ep 13| Slash careers : Multiple Jobs , Multiple Lives
B.2.3 Example 3
85
‘ ‘ ‘ <| end | >
<| user | >
RESULT :
employee training count
0 NaN NaN 0
B.2.4 Example 4
SELECT ? name
WHERE {
? employee emp : id ? id ;
emp : firstName ? name .
FILTER regex (? id , " jt85264 " , " i " ) .
}
‘ ‘ ‘ <| end | >
<| user | >
RESULT :
Empty DataFrame
Columns : [ name ]
Index : []
86
RESPONSE SHAPE : (0 , 1) <| end | >
<| assistant | >
The manager of ’ jt85264 ’ is not known .
87
Bibliography
Li, Raymond et al. (2023). “StarCoder: may the source be with you!” arXiv: 2305.06161
[cs.CL] (pages 2, 14, 17, 31, 71–73).
Mazumdar, Somnath et al. (2019). “A survey on data storage and placement methodologies
for Cloud-Big Data ecosystem”. In: Journal of Big Data 6.1, p. 15. issn: 2196-1115.
doi: 10.1186/s40537-019-0178-3. url: https://doi.org/10.1186/s40537-019-
0178-3 (page 1).
Studer, Rudi, V.Richard Benjamins, and Dieter Fensel (1998). “Knowledge engineering:
Principles and methods”. In: Data Knowledge Engineering 25.1, pp. 161–197. issn:
0169-023X. doi: https://doi.org/10.1016/S0169-023X(97)00056-6. url: https:
//www.sciencedirect.com/science/article/pii/S0169023X97000566 (pages 1,
39).
Brown, Tom B. et al. (2020). “Language Models are Few-Shot Learners”. arXiv: 2005.14165
[cs.CL] (page 1).
Devlin, Jacob et al. (2019). “BERT: Pre-training of Deep Bidirectional Transformers for
Language Understanding”. arXiv: 1810.04805 [cs.CL] (pages 1, 7, 11).
Sun, Ruoxi et al. (2023). “SQL-PaLM: Improved Large Language Model Adaptation for
Text-to-SQL”. arXiv: 2306.00739 [cs.CL] (pages 1, 15).
European Commission (2016). “Regulation (EU) 2016/679 of the European Parliament
and of the Council of 27 April 2016 on the protection of natural persons with regard to
the processing of personal data and on the free movement of such data, and repealing
Directive 95/46/EC (General Data Protection Regulation) (Text with EEA relevance)”.
url: https://eur-lex.europa.eu/eli/reg/2016/679/oj (pages 4, 52).
Hambarde, Kailash and Hugo Proença (2023). “Information Retrieval: Recent Advances
and Beyond”. In: arXiv preprint arXiv:2301.08801 (page 5).
Chaudhri, Vinay K et al. (2022). “Knowledge graphs: Introduction, history, and perspec-
tives”. In: AI Magazine 43.1, pp. 17–29 (pages 5, 13).
Brin, Sergey and Lawrence Page (1998). “The Anatomy of a Large-Scale Hypertextual
Web Search Engine”. In: Seventh International World-Wide Web Conference (WWW
1998) (page 6).
Bugaje, Maryam and Gobinda Chowdhury (Nov. 2017). “Is Data Retrieval Different from
Text Retrieval? An Exploratory Study”. In: pp. 97–103. isbn: 978-3-319-70231-5. doi:
10.1007/978-3-319-70232-2_8 (page 6).
Tiwari, Anil (2023). “Emergence of Vector Databases with AI wave”. Medium. url:
https://tiw- anilk.medium.com/emergence- of- vector- databases- with- ai-
wave-dd9976dedc2f (page 6).
89
Cer, Daniel et al. (2018). “Universal Sentence Encoder”. In: CoRR abs/1803.11175. arXiv:
1803.11175. url: http://arxiv.org/abs/1803.11175 (page 7).
Johnson, Jeff, Matthijs Douze, and Hervé Jégou (2017). “Billion-scale similarity search
with GPUs”. In: CoRR abs/1702.08734. arXiv: 1702.08734. url: http://arxiv.org/
abs/1702.08734 (page 7).
Louis, Antoine (July 2020). “A Brief History of Natural Language Processing”. In: Medium.
url: https : / / medium . com / @antoine . louis / a - brief - history - of - natural -
language-processing-part-1-ffbcb937ebce (page 9).
Chiusano, Fabio (Sept. 2022). “A Brief Timeline of NLP: A journey across grammars, expert
systems, ontologies, statistical models, neural networks, word embeddings, transformers,
etc.” In: NLPlanet. https://medium.com/nlplanet/a-brief-timeline-of-nlp-bc45b640f07d
(page 9).
Bommasani, Rishi et al. (2021). “On the Opportunities and Risks of Foundation Models”.
In: arXiv preprint arXiv:2108.07258 (pages 9, 11, 12).
Weaver, Warren (1949). “Translation”. In: Machine Translation of Languages. Ed. by
William N. Locke and A. Donald Boothe. Reprinted from a memorandum written by
Weaver in 1949. Cambridge, MA: MIT Press, pp. 15–23 (page 9).
Chomsky, Noam (1957). “Syntactic Structures”. Mouton (page 10).
Pierce, John R et al. (1966). “Language and machines — computers in translation and
linguistics”. Tech. rep. Washington, DC: National Academy of Sciences, National
Research Council (page 10).
Weizenbaum, Joseph (1966). “ELIZA—A Computer Program for the Study of Natural
Language Communication between Man and Machine”. In: Communications of the
ACM 9.1, pp. 36–45. doi: 10.1145/365153.365168 (page 10).
Winograd, Terry (1971). “Procedures as a representation for data in a computer program
for understanding natural language”. Tech. rep. Massachusetts Institute of Technology,
Cambridge Project (page 10).
Woods, W., R. Kaplan, and B. Nash-Webber (1972). “The Lunar Sciences Natural Language
Information System: Final Report”. Tech. rep. Bolt, Beranek and Newman, Cambridge,
MA (pages 10, 12).
Dyer, Michael G. (1983). “The Role of Affect in Narratives”. In: Cognitive Science 7.3,
pp. 211–242 (page 10).
Bengio, Yoshua et al. (2003). “A Neural Probabilistic Language Model”. In: Journal of
Machine Learning Research 3, pp. 1137–1155 (page 10).
Rumelhart, David E., Geoffrey E. Hinton, and Ronald J. Williams (1986). “Learning
representations by back-propagating errors”. In: Nature 323, pp. 533–536 (page 10).
Hochreiter, Sepp and Jürgen Schmidhuber (1997). “Long Short-Term Memory”. In: Neural
Computation 9.8, pp. 1735–1780 (pages 10, 23).
Mikolov, Tomáš et al. (2010). “Recurrent Neural Network Based Language Model”. In: Pro-
ceedings of the Eleventh Annual Conference of the International Speech Communication
Association (page 10).
90
Graves, Alex (2013). “Generating Sequences with Recurrent Neural Networks”. In: arXiv
preprint arXiv:1308.0850 (page 10).
Collobert, Ronan and Jason Weston (2008). “A Unified Architecture for Natural Language
Processing: Deep Neural Networks with Multitask Learning”. In: Proceedings of the
25th International Conference on Machine Learning, pp. 160–167 (page 10).
LeCun, Yann et al. (1998). “Gradient-Based Learning Applied to Document Recognition”.
In: Proceedings of the IEEE 86.11, pp. 2278–2324 (page 10).
Mikolov, Tomas et al. (2013). “Efficient Estimation of Word Representations in Vector
Space”. In: arXiv preprint arXiv:1301.3781 (pages 10, 18).
Socher, Richard et al. (2013). “Recursive Deep Models for Semantic Compositionality Over
a Sentiment Treebank”. In: Proceedings of the 2013 Conference on Empirical Methods
in Natural Language Processing, pp. 1631–1642 (page 10).
Sutskever, Ilya, Oriol Vinyals, and Quoc V. Le (2014). “Sequence to Sequence Learning with
Neural Networks”. In: Advances in Neural Information Processing Systems, pp. 3104–
3112 (page 10).
Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio (2015). “Neural Machine Trans-
lation by Jointly Learning to Align and Translate”. In: Proceedings of the International
Conference on Learning Representations (ICLR) (page 10).
Vaswani, Ashish et al. (2017). “Attention Is All You Need”. In: Advances in Neural
Information Processing Systems (pages 10, 20, 24–27).
Radford, Alec et al. (2018). “Improving Language Understanding by Generative Pre-
Training”. In: OpenAI (pages 11, 28, 30).
Clark, Peter et al. (2020). “From ‘F’ to ‘A’ on the N.Y. Regents Science Exams: An
Overview of the Aristo Project”. In: AI Magazine 41.4, pp. 39–53 (page 11).
Ben Zaken, Elad, Shauli Ravfogel, and Yoav Goldberg (2022). “BitFit: Simple Parameter-
efficient Fine-tuning for Transformer-based Masked Language-models”. In: arXiv
preprint arXiv:2106.10199 (page 12).
Hu, Edward J et al. (2021). “LoRA: Low-Rank Adaptation of Large Language Models”.
In: arXiv preprint arXiv:2106.09685 (pages 12, 57).
Hayes, Patrick J (1981). “Computing science and statistics: The interface”. In: Journal of
the American Statistical Association 76.374, pp. 7–15 (page 13).
McCarthy, John (1989). “First-order logic and AI”. In: Readings in artificial intelligence.
Morgan Kaufmann Publishers Inc., pp. 13–23 (page 13).
Taylor, R. and M. Frank (1976). “Data base management systems”. In: Computer 9.2,
pp. 38–44 (page 13).
Codd, Edgar F (1982). “Relational database: a practical foundation for productivity”. In:
Communications of the ACM 25.2, pp. 109–117 (page 13).
Lenat, Douglas B, Ramanathan V Guha, et al. (1995). “CYC: A large-scale investment in
knowledge infrastructure.” In: AAAI/IAAI. Vol. 1995, pp. 673–680 (page 13).
Lenat, Douglas B, Ramanathan V Guha, et al. (1991). “Building large knowledge-based
systems; representation and inference in the Cyc project.” In: AAAI/IAAI. Vol. 91,
pp. 1168–1175 (page 13).
91
Page, Lawrence et al. (1999). “The PageRank Citation Ranking: Bringing Order to the
Web.” In: Technical report. url: http://ilpubs.stanford.edu:8090/422/1/1999-
66.pdf (page 13).
Guha, Ramanathan V. (1996). “Metadata for the World Wide Web”. In: Proceedings of
the First International Conference on the World-Wide Web. url: http://www.cs.
wustl.edu/~schmidt/PDF/meta-www96.pdf (page 13).
Pan, Shirui et al. (2023). “Unifying Large Language Models and Knowledge Graphs: A
Roadmap”. In: arXiv preprint arXiv:2306.08302 (page 13).
Xu, Yichong et al. (2021). “Fusing context into knowledge graph for commonsense question
answering”. In: Findings of the Association for Computational Linguistics: ACL-
IJCNLP 2021, pp. 1201–1207 (page 14).
Hu, Nan et al. (2023). “An empirical study of pre-trained language models in simple
knowledge graph question answering”. In: arXiv preprint arXiv:2303.10368 (page 14).
Wang, Xiaozhi et al. (2021). “KEPLER: A Unified Model for Knowledge Embedding
and Pre-trained Language Representation”. In: Transactions of the Association for
Computational Linguistics 9, pp. 176–194 (page 14).
Ke, Pei et al. (Aug. 2021). “JointGT: Graph-Text Joint Representation Learning for Text
Generation from Knowledge Graphs”. In: Findings of the Association for Computational
Linguistics: ACL-IJCNLP 2021. Online: Association for Computational Linguistics
(page 14).
Kocetkov, Denis et al. (2022). “The Stack: 3 TB of permissively licensed source code”. In:
Preprint (pages 14, 31, 32).
Chen, Mark, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto,
et al. (2021). “Evaluating Large Language Models Trained on Code”. In: arXiv preprint
arXiv:2107.03374 (page 14).
Nijkamp, Erik et al. (2022). “A Conversational Paradigm for Program Synthesis”. In:
arXiv preprint (page 14).
Wu, Sen, Laurel Orr, and Manasi Ganti (2023). “Introducing NSQL: Open-source SQL Copi-
lot Foundation Models”. In: Numbers Station. url: https://www.numbersstation.
ai / post / introducing - nsql - open - source - sql - copilot - foundation - models
(page 15).
Fleuret, François (2021). “Deep Learning Course 14x050”. University of Geneva, Switzer-
land. Available at: https://fleuret.org/dlc/. Includes slides, recordings, and a
virtual machine. url: https://fleuret.org/dlc/ (page 20).
Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio (2014). “Neural machine trans-
lation by jointly learning to align and translate”. In: arXiv preprint arXiv:1409.0473
(page 21).
Louppe, Gilles (2023). “Lecture 7: Attention and Transformers”. Lectures for INFO8010 -
Deep Learning, ULiège, Spring 2023. url: https://github.com/glouppe/info8010-
deep-learning (pages 22, 24).
Zhang, Aston et al. (2021). “Dive into Deep Learning”. In: arXiv preprint arXiv:2106.11342
(pages 22, 24).
92
Chung, Junyoung et al. (2014). “Empirical Evaluation of Gated Recurrent Neural Networks
on Sequence Modeling”. In: CoRR abs/1412.3555. arXiv: 1412 . 3555. url: http :
//arxiv.org/abs/1412.3555 (page 23).
Shazeer, Noam (2019). “Fast Transformer Decoding: One Write-Head is All You Need”.
In: arXiv preprint arXiv:1911.02150 (pages 25, 26, 30).
Ainslie, Joshua et al. (May 2023). “GQA: Training Generalized Multi-Query Transformer
Models from Multi-Head Checkpoints” (page 26).
He, Kaiming et al. (June 2016). “Deep Residual Learning for Image Recognition”. In:
pp. 770–778. doi: 10.1109/CVPR.2016.90 (page 27).
Ba, Jimmy Lei, Jamie Ryan Kiros, and Geoffrey E. Hinton (2016). “Layer Normalization”.
arXiv: 1607.06450 [stat.ML] (page 27).
Liu, Peter J. et al. (2018). “Generating Wikipedia by Summarizing Long Sequences”. In:
CoRR abs/1801.10198. arXiv: 1801.10198. url: http://arxiv.org/abs/1801.10198
(page 28).
Allal, Loubna Ben et al. (2023). “SantaCoder: don’t reach for the stars!” arXiv: 2301.03988
[cs.SE] (page 31).
Kingma, Diederik P. and Jimmy Ba (2017). “Adam: A Method for Stochastic Optimization”.
arXiv: 1412.6980 [cs.LG] (page 33).
Gershgorn, Dave (2021). “GitHub and OpenAI launch a new AI tool that generates its own
code: Microsoft gets a taste of OpenAI’s tech”. In: url: https://www.theverge.com/
2021/6/29/22555777/github-openai-ai-tool-autocomplete-code (page 34).
Köpf, Andreas et al. (2023). “OpenAssistant Conversations – Democratizing Large Lan-
guage Model Alignment”. arXiv: 2304.07327 [cs.CL] (page 34).
Conover, Mike et al. (2023). “Free Dolly: Introducing the World’s First Truly Open
Instruction-Tuned LLM”. url: https://www.databricks.com/blog/2023/04/12/
dolly- first- open- commercially- viable- instruction- tuned- llm (visited on
06/30/2023) (page 34).
Ehrlinger, Lisa and Wolfram Wöß (Sept. 2016). “Towards a Definition of Knowledge
Graphs”. In: (page 37).
Singhal, Amit (May 2012). “Introducing the Knowledge Graph: things, not strings”.
Accessed: [Your Access Date Here]. url: https://blog.google/products/search/
introducing-knowledge-graph-things-not/ (page 37).
Fensel, Dieter, ed. (2011). “Foundations for the Web of Information and Services: A Review
of 20 Years of Semantic Web Research”. Springer Berlin Heidelberg (page 37).
Debruyne, Christophe (2023). “Knowledge Representation and Reasoning”. https://
www.programmes.uliege.be/cocoon/20232024/cours/INFO9014-1.html. Lecture
2. Liège, Belgium: University of Liège (pages 37, 38).
Wu, Honghan et al. (Jan. 2017). “Understanding Knowledge Graphs”. English. In: Exploit-
ing Linked Data and Knowledge Graphs in Large Organisations. Switzerland: Springer
International Publishing AG, pp. 147–180. isbn: 9783319456522. doi: 10.1007/978-
3-319-45654-6_6 (page 38).
93
De Leenheer, Pieter and Tom Mens (2008). “Ontology Evolution”. In: Ontology Man-
agement: Semantic Web, Semantic Web Services, and Business Applications. Ed. by
Martin Hepp et al. Boston, MA: Springer US, pp. 131–176. isbn: 978-0-387-69900-4.
doi: 10.1007/978-0-387-69900-4_5. url: https://doi.org/10.1007/978-0-387-
69900-4_5 (page 39).
Gruber, Thomas R. (1995). “Toward principles for the design of ontologies used for
knowledge sharing?” In: International Journal of Human-Computer Studies 43.5,
pp. 907–928. issn: 1071-5819. doi: https://doi.org/10.1006/ijhc.1995.1081. url:
https : / / www . sciencedirect . com / science / article / pii / S1071581985710816
(page 39).
Arenas, Marcelo et al. (2012). “A Direct Mapping of Relational Data to RDF”. Technical
Report. Available at: http://www.w3.org/TR/2012/REC- rdb- direct- mapping-
20120927/. W3C Recommendation. url: http://www.w3.org/TR/2012/REC-rdb-
direct-mapping-20120927/ (page 47).
Sequeda, Juan et al. (Jan. 2009). “Direct mapping SQL databases to the semantic web: A
survey”. In: (page 47).
Debruyne, Christophe and Declan O’Sullivan (Jan. 2016). “R2RML-F: Towards Sharing
and Executing Domain Logic in R2RML Mappings”. In: (page 47).
Halevy, Alon et al. (June 2005). “Enterprise information integration: successes, challenges
and controversies”. In: pp. 778–787. doi: 10.1145/1066157.1066246 (page 48).
Poumay, J. (2019). “Master thesis : Term extraction from domain specific texts”. Unpub-
lished master’s thesis. MA thesis. Liège, Belgique: Université de Liège. url: https:
//matheo.uliege.be/handle/2268.2/7487 (page 49).
Trivedi, Priyansh et al. (2017). “Lc-quad: A corpus for complex question answering over
knowledge graphs”. In: International Semantic Web Conference. Springer, pp. 210–218
(page 56).
Aghajanyan, Armen, Luke Zettlemoyer, and Sonal Gupta (2020). “Intrinsic Dimensionality
Explains the Effectiveness of Language Model Fine-Tuning”. arXiv: 2012 . 13255
[cs.LG] (page 57).
Tunstall, Lewis et al. (2023). “Creating a Coding Assistant with StarCoder”. In: Hugging
Face Blog. https://huggingface.co/blog/starchat (page 59).
Chen, Mark, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira
Pinto, et al. (2021). “Evaluating Large Language Models Trained on Code”. In: arXiv:
2107.03374 [cs.LG] (page 71).
Cassano, Federico et al. (2022). “MultiPL-E: A Scalable and Extensible Approach to
Benchmarking Neural Code Generation”. arXiv: 2208.08227 [cs.LG] (page 71).
94