Preprints202405 0305 v1
Preprints202405 0305 v1
doi: 10.20944/preprints202405.0305.v1
Copyright: This is an open access article distributed under the Creative Commons
Attribution License which permits unrestricted use, distribution, and reproduction in any
medium, provided the original work is properly cited.
Preprints.org (www.preprints.org) | NOT PEER-REVIEWED | Posted: 7 May 2024 doi:10.20944/preprints202405.0305.v1
Disclaimer/Publisher’s Note: The statements, opinions, and data contained in all publications are solely those of the individual author(s) and
contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting
from any ideas, methods, instructions, or products referred to in the content.
Article
Abstract: This study addresses the prevalent gap between formal and informal architectural
methodologies in software engineering. Recognizing the potential of informal architecture artifacts
in analytical processes, we introduce a groundbreaking methodology that efficiently transforms
these informal components into structured formal models. This method facilitates a deeper
understanding and utilization of informal diagrams and enhances analytical capabilities through
graph analysis techniques. By leveraging user-friendly tools like Draw.io, the methodology
democratizes the modeling process, making sophisticated architectural analyses accessible to a
broader spectrum of professionals without requiring deep expertise in formal methods. The
innovative aspects of this methodology lie in its ability to streamline the transformation process,
significantly improving both the efficiency and effectiveness of model creation and analysis. These
enhancements are demonstrated through a practical application involving a sample architecture
diagram, where the resulting model is thoroughly analyzed using advanced graph analysis tools
like Python's NetworkX library and Neo4j. This approach bridges the theoretical and practical
divides in software architecture and sets a new standard for integrating informal artifacts into
systematic engineering workflows. In addition, considerations for Artificial Intelligence
developments are discussed.
1. Introduction
In the evolving landscape of software engineering, a significant gap persists between the
theoretical methodologies proposed in research and the pragmatic approaches applied in practice.
Research often emphasizes the rigor of formal methods, which are meticulously structured and aim
to ensure semantic clarity and programmatic integrity [1][2]. However, these methods require a deep
understanding of specialized modeling languages and tools [3], creating a barrier to widespread
adoption due to the niche expertise required.
Conversely, informal methods predominate in the practical realm. Engineers frequently rely on
natural-language documents, wikis, and simple boxes-and-lines diagrams due to their ease of use
and accessibility [4]. Despite their popularity, these informal methods suffer significant drawbacks,
including structure, formatting, and syntax inconsistencies, which complicate further analysis and
integration into formal systems [5][6].
This paper introduces an innovative methodology that bridges this divide by transforming
informal, often chaotic architectural diagrams into structured, formal models. Our approach
leverages the user-friendly diagramming tool Draw.io (also available online as diagrams.net) to
extract data from informal boxes-and-lines diagrams. This data is then structured into graph-like
constructs compatible with advanced analysis tools such as Python's NetworkX library and Neo4j.
This method not only democratizes the creation of formal models, making them accessible to a
broader range of professionals without specialized training in formal methods but it also enhances
the efficiency and effectiveness of architectural analyses.
We present a series of techniques that represent an evolution in handling architectural artifacts—
optimizing the transformation process to be more intuitive and less resource-intensive. Our
methodology is demonstrated through a sample analysis that showcases the complete workflow,
from the initial extraction of data from informal diagrams to their integration into a structured model
ready for formal analysis. This approach not only addresses the current gaps in practice but also
pushes the boundaries of current software engineering methodologies by introducing a scalable, cost-
effective solution that maintains the integrity and utility of formal modeling in a way that is aligned
with everyday engineering practices. In addition, this is just an initial step to generate cases and then
use the capabilities of “interpretation” of Large Language Models and advanced pattern recognition
machines to build an even more powerful methodology.
2. Background
Residential private networks connect through a VPN, highlighting the security measures for
remote access, which is increasingly relevant in modern network designs that accommodate
telecommuting. The enterprise cloud section, encased within a Virtual Private Cloud (VPC), features
redundant application servers behind a load balancer, illustrating high availability and fault
tolerance strategies essential for maintaining continuous service delivery.
This diagram style has different syntaxes depending on the language (e.g., UML, SysML), but
the general purpose is the same [8][9]. This type of diagram shows the objects or data entities in a
system. It can be used to represent database tables or class relationships. Similar diagrams can be
used in cyber-physical domains using SysML to show the logical structure of a system [8].
3. Tooling
4. Methodology
This section outlines the general method for creating structured models from informal Draw.io
diagrams. The drawio.png file format will be the primary focus for its practical use in documentation
[15].
1. Begin with a Draw.io diagram, saved as a .drawio.png file. Those wishing to reproduce this
method may use the image below as it uses this format. In this case, a simple activity diagram is
used.
Preprints.org (www.preprints.org) | NOT PEER-REVIEWED | Posted: 7 May 2024 doi:10.20944/preprints202405.0305.v1
2. Extract the MxFile data from the image. The MxFile XML is stored as metadata in the image.
This can be extracted and converted into a usable format.
3. Convert to JSON (optional). As an intermediate step, the method used in this paper converts
the MxFile XML to JSON. This is done primarily for the convenience of working with JSON over
XML and can be skipped if needed.
4. Convert to NetworkX. Next, data is converted into a usable model. The primary method used
for the demonstration is Draw.io, but there are no technical limitations to the format at this point.
With a NetworkX model, a usable format is available for analysis or visualization.
Preprints.org (www.preprints.org) | NOT PEER-REVIEWED | Posted: 7 May 2024 doi:10.20944/preprints202405.0305.v1
5. Infer Additional Information (optional). A final step is to infer additional information that
might be useful in a model that was not present in the original diagram. This is discussed in
detail later in this paper.
The remaining sections of this paper will cover each of these stages in detail. First data extraction
and format conversion is discussed, then inferences. Finally, an end-to-end example is shown which
demonstrates indexing, query, and analysis concepts.
10
The data value in the content attribute contains escaped HTML. Unescaping the HTML gives an
MxFile; however, the data is still encoded. The data is URL encoded, then deflated, and base-64
encoded. Reversing this encoding gives the MxGraph XML, as shown in Fig. X.
Preprints.org (www.preprints.org) | NOT PEER-REVIEWED | Posted: 7 May 2024 doi:10.20944/preprints202405.0305.v1
11
The tool CyberChef [16] was used to represent the complete extraction method in a platform
agnostic way. The CyberChef recipe for converting the original SVG file to the MxGraph XML data
is:
Figure 10. The CyberChef recipe for converting the original SVG file to the MxGraph XML data.
12
This uses a custom MxGraph class which contains the logic for parsing the MxGraph into a
flattened list of dictionaries.
7. Creating Models
NetworkX converts the MxFile contents to a more usable format. It provides functionality for
working with graph data structures, such as traversal and analysis, and visualization capabilities.
The following Python code demonstrates how to convert the MxFile contents to NetworkX. In
short, all diagram elements are traversed and identified as either a node or an edge. The appropriate
NetworkX functions are used to add the elements to a NetworkX graph. Note that the below code
snippet simplifies the actual code to illustrate the methodology. The algorithm used in practice
captures node label information, style, and coloring. The complete code is available online.
Preprints.org (www.preprints.org) | NOT PEER-REVIEWED | Posted: 7 May 2024 doi:10.20944/preprints202405.0305.v1
13
8. Inferring Information
After extracting data from the diagram and converting to a NetworkX graph, the graph can be
analyzed to infer information not contained in the diagram data. This section discusses several of
these concepts with examples.
14
In this example, the graph is analyzed for nodes that are inside other nodes. By comparing each
node to each other node, elements whose bounds lie entirely within the bounds of another element
are identified. If this condition is true, an in relationship is added between those two elements. While
this may be an inefficient approach to this problem at scale, diagrams are designed to be visual and
therefore should not reach a scale where this becomes a computationally hard problem.
In this case, a complementary bi-directional relationship is created between the two elements.
15
leveraging modern graph databases, this sort of pattern can be identified without the need for
making the inference manually.
Grouping by proximity. Grouping related elements together based on proximity, within some
specified tolerance, has also been suggested.
Element type identification. Identifying different types of diagram elements can also be useful.
This can be done either by adding data fields to elements or considering element style.
Edge types / edge labels. The demonstrated approach ignores edge labels in MxGraph due to
the added complexity and distraction for the core method being presented. In MxGraph, edge
labels are vertices making their role in the graph more complex. In future work, it is
recommended that this be reconciled and edge labels.
9. End-to-End Example
In this section, an end-to-end example begins with an architecture diagram and uses the data
extraction methods presented to generate models that can be queried. Analysis techniques are
demonstrated to answer representative real-world questions and provide insight into a system.
16
Figure 17. Simplified representation of a network that might be used in a small business.
This example will demonstrate how one might analyze a network to understand interactions
between systems, analyze impacts, or assess risk. Next, the diagram is converted to a NetworkX
graph, and additional information is inferred, as described in the previous sections. The edges
colored red are the inferred relationships that capture the geometric containment (“in”
relationships”) of diagram elements inside the network enclaves. In the following sections, this
information is used to show how to query the model.
17
18
Consider a more practical scenario where an enterprise has identified a critical or sensitive asset.
The graph model can be used to identify high-risk components in the architecture (e.g nodes that
connect to that asset directly or indirectly). The Cypher query below demonstrates how to do this.
In this case, a similar MATCH pattern is used, with one notable exception: the relationship is
specified as a variable-length path (e.g. r:EDGE*0..4) and only includes the EDGE type (ignoring the
inferred IN relationships shown in previous sections). The query then applies a WHERE condition to
limit the database match to only the high-value asset.
Preprints.org (www.preprints.org) | NOT PEER-REVIEWED | Posted: 7 May 2024 doi:10.20944/preprints202405.0305.v1
19
Figure 22. Get all items up to four steps from the data lake.
10. Results
A general method for creating models from informal architecture diagrams generated with
Draw.io was presented. The method was effective for creating models that can be queried to analyze
a system's characteristics.
This method contributes to systems and software architecture knowledge by providing a
methodology for addressing the gap between formal and informal architectural artifacts [18]. While
clear gaps remain in the need to keep implemented systems aligned with their architecture [5], this
method makes creating formal architecture artifacts easier and more accessible to implementation
teams based on their current workflows that leverage informal documents rather than formal
architectures [19].
It is important to note that trends in modeling are moving towards more text-based code-like
formats, as demonstrated with the initial release of SysML 2.0 [20]. This is designed to make MBSE
(Model-Based Systems Engineering) models more semantic and executable and allow them to follow
processes reminiscent of software engineering.
20
converting informal diagrams into structured, formal models by employing large language models
[21] and advanced pattern recognition systems, such as those utilizing transfer learning [22] from
networks like ResNet-50 [23]. This automation will speed up the process and reduce human error
and the subjectivity of manual interpretation. The methodology proposed in this paper can be used
to build a massive learning book that includes structured and unstructured data.
12. Conclusions
This paper presented a method for creating models from informal architectural artifacts. While
this paper has demonstrated that the problem is feasible, much work remains. First, we need to
explore the feasibility of this approach with other tools. While Draw.io is a popular tool, it is not the
only tool used to create informal diagrams. Second, intelligent inferences about the diagram should
be further explored. Further intelligence can make this a far more practical technique for analysis.
One such approach beyond the inferences presented in this paper is to consider applying this
technique at scale with many diagrams describing a system and exploring the ability to link common
elements across diagrams. Finally, we don’t see a path toward industry adoption without easy-to-
use, robust software tooling. We view that as a critical step towards adopting this technique in
practice rather than a research novelty. Artificial intelligence, particularly pattern recognition and
large language models, has tremendous potential to support and enhance the methodology.
Author Contributions: Mr. Joshua Kaplan contributed to the conceptualization and methodology. Dr. Luis
Rabelo reviewed, enhanced, edited, and added the future of Artificial Intelligence.
21
Data Availability Statement: The complete source code for this paper is publicly available at
https://github.com/josh-kaplan/extracting-data-from-diagrams. Additional follow-on content will be published
at https://jdkaplan. com/papers/informal-architecture.
References
1. Basili, V.; Briand, L.; Bianculli, D.; Nejati, S.; Pastore, F.; Sabetzadeh, M. Software Engineering Research and
Industry: A Symbiotic Relationship to Foster Impact. IEEE Software 2018, 35, 44-49,
doi:10.1109/MS.2018.290110216.
2. M. Richards and N. Ford. Fundamentals of Software Architecture. O'Reilly Media, Inc.
https://learning.oreilly.com/library/view/fundamentals-ofsoftware/9781492043447/
3. E. Carroll and R. Malins. Systematic Literature Review: How is Model-Based Systems Engineering
Justifed?. Sandia National Laboratories. 2016. https://doi.org/10.2172/1561164
4. M. Ozkaya. Do the informal & formal software modeling notations satisfy practitioners for software
architecture modeling? Information and Software Technology 2018, 95, 15-33,
doi:https://doi.org/10.1016/j.infsof.2017.10.008.
5. J. Keim, Y. Schneider, and A. Koziolek. Towards consistency analysis between formal and informal
software architecture artefacts. In Proceedings of the 2019 IEEE/ACM 2nd International Workshop on
Establishing the Community-Wide Infrastructure for Architecture-Based Software Engineering (ECASE),
27-27 May 2019, 2019; pp. 6-12.doi: https://ieeexplore.ieee.org/document/8815606
6. Ali, N.; Baker, S.; O’Crowley, R.; Herold, S.; Buckley, J. Architecture consistency: State of the practice,
challenges and requirements. Empirical Software Engineering 2018, 23, 224-258, doi:10.1007/s10664-017-9515-
3.
7. M. Fowler. Software Architecture Guide. https://martinfowler.com/architecture/
8. Object Management Group. OMG® Uni ed Modeling Language® (OMG UML®), Versionb2.5.1. 2023.
https://www.omg.org/spec/UML/2.5.1/PDF
9. Object Management Group. OMG Systems Modeling Language™ (SysML®), Version 2.0 Beta, Part 1
Language Specification. 2023. https://www.omg.org/spec/SysML/2.0/Beta1/Language/PDF
10. JGraph Ltd.. draw.io. July 2023. https://www.drawio.com/
11. JGraph Ltd.. Github - jgraph/drawio-desktop (Source Code). July 2023. https://github.com/jgraph/drawio-
desktop
12. Henning Dieterichs. Github - hediet/vscode-drawio (Source Code). July 2023.
https://github.com/hediet/vscode-drawio
13. Henning Dieterichs. Draw.io Integration - Visual Studio Marketplace. July
2023.https://marketplace.visualstudio.com/items?itemName=hediet.vscode-drawio
14. JGraph Ltd. MxGraph. https://jgraph.github.io/mxgraph/
15. J. Kaplan. Agile Architecture in Practice. 2023. https://jdkaplan.com/articles/agile-architecture-in-practice
16. GCHQ. CyberChef. https://gchq.github.io/CyberChef/
17. I. Robinson, J. Webber, and E. Eifrim. Graph Databases, 2nd Edition. O'Reilly Media, Inc.. 2015.
https://learning.oreilly.com/library/view/graph-databases-2nd/9781491930885/
18. Kassab, M.; Mazzara, M.; Lee, J.; Succi, G. Software architectural patterns in practice: an empirical study.
Innovations in Systems and Software Engineering 2018, 14, 263-271, doi:10.1007/s11334-018-0319-4.
19. Schilling, R.D.; Aier, S.; Winter, R. Designing an Artifact for Informal Control in Enterprise Architecture
Management. In Proceedings of the ICIS, 2019, Munich, Germany, Dec 15-18, 2019.
20. Object Management Group. OMG Systems Modeling Language™ (SysML®), Version 2.0 Beta. 2023.
https://www.omg.org/spec/SysML/2.0/Beta1
21. Zhang, T.; Ladhak, F.; Durmus, E.; Liang, P.; McKeown, K.; Hashimoto, T.B. Benchmarking Large
Language Models for News Summarization. Transactions of the Association for Computational Linguistics 2024,
12, 39-57, doi:10.1162/tacl_a_00632.
22. Hosna, A.; Merry, E.; Gyalmo, J.; Alom, Z.; Aung, Z.; Azim, M.A. Transfer learning: a friendly introduction.
Journal of Big Data 2022, 9, 102, doi:10.1186/s40537-022-00652-w.
23. Zhang, L.; Bian, Y.; Jiang, P.; Zhang, F. A Transfer Residual Neural Network Based on ResNet-50 for
Detection of Steel Surface Defects. Applied Sciences 2023, 13, 5260.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those
of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s)
disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or
products referred to in the content.