0% found this document useful (0 votes)
10 views

SignaturedBasedModelClassifyLungCancerStage-10

This study presents a signature-based model for classifying lung cancer stages using multiplex immunofluorescence image data and spatial summary functions. The research evaluates classifiers like LDA, Logistic Regression, and Adaboost, finding that signatures as features yield the highest accuracy with Logistic Regression. The findings indicate that signatures can enhance cancer diagnosis by providing insights into tumor heterogeneity and immune cell distribution.

Uploaded by

Gaurav Dhar
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

SignaturedBasedModelClassifyLungCancerStage-10

This study presents a signature-based model for classifying lung cancer stages using multiplex immunofluorescence image data and spatial summary functions. The research evaluates classifiers like LDA, Logistic Regression, and Adaboost, finding that signatures as features yield the highest accuracy with Logistic Regression. The findings indicate that signatures can enhance cancer diagnosis by providing insights into tumor heterogeneity and immune cell distribution.

Uploaded by

Gaurav Dhar
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Proc.

of International Conference on Artificial Intelligence, Computer, Data Sciences and Applications (ACDSA 2025)
7-9 August 2025, Antalya-Türkiye

A signature-based model for learning lung cancer


stage from multiplex immunoflurescence image data
with spatial summary functions
1st Christelle Agonkoui 2nd Hua Cao 3rd Sophie Dabo-Niang
MODAL Team Department of Health and Environment MODAL Team
Centre Inria de l’Université de Lille Junia HEI-ISEN-ISA Centre Inria de l’Université de Lille
Lille, France Lille, France Lille, France
[email protected] [email protected] [email protected]

4th Gaurav Dhar 5th Nicolas Jankovsky 6th Karin Sahmer


MODAL Team MODAL Team ULR 4515 - LGCgE,
Centre Inria de l’Université de Lille SATT Nord Laboratoire de Génie Civil et géo-Environnement,
Lille, France Lille, France Univ. Lille, IMT Lille Douai, Univ. Artois, JUNIA
[email protected] [email protected] Lille, France
[email protected]

Abstract—This study investigates the use of signatures as more insight on the tumor heterogeneity and the stage of
features within a classification model designed to determine lung the cancer. In recent work by Johnson et al., a single-cell
cancer stages from multiplex immunofluorescence image data multiplex immunoflurescence dataset of non small cell lung
utilizing spatial summary functions. We specifically evaluated
the performance of three classifiers: LDA, Logistic Regression, carcinoma tissue was obtained using a Vectra 3.0 imaging
and Adaboost, employing various features, including signatures. system [5]. This dataset is known as the VectraPolarisData
Our findings indicate that the highest accuracy was attained using and is freely available on Bioconductor/ExperimentHub.
signatures as features alongside the Logistic Regression classifier. Significantly, it was found that differences in the numbers of
These outcomes suggest that signatures hold potential as valuable immune cells, the location of specific populations within the
features in medical and cancer diagnosis classification tasks.
Index Terms—Feature extraction, feature selection, functional tumor microenvironment influences patient outcomes.
data analysis, medical diagnosis, spatial statistics, signatures,
supervised learning It is natural to ask whether accurate computational models
can be developed to leverage single-cell data for cancer
I. I NTRODUCTION diagnosis. Indeed, machine learning and statistical models
Cancer continues to be a leading cause of mortality have become essential tools for diagnosing cancer from image
worldwide, accounting for nearly 10 million deaths in data [6]. Functional data analysis (FDA) is a field of statistics
2020 [1]. Lung cancer is one of the most prevalent cancers that treats functional data like curves and images as predictors
worldwide (with over 2 million new cases in 2020), and or outcomes in statistical models [7]. FDA forms an ideal
its early detection and accurate diagnosis are critical for analytic framework for connecting cell spatial relationships
effective treatment. For lung cancer patients diagnosed in to patient outcomes. Recently, Wrobel et. al have used FDA
2003-2009, the 5 year relative survival rate was 54% for stage to analyze and represent spatial summary statistics from the
1, 26% for stage 2, and 4% for stage three. In general, cancer VectraPolarisData images. Spatial summary statistics resume
diagnosis is conducted from a number of different data: the amount of clustering and co-occurence of cells in the
including images, biomarkers and blood tests. Single-cell tumor microenvironment [8]. It has been shown that degree of
data acquisition methods such as mass cytometry [2] and clustering of different types of immune cells is significantly
dynamic profiling of fluorescent fusion proteins [3] were associated with overall survival in cancer. It is therefore
developed years ago and are commonly found in modern reasonable to study spatial summary functions as functional
biological labs. However, it has been only recently that data for predicting cancer diagnosis and outcomes.
single-cell data has been incorporated in making cancer
diagnosis and technologies continue to be developed to obtain The theory of signatures is a mathematical tool that can
more accurate information at the cellular level [4]. Single-cell be used in FDA to encode features for statistical models
data has the advantage of giving detailed information about incorporating functional data [9]–[11]. The signature of a
tumors and their microenvironment. This information not path was initially defined by Chen [12], [13] and rediscovered
only improves cancer diagnosis but the hope is also to gain by Lyons in the context of rough path theory [14]. Signatures
have been widely used in many machine learning applications point in a point pattern and its closest neighboring point.
such as character recognition [15], finance [16] and
medicine [17]. They have demonstrated good performance for Spatial summary functions such as the Ripley’s K function
machine learning algorithms as feature extraction methods. are used to determine whether an observed collection of points
A signature based model for early detection of sepsis won is distributed differently from complete spatial randomness
the PhysioNet Challenge 2019 outperforming all other (CSR). As an example, for a homogenous Poisson process
methods [18]. on an infinite study area, the Ripley’s K function is given
by K(r) = πr2 while in practice the CSR is computed empir-
Compared to other feature extraction methods in machine ically with Monte Carlo simulations. Unless otherwise stated,
learning, signatures have strong theoretical guarantees and it we will consider that the CSR subtracted from the spatial
is possible to reconstruct the functional data from a repre- summary functions we consider as a metrics for clustering
sentation of its signatures [19]. However to the extent of and co-occurence of cells. Precisely, we use the Ripley’s L
our knowledge, signatures have not been extensively applied function and nearest neighbor G as functional data from the
in machine learning or computational models with cancer images before extracting features using signatures.
imaging data. In this paper, we expect to bridge this gap
by introducing a signature-based model for learning lung B. Functional Data Analysis
cancer stage from multiplex immunoflurescence image data Functional data analysis (FDA) is a field of statistics where
with spatial summary functions. Specifically we encode the each sample element is considered to be a random function
spatial summary functions from non-small cell lung cancer with desired regularity, generally square-integrable. Once a
image data available in the VectraPolarisData package us- suitable representation of functional data can be found with
ing signatures and perform hyperparameter optimization and a fixed number of basis functions, we can use the coeffi-
tuning to achieve robust performance using machine learning cients as features or calculate different properties to be used
models. as features in classification and regression tasks. We used
II. M ETHODS different features computed from a B-spline representation of
our spatial summary functions, including the maximum and
A. Spatial summary functions for image data minimum of each summary function, the peak-to-peak value or
A faire : Sophie, définir le champs aléatoire pour les images maximum absolute difference within each summary function
We use spatial summary metrics from each image that and the maximum of the derivative computed using the B-
resumes the spatial clustering of immune cells as a function spline representation. We also consider another feature that
of radius. Spatial methods from the geospatial statistics lit- we call the “wigglyness”. Given a function γ over [0, T ] ⊂ R,
erature, including Ripley’s K and nearest neighbor G, have the wigglyness of γ is computed as the integral of square of
become popular for summarizing cell-type clustering in spatial the derivative of the B-spline representation of our summary
single-cell data [20]. The Ripley’s K function describes the functions:
colocalization of two different cell types in an image [21].
T
Mathematically, the Ripley’s K function is given by:
Z
wigglyness(γ) = (γ ′ (t))2 dt. (4)
0
m m
|A|
1(d{ci , cj } ≤ r)eij We group together the previously mentioned features: the
XX
K(r) = (1)
m(m − 1) i=1 maximum and minimum of each summary function, the peak-
i̸=j
to-peak value or maximum absolute difference within each
where d{ci , cj } is the pairwise distance between cells ci and summary function and the maximum of the derivative along
cj , |A| is the tissue area, 1 is an indicator function, and the eij with the wigglyness to define the set of features we call
is an edge correction to account for bias that occurs for points the curve features of our data.
at the boundary of the tissue region. In practice, Ripley’s K
function is often re-scaled to Ripley’s L function: C. Signatures
r
K(r) The signature of a path refers to a set of statistics or
L(r) = . (2) features that summarize the key characteristics of a continuous,
π
Another spatial summary function, the G function is given multidimensional path, often used in the context of time series
as the probability that the nearest cell of type c1 lies within a data, stochastic processes, or machine learning for sequential
radius r of a cell of the same type and is defined as: data. The signature is a mathematical tool that encodes the
information of a path into a structured format that can be
m m used to analyze or compare paths. This concept originates
1 XX
G(r) = 1(dN N {c1i } ≤ r) (3) from rough path theory, which is a branch of mathematics
m i=1 i=1
that studies paths in a way that generalizes classical calculus
where dN N {c1i } is the nearest-neighbor distance for cell to handle irregular or complex paths.
type c1 , defined as the shortest distance between a specific We recall that:
1. Path : A path is a continuous function γ : [0, T ] → Rd scores, but these scores may not be accurate estimates of
where γ(t) represents the state of the system at time t in d- the posterior probability for the predicted classes, leading
dimensional space. to misleading confidence levels in predictions. Calibration
2. Signature of a Path: The signature of a path is a se- techniques, such as Platt scaling and isotonic regression, help
quence of iterated integrals that captures the geometric and adjust the predicted probabilities to align them with empirical
algebraic properties of the path. For a path γ, the signature probabilities, improving reliability in decision-making applica-
S(γ) is defined as the collection of all iterated integrals of the tions. Platt scaling is a popular method that fits the outputted
path over different levels: probabilities using a logistic regression model. Let us recall
that as a classifier, logistic regression predicts the probability
Z T Z T Z t
! that minimize the log loss likelihood function and is partic-
S(γ) = 1, dγ(t), dγ(s)dγ(t), . . . (5) ularly effective when the relationship between the predictors
0 0 0 and the log-odds of the outcome is linear [?]. Formally, Platt
Each term represents a higher-order summary of the path, scaling is function c(·) which takes probabilities p and fits
capturing more complex relationships between different parts them using the sigmoid :
of the path. 1
3. Iterated Integrals: Iterated integrals are the building c(p) = (7)
1 + e−(ap+b)
blocks of the signature. For a path γ : [0, T ] → Rd , the first-
where c(p) is the calibrated probability and a and b are the
level (also named order) iterated integral is simply the integral
parameters to be fitted.
of the path itself: Z T Isotonic regression is a non-parametric approach to mapping
dγ(t) non-probabilistic classifier scores to probabilities. It relaxs the
0 assumption of a sigmoidal relationship between the model
The second-level (or order) iterated integral takes into account scores and empirical frequencies made by Platt scaling to an
pairwise interactions: isotonic (non-decreasing) one. The following model is used:
Z TZ t
dγ(s)dγ(t) c(p) = m(p) + ϵ (8)
0 0
where the isotonic function m is found by minimizing a
And so on for higher levels. squared-loss function [?].
4. Truncation: Since the full signature is an infinite se-
quence of iterated integrals, in practical applications, it is III. C LASSIFICATION OF CANCER STAGE
common to truncate the signature at a certain level. This A. Dataset
is because the length of the signature depends exponentially
We used a single-cell lung cancer imaging dataset, publicly
on its level. The length of a signature
Pm iup to d(d level k of a d
m
−1) available as part of the VectraPolarisData package on Bio-
dimensional path is given by i=1 d = d−1 . Thus, conductor. The lung cancer VectraPolarisData image dataset
one may only consider up to second- or third-level integrals,
is a segmented and phenotyped multiplex immunoflurescence
depending on the complexity of the problem. The signature of
dataset of non small cell lung carcinoma tissue containing
γ up to level k is the collection of all iterated integrals of the
spatial coordinates and other sample characteristics for over
form:
! 1.5 million cells [5], [8]. Below are examples of segmented
Z T Z T Z t
k and phenotyped slices of lung cancer images with immune
S (γ) = 1, dγ(t1 ), . . . , ··· dγ(tk ) · · · dγ(t1 ) cells and other cell types. The image on the left is of stage 1
0 0 0
(6) cancer while that on the right is of stage 3 cancer.
The signature has several desirable properties such as
invariance under certain transformations, such as time
reparametrization. The signature of a path is a powerful,
mathematical tool used to encode a continuous path into a
series of iterated integrals. It can represent complex, high-
dimensional paths and is used in rough path theory, machine
learning, and stochastic analysis to study or compare paths in
a robust and invariant way.
D. Calibration Fig. 1: Labelled images of stage 1 and stage 3 lung cancer
Probability calibration is an essential step in machine learn- with immune cells.
ing classification tasks that ensures prediction of probabilities
accurately reflecting the true likelihood of an event. Many In more detail, the dataset contains regions of interest
machine learning models, including logistic regression, sup- (ROIs) from a tissue slice of patients. Cell and tissue
port vector machines, and neural networks, output probability segmentation and cell phenotyping were performed using
inForm software. In the lung dataset individual cells such as Fig. 3: Ripley’s G function for all patient images.
immune cells are labeled and there is also patient metadata We computed the signatures of the Ripley’s L functions and
like their age, gender, survival days, survival status and Ripley’s G functions up to level 6 and standardized by sub-
their cancer stage. Il faut donner plus d’information sur les tracting the mean and dividing by the standard deviation. We
métadata. On peut donner un résumé descriptif. note that already at level 6 the signature length is 1111110, and
computing higher levels becomes computationally intractable.
The stage counts for different patients show that the dif- We used the anomaly detection algorithm Isolation Forest to
ferent cancer stages 1, 2, and 3 are not equally represented. remove outliers in the L and G functions.
Furthermore, our motivation is to classify stage 1 cancer, since A PARTIR D’ICI TOUT LE RESTE DE PAPIER EST A
its accurate diagnosis leads to improved survival rates for RECRIRE AVEC LES RESULTATS DE NOTRE METHOD-
patients. Therefore we grouped stages 2 and 3 together for a OLOGIE ACTUELLE.
final count of 32 patients with stage 1 cancer and 15 patients
with later stage cancer.
C. Model training and Hyperparameter optimization
B. Summary Functions and Signature Features
Our goal is to describe the spatial clustering of immune We trained the following classification models: Linear discrim-
cells in the images corresponding to each patient as a function inant analysis (LDA), Logistic Regression (LogRegression)
of radius. To accomplish this, the summary functions of and Adaptive Boosting (AdaBoost) using the raw data of the of
Ripley’s L and nearest neighbor G functions were computed Ripley’s L function and Ripley’s G functions. We performed
and subtracted from spatial functions for complete spatial repeated random sub-sampling validation with all levels of
randomness (CSR). In more detail, the Ripley’s L and nearest signatures up to level 6 with 100 runs each using a 80%
neighbor G functions were computed at 100 different points. training and 20% testing split. We determined which level of
We give plots of the of Ripley’s L and nearest neighbor G signatures gave the best accuracy from these results. We then
functions below for the different classes with stage 1 and then compared the accuracy between taking the raw functional data,
2 and 3 together: signatures, curve features and wigglyness as features.

IV. R ESULTS AND D ISCUSSION

We first obtained results for the hyperparameter optimization


on the level of the signature. We give a plot below of the
results with the LDA, LogRegression and Adaboost classifiers.

Fig. 2: Ripley’s L function for all patient images.

Fig. 4: Average accuracy with different levels (also called


orders) of the signature.

We see that the best accuracy was obtained with the signatures
of level 4 and the Logistic Regression classifier. In order
to select the best level of signatures, we then compared the
average accuracy of all three classifiers combined together over
S different levels of signatures in the next plot.
R EFERENCES
[1] Sung, Hyuna, et al. “Global cancer statistics 2020: GLOBOCAN es-
timates of incidence and mortality worldwide for 36 cancers in 185
countries.” CA: a cancer journal for clinicians 71.3 (2021): 209-249.
[2] Bendall, Sean C., et al. “Single-cell mass cytometry of differential
immune and drug responses across a human hematopoietic continuum.”
Science 332.6030 (2011): 687-696.
[3] Taniguchi, Yuichi, et al. “Quantifying E. coli proteome and transcrip-
tome with single-molecule sensitivity in single cells.” science 329.5991
(2010): 533-538.
[4] Saadatpour, Assieh, et al. “Single-cell analysis in cancer genomics.”
Trends in Genetics 31.10 (2015): 576-586.
[5] Johnson, Amber M., et al. “Cancer cell-specific major histocompatibility
complex II expression as a determinant of the immune infiltrate orga-
nization and function in the NSCLC tumor microenvironment.” Journal
of Thoracic Oncology 16.10 (2021): 1694-1704.
[6] Swanson, Kyle, et al. “From patterns to patients: Advances in clinical
machine learning for cancer diagnosis, prognosis, and treatment.” Cell
Fig. 5: Average accuracy of all three classifiers combined over 186.8 (2023): 1772-1791.
[7] Ramsay, James O., and Bernard W. Silverman, eds. Applied functional
different levels (also called orders) of the signature. data analysis: methods and case studies. New York, NY: Springer New
York, 2002.
We then obtained the average accuracies and standard devia- [8] Wrobel, Julia, et al. “mxfda: a comprehensive toolkit for functional data
tions for the three classifiers with the different features over analysis of single-cell spatial data.” Bioinformatics Advances 4.1 (2024):
vbae155.
100 runs using a 80% training and 20% testing split. In the [9] Fermanian, Adeline. “Functional linear regression with truncated signa-
following table we give the results of accuracy obtained using tures.” Journal of Multivariate Analysis 192 (2022): 105031.
signatures (Sig) at level 4 and as well as the other features [10] Yap, Zhong Jing, Dharini Pathmanathan, and Sophie Dabo-Niang. “Fore-
casting mortality rates with functional signatures.” ASTIN Bulletin: The
(raw data, curve features (CFeatures) and wigglyness). The Journal of the IAA 55.1 (2025): 97-120.
average accuracies are followed by the standard deviations in [11] Frévent, Camille. “A functional spatial autoregressive model using
parentheses. signatures.” arXiv preprint arXiv:2303.12378 (2023).
[12] Chen, Kuo-Tsai. “Integration of paths, geometric invariants and a gen-
eralized Baker-Hausdorff formula.” Annals of Mathematics 65.1 (1957):
TABLE I: Comparison of the accuracy classifiers with level 4 163-178.
signatures and features. [13] Chen, Kuo-Tsai. “Integration of paths–A faithful representation of paths
by noncommutative formal power series.” Transactions of the American
Classifier/Features Raw Data Sig CFeatures Wigglyness Mathematical Society 89.2 (1958): 395-407.
LDA 0.64(.15) 0.64(.13) 0.49(.15) 0.68(.11) [14] Lyons, Terry J. “Differential equations driven by rough signals.” Revista
Logistic Regression 0.58(.12) 0.70(.11) 0.57(.13) 0.62(.13) Matemática Iberoamericana 14.2 (1998): 215-310.
AdaBoost 0.58(.11) 0.65(.13) 0.56(.12) 0.49(.13) [15] Yang, Weixin, Lianwen Jin, and Manfei Liu. “Chinese character-level
writer identification using path signature feature, DropStroke and deep
CNN.” 2015 13th International Conference on Document Analysis and
From the table, we find that the results obtained by the Recognition (ICDAR). IEEE, 2015.
classification methods using signatures are better than results [16] Buehler, Hans, et al. “Generating financial markets with signatures.”
obtained using the other features (the raw data, curve features Available at SSRN 3657366 (2020).
[17] Perez Arribas, Imanol, et al. “A signature-based machine learning model
and the wigglyness). for distinguishing bipolar disorder and borderline personality disorder.”
Translational psychiatry 8.1 (2018): 274.
V. C ONCLUSION [18] Morrill, James, et al. “The signature-based model for early detection of
We have used signatures, a recent tool from the theory of rough sepsis from electronic health records in the intensive care unit.” 2019
Computing in Cardiology (CinC). IEEE, 2019.
paths, as features for learning lung cancer stage from multiplex [19] Fermanian, Adeline, et al. “The insertion method to invert the signature
immunoflurescence image data with spatial summary func- of a path.” Recent Advances in Econometrics and Statistics: Festschrift
tions. Specifically we encoded the spatial summary functions, in Honour of Marc Hallin. Cham: Springer Nature Switzerland, 2024.
575-595.
Ripley’s L functions and Ripley’s G functions, from non-small [20] Vu, Thao, et al. “SPF: a spatial and functional data analytic approach to
cell lung cancer image data available in the VectraPolarisData cell imaging data.” PLoS computational biology 18.6 (2022): e1009486.
package by computing their signatures. We then performed [21] Ripley, Brian D. “Modelling spatial patterns.” Journal of the Royal
Statistical Society: Series B (Methodological) 39.2 (1977): 172-192.
hyperparameter optimization to determine the best level of
signatures and compared the performance of three different
classifiers: LDA, Logistic Regression and Adaboost. We found
that the best accuracy was obtained with signatures of level 4
and the Logistic Regression classifier. These results show the
promise of using signatures as features in classification tasks
for medicine and cancer diagnosis.
ACKNOWLEDGMENT
We want to thank the ANR for allotting us the ANR-21-CE42-
0011 research grant to fund this research.

You might also like