Extended Abstracts Fall 2015 Biomedical Big Data; Statistics for Low Dose Radiation Research scribd download
Extended Abstracts Fall 2015 Biomedical Big Data; Statistics for Low Dose Radiation Research scribd download
Visit the link below to download the full version of this book:
https://medipdf.com/product/extended-abstracts-fall-2015-biomedical-big-data-sta
tistics-for-low-dose-radiation-research/
Volume 7
Series editors
Enric Ventura
Antoni Guillamon
Since 1984 the Centre de Recerca Matemàtica (CRM) has been organizing scien-
tific events such as conferences or workshops which span a wide range of
cutting-edge topics in mathematics and present outstanding new results. In the fall
of 2012, the CRM decided to publish extended conference abstracts originating
from scientific events hosted at the center. The aim of this initiative is to quickly
communicate new achievements, contribute to a fluent update of the state of the art,
and enhance the scientific benefit of the CRM meetings. The extended abstracts are
published in the subseries Research Perspectives CRM Barcelona within the Trends
in Mathematics series. Volumes in the subseries will include a collection of revised
written versions of the communications, grouped by events.
Editors
Guadalupe Gómez
Pere Puig
M.Luz Calle
Editors
Elizabeth A. Ainsbury
Elisabeth Cardis
Pere Puig
Jochen Einbeck
Editors
Editors
Elizabeth A. Ainsbury Jochen Einbeck
Chemical and Environmental Hazards Mathematical Sciences
Public Health England Durham University
Chilton Durham
UK UK
Mathematics Subject Classification (2010): 62M10, 62N01, 62P10, 92B15, 92C60, 92D30
v
vi Contents
Foreword
In the last quarter of 2015, from September 8 to November 27, over 100 biosta-
tisticians, statisticians and mathematicians from 45 different institutions visited the
Centre de Recerca Matemàtica (CRM) in Bellaterra to participate in the Intensive
Research Programme on Statistical Advances for Complex Data. The local orga-
nizers of this research semester were Alejandra Cabaña (Universitat Autònoma de
Barcelona), Malu Calle (Universitat de Vic), Pedro Delicado (Universitat Politèc-
nica de Catalunya), Anna Espinal (Universitat Autonòma de Barcelona), Guadalupe
Gómez (Universitat Politècnica de Catalunya), Rosa Lamarca (Almirall SA), Pere
Puig (Universitat Autonòma de Barcelona), Montserrat Rué (Universitat de Lleida),
and Àlex Sánchez (Universitat de Barcelona). The program brought together sci-
entists, from enthusiastic Ph.D. students to respected senior professors, working in
relevant areas such as Modeling and analysis of biological and biomedical data,
Biostatistical methods for clinical trials and for complex time-to-event data, and
Statistics and Big Data. The very dynamic and productive atmosphere we enjoyed
translated into equally active courses, seminars and a workshop on Biomedical (Big)
Data, held on November 26 and 27, closing the program.
The workshop was a meeting point for the researchers who are members of
BIOSTATNET, a Spanish pioneer network of biostatisticians. BIOSTATNET has
almost two hundred members organized around eight different nodes, led by statisti-
cians from different universities, with own research projects and teaching experience
in biostatistical matters, and working closely with biomedical researchers. The work-
shop included five invited talks, a roundtable, eleven contributed oral presentations
and ten posters.
In this volume of the subseries Research Perspectives CRM-Barcelona (published
by Birkhäuser inside the series Trends in Mathematics), we present ten extended
abstracts corresponding to selected talks given by participants in the workshop on
Biomedical (Big) Data. The variety of topics presented bears testimony to the rich
activity that made a success of the workshop, and also of the Intensive Research
Programme. The selected topics include methodological biostatistical and bioinfor-
2 Biomedical Big Data
July 2016
Barcelona, Spain Guadalupe Gómez
Pere Puig
M.Luz Calle
Extreme Observations in Biomedical Data
1 Introduction
In current biomedical research, genetic studies are extensively used to identify the
causes of human diseases and they provide insights for the eventual development of
therapeutic strategies. Integration of different types of data sets, such as gene expres-
sion data, genotype data or clinical information is needed to capture information that
may otherwise be lost in separate analyses. Furthermore, it is crucial to be able to
detect extreme observations, since an extreme value may indicate an individual with a
wrong diagnosis or presenting particular clinical features or classified in the extreme
spectrum of the disease. Moreover, the usual scenario with current data is the lack of
C. Arenas (B)
Department of Statistics, University of Barcelona, Barcelona, Spain
e-mail: [email protected]
I. Irigoien
Department of Computer Sciences and Artificial Intelligence,
University of the Basque Country, Leioa, Spain
e-mail: [email protected]
F. Mestres · B. Cormand
Department of Genetics, University of Barcelona, Barcelona, Spain
e-mail: [email protected]
B. Cormand
e-mail: [email protected]
C. Toma
Neuroscience Research Australia, Sydney, NSW, Australia
e-mail: [email protected]
2 Methods
The starting point is an n × p data matrix ( p can be much larger than the size of the
sample n) where the rows correspond to observations (individuals, samples...) and the
columns correspond to any kind of features to be measured which can be continuous,
binary or multiattribute data (genes, clinical/pathological features,…). Let G be a
group that is represented by a p-random vector Y = (Y1 , . . . , Y p ), with values in a
metric space S ⊂ R p and a probability density f with respect to a suitable measure
λ. Let δ be a distance function between any pair of observations, δi j = δ(yi , y j ).
see [1]. When δ is the Euclidean distance, V (G) = tr () with = cov(Y). The
geometric variability is as a variant of Rao’s diversity coefficient; see [2].
see [1].
1 2
V̂ (G) = δ (yi , y j ),
2n 2 i, j
and
1 2
φ̂2 (y, G) = δ (y, yi ) − V̂ (G),
n i
respectively. See [3] for a review of these concepts, and for applications see [4, 5]
and references therein.
Definition 3 For each observation yi , the depth function I (yi , G) is defined by
−1
φ2 (yi , G)
I (yi , G) = 1 + ; (1)
V (G)
see [6].
Proposition 4 Function I takes values in [0, 1] and, according to [7], it is a type
C depth function. Furthermore, it verifies the following desirable properties: (i)
maximality at center; (ii) monotonicity relative to the deepest observation; (iii) van-
ishing at infinity; and (iv) depending on the data and the selected distance, it is
affine-invariant.
As I is a depth function, it assigns to any observation a degree of centrality, thus a
O = 1/I , suggests
small value of I , or equivalently a large value of a possible extreme
observation. Note that, by (1), Ô(yi , G) = n j δ 2 (yi , y j )/ j,k δ 2 (y j , yk ).
However, with only one observation taking a very large value, Ô already gives
aberrant values. For this reason, we propose the following version for Ô(yi , G)
where, due to robustness consideration, the mean is replaced by the median.
Definition 5 For each observation yi a new statistic O R (yi , G) is defined by
medδ,i
O R (yi , G) = , (2)
medδ
where Q 3 and M are the 3-th quartile and the median of all the O R values.
Our simulation studies show that the procedure is robust in front of masking effect
and it can properly identify most of the outliers when mixed data are analyzed.
Now consider the following study [9] in which 10 autism multiplex families were
analyzed (nine with two affected sibs and one with three affected sibs). First, in
a clinical study, five features were measured in 21 affected individuals: two were
continuous (age and non-verbal intelligence quotient(NVIQ)), and three were cate-
gorical (gender, language delay and autism spectrum category). Using (3) and the
Gower distance [10], the threshold value was λ = 1.519, and four individuals could
be considered as extreme observations. Three of them were male (13, 17 and 20 years
old) with autism and language delay, and they presented NVIQ values indicative of
mental retardation. The most emblematic extreme value, corresponded to another
man (25 years old) also with an autism diagnosis and language delay, and presenting
the smallest NVIQ value. Thus, our method highlighted the four individuals from our
study that had the most severe clinical presentation of the disorder. In a second study,
a genetic analysis was performed in the 21 affected individuals and in their parents.
The full exome sequence (the fraction of the genome that encodes proteins, approx-
imately 3.4 × 107 nucleotide positions from 20,000 genes) of all family members
was determined. We selected those rare genetic variants (infrequent in the general
population) leading to an amino acid change in the encoded protein that were trans-
mitted from one parent to the two (or three) affected sibs. The identified mutations,
an average of 36.3 per family, were ranked according to their predicted damaging
effect using the SIFT and PolyPhen-2 tools. In this case, no extreme observations
were detected. This result is consistent with the fact that this type of mutation may
not have a major role in the aetiology of the disorder (as compared to mutations lead-
ing to truncated proteins, not considered here) in the sample of multiplex families
reported previously in [9].
Extreme Observations in Biomedical Data 7
Table 1 Columns: Cancer data sets; classes (k); samples (n); original genes ( p); extreme genes
selected by our criterion (N G); total leave-one-out classification rate, in percentage, using all genes
(C Rall ) and using the reduced list of genes (C Rsel )
Data set k n p NG C Rall C Rsel Data set k n p NG C Rall C Rsel
Alizadeh-2000-v1 2 42 1095 118 90.48 92.86 Laiho-2007 2 37 2202 414 81.08 86.49
Alizadeh-2000-v2 3 62 2093 306 98.39 98.39 Lapointe- 3 69 1625 170 81.16 72.46
2004-v1
Armstrong-2002-v1 2 72 1081 193 91.67 98.61 Lapointe- 4 110 2496 249 80.91 70.00
2004-v2
Armstrong-2002-v2 3 72 2194 391 88.89 91.67 Liang-2005 3 37 1411 179 94.59 100.00
Bittner-2000-V1 2 38 2201 279 76.32 84.21 Nutt-2003-v1 4 50 1377 320 72.00 70.00
Bittner-2000-V2 3 38 2201 279 63.16 65.79 Nutt-2003-v2 2 28 1070 173 89.29 100.00
Bredel-2005 3 50 1739 238 84.00 84.00 Nutt-2003-v3 2 22 1152 246 100.00 90.91
Dyrskjot-2003 3 40 1203 217 75.00 82.50 Pomeroy- 2 34 857 126 76.47 79.41
2002-v1
Garber-2001 4 66 4553 391 81.82 83.33 Risinger-2003 4 42 1771 255 71.43 71.43
Golub-1999-v1 2 72 1877 321 88.89 90.28 Shipp-2002-v1 2 77 798 137 85.71 75.32
Golub-1999-v2 3 72 1877 321 84.72 88.89 Tomlins-2006- 4 92 1288 129 83.70 84.78
v2
Gordon-2002 2 181 1626 290 100.00 96.69 West-2001 2 49 1198 180 75.51 75.51
Khan-2001 4 83 1069 165 53.01 65.06 Yeoh-2002-v1 2 248 2526 315 87.90 95.97
Acknowledgements This research was partially supported by Grant 2014 SGR 464 (GRBIO)
from the Departament d’Economia i Coneixement de la Generalitat de Catalunya; by the Basque
Government Research Team Grant (IT313-10) SAIOTEK Project SA- 2013/00397; and by the
University of the Basque Country UPV/EHU (Grant UFI11/45 (BAILab)).
8 C. Arenas et al.
References
1. C. Arenas and C.M. Cuadras, “Some recent statistical methods based on distances”, Cont. Sci. 2
(2002), 183–191.
2. C.M. Cuadras and J. Fortiana, “A continuous metric scaling solution for a random variable”,
J. Multivariate Ana. 32 (1995), 1–14.
3. J.C. Gower, “A general coefficient of similarity and some of its properties”, Biometrics 27
(1971), 857–871.
4. I. Irigoien and C. Arenas, “INCA: new statistics for estimating the number of clusters and
identifying atypical units”, Stat. Med. 27, (2008), 2948–2973.
5. I. Irigoien, F. Mestres, and C. Arenas, “The depth problem: identifying the most representative
units in a data group”, IEEE ACM T Comput Bi 10 (2013), 161–172.
6. I. Irigoien, B. Sierra, and C. Arenas, “ICGE: an R package for detecting relevant clusters and
atypical units in gene expression”, BMC Bioinformatics 13 (2013), 30–41.
7. A.C. Kimber, “Exploratory data analysis for possibly censored data from skewed distributions”,
Appl. Stat. 39 (1990), 21–30.
8. C.R. Rao, “Diversity: its measurement decomposition apportionment and analysis”, Sankhya
Indian J. Stat. 44 (1982), 1–22.
9. C. Toma, B. Torrico, A. Hervs, R. Valdés-Mas, A. Tristán-Noguero, V. Padillo, M. Maristany,
M. Salgado, C. Arenas, X.S. Puente, M. Bayés, and B. Cormand, “Exome sequencing in
multiplex autism families suggests a major role for heterozygous truncating mutations”, Mol
Psychiatr 19 (2014), 784–790.
10. R. Serfling and S. Zuo, “General notions of statistical depth function”, Ann. Stat. 28 (2000),
461–482.
An Ordinal Joint Model for Breast Cancer