0% found this document useful (0 votes)
45 views

1 Task 2: Random Data?

The document describes an issue where randomly generated data (X) and responses (Y) for a binary classification task using an SVM model are incorrectly producing AUC scores significantly higher than 0.5, as would be expected from random data. The author implemented the same experiment in both R and Python using leave-group-out cross validation 1000 times on data with a small number of rows for X. They are looking for potential causes of this issue other than simply increasing the number of rows in X or decreasing the number of columns. No answer is provided.

Uploaded by

Jawwad Qammar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views

1 Task 2: Random Data?

The document describes an issue where randomly generated data (X) and responses (Y) for a binary classification task using an SVM model are incorrectly producing AUC scores significantly higher than 0.5, as would be expected from random data. The author implemented the same experiment in both R and Python using leave-group-out cross validation 1000 times on data with a small number of rows for X. They are looking for potential causes of this issue other than simply increasing the number of rows in X or decreasing the number of columns. No answer is provided.

Uploaded by

Jawwad Qammar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

task2_random-data

February 24, 2023

1 Task 2: Random Data?


1.1 Question
I ran the following code for a binary classification task w/ an SVM in both R (first
sample) and Python (second example).
Given randomly generated data (X) and response (Y), this code performs leave group
out cross validation 1000 times. Each entry of Y is therefore the mean of the prediction
across CV iterations.
Computing area under the curve should give ~0.5, since X and Y are completely random.
However, this is not what we see. Area under the curve is frequently significantly higher
than 0.5. The number of rows of X is very small, which can obviously cause problems.
Any idea what could be happening here? I know that I can either increase the number
of rows of X or decrease the number of columns to mediate the problem, but I am
looking for other issues.
Y=as.factor(rep(c(1,2), times=14))
X=matrix(runif(length(Y)*100), nrow=length(Y))

library(e1071)
library(pROC)

colnames(X)=1:ncol(X)
iter=1000
ansMat=matrix(NA,length(Y),iter)
for(i in seq(iter)){
#get train

train=sample(seq(length(Y)),0.5*length(Y))
if(min(table(Y[train]))==0)
next

#test from train


test=seq(length(Y))[-train]

#train model
XX=X[train,]

1
YY=Y[train]
mod=svm(XX,YY,probability=FALSE)
XXX=X[test,]
predVec=predict(mod,XXX)
RFans=attr(predVec,'decision.values')
ansMat[test,i]=as.numeric(predVec)
}

ans=rowMeans(ansMat,na.rm=TRUE)

r=roc(Y,ans)$auc
print(r)
Similarly, when I implement the same thing in Python I get similar results.

[10]: Y = np.array([1, 2]*14)


X = np.random.uniform(size=[len(Y), 100])
n_iter = 1000
ansMat = np.full((len(Y), n_iter), np.nan)
for i in range(n_iter):
# Get train/test index
train = np.random.choice(range(len(Y)), size=int(0.5*len(Y)),␣
↪replace=False, p=None)

if len(np.unique(Y)) == 1:
continue
test = np.array([i for i in range(len(Y)) if i not in train])
# train model
mod = SVC(probability=False)
mod.fit(X=X[train, :], y=Y[train])
# predict and collect answer
ansMat[test, i] = mod.predict(X[test, :])
ans = np.nanmean(ansMat, axis=1)
fpr, tpr, thresholds = roc_curve(Y, ans, pos_label=1)
print(auc(fpr, tpr))

0.8367346938775511

1.2 Your answer


[ ]:

[ ]:

1.3 Feedback
Was this exercise is difficult or not? In either case, briefly describe why.

[ ]:

2
[ ]:

You might also like