information retrieval evaluation statistical significance evaluation measures effect sizes ntcir test collections statistical power short text conversation natural language processing confidence intervals relevance assessments graded relevance inter-assessor agreement topic set size design sample sizes replicability reproducibility dialogues topic set sizes web search sigir stc measures power analysis failture analysis preference assessments dialogue systems user preferences tukey hsd test anova t-tests multiple comparison procedures trec clef power analyis unanimity gain values lancers innformation retrieval bayesian inference power information access progress monitoring
See more