0% found this document useful (0 votes)

92 views16 pages

1997-00 Listing of Working Papers

This document contains summaries of working papers from 2000 related to text mining and machine learning. The papers discuss topics like using compression to identify acronyms, text categorization using compression models, interactive machine learning approaches, keyphrase extraction, correlation-based feature selection, and benchmarking attribute selection techniques.

Uploaded by

iky77

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOC, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

92 views16 pages

1997-00 Listing of Working Papers

Uploaded by

iky77

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOC, PDF, TXT or read online on Scribd

You are on page 1/ 16

1997-00 Listing of Working Papers

2000/1 Using compression to identify acronyms in text Stuart Yeates, David Bainbridge, Ian ! "itten #ext mining is about $oo%ing for patterns in natura$ $anguage text, and may be defined as t&e process of ana$y'ing text to extract information from it for particu$ar purposes! In previous (or%, (e c$aimed t&at compression is a %ey tec&no$ogy for text mining, and bac%ed t&is up (it& a study t&at s&o(ed &o( particu$ar %inds of $exica$ to%ens)names, dates, $ocations, etc.)can be identified and $ocated in running text, using compression mode$s to provide t&e $everage necessary to distinguis& different to%en types *"itten et al., 1+++, 2000/2 #ext categori'ation using compression mode$s -ibe .ran%, /&ang /&ui, Ian ! "itten #ext categori'ation, or t&e assignment of natura$ $anguage texts to predefined categories based on t&eir content, is of gro(ing importance as t&e vo$ume of information avai$ab$e on t&e internet continues to over(&e$m us! #&e use of predefined categories imp$ies a 0supervised $earning1 approac& to categori'ation, (&ere a$ready2c$assified artic$es 3 (&ic& effective$y define t&e categories 3 are used as 0training data1 to bui$d a mode$ t&at can be used for c$assifying ne( artic$es t&at comprise t&e 0test data1! #&is contrasts (it& 0unsupervised1 $earning, (&ere t&ere is no training data and c$usters of $i%e documents are soug&t amongst t&e test artic$es! "it& supervised $earning, meaningfu$ $abe$s *suc& as %eyp&rases, are attac&ed to t&e training documents, and appropriate $abe$s can be assigned automatica$$y to test documents depending on (&ic& category t&ey fa$$ into! 2000/4 5eserved for Sa$$y 6o 2000/7 Interactive mac&ine $earning)$etting users bui$d c$assifiers 8a$co$m "are, -ibe .ran%, 9eoffrey o$mes, 8ar% a$$, Ian ! "itten :ccording to standard procedure, bui$ding a c$assifier is a fu$$y automated process t&at fo$$o(s data preparation by a domain expert! In contrast, ;I<interactive;/I<mac&ine $earning engages users in actua$$y generating t&e c$assifier t&emse$ves! #&is offers a natura$ (ay of integrating bac%ground %no($edge into t&e mode$ing stage)so $ong as interactive too$s can be designed t&at support efficient and effective communication! #&is paper s&o(s t&at appropriate tec&ni=ues can empo(er users to create mode$s t&at compete (it& c$assifiers bui$t by state2of2t&e2art $earning a$gorit&ms! It demonstrates t&at users)even users (&o are not domain experts)can often construct good c$assifiers, (it&out any &e$p from a $earning a$gorit&m, using a simp$e t(o2dimensiona$ visua$ interface! -xperiments demonstrate t&at, not surprising$y, success &inges on t&e domain> if a fe( attributes can support good predictions, users generate accurate c$assifiers, (&ereas domains (it& many &ig&2order attribute interactions favor standard mac&ine $earning tec&ni=ues! #&e future c&a$$enge is to ac&ieve a symbiosis bet(een &uman user and mac&ine $earning a$gorit&m! 2000/? @-:> Aractica$ automatic %eyp&rase extraction Ian ! "itten, 9ordon "! Aaynter, -ibe .ran%, /ar$ 9ut(in, /raig 9! Bevi$$28anning @eyp&rases provide semantic metadata t&at summari'e and c&aracteri'e documents! #&is paper describes @ea, an a$gorit&m for automatica$$y extracting %eyp&rases from text! @ea identifies candidate %eyp&rases using $exica$ met&ods, ca$cu$ates feature va$ues for eac& candidate, and uses a mac&ine $earning a$gorit&m to predict (&ic& candidates are good %eyp&rases! #&e mac&ine $earning sc&eme first bui$ds a prediction mode$ using training documents (it& %no(n %eyp&rases, and t&en uses t&e mode$ to find %eyp&rases in ne( documents! "e use a $arge test corpus to eva$uate @eaCs effectiveness in terms of &o( many aut&or2assigned %eyp&rases are correct$y identified! #&e system is simp$e, robust, and pub$ic$y avai$ab$e! 2000/D 2/&arts and E> &o(s, (&ys and (&erefores 9reg 5eeve, Steve 5eeves In t&is paper (e s&o(, by a series of examp$es, &o( t&e 2c&art forma$ism can be trans$ated into E! "e give reasons for (&y t&is is an interesting and sensib$e t&ing to do and (&at it mig&t be used for! 2000/F Gne dimensiona$ non2uniform rationa$ B2sp$ines for animation contro$ :bde$a'i' 8a&oui 8ost 4D animation pac%ages use grap&ica$ representations ca$$ed motion grap&s to represent t&e variation in time of t&e motion parameters! 8any use t(o2dimensiona$ B2sp$ines as animation curves because of t&eir po(er to represent free2form curves! In t&is proHect, (e investigate t&e possibi$ity of using Gne2dimensiona$ Bon2Uniform 5ationa$ B2Sp$ine *BU5BS, curves for t&e interactive construction of animation contro$ curves! Gne2dimensiona$ BU5BS curves present t&e potentia$ of so$ving some prob$ems encountered in motion grap&s (&en t(o2dimensiona$ B2sp$ines are used! #&e study focuses on t&e properties of Gne2dimensiona$ BU5BS mat&ematica$ mode$! It a$so investigates t&e a$gorit&ms and s&ape modification too$s devised for t(o2dimensiona$ curves

and t&eir port to t&e Gne2dimensiona$ BU5BS mode$! It a$so $oo%s at t&e issues re$ated to t&e user interface used to interactive$y modify t&e s&ape of t&e curves! 2000/I /orre$ation2based feature se$ection of discrete and numeric c$ass mac&ine $earning 8ar% :! a$$ :$gorit&ms for feature se$ection fa$$ into t(o broad categories> ;I<(rappers;/I<t&at use t&e $earning a$gorit&m itse$f to eva$uate t&e usefu$ness of features and ;I<fi$ters;/I<t&at eva$uate features according to &euristics based on genera$ c&aracteristics of t&e data! .or app$ication to $arge databases, fi$ters &ave proven to be more practica$ t&an (rappers because t&ey are muc& faster! o(ever, most existing fi$ter a$gorit&ms on$y (or% (it& discrete c$assification prob$ems! #&is paper describes a fast, corre$ation2based fi$ter a$gorit&m t&at can be app$ied to continuous and discrete prob$ems! #&e a$gorit&m often out2performs t&e (e$$2%no(n 5e$ief. attribute estimator (&en used as a preprocessing step for naJve Bayes, instance2based $earning, decision trees, $oca$$y (eig&ted regression, and mode$ trees! It performs more feature se$ection t&an 5e$ief. does2reducing t&e data dimensiona$ity by fifty percent in most cases! :$so, decision and mode$ trees bui$t from t&e prepocessed data are often significant$y sma$$er! 2000/+ : deve$opment environment for predictive mode$$ing in foods 9! o$mes, 8!:! a$$ "-@: *"ai%ato -nvironment for @no($edge :na$ysis, is a compre&ensive suite of 6ava c$ass $ibraries t&at imp$ement many state2 of2t&e2art mac&ine $earning/data mining a$gorit&ms! Bon2programmers interact (it& t&e soft(are via a user interface component ca$$ed t&e @no($edge -xp$orer! :pp$ications constructed from t&e "-@: c$ass $ibraries can be run on any computer (it& a (eb bro(sing capabi$ity, a$$o(ing users to app$y mac&ine $earning tec&ni=ues to t&eir o(n data regard$ess of computer p$atform! #&is paper describes t&e user interface component of t&e "-@: system in reference to previous app$ications in t&e predictive mode$ing of foods! 2000/10 Benc&mar%ing attribute se$ection tec&ni=ues for data mining 8ar% :! a$$, 9eoffrey o$mes Data engineering is genera$$y considered to be a centra$ issue in t&e deve$opment of data mining app$ications! #&e success of many $earning sc&emes, in t&eir attempts to construct mode$s of data, &inges on t&e re$iab$e identification of a sma$$ set of &ig&$y predictive attributes! #&e inc$usion of irre$evant, redundant and noisy attributes in t&e mode$ bui$ding process p&ase can resu$t in poor predictive performance and increased computation! :ttribute se$ection genera$$y invo$ves a combination of searc& and attribute uti$ity estimation p$us eva$uation (it& respect to specific $earning sc&emes! #&is $eads to a $arge number of possib$e permutations and &as $ed to a situation (&ere very fe( benc&mar% studies &ave been conducted! #&is paper presents a benc&mar% comparison of severa$ attribute se$ection met&ods! :$$ t&e met&ods produce an attribute ran%ing, a usefu$ devise of iso$ating t&e individua$ merit of an attribute! :ttribute se$ection is ac&ieved by cross2va$idating t&e ran%ings (it& respect to a $earning sc&eme to find t&e best attributes! 5esu$ts are reported for a se$ection of standard data sets and t(o $earning sc&emes /7!? and naJve Bayes! 2000/11 Steve 5eeves, 9reg 5eeve 2000/12 8a$i%a 8a&oui, Sa$$y 6o /unning&am #ransaction $ogs are inva$uab$e sources of fine2grained information about usersC searc& be&avior! #&is paper compares t&e searc&ing be&avior of users across t(o """2accessib$e digita$ $ibraries> t&e Be( Eea$and Digita$ KibraryCs /omputer Science #ec&nica$ 5eports co$$ection */S#5,, and t&e @ar$sru&e /omputer Science Bib$iograp&ies */SBIB, co$$ection! Since t&e t(o co$$ections are designed to support t&e same type of users2researc&ers/students in computer science a comparative $og ana$ysis is $i%e$y to uncover common searc&ing preferences for t&at user group! #&e t(o co$$ections differ in t&eir content, &o(everL t&e /S#5 indexes a fu$$ text co$$ection, (&i$e t&e /SBIB is primari$y a bib$iograp&ic database! Differences in searc&ing be&avior bet(een t&e t(o systems may indicate t&e effect of differing searc& faci$ities and content type!

++/1 Kexica$ attraction for text compression 6osc&a Bac&, Ian ! "itten Be( met&ods of ac=uiring structura$ information in text documents may support better compression by identifying an appropriate prediction context for eac& symbo$! #&e met&od of 0$exica$ attraction1 infers syntactic dependency structures from statistica$ ana$ysis

of $arge corpora! "e describe t&e generation of a $exica$ attraction mode$, discuss its app$ication to text compression, and exp$ore its potentia$ to outperform fixed2context mode$s suc& as (ord2$eve$ AA8! Aer&aps t&e most exciting aspect of t&is (or% is t&e prospect of using compression as a metric for structure discovery in text! ++/2 9enerating ru$e sets from mode$ trees 9eoffrey o$mes, 8ar% a$$, -ibe .ran% @no($edge discovered in a database must be represented in a form t&at is easy to understand! Sma$$, easy to interpret nuggets of %no($edge from data are one re=uirement and t&e abi$ity to induce t&em from a variety of data sources is a second! #&e $iterature is abound (it& c$assification a$gorit&ms, and in recent years (it& a$gorit&ms for time se=uence ana$ysis, but re$ative$y $itt$e &as been pub$is&ed on extracting meaningfu$ information from prob$ems invo$ving continuous c$asses *regression,! 8ode$ trees2decision trees (it& $inear mode$s at t&e $eaf nodes2&ave recent$y emerged as an accurate met&od for numeric prediction t&at produces understandab$e mode$s! o(ever, it is (e$$ %no(n t&at decision $ists2ordered sets of If2#&en ru$es2&ave t&e potentia$ to be more compact and t&erefore more understandab$e t&an t&eir tree counterparts! In t&is paper (e present an a$gorit&m for inducing simp$e, yet accurate ru$e sets from mode$ trees! #&e a$gorit&m (or%s by repeated$y bui$ding mode$ trees and se$ecting t&e best ru$e at eac& iteration! It produces ru$e sets t&at are, on t&e (&o$e, as accurate but sma$$er t&an t&e mode$ tree constructed from t&e entire dataset! -xperimenta$ resu$ts for various &euristics (&ic& attempt to find a compromise bet(een ru$e accuracy and ru$e coverage are reported! "e a$so s&o( empirica$$y t&at our met&od produces more accurate and sma$$er ru$e sets t&an t&e commercia$ state2of2t&e2art ru$e $earning system /ubist! ++/4 : diagnostic too$ for tree based supervised c$assification $earning a$gorit&ms Keonard #rigg, 9eoffrey o$mes #&e process of deve$oping app$ications of mac&ine $earning and data mining t&at emp$oy supervised c$assification a$gorit&ms inc$udes t&e important step of %no($edge verification! Interpretab$e output is presented to a user so t&at t&ey can verify t&at t&e %no($edge contained in t&e output ma%es sense for t&e given app$ication! :s t&e deve$opment of an app$ication is an iterative process it is =uite $i%e$y t&at a user (ou$d (is& to compare mode$s constructed at various times or stages! Gne crucia$ stage (&ere comparison of mode$s is important is (&en t&e accuracy of a mode$ is being estimated, typica$$y using some form of cross2va$idation! #&is stage is used to estab$is& an estimate of &o( (e$$ a mode$ (i$$ perform on unseen data! #&is is vita$ information to present to a user, but it is a$so important to s&o( t&e degree of variation bet(een mode$s obtained from t&e entire dataset and mode$s obtained during cross2va$idation! In t&is (ay it can be verified t&at t&e cross2va$idation mode$s are at $east structura$$y a$igned (it& t&e mode$ garnered from t&e entire dataset! #&is paper presents a diagnostic too$ for t&e comparison of tree2based supervised c$assification mode$s! #&e met&od is adapted from (or% on approximate tree matc&ing and app$ied to decision trees! #&e too$ is described toget&er (it& experimenta$ resu$ts on standard datasets! ++/7 .eature se$ection for discrete and numeric c$ass mac&ine $earning 8ar% :! a$$ :$gorit&ms for feature se$ection fa$$ into t(o broad categories> ;I<(rappers;/I<use t&e $earning a$gorit&m itse$f to eva$uate t&e usefu$ness of features, (&i$e ;I<fi$ters;/I<eva$uate features according to &euristics based on genera$ c&aracteristics of t&e data! .or app$ication to $arge databases, fi$ters &ave proven to be more practica$ t&an (rappers because t&ey are muc& faster! o(ever, most existing fi$ter a$gorit&ms on$y (or% (it& discrete c$assification prob$ems! #&is paper describes a fast, corre$ation2based fi$ter a$gorit&m t&at can be app$ied to continuous and discrete prob$ems! -xperiments using t&e ne( met&od as a preprocessing step for naJve Bayes, instance2based $earning, decision trees, $oca$$y (eig&ted regression, and mode$ trees s&o( it to be an effective feature se$ector2 it reduces t&e data in dimensiona$ity by more t&an sixty percent in most cases (it&out negative$y affecting accuracy! :$so, decision and mode$ trees bui$t from t&e pre2processed data are often significant$y sma$$er! ++/? Bro(sing tree structures 8ar% :pper$ey, 5obert Spence, Step&en odge, 8ic&ae$ /&ester 9rap&ic representations of tree structures are notorious$y difficu$t to create, disp$ay, and interpret, particu$ar$y (&en t&e vo$ume of information t&ey contain, and &ence t&e number of nodes, is $arge! #&e prob$em of interactive$y bro(sing information &e$d in tree structures is examined, and t&e imp$ementation of an innovative tree bro(ser described! #&is bro(ser is based on distortion2oriented disp$ay tec&ni=ues and intuitive direct manipu$ation interaction! #&e tree $ayout is automatica$$y generated, but t&e $ocation and extent of detai$ s&o(n is contro$$ed by t&e user! It is suggested t&at t&ese tec&ni=ues cou$d be extended to t&e bro(sing of more genera$ net(or%s!

++/D .aci$itating mu$tip$e copy/past operations 8ar% :pper$ey, 6ay Ba%er, Da$e .$etc&er, Bi$$ 5ogers /opy and paste, or cut and paste, using a c$ipboard or paste buffer &as $ong been t&e princip$e faci$ity provided to users for transferring data bet(een and (it&in 9UI app$ications! "e argue t&at t&is mec&anism can be c$umsy in circumstances (&ere severa$ pieces of information must be moved systematica$$y! In t(o situations 2 extraction of data fie$ds from unstructured data found in a directed searc& process, and reorganisation of computer program source text 2 (e present a$ternative, more natura$, user interface faci$ities to ma%e t&e tas% $ess onerous, and to provide improved visua$ feedbac% during t&e operation! .or t&e data extraction tas% (e introduce t&e Stretc&ab$e Se$ection #oo$, a semi2transparent over$ay augmenting t&e mouse pointer to automate paste operations and provide information to prompt t&e user! "e describe a prototype imp$ementation t&at functions in a co$$aborative soft(are environment, a$$o(ing users to cooperate on a mu$tip$e copy/paste operation! .or text reorganisation, (e present an extension to -macs, providing simi$ar functiona$ity, but (it&out t&e co$$aborative features! ++/F :utomating iterative tas%s (it& programming by demonstration> a user eva$uation 9ordon "! Aaynter, Ian ! "itten /omputer users often face iterative tas%s t&at cannot be automated using t&e too$s and aggregation tec&ni=ues provided by t&eir app$ication program> t&ey end up performing t&e iteration by &and, repeating user interface actions over and over again! "e &ave imp$emented an agent, ca$$ed .ami$iar, t&at can be taug&t to perform iterative tas%s using programming by demonstration *ABD,! Un$i%e ot&er ABD systems, it is domain independent and (or%s (it& unmodified, (ide$y2used, app$ications in a popu$ar operating system! In a forma$ eva$uation, (e found t&at users =uic%$y $earned to use t&e agent to automate iterative tas%s! 9enera$$y, t&e participants preferred to use mu$tip$e se$ection (&ere possib$e, but cou$d and did use ABD in situations invo$ving iteration over many commands, or (&en ot&er tec&ni=ues (ere unavai$ab$e! ++/I : survey of soft(are re=uirements specification practices in t&e Be( Eea$and soft(are industry Kindsay 9roves, 5ay Bic%son, 9reg 5eeve, Steve 5eeves, 8ar% Utting "e report on t&e soft(are deve$opment tec&ni=ues used in t&e Be( Eea$and soft(are industry, paying particu$ar attention to re=uirements gat&ering! "e surveyed a se$ection of soft(are companies (it& a genera$ =uestionnaire and t&en conducted in2dept& intervie(s (it& four companies! Gur resu$ts s&o( a (ide variety in t&e %inds of companies underta%ing soft(are deve$opment, emp$oying a (ide range of soft(are deve$opment tec&ni=ues! :$t&oug& our data are not sufficient$y detai$ed to dra( statistica$$y significant conc$usions, it appears t&at $arger soft(are deve$opment groups typica$$y &ave more (e$$2defined soft(are deve$opment processes, spend proportiona$$y more time on re=uirements gat&ering, and fo$$o( more rigorous testing regimes! ++/+ #&e K5UM""" proxy cac&e document rep$acement a$gorit&m /&ung2yi /&ang, #ony 8c9regor, 9eoffrey o$mes Gbtaining good performance from """ proxy cac&es is critica$$y dependent on t&e document rep$acement po$icy used by t&e proxy! #&is paper va$idates t&e (or% of ot&er aut&ors by reproducing t&eir studies of proxy cac&e document rep$acement a$gorit&ms! .rom t&is basis a cross2trace study is mounted! #&is demonstrates t&at t&e performance of most document rep$acement a$gorit&ms is dependent on t&e type of (or%$oad t&at t&ey are presented (it&! .ina$$y (e propose a ne( a$gorit&m, K5UM, t&at consistent$y performs (e$$ across a$$ our traces! ++/10 5educed2error pruning (it& significance tests -ibe .ran%, Ian ! "itten "&en bui$ding c$assification mode$s, it is common practice to prune t&em to counter spurious effects of t&e training data> t&is often improves performance and reduces mode$ si'e! N5educed2error pruningN is a fast pruning procedure for decision trees t&at is %no(n to produce sma$$ and accurate trees! :part from t&e data from (&ic& t&e tree is gro(n, it uses an independent NpruningN set, and pruning decisions are based on t&e mode$Cs error rate on t&is fres& data! 5ecent$y it &as been observed t&at reduced2error pruning overfits t&e pruning data, producing unnecessari$y $arge decision trees! #&is paper investigates (&et&er standard statistica$ significance tests can be used to counter t&is p&enomenon! #&e prob$em of overfitting to t&e pruning set &ig&$ig&ts t&e need for significance testing! "e investigate t(o c$asses of test, NparametricN and Nnon2parametric!N #&e standard c&i2s=uared statistic can be used bot& in a parametric test and as t&e basis for a non2 parametric permutation test! In bot& cases it is necessary to se$ect t&e significance $eve$ at (&ic& pruning is app$ied! "e s&o( empirica$$y t&at bot& versions of t&e c&i2s=uared test perform e=ua$$y (e$$ if t&eir significance $eve$s are adHusted appropriate$y! Using a co$$ection of standard datasets, (e s&o( t&at significance testing improves on standard reduced error pruning if t&e significance $eve$ is tai$ored to t&e particu$ar dataset at &and using cross2va$idation, yie$ding consistent$y sma$$er trees t&at perform at $east as (e$$ and sometimes better! ++/11 "e%a> Aractica$ mac&ine $earning too$s and tec&ni=ues (it& 6ava imp$ementations Ian ! "itten, -ibe .ran%, Ken #rigg, 8ar% a$$, 9eoffrey o$mes, Sa$$y 6o /unning&am

#&e "ai%ato -nvironment for @no($edge :na$ysis *"e%a, is a compre&ensive suite of 6ava c$ass $ibraries t&at imp$ement many state2of2t&e2art mac&ine $earning and data mining a$gorit&ms! "e%a is free$y avai$ab$e on t&e "or$d2"ide "eb and accompanies a ne( text on data mining O1P (&ic& documents and fu$$y exp$ains a$$ t&e a$gorit&ms it contains! :pp$ications (ritten using t&e "e%a c$ass $ibraries can be run on any computer (it& a "eb bro(sing capabi$ityL t&is a$$o(s users to app$y mac&ine $earning tec&ni=ues to t&eir o(n data regard$ess of computer p$atform! ++/12 Aace 5egression Yong "ang, Ian ! "itten #&is paper articu$ates a ne( met&od of $inear regression, 0pace regression1, t&at addresses many dra(bac%s of standard regression reported in t&e $iterature)particu$ar$y t&e subset se$ection prob$em! Aace regression improves on c$assica$ ordinary $east s=uares *GKS, regression by eva$uating t&e effect of eac& variab$e and using a c$ustering ana$ysis to improve t&e statistica$ basis for estimating t&eir contribution to t&e overa$$ regression! :s (e$$ as outperforming GKS, it a$so outperforms)in a remar%ab$y genera$ sense) ot&er $inear mode$ing tec&ni=ues in t&e $iterature, inc$uding subset se$ection procedures, (&ic& see% a reduction in dimensiona$ity t&at fa$$s out as a natura$ byproduct of pace regression! #&e paper defines six procedures t&at s&are t&e fundamenta$ idea of pace regression, a$$ of (&ic& are t&eoretica$$y Hustified in terms of asymptotic performance! -xperiments confirm t&e performance improvement over ot&er tec&ni=ues! ++/14 : compression2based a$gorit&m for /&inese (ord segmentation "!6! #ea&an, Yingying "en, 5odger 8cBab, Ian ! "itten #&e /&inese $anguage is (ritten (it&out using spaces or ot&er (ord de$imiters! :$t&oug& a text may be t&oug&t of as a corresponding se=uence of (ords, t&ere is considerab$e ambiguity in t&e p$acement of boundaries! Interpreting a text as a se=uence of (ords is beneficia$ for some information retrieva$ and storage tas%s> for examp$e, fu$$2text searc&, (ord2based compression, and %eyp&rase extraction! "e describe a sc&eme t&at infers appropriate positions for (ord boundaries using an adaptive $anguage mode$ t&at is standard in text compression! It is trained on a corpus of pre2segmented text, and (&en app$ied to ne( text, interpo$ates (ord boundaries so as to maximi'e t&e compression obtained! #&is simp$e and genera$ met&od performs (e$$ (it& respect to specia$i'ed sc&emes for /&inese $anguage segmentation! ++/17 /$ustering (it& finite data from semi2parametric mixture distributions Yong "ang, Ian ! "itten -xisting c$ustering met&ods for t&e semi2parametric mixture distribution perform (e$$ as t&e vo$ume of data increases! o(ever, t&ey a$$ suffer from a serious dra(bac% in finite2data situations> sma$$ out$ying groups of data points can be comp$ete$y ignored in t&e c$usters t&at are produced, no matter &o( far a(ay t&ey $ie from t&e maHor c$usters! #&is can resu$t in unbounded $oss if t&e $oss function is sensitive to t&e distance bet(een c$usters! #&is paper proposes a ne( distance2based c$ustering met&od t&at overcomes t&e prob$em by avoiding g$oba$ constraints! -xperimenta$ resu$ts i$$ustrate its superiority to existing met&ods (&en sma$$ c$usters are present in finite data setsL t&ey a$so suggest t&at it is more accurate and stab$e t&an ot&er met&ods even (&en t&ere are no sma$$ c$usters! ++/1? ++/1D #&e Biupepa /o$$ection> Gpening t&e b$inds on a (indo( to t&e past #e #a%a @eegan, Sa$$y 6o /unning&am, 8ar% :pper$ey #&is paper describes t&e bui$ding of a digita$ $ibrary co$$ection of &istoric ne(spapers! #&e ne(spapers * Niupepa in 8aori,, (&ic& (ere pub$is&ed in Be( Eea$and during t&e period 1I72 to 1+44, form a uni=ue &istorica$ record of t&e 8aori $anguage, and of events from an &istorica$ perspective! Images of t&ese ne(spapers &ave been converted to digita$ form, e$ectronic text extracted from t&ese, and t&e co$$ection is no( being made avai$ab$e over t&e Internet as a part of t&e Be( Eea$and Digita$ Kibrary *BEDK, proHect at t&e University of "ai%ato!

+I/1 Boosting trees for cost2sensitive c$assifications @ai 8ing #ing, EiHian E&eng #&is paper exp$ores t(o boosting tec&ni=ues for cost2sensitive tree c$assification in t&e situation (&ere misc$assification costs c&ange very often! Idea$$y, one (ou$d $i%e to &ave on$y one induction, and use t&e induced mode$ for different misc$assification costs! #&us, it demands robustness of t&e induced mode$ against cost c&anges! /ombining mu$tip$e trees gives robust predictions against t&is c&ange! "e demonstrate t&at ordinary boosting combined (it& t&e minimum expected cost criterion to se$ect t&e prediction c$ass is a good so$ution under t&is situation! "e a$so introduce a variant of t&e ordinary boosting procedure (&ic& uti$i'es t&e cost information during training! "e s&o( t&at t&e proposed tec&ni=ue performs better t&an t&e ordinary boosting in terms of misc$assification cost! o(ever, t&is tec&ni=ue re=uires to induce a set of ne( trees every time t&e cost c&anges! Gur empirica$ investigation a$so revea$s some interesting be&avior of boosting decision trees for cost2sensitive c$assification!

+I/2 9enerating accurate ru$e sets (it&out g$oba$ optimi'ation -ibe .ran%, Ian ! "itten #&e t(o dominant sc&emes for ru$e2$earning, /7!? and 5IAA-5, bot& operate in t(o stages! .irst t&ey induce an initia$ ru$e set and t&en t&ey refine it using a rat&er comp$ex optimi'ation stage t&at discards */7!?, or adHusts *5IAA-5, individua$ ru$es to ma%e t&em (or% better toget&er! In contrast, t&is paper s&o(s &o( good ru$e sets can be $earned one ru$e at a time, (it&out any need for g$oba$ optimi'ation! "e present an a$gorit&m for inferring ru$es by repeated$y generating partia$ decision trees, t&us combining t&e t(o maHor paradigms for ru$e generation2creating ru$es from decision trees and t&e separate2and2con=uer ru$e2$earning tec&ni=ue! #&e a$gorit&m is straig&tfor(ard and e$egant> despite t&is, experiments on standard datasets s&o( t&at it produces ru$e sets t&at are as accurate as and of simi$ar si'e to t&ose generated by /7!?, and more accurate t&an 5IAA-5Cs! 8oreover, it operates efficient$y, and because it avoids postprocessing, does not suffer t&e extreme$y s$o( performance on pat&o$ogica$ examp$e sets for (&ic& t&e /7!? met&od &as been critici'ed! +I/4 QRuery> a grap&ica$ user interface for Boo$ean =uery Specification and dynamic resu$t previe( Steve 6ones #extua$ =uery $anguages based on Boo$ean $ogic are common amongst t&e searc& faci$ities of on2$ine information repositories! o(ever, t&ere is evidence to suggest t&at t&e syntactic and semantic demands of suc& $anguages $ead to user errors and adverse$y affect t&e time t&at it ta%es users to form =ueries! :dditiona$$y, users are faced (it& user interfaces to t&ese repositories (&ic& are unresponsive and uninformative, and conse=uent$y fai$ to support effective =uery refinement! "e suggest t&at grap&ica$ =uery $anguages, particu$ar$y Qenn2$i%e diagrams, provide a natura$ medium for Boo$ean =uery specification (&ic& overcomes t&e prob$ems of textua$ =uery $anguages! :$so, dynamic resu$t previe(s can be seam$ess$y integrated (it& grap&ica$ =uery specification to increase t&e effectiveness of =uery refinements! "e describe QRuery, a =uery interface to t&e Be( Eea$and Digita$ Kibrary (&ic& exp$oits =uerying by Qenn diagrams and integrated =uery resu$t previe(s! +I/7 5evising ;I<E;/I<> semantics and $ogic 8artin /! enson, Steve 5eeves "e introduce a simp$e specification $ogic ;I<E;/I<c comprising a $ogic and semantics *in ;I<E.;/I< set t&eory,! "e t&en provide an interpretation for *a rationa$ reconstruction of, t&e specification $anguage ;I<E;/I< (it&in ;I<E;/I<c! :s a resu$t (e obtain a sound $ogic for ;I<E;/I<, inc$uding t&e sc&ema ca$cu$us! : conse=uence of our forma$isation is a criti=ue of a number of concepts used in ;I<E;/I<! "e demonstrate t&at t&e comp$ications and confusions (&ic& t&ese concepts introduce can be avoided (it&out compromising expressibi$ity! +I/? : $ogic for t&e sc&ema ca$cu$us 8artin /! enson, Steve 5eeves In t&is paper (e introduce and investigate a $ogic for t&e sc&ema ca$cu$us of ;I<E;/I<! #&e sc&ema ca$cu$us is arguab$y t&e reason for ;I<E;/I<Ss popu$arity but so far no true ca$cu$us *a sound system of ru$es for reasoning about sc&ema expressions, &as been given! Aresentations t&us far &ave eit&er fai$ed to provide a ca$cu$us *e!g! t&e draft standard O4P, or &ave fa$$en bac% on informa$ descriptions at a syntactic $eve$ *most text boo%s e!g! OFO,! Gnce t&e ca$cu$us is estab$is&ed (e introduce a derived e=uationa$ $ogic (&ic& enab$es us to forma$ise proper$y t&e informa$ notations of sc&ema expression e=ua$ity to be found in t&e $iterature! +I/D Be( foundations for ;I<E;/I< 8artin /! enson, Steve 5eeves "e provide a constructive and intensiona$ interpretation for t&e specification $anguage ;I<E;/I< in a t&eory of operations and %inds ;I<#;/I<! #&e motivation is to faci$itate t&e deve$opment of an integrated approac& to program construction! "e i$$ustrate t&e ne( foundations for ;I<E;/I< (it& examp$es! +I/F Aredicting app$e bruising re$ations&ips using mac&ine $earning 9! o$mes, S!6! /unning&am, B!#! De$a 5ue, :!.! Bo$$en

Many models have been used to describe the influence of internal or external factors on apple bruising. Few of these have addressed the application of derived relationships to the evaluation of commercial operations. From an industry perspective, a model must enable fruit to be rejected on the basis of a commercially significant bruise and must also accurately quantify the effects of various combinations of input features (such as cultivar, maturity, size, and so on) on bruise prediction. Input features must in turn have characteristics which are measurable commercially; for example, the measure of force should be impact energy rather than energy

absorbed. Further, as the commercial criteria for acceptable damage levels change, the model should be versatile enough to regenerate new bruise thresholds from existing data.
8ac&ine $earning is a burgeoning tec&no$ogy (it& a vast range of potentia$ app$ications particu$ar$y in agricu$ture (&ere $arge amounts of data can be readi$y co$$ected O1P! #&e main advantage of using a mac&ine $earning met&od in an app$ication is t&at t&e mode$s bui$t for prediction can be vie(ed and understood by t&e o(ner of t&e data (&o is in a position to determine t&e usefu$ness of t&e mode$, an essentia$ component in a commercia$ environment! +I/I :n eva$uation of passage2$eve$ indexing strategies for a tec&nica$ report arc&ive 8ic&ae$ "i$$iams Aast researc& &as s&o(n t&at using evidence from document passages rat&er t&an comp$ete documents is an effective (ay of improving t&e precision of fu$$2text database searc&es! o(ever, passage2$eve$ indexing &as yet to be (ide$y adopted for commercia$ or on$ine databases! #&is paper reports on experiments designed to test t&e efficacy of passage2$eve$ indexing (it& a particu$ar co$$ection of a fu$$2text on$ine database, t&e Be( Eea$and Digita$ Kibrary! Discourse passages and (ord2(indo( passages are used for t&e indexing process! Bot& ran%ed and Boo$ean searc&ing are used to test t&e resu$ting indexes! Gver$apping (indo( passages are s&o(n to offer t&e best retrieva$ performance (it& bot& ran%ed and Boo$ean =ueries! 8odifications may be necessary to t&e term (eig&ting met&odo$ogy in order to ensure optima$ ran%ed =uery performance! +I/+ 8anaging mu$tip$e co$$ections, mu$tip$e $anguages, and mu$tip$e media in a distributed digita$ $ibrary Ian ! "itten, 5odger 8cBab, Steve 6ones, Sa$$y 6o /unning&am, David Bainbridge, 8ar% :pper$ey 8anaging t&e organi'ationa$ and soft(are comp$exity of a compre&ensive digita$ $ibrary presents a significant c&a$$enge! Different $ibrary co$$ections eac& &ave t&eir o(n distinctive features! Different presentation $anguages &ave structura$ imp$ications suc& as $eft2 to2rig&t (riting order and text2on$y interfaces for t&e visua$$y impaired! Different media invo$ve different fi$e formats, and2more important$y2radica$$y different searc& strategies are re=uired for non2textua$ media! In a distributed $ibrary, ne( co$$ections can appear async&ronous$y on servers in different parts of t&e (or$d! :nd as searc&ing interfaces mature from t&e command2$ine era exemp$ified by current "eb searc& engines into t&e age of reactive visua$ interfaces, experimenta$ ne( interfaces must be deve$oped, supported, and tested! #&is paper describes our experience, gained from operating a substantia$ digita$ $ibrary service over severa$ years, in so$ving t&ese prob$ems by designing an appropriate soft(are arc&itecture! +I/10 -xperiences (it& a (eig&ted decision tree $earner 6o&n 9! /$eary, Keonard -! #rigg

Machine learning algorithms for inferring decision trees typically choose a single best tree to describe the training data. Recent research has shown that classification performance can be significantly improved by voting predictions of multiple, independently produced decision trees. This paper describes an algorithm, OB1, that makes a weighted sum over many possible models. We describe one instance of OB1, that includes <I>all</I> possible decision trees as well as nave Bayesian models. OB1 is compared with a number of other decision tree and instance based learning alogrithms on some of the data sets from the UCI repository. Both an information gain and an accuracy measure are used for the comparison. On the information gain measure OB1 performs significantly better than all the other algorithms. On the accuracy measure it is significantly better than all the algorithms except nave Bayes which performs comparably to OB1.
+I/11 :n entropy gain measure of numeric prediction performance Keonard #rigg /ategorica$ c$assifier performance is typica$$y eva$uated (it& respect to error rate, expressed as a percentage of test instances t&at (ere not correct$y c$assified! "&en a c$assifier produces mu$tip$e c$assifications for a test instance, t&e prediction is counted as incorrect *even if t&e correct c$ass (as one of t&e predictions,! :$t&oug& common$y used in t&e $iterature, error rate is a coarse measure of c$assifier performance, as it is based on$y on a sing$e prediction offered for a test instance! Since many c$assifiers can produce a c$ass distribution as a prediction, (e s&ou$d use t&is to provide a better measure of &o( muc& information t&e c$assifier is extracting from t&e domain! Bumeric c$assifiers are a re$ative$y ne( deve$opment in mac&ine $earning, and as suc& t&ere is no sing$e performance measure t&at &as become standard! #ypica$$y t&ese mac&ine $earning sc&emes predict a sing$e rea$ number for eac& test instance, and t&e error bet(een t&e predicted and actua$ va$ue is used to ca$cu$ate a myriad of performance measures suc& as corre$ation coefficient, root mean

s=uared error, mean abso$ute error, re$ative abso$ute error, and root re$ative s=uared error! "it& so many performance measures it is difficu$t to estab$is& an overa$$ performance eva$uation! #&e next section describes a performance measure for mac&ine $earning sc&emes t&at attempts to overcome t&e prob$ems (it& current measures! In addition, t&e same eva$uation measure is used for categorica$ and numeric c$assifier!

+I/12 Aroceedings of /BIS- S+I /aiS-M+I "or%s&op on /omponent Based Information Systems -ngineering -dited by 6o&n 9rundy

Component-based information systems development is an area of research and practice of increasing importance. Information Systems developers have realised that traditional approaches to IS engineering produce monolithic, difficult to maintain, difficult to reuse systems. In contrast, the use of software components, which embody data, functionality and well-specified and understood interfaces, makes interoperable, distributed and highly reusable IS components feasible. Component-based approaches to IS engineering can be used at strategic and organisational levels, to model business processes and whole IS architectures, in development methods which utilise component-based models during analysis and design, and in system implementation. Reusable components can allow end users to compose and configure their own Information Systems, possibly from a range of suppliers, and to more tightly couple their organisational workflows with their IS support.
#&is (or%s&op proceedings contains a range of papers addressing one or more of t&e above issues re$ating to t&e use of component mode$s for IS deve$opment! :$$ of t&ese papers (ere refereed by at $east t(o members of an internationa$ (or%s&op committee comprising industry and academic researc&ers and users of component tec&no$ogies! Strategic uses of components are addressed in t&e first t&ree papers, (&i$e t&e fo$$o(ing t&ree address uses of components for systems design and (or%f$o( management! Systems deve$opment using components, and t&e provision of environments for component management are addressed in t&e fo$$o(ing group of five papers! #&e $ast t&ree papers in t&is proceedings address component management and ana$ysis tec&ni=ues! :$$ of t&ese papers provide ne( insig&ts into t&e many varied uses of component tec&no$ogy for IS engineering! I &ope you find t&em as interesting and usefu$ as I &ave (&en co$$ating t&is proceedings and organising t&e (or%s&op! +I/14 :n ana$ysis of usage of a digita$ $ibrary Steve 6ones, Sa$$y 6o /unning&am, 5odger 8cBab :s experimenta$ digita$ $ibrary testbeds gain (ider acceptance and deve$op significant user bases, it becomes important to investigate t&e (ays in (&ic& users interact (it& t&e systems in practice! #ransaction $ogs are one source of usage information, and t&e information on user be&aviour can be cu$$ed from t&em bot& automatica$$y *t&roug& ca$cu$ation of summary statistics, and manua$$y *by examining =uery strings for semantic c$ues on searc& motivations and searc&ing strategy,! "e conduct a transaction $og ana$ysis on user activity in t&e /omputer Science #ec&nica$ 5eports /o$$ection of t&e Be( Eea$and Digita$ Kibrary, and report insig&ts gained and identify resu$ting searc& interface design issues! +I/17 8easuring :#8 traffic> fina$ report for Be( Eea$and #e$ecom 6o&n /$eary, Ian 9ra&am, 8urray Aearson, #ony 8c9regor #&e report describes t&e deve$opment of a $o(2cost :#8 monitoring system, &osted by a standard A/! #&e monitor can be used remote$y returning information on :#8 traffic f$o(s to a centra$ site! #&e monitor is interfaces to a 9AS timing receiver, (&ic& provides an abso$ute time accuracy of better t&an 1 usec! By monitoring t&e same traffic f$o( at different points in a net(or% it is possib$e to measure ce$$ de$ay and de$ay variation in rea$ time, and (it& existing traffic! #&e monitoring system c&aracterises ce$$s by a /5/ ca$cu$ated over t&e ce$$ pay$oad, t&us specia$ measurement ce$$s are not re=uired! De$ays in bot& $oca$ area and (ide2area net(or%s &ave been measured using t&is system! It is possib$e to measure de$ay in a net(or% t&at is not end2to2end :#8, as $ong as some ce$$s remain identica$ at t&e entry and exit points! -xamp$es are given of traffic and de$ay measurements in bot& (ide and $oca$ area net(or% systems, inc$uding de$ays measured over t&e Internet from /anada to Be( Eea$and!

+I/1? Despite its simp$icity, t&e naJve Bayes $earning sc&eme performs (e$$ on most c$assification tas%s, and is often significant$y more accurate t&an more sop&isticated met&ods! :$t&oug& t&e probabi$ity estimates t&at it produces can be inaccurate, it often assigns maximum probabi$ity to t&e correct c$ass! #&is suggests t&at its good performance mig&t be restricted to situations (&ere t&e output is categorica$! It is t&erefore interesting to see &o( it performs in domains (&ere t&e predicted va$ue is numeric, because in t&is case, predictions are more sensitive to inaccurate probabi$ity estimates!;A<

#&is paper s&o(s &o( to app$y t&e naJve Bayes met&odo$ogy to numeric prediction *i!e! regression, tas%s, and compares it to $inear regression, instance2based $earning, and a met&od t&at produces 0mode$ trees12decision trees (it& $inear regression functions at t&e $eaves! :$t&oug& (e ex&ibit an artificia$ dataset for (&ic& naJve Bayes is t&e met&od of c&oice, on rea$2(or$d datasets it is a$most uniform$y (orse t&an mode$ trees! #&e comparison (it& $inear regression depends on t&e error measure> for one measure naJve Bayes performs simi$ar$y, for anot&er it is (orse! /ompared to instance2based $earning, it performs simi$ar$y (it& respect to bot& measures! #&ese resu$ts indicate t&at t&e simp$istic statistica$ assumption t&at naJve Bayes ma%es is indeed more restrictive for regression t&an for c$assification! +I/1D Kin% as you type> using %ey p&rases for automated dynamic $in% generation Steve 6ones "&en documents are co$$ected toget&er from diverse sources t&ey are un$i%e$y to contain usefu$ &ypertext $in%s to support bro(sing amongst t&em! .or $arge co$$ections of t&ousands of documents it is pro&ibitive$y resource intensive to manua$$y insert $in%s into eac& document! Users of suc& co$$ections may (is& to re$ate documents (it&in t&em to text t&at t&ey are t&emse$ves generating! #&is process, often invo$ving %ey(ord searc&ing, distracts from t&e aut&oring process and resu$ts in materia$ re$ated to =uery terms but not necessari$y to t&e aut&orSs document! Ruery terms t&at are effective in one co$$ection mig&t not be so in anot&er! "e &ave deve$oped A&rasier, a system t&at integrates aut&oring *of text and &yper$in%s,, bro(sing, =uerying and reading in support of information retrieva$ activities! A&rasier exp$oits %ey p&rases (&ic& are automatica$$y extracted from documents in a co$$ection, and uses t&em as $in% anc&ors and to identify candidate destinations for &yper$in%s! #&is system suggests $in%s into existing co$$ections for purposes of aut&oring and retrieva$ of re$ated information, creates $in%s bet(een documents in a co$$ection and provides supportive document and $in% overvie(s! +I/1F 8e$ody based tune retrieva$ over t&e "or$d "ide "eb David Bainbridge, 5odger 6! 8cBab, K$oyd :! Smit& In t&is paper (e describe t&e steps ta%en to deve$op a "eb2based version of an existing stand2a$one, sing$e2user digita$ $ibrary app$ication for me$odica$ searc&ing of a co$$ection of music! .or t&e t&ree %ey components> input, searc&ing, and output, (e assess t&e suitabi$ity of various "eb2based strategies t&at dea$ (it& t&e no( distributed soft(are arc&itecture and exp$ain t&e decisions (e made! #&e resu$ting me$ody indexing service, %no(n as 8-KD-T, &as been in operation for one year, and t&e feed2bac% (e &ave received &as been favorab$e! +I/1I 8a%ing ora$ &istory accessib$e over t&e "or$d "ide "eb David Bainbridge, Sa$$y 6o /unning&am "e describe a mu$timedia, """2based ora$ &istory co$$ection constructed from off2t&e2s&e$f or pub$ic$y avai$ab$e soft(are! #&e source materia$s for t&e co$$ection inc$ude audio tapes of intervie(s and summary transcripts of eac& intervie(, as (e$$ as p&otograp&s i$$ustrating episodes mentioned in t&e tapes! Sections of t&e transcripts are manua$$y matc&ed to associated segments of t&e tapes, and t&e tapes are digiti'ed! Users searc& a fu$$2text retrieva$ system based on t&e text transcripts to retrieve re$evant transcript sections and t&eir associated audio recordings and p&otograp&s! It is a$so possib$e to searc& for p&otos by matc&ing text =ueries against text descriptions of t&e p&otos in t&e co$$ection, (&ere t&e $ocated p&otos $in% bac% to t&eir respective intervie( transcript and audio recordings!

1997 +F/1 : dynamic and f$exib$e representation of socia$ re$ations&ips in /S/" Steve 6ones, Steve 8ars& /S/" system designers $ac% effective support in addressing t&e socia$ issues and interpersona$ re$ations&ips (&ic& are $in%ed (it& t&e use of /S/" systems! "e present a forma$ description of trust to support /S/" system designers in considering t&e socia$ aspects of group (or%, embedding t&ose considerations in systems and ana$ysing computer supported group processes! "e argue t&at trust is a critica$ aspect in group (or%, and describe (&at (e consider to be t&e bui$ding b$oc%s of trust! "e t&en present a forma$ notation for t&e bui$ding b$oc%s, t&eir use in reasoning about socia$ interactions and &o( t&ey are amended over time! "e t&en consider &o( t&e forma$ism may be used in practice, and present some insig&ts from initia$ ana$ysis of t&e be&aviour of t&e forma$ism! #&is is fo$$o(ed by a description of possib$e amendments and extensions to t&e forma$ism! "e conc$ude t&at it is possib$e to forma$ise a notion of trust and to mode$ t&e forma$isation by a computationa$ mec&anism! +F/2 Design issues for "or$d "ide "eb navigation visua$isation too$s :ndy /oc%burn, Steve 6ones

#&e "or$d "ide "eb *""", is a successfu$ &ypermedia information space used by mi$$ions of peop$e, yet it suffers from many deficiencies and prob$ems in support for navigation around its vast information space! In t&is paper (e identify t&e origins of t&ese navigation prob$ems, name$y """ bro(ser design, """ page design, and """ page description $anguages! 5egard$ess of t&eir origins, t&ese prob$ems are eventua$$y represented to t&e user at t&e bro(serSs user interface! #o &e$p overcome t&ese prob$ems, many too$s are being deve$oped (&ic& a$$o( users to visua$ise """ subspaces! "e identify five %ey issues in t&e design and functiona$ity of t&ese visua$isation systems> c&aracteristics of t&e visua$ representation, t&e scope of t&e subspace representation, t&e mec&anisms for generating t&e visua$isation, t&e degree of bro(ser independence, and t&e navigation support faci$ities! "e provide a critica$ revie( of t&e diverse range of """ visua$isation too$s (it& respect to t&ese issues! +F/4 Stac%ed genera$i'ation> (&en does it (or%U @ai 8ing #ing, Ian ! "itten Stac%ed genera$i'ation is a genera$ met&od of using a &ig&2$eve$ mode$ to combine $o(er2$eve$ mode$s to ac&ieve greater predictive accuracy! In t&is paper (e address t(o crucia$ issues (&ic& &ave been considered to be a Cb$ac% artC in c$assification tas%s ever since t&e introduction of stac%ed genera$i'ation in 1++2 by "o$pert> t&e type of genera$i'er t&at is suitab$e to derive t&e &ig&er2$eve$ mode$, and t&e %ind of attributes t&at s&ou$d be used as its input! "e demonstrate t&e effectiveness of stac%ed genera$i'ation for combining t&ree different types of $earning a$gorit&ms, and a$so for combining mode$s of t&e same type derived from a sing$e $earning a$gorit&m in a mu$tip$e2data2batc&es scenario! "e a$so compare t&e performance of stac%ed genera$i'ation (it& pub$is&ed resu$ts arcing and bagging! +F/7 Bro(sing in digita$ $ibraries> a p&rase2based approac& /raig Bevi$$28anning, Ian ! "itten, 9ordon "! Aaynter : %ey =uestion for digita$ $ibraries is t&is> &o( s&ou$d one go about becoming fami$iar (it& a digita$ co$$ection, as opposed to a p&ysica$ oneU Digita$ co$$ections genera$$y present an appearance (&ic& is extreme$y opa=ue2a screen, typica$$y a "eb page, (it& no indication of (&at, or &o( muc&, $ies beyond> (&et&er a carefu$$y2se$ected co$$ection or a morass of (ort&$ess ep&emeraL (&et&er &a$f a do'en documents or many mi$$ions! :t $east p&ysica$ co$$ections occupy p&ysica$ space, present a p&ysica$ appearance, and ex&ibit tangib$e p&ysica$ organi'ation! "&en standing on t&e t&res&o$d of a $arge $ibrary one gains a sense of presence and permanence t&at ref$ects t&e care ta%en in bui$ding and maintaining t&e co$$ection inside! Bo2one cou$d confuse it (it& a dung2&eapV Yet in t&e digita$ (or$d t&e difference is not so pa$pab$e!

+F/? : grap&ica$ notation for t&e design of information visua$isations 8att&e( /! ump&rey Qisua$isations are co&erent, grap&ica$ expressions of comp$ex information t&at en&ance peop$eSs abi$ity to communicate and reason about t&at information! Yet despite t&e importance of visua$isations in &e$ping peop$e to understand and so$ve a (ide variety of prob$ems, t&ere is a deart& of forma$ too$s and met&ods for discussing, describing and designing t&em! :$t&oug& simp$e visua$isations, suc& as bar c&arts and scatterp$ots, are easi$y produced by modern interactive soft(are, nove$ visua$isations of mu$tivariate, mu$tire$ationa$ data must be expressed in a programming $anguage! #&e 5e$ationa$ Qisua$isation Botation is a ne(, grap&ica$ $anguage for designing suc& &ig&$y expressive visua$isations t&at does not use programming constructs! Instead, t&e notation is based on re$ationa$ a$gebra, (&ic& is (ide$y used in database =uery $anguages, and it is supported by a suite of direct manipu$ation too$s! #&is artic$e presents t&e notation and examines t&e designs of some interesting visua$isations!

+F/D :pp$ications of mac&ine $earning in information retrieva$ Sa$$y 6o /unning&am, 6ames Kittin, Ian ! "itten Information retrieva$ systems provide access to co$$ections of t&ousands, or mi$$ions, of documents, from (&ic&, by providing an appropriate description, users can recover any one! #ypica$$y, users iterative$y refine t&e descriptions t&ey provide to satisfy t&eir needs, and retrieva$ systems can uti$i'e user feedbac% on se$ected documents to indicate t&e accuracy of t&e description at any stage! #&e sty$e of description re=uired from t&e user, and t&e (ay it is emp$oyed to searc& t&e document database, are conse=uences of t&e indexing met&od used for t&e co$$ection! #&e index may ta%e different forms, from storing %ey(ords (it& $in%s to individua$ documents, to c$ustering documents under re$ated topics!

+F/F /omputer concepts (it&out computers> a first course in computer science 9eoffrey o$mes, #ony /! Smit&, "i$$iam 6! 5ogers "&i$e some institutions see% to ma%e /S1 curricu$a more enHoyab$e by incorporating specia$ised educationa$ soft(are O1P or by setting more enHoyab$e programming assignments O2P, (e &ave Hoined t&e gro(ing number of /omputer Science departments t&at see%

to improve t&e =ua$ity of t&e /S1 experience by focusing student attention a(ay from t&e computer monitor O4,7P! Sop&isticated computing concepts usua$$y reserved for senior $eve$ courses are presented in a ;I<popu$ar science;/I< manner, and given e=ua$ time a$ongside t&e essentia$ introductory programming materia$! By exposing students to a broad range of specific computationa$ prob$ems (e endeavour to ma%e t&e introductory course more interesting and enHoyab$e, and insti$ in students a sense of vision for areas t&ey mig&t specia$ise in as computing maHors!

+F/I : sig&t2singing tutor K$oyd :! Smit&, 5odger 6! 8cBab #&is paper describes a computer program designed to aid its users in $earning to sig&t2sing! Sig&t2singing2t&e abi$ity to sing music from a score (it&out prior study2is an important s%i$$ for musicians and &o$ds a centra$ p$ace in most university music curricu$a! Its importance to voca$ists is obviousL it is a$so an important s%i$$ for instrumenta$ists and conductors because it deve$ops t&e aura$ imagination necessary to Hudge &o( t&e music s&ou$d sound, (&en p$ayed *Ben(ard and /arr 1++1,! .urt&ermore, it is an important s%i$$ for amateur musicians, (&o can save a great dea$ of re&earsa$ time t&roug& an abi$ity to sing music at sig&t! +F/+ Stac%ing bagged and dagged mode$s @ai 8ing #ing, I! ! "itten In t&is paper, (e investigate t&e met&od of stacked generalization in combining mode$s derived from different subsets of a training dataset by a sing$e $earning a$gorit&m, as (e$$ as different a$gorit&ms! #&e simp$est (ay to combine predictions from competing mode$s is maHority vote, and t&e effect of t&e samp$ing regime used to generate training subsets &as a$ready been studied in t&is context2(&en bootstrap samp$es are used t&e met&od is ca$$ed bagging, and for disHoint samp$es (e ca$$ it dagging! #&is paper extends t&ese studies to stac%ed genera$i'ation, (&ere a $earning a$gorit&m is emp$oyed to combine t&e mode$s! #&is yie$ds ne( met&ods dubbed bag-stacking and dag-stacking! "e demonstrate t&at bag2stac%ing and dag2stac%ing can be effective for c$assification tas%s even (&en t&e training samp$es cover Hust a sma$$ fraction of t&e fu$$ dataset! In contrast to ear$ier bagging resu$ts, (e s&o( t&at bagging and bag2stac%ing (or% for stab$e as (e$$ as unstab$e $earning a$gorit&ms, as do dagging and dag2stac%ing! "e find t&at bag2stac%ing *dag2stac%ing, a$most a$(ays &as &ig&er predictive accuracy t&an bagging *dagging,, and (e a$so s&o( t&at bag2stac%ing mode$s derived using t(o different a$gorit&ms is more effective t&an bagging! +F/10 -xtracting text from Aostscript /raig Bevi$$28anning, #odd 5eed, Ian ! "itten "e s&o( &o( to extract p$ain text from AostScript fi$es! : textua$ scan is inade=uate because AostScript interpreters can generate c&aracters on t&e page t&at do not appear in t&e source fi$e! .urt&ermore, (ord and $ine brea%s are imp$icit in t&e grap&ica$ rendition, and must be inferred from t&e positioning of (ord fragments! "e present a robust tec&ni=ue for extracting text and recogni'ing (ords and paragrap&s! #&e met&od uses a standard AostScript interpreter but redefines severa$ AostScript operators, and simp$e &euristics are emp$oyed to $ocate (ord and $ine brea%s! #&e sc&eme &as been used to create a fu$$2text index, and p$ain2text versions, of 70,000 tec&nica$ reports *47 9byte of AostScript,! Gt&er text2extraction systems are revie(ed> none offer t&e same combination of robustness and simp$icity! +F/11 9at&ering and indexing ric& fragments of t&e "or$d "ide "eb 9eoffrey o$mes, "i$$iam 6 5ogers "&i$e t&e "or$d "ide "eb *""", is an attractive option as a resource for teac&ing and researc& it does &ave some undesirab$e features! #&e cost of a$$o(ing students un$imited access can be &ig&2bot& in money and timeL students may become addicted to CsurfingC t&e (eb2exp$oring pure$y for entertainment2and Heopardise t&eir studies! Students are $i%e$y to discover undesirab$e materia$ because $arge sca$e searc& engines index sites regard$ess of t&eir merit! .ina$$y, t&e exp$osive gro(t& of """ usage means t&at servers and net(or%s are often over$oaded, to t&e extent t&at a student may gain a very negative vie( of t&e tec&no$ogy! "e &ave deve$oped a piece of soft(are (&ic& attempts to address t&ese issues by capturing ric& fragments of t&e """ onto $oca$ storage media! It is possib$e to put a co$$ection onto /D 5G8, providing portabi$ity and inexpensive storage! #&is enab$es t&e presentation of t&e """ to distance $earning students, (&o do not &ave internet access! #&e soft(are interfaces to standard, common$y avai$ab$e (eb bro(sers, acting as a proxy server to t&e fi$es stored on t&e $oca$ media, and provides a searc& engine giving fu$$ text searc&ing capabi$ity (it&in t&e co$$ection!

+F/12 Using mode$ trees for c$assification -ibe .ran%, Yong "ang, Stuart Ing$is, 9eoffrey o$mes, Ian ! "itten

8ode$ trees, (&ic& are a type of decision tree (it& $inear regression functions at t&e $eaves, form t&e basis of a recent successfu$ tec&ni=ue for predicting continuous numeric va$ues! #&ey can be app$ied to c$assification prob$ems by emp$oying a standard met&od of transforming a c$assification prob$em into a prob$em of function approximation! Surprising$y, using t&is simp$e transformation t&e mode$ tree inducer 8?C, based on Ruin$anCs 8?, generates more accurate c$assifiers t&an t&e state2of2t&e2art decision tree $earner /?!0, particu$ar$y (&en most of t&e attributes are numeric! +F/14 Discovering inter2attribute re$ations&ips 9eoffrey o$mes It is important to discover re$ations&ips bet(een attributes being used to predict a c$ass attribute in supervised $earning situations for t(o reasons! .irst, any suc& re$ations&ip (i$$ be potentia$$y interesting to t&e provider of a dataset in its o(n rig&t! Second, it (ou$d simp$ify a $earning a$gorit&mCs searc& space, and t&e re$ated irre$evant feature and subset se$ection prob$em, if t&e re$ations&ips (ere removed from datasets a&ead of $earning! :n a$gorit&m to discover suc& re$ations&ips is presented in t&is paper! #&e a$gorit&m is described and a surprising number of inter2attribute re$ations&ips are discovered in datasets from t&e University of /a$ifornia at Irvine *U/I, repository! +F/17 Kearning from batc&ed data> mode$ combination vs data combination @ai 8ing #ing, Boon #o& Ko(, Ian ! "itten "&en presented (it& mu$tip$e batc&es of data, one can eit&er combine t&em into a sing$e batc& before app$ying a mac&ine $earning procedure or $earn from eac& batc& independent$y and combine t&e resu$ting mode$s! #&e former procedure, data combination, is straig&tfor(ardL t&is paper investigates t&e $atter, mode$ combination! 9iven an appropriate combination met&od, one mig&t expect mode$ combination to prove superior (&en t&e data in eac& batc& (as obtained under some(&at different conditions or (&en different $earning a$gorit&ms (ere used on t&e batc&es! -mpirica$ resu$ts s&o( t&at mode$ combination often outperforms data combination even (&en t&e batc&es are dra(n random$y from a sing$e source of data and t&e same $earning met&od is used on eac&! 8oreover, t&is is not Hust an artifact of one particu$ar met&od of combining mode$s> it occurs (it& severa$ different combination met&ods! "e re$ate t&is p&enomenon to t&e $earning curve of t&e c$assifiers being used! -ar$y in t&e $earning process (&en t&e $earning curve is steep t&ere is muc& to gain from data combination, but $ater (&en it becomes s&a$$o( t&ere is $ess to gain and mode$ combination ac&ieves a greater reduction in variance and &ence a $o(er error rate! #&e practica$ imp$ication of t&ese resu$ts is t&at one s&ou$d consider using mode$ combination rat&er t&an data combination, especia$$y (&en mu$tip$e batc&es of data for t&e same tas% are readi$y avai$ab$e! It is often superior even (&en t&e batc&es are dra(n random$y from a sing$e samp$e, and (e expect its advantage to increase if genuine statistica$ differences bet(een t&e batc&es exist! +F/1? Information see%ing retrieva$, reading and storing be&aviour of $ibrary users #urner @! In t&e interest of digita$ $ibraries, it is advisab$e t&at designers be a(are of t&e potentia$ be&aviour of t&e users of suc& a system! #&ere are t(o distinct parts under investigation, t&e interaction bet(een traditiona$ $ibraries invo$ving t&e see%ing and retrieva$ of re$evant materia$, and t&e reading and storage be&aviours ensuing! #&roug& t&is ana$ysis, t&e findings cou$d be incorporated into digita$ $ibrary faci$ities! #&ere &as been copious amounts of researc& on information see%ing $eading to t&e deve$opment of be&avioura$ mode$s to describe t&e process! Gften researc& on t&e information see%ing practices of individua$s is based on t&e tas% and fie$d of study! #&e information see%ing mode$, presented by -$$is et a$! *1++4,, c&aracterises t&e format of t&is study (&ere it is used to compare various researc& on t&e information see%ing practices of groups of peop$e *from academics to professiona$s,! It is found t&at, a$t&oug& researc&ers do ma%e use of $ibrary faci$ities, t&ey tend to re$y &eavi$y on t&eir o(n co$$ections and primari$y use t&e $ibrary as a source for previous$y identified information, bro(sing and inter$oan! It (as found t&at t&ere are significant differences in user be&aviour bet(een t&e groups ana$ysed! "&en $oo%ing at t&e reading and storage of materia$ it (as &ard to dra( conc$usions, due to t&e $ac% of substantia$ researc& and information on t&e topic! o(ever, t&roug& t&e use of reading strategies, a genera$ idea on &o( readers be&ave can be deve$oped! Designers of digita$ $ibraries can benefit from t&e guide$ines presented &ere to better understand t&eir audience! +F/1D Aroceeding of t&e IB#-5:/#+F /ombined "or%s&op on /S/" in /I2"or$d(ide 8att&ias 5auterberg, Kars Gestreic&er, 6o&n 9rundy #&is is t&e proceedings for t&e IB#-5:/#+F combined (or%s&op on 0/S/" in /I2(or$d(ide1! #&e position papers in t&is proceedings are t&ose se$ected from topics re$ating to /I community deve$opment (or$d(ide and to /S/" issues! Grigina$$y t&ese (ere to be t(o separate IB#-5:/# (or%s&ops, but (ere combined to ensure sufficient participation for a combined (or%s&op to run! #&e combined (or%s&op &as been sp$it into t(o separate sessions to run in t&e morning of 6u$y 1?t&, Sydney, :ustra$ia! Gne to discuss t&e issues re$ating to t&e position papers focusing on genera$ /S/" systems, t&e ot&er to t&e deve$opment of /I communities in a (or$d(ide context! #&e /S/" session uses as a case study a proposed group(are too$ for faci$itating t&e

deve$opment of an /I database (it& a (or$d(ide geograp&ica$ distribution! #&e /I community session focuses on deve$oping t&e content for suc& a database, in order for it to foster t&e continued deve$opment of /I communities! #&e afternoon session of t&e combined (or%s&op invo$ves a Hoint discussion of t&e case study group(are too$, in terms of its content and $i%e$y group(are faci$ities! #&e position papers &ave been grouped into t&ose focusing on /I communities and &ence content issues for a group(are database, and t&ose focusing on /S/" and group(are issues, and &ence $i%e$y group(are support in t&e proposed /I database/co$$aboration too$s! "e &ope t&at you find t&e position papers in t&is proceedings offer a (ide range of interesting reports of /I community deve$opment (or$d(ide, $eading /S/" system researc&, and t&at a group(are too$ supporting aspects of a (or$d(ide /I database can dra( upon t&e varied (or% reported!

+F/1F Internationa$ising a spreads&eet for Aacific Basin $anguages 5obert Barbour, :$vin Yeo :s peop$e trade and engage in commerce, an economica$$y dominant cu$ture tends to migrate $anguage into ot&er recent$y contacted cu$tures! Information tec&no$ogy *I#, can acce$erate encu$turation and promote t&e expansion of (estern &egemony in I#! -=ua$$y, I# can present a cu$tura$$y appropriate interface to t&e user t&at promotes t&e preservation of cu$ture and $anguage (it& very $itt$e additiona$ effort! In t&is paper a spreads&eet is internationa$ised to accept $anguages from t&e Katin21 c&aracter set suc& as -ng$is&, 8aori and Ba&asa 8e$ayu *8a$aysiaSs nationa$ $anguage,! : tec&ni=ue t&at a$$o(s a non2programmer to add a ne( $anguage to t&e spreads&eet is described! #&e tec&ni=ue cou$d a$so be used to internationa$ise ot&er soft(are at t&e point of design by fo$$o(ing t&e steps (e out$ine!

+F/1I Koca$ising a spreads&eet> an Iban examp$e :$vin Yeo, 5obert Barbour Aresent$y, t&ere is $itt$e $oca$isation of soft(are to sma$$er cu$tures if it is not economica$$y viab$e! "e be$ieve soft(are s&ou$d a$so be $oca$ised to t&e $anguages of sma$$ cu$tures in order to sustain and preserve t&ese sma$$ cu$tures! :s an examp$e, (e $oca$ised a spreads&eet from -ng$is& to Iban! #&e process in (&ic& (e carried out t&e $oca$isation can be used as a frame(or% for t&e $oca$isation of soft(are to $anguages of sma$$ et&nic minorities! Some prob$ems faced during t&e $oca$isation process are a$so discussed!

+F/1+ Strategies of internationa$isation and $oca$isation> a postmodernist/s perspective :$vin Yeo, 5obert Barbour 8any soft(are companies today are deve$oping soft(are not on$y for $oca$ consumption but for t&e rest of t&e (or$d! "e introduce t&e concepts of internationa$isation and $oca$isation and discuss some tec&ni=ues using t&ese processes! :n examination of postmodern criti=ue (it& respect to t&e soft(are industry is a$so reported! In addition, (e a$so feature our proposed internationa$isation tec&ni=ue t&at (as inspired by ta%ing into account t&e researc&es of postmodern p&i$osop&ers and mat&ematicians! :s i$$ustrated in our prototype, t&e tec&ni=ue empo(ers non2programmers to $oca$ise t&eir o(n soft(are! .urt&er deve$opment of t&e tec&ni=ue and its imp$ications on user interfaces and t&e future of soft(are internationa$isation and $oca$isation are discussed! +F/20 Kanguage use in soft(are :$vin Yeo, 5obert Barbour 8any of t&e popu$ar soft(are (e use today are in -ng$is&! Qery fe( soft(are app$ications are avai$ab$e in minority $anguages! Besides economic goa$s, (e Hustify (&y soft(are s&ou$d be made avai$ab$e to sma$$er cu$tures! .urt&ermore, t&ere is evidence t&at peop$e $earn and progress faster in soft(are in t&eir mot&er tongue *9riffit&s et at, 1++7, *@roc%, 1++D,! "e &ypot&esise t&at experienced users of -ng$is& spreads&eet can easi$y migrate to a spreads&eet in t&eir native tongue i!e! Ba&asa 8e$ayu *8a$aysiaSs nationa$ $anguage,! Gbservations made in t&e study suggest t&at t&e native spea%ers of Ba&asa 8e$ayu &ad difficu$ties (it& t&e Ba&asa 8e$ayu interface! #&e subHectsS main difficu$ty (as t&eir unfami$iarity (it& computing termino$ogy in Ba&asa 8e$ayu! "e present possib$e strategies to increase t&e use of Ba&asa 8e$ayu in I#! #&ese strategies may a$so be used to promote t&e use of ot&er minority $anguages in I#!

+F/21 Usabi$ity testing> a 8a$aysian study :$vin Yeo, 5obert Barbour, 8ar% :pper$ey :n exp$oratory study of soft(are assessment tec&ni=ues is conducted in 8a$aysia! SubHects in t&e study comprised staff members of a 8a$aysian university (it& a &ig& Information #ec&no$ogy *I#, presence! #&e subHects assessed a spreads&eet too$ (it& a Ba&asa 8e$ayu *8a$aysiaSs nationa$ $anguage, interface! Soft(are eva$uation tec&ni=ues used inc$ude t&e t&in% a$oud met&od, intervie(s and

t&e System Usabi$ity Sca$e! #&e responses in t&e various tec&ni=ues used are reported and initia$ resu$ts indicate idiosyncratic be&aviour of 8a$aysian subHects! #&e imp$ications of t&e findings are a$so discussed!

+F/22 Inducing cost2sensitive trees via instance2(eig&ting @ai 8ing #ing "e introduce an instance2(eig&ting met&od to induce cost2sensitive trees in t&is paper! It is a genera$i'ation of t&e standard tree induction process (&ere on$y t&e initia$ instance (eig&ts determine t&e type of tree *i!e!, minimum error trees or minimum cost trees, to be induced! "e demonstrate t&at it can be easi$y adopted to an existing tree $earning a$gorit&m! Arevious researc& gave insufficient evidence to support t&e fact t&at t&e greedy divide2and2con=uer a$gorit&m can effective$y induce a tru$y cost2sensitive tree direct$y from t&e training data! "e provide t&is empirica$ evidence in t&is paper! #&e a$gorit&m emp$oying t&e instance2(eig&ting met&od is found to be comparab$e to or better t&an bot& /7!? and /? in terms of tota$ misc$assification costs, tree si'e and t&e number of &ig& cost errors! #&e instance2(eig&ting met&od is a$so simp$er and more effective in imp$ementation t&an a met&od based on a$tered priors! +F/24 .ast convergence (it& a greedy tag2p&rase dictionary 5oss Aeeters, #ony /! Smit&

The best general-purpose compression schemes make their gains by estimating a probability distribution over all possible next symbols given the context established by some number of previous symbols. Such context models typically obtain good compression results for plain text by taking advantage of regularities in character sequences. Frequent words and syllables can be incorporated into the model quickly and thereafter used for reasonably accurate prediction. However, the precise context in which frequent patterns emerge is often extremely varied, and each new word or phrase immediately introduces new contexts which can adversely affect the compression rate
: great dea$ of t&e structura$ regu$arity in a natura$ $anguage is given rat&er more by properties of its grammar t&an by t&e ort&ograp&ic transcription of its p&ono$ogy! #&is imp$ies t&at access to a grammatica$ abstraction mig&t $ead to good compression! "&i$e grammatica$ mode$s &ave been used successfu$$y for compressing computer programs O7P, grammar2based compression of p$ain text &as received $itt$e attention, primari$y because of t&e difficu$ties associated (it& constructing a suitab$e natura$ $anguage grammar! But even (it&out a precise formu$ation of t&e syntax of a $anguage, t&ere is a $inguistic abstraction (&ic& is easi$y accessed and (&ic& demonstrates a &ig& degree of regu$arity (&ic& can be exp$oited for compression purposes2name$y, $exica$ categories! +F/27 #ag based mode$s of -ng$is& text "! 6! #ea&an, 6o&n 9! /$eary #&e prob$em of compressing -ng$is& text is important bot& because of t&e ubi=uity of -ng$is& as a target for compression and because of t&e $ig&t t&at compression can s&ed on t&e structure of -ng$is&! -ng$is& text is examined in conHunction (it& additiona$ information about t&e parts of speec& of eac& (ord in t&e text *t&ese are referred to as 0tags1,! It is s&o(n t&at t&e tags p$us t&e text can be compressed more t&an t&e text a$one! -ssentia$$y t&e tags can be compressed for not&ing or even a sma$$ net saving in si'e! : comparison is made of a number of different (ays of integrating compression of tags and text using an escape mec&anism simi$ar to AA8! #&ese are a$so compared (it& standard (ord based and c&aracter based compression programs! #&e resu$t is t&at t&e tag c&aracter and (ord based sc&emes a$(ays outperform t&e c&aracter based sc&emes! Gvera$$, t&e tag based sc&emes outperform t&e (ord based sc&emes! "e conc$ude by conHecturing t&at tags c&osen for compression rat&er t&an $inguistic purposes (ou$d perform even better! +F/2? 8usica$ image compression David Bainbridge, Stuart Ing$is Gptica$ music recognition aims to convert t&e vast repositories of s&eet music in t&e (or$d into an on2$ine digita$ format OBai+FP! In t&e near future it (i$$ be possib$e to assimi$ate music into digita$ $ibraries and users (i$$ be ab$e to perform searc&es based on a sung me$ody in addition to typica$ text2based searc&ing O8S"W+DP! :n important re=uirement for suc& a system is t&e abi$ity to reproduce t&e origina$ score as accurate$y as possib$e! Due to t&e &uge amount of s&eet music avai$ab$e, t&e efficient storage of musica$ images is an important topic of study! #&is paper investigates (&et&er t&e 0%no($edge1 extracted from t&e optica$ music recognition *G85, process can be exp$oited to gain &ig&er compression t&an t&e 6BI9 internationa$ standard for bi2$eve$ image compression! "e present a &ybrid approac& (&ere t&e primitive s&apes of music extracted by t&e optica$ music recognition process2note &eads, note stems, staff $ines and so fort&2are fed into a grap&ica$ symbo$ based compression sc&eme origina$$y designed for images containing main$y printed text! Using t&is

&ybrid approac& t&e average compression rate for a sing$e page is improved by 4!?X over 6BI9! "&en mu$tip$e pages (it& simi$ar typograp&y are processed in se=uence, t&e fi$e si'e is decreased by 72IX! Section 2 presents t&e re$evant bac%ground to bot& optica$ music recognition and textua$ image compression! Section 4 describes t&e experiments performed on DD test images, out$ining t&e combinations of parameters t&at (ere examined to give t&e best resu$ts! #&e initia$ resu$ts and refinements are presented in Section 7, and (e conc$ude in t&e $ast section by summari'ing t&e findings of t&is (or%!

+F/2D /orrecting -ng$is& text using AA8 mode$s "! 6! #ea&an, S! Ing$is, 6! 9! /$eary, 9! o$mes :n essentia$ component of many app$ications in natura$ $anguage processing is a $anguage mode$er ab$e to correct errors in t&e text being processed! .or optica$ c&aracter recognition *G/5,, poor scanning =ua$ity or extraneous pixe$s in t&e image may cause one or more c&aracters to be mis2recogni'edL (&i$e for spe$$ing correction, t(o c&aracters may be transposed, or a c&aracter may be inadvertent$y inserted or missed out! #&is paper describes a met&od for correcting -ng$is& text using a AA8 mode$! : met&od t&at segments (ords in -ng$is& text is introduced and is s&o(n to be a significant improvement over previous$y used met&ods! : simi$ar tec&ni=ue is a$so app$ied as a post2 processing stage after pages &ave been recogni'ed by a state2of2t&e2art commercia$ G/5 system! "e s&o( t&at t&e accuracy of t&e G/5 system can be increased from +?!+X to +D!DX, a decrease of about 10 errors per page!

+F/2F /onstraints on para$$e$ism beyond 10 instructions per cyc$e 6o&n 9! /$eary, 5ic&ard ! Kittin, 6! :! David 8c"&a, 8urray "! Aearson #&e prob$em of extracting Instruction Keve$ Aara$$e$ism at $eve$s of 10 instructions per c$oc% and &ig&er is considered! #(o different arc&itectures (&ic& use specu$ation on memory accesses to ac&ieve t&is $eve$ of performance are revie(ed! It is pointed out t&at (&i$e t&is form of specu$ation gives &ig& potentia$ para$$e$ism it is necessary to retain execution state so t&at incorrect specu$ation can be detected and subse=uent$y s=uas&ed! Simu$ation resu$ts s&o( t&at t&e space to store suc& state is a critica$ resource in obtaining good speedup! #o ma%e good use of t&e space it is essentia$ t&at state be stored efficient$y and t&at it be retired as soon as possib$e! : number of tec&ni=ues for extracting t&e best usage from t&e avai$ab$e state storage are introduced!

+F/2I -ffects of re2ordered memory operations on para$$e$ism 5ic&ard ! Kittin, 6o&n 9! /$eary #&e performance effect of permitting different memory operations to be re2ordered is examined! #&e avai$ab$e para$$e$ism is computed using a mac&ine code simu$ator! : range of possib$e restrictions on t&e re2ordering of memory operations is considered> from t&e pure$y se=uentia$ case (&ere no re2ordering is permittedL to t&e comp$ete$y permissive one (&ere memory operations may occur in any order so t&at t&e para$$e$ism is restricted on$y by data dependencies! : genera$ conc$usion is dra(n t&at to re$iab$y obtain para$$e$ism beyond 10 instructions per c$oc% (i$$ re=uire an abi$ity to re2order a$$ memory instructions! : brief description of a feasib$e arc&itecture capab$e of t&is is given!

+F/2+ GE/ IS+D Industry Session> Sixt& :ustra$ian /onference on uman2/omputer Interaction -dited by /&ris A&i$$ips, 6anis 8c@auge #&e idea for a specific industry session at GE/ I (as first mooted at t&e 1++? conference in "o$$ongong, during =uestions fo$$o(ing a session of s&ort papers (&ic& &appened *serendipitous$y, to be presented by peop$e from industry! :n animated discussion too% p$ace, most of (&ic& (as about &o( GE/ I cou$d be made more re$evant to peop$e in industry, be it (or%ing as usabi$ity consu$tants, or (or%ing (it&in organisations eit&er as usabi$ity professiona$s or as Yc&ampions of t&e causeS! #&e discussion raised more =uestions t&an ans(ers, about t&e format of suc& as session, about t&e c&a$$enges of attracting industry participation, and about t&e best (ay of pub$is&ing t&e resu$ts! :$t&oug& no rea$ so$utions (ere arrived at, it (as enoug& to p$ace an industry session on t&e agenda for GE/ IS+D! +F/40 :daptive mode$s of -ng$is& text "! 6! #ea&an, 6o&n 9! /$eary ig& =ua$ity mode$s of -ng$is& text (it& performance approac&ing t&at of &umans is important for many app$ications inc$uding spe$$ing correction, speec& recognition, G/5, and encryption! : number of different statistica$ mode$s of -ng$is& are compared (it&

eac& ot&er and (it& previous estimates from &uman subHects! It is conc$uded t&at t&e best current mode$s are (ord based (it& part of speec& tags! 9iven sufficient training text, t&ey are ab$e to attain performance comparab$e to &umans!

+F/41 : grap&ica$ user interface for Boo$ean =uery specification Steve 6ones, S&ona 8cInnes Gn2$ine information repositories common$y provide %ey(ord searc& faci$ities via textua$ =uery $anguages based on Boo$ean $ogic! o(ever, t&ere is evidence to suggest t&at t&e syntactica$ demands of suc& $anguages can $ead to user errors and adverse$y affect t&e time t&at it ta%es users to form =ueries! Users a$so face difficu$ties because of t&e conf$ict in semantics bet(een :BD and G5 (&en used in Boo$ean $ogic and -ng$is& $anguage! "e suggest t&at grap&ica$ =uery $anguages, in particu$ar Qenn2$i%e diagrams, can a$$eviate t&e prob$ems t&at users experience (&en forming Boo$ean expressions (it& textua$ $anguages! "e describe Q=uery, a Qenn2 diagram based user interface to t&e Be( Eea$and Digita$ Kibrary *BEDK,! #&e design of Q=uery &as been part$y motivated by ana$ysis of BEDK usage! "e found t&at fe( =ueries contain more t&an t&ree terms, use of t&e intersection operator dominates and t&at =uery refinement is common! : study of t&e uti$ity of Qenn diagrams for =uery specification indicates t&at (it& $itt$e or no training users can interpret and form Qenn2$i%e diagrams (&ic& accurate$y correspond to Boo$ean expressions! #&e uti$ity of Q=uery is considered and directions for future (or% are proposed!

TYBSC CS Information Retrieval Munotes
No ratings yet
TYBSC CS Information Retrieval Munotes
85 pages
Holmes Cunningham DM93
No ratings yet
Holmes Cunningham DM93
10 pages
IntroToPracticalElectronicsMicrocontrollersAndSoftwareDesign PDF
98% (46)
IntroToPracticalElectronicsMicrocontrollersAndSoftwareDesign PDF
1,014 pages
Blue Book
No ratings yet
Blue Book
20 pages
25 June 2019: Department of Agriculture
No ratings yet
25 June 2019: Department of Agriculture
2 pages
General Dynamics - SRAT II Stryker Reactive Armor Tiles
No ratings yet
General Dynamics - SRAT II Stryker Reactive Armor Tiles
1 page
Dv Emrs Jsa Date
No ratings yet
Dv Emrs Jsa Date
18 pages
ilxomjon
No ratings yet
ilxomjon
19 pages
Panavision Gamma Gdp Manual v 144 A
No ratings yet
Panavision Gamma Gdp Manual v 144 A
40 pages
6153-Article Text-6539-1-10-20230303
No ratings yet
6153-Article Text-6539-1-10-20230303
9 pages
Application of Computational Linguistics
No ratings yet
Application of Computational Linguistics
19 pages
Case Sudies Assignment
No ratings yet
Case Sudies Assignment
21 pages
Text Categorization Using Association Rule and Naïve Bayes Classifier
No ratings yet
Text Categorization Using Association Rule and Naïve Bayes Classifier
9 pages
RP 7
No ratings yet
RP 7
11 pages
Nishna Nyachhyon
No ratings yet
Nishna Nyachhyon
12 pages
MSc in Marketing and Analytics
No ratings yet
MSc in Marketing and Analytics
2 pages
Chasdei Lev Succos 23 Order Form
No ratings yet
Chasdei Lev Succos 23 Order Form
4 pages
TECHNICAL REPORT
No ratings yet
TECHNICAL REPORT
14 pages
Prior steps into knowledge mapping: Text mining application and comparison
No ratings yet
Prior steps into knowledge mapping: Text mining application and comparison
8 pages
An Introduction To The WEKA Data Mining System
No ratings yet
An Introduction To The WEKA Data Mining System
2 pages
Data Mining: A Brief Introduction To The Field and Research Community
No ratings yet
Data Mining: A Brief Introduction To The Field and Research Community
5 pages
Rajesh Roshan SIP Report
No ratings yet
Rajesh Roshan SIP Report
14 pages
A Review and Analysis of The Usability of Data Management Environments
No ratings yet
A Review and Analysis of The Usability of Data Management Environments
23 pages
Rijul Research Paper
No ratings yet
Rijul Research Paper
9 pages
Ijiset V2 I2 63 PDF
No ratings yet
Ijiset V2 I2 63 PDF
9 pages
CRP I TV 53 Galloway
No ratings yet
CRP I TV 53 Galloway
12 pages
1.1 Domain Overview Knowledge and Data Engineering: ESDG 2000
No ratings yet
1.1 Domain Overview Knowledge and Data Engineering: ESDG 2000
22 pages
From Big Data To Smart Data: Teaching Data Mining and Visualization
No ratings yet
From Big Data To Smart Data: Teaching Data Mining and Visualization
5 pages
Published By: Blue Eyes Intelligence Engineering & Sciences Publication Retrieval Number: I10030789S19/19©BEIESP DOI: 10.35940/ijitee.I1003.0789S19
No ratings yet
Published By: Blue Eyes Intelligence Engineering & Sciences Publication Retrieval Number: I10030789S19/19©BEIESP DOI: 10.35940/ijitee.I1003.0789S19
1 page
Web Content Mining 2
No ratings yet
Web Content Mining 2
83 pages
Datos Deforestacion2
No ratings yet
Datos Deforestacion2
48 pages
Intelligent Information Retrieval From The Web
No ratings yet
Intelligent Information Retrieval From The Web
4 pages
KB Neural Data Mining PDF
No ratings yet
KB Neural Data Mining PDF
112 pages
UNIT - 6
No ratings yet
UNIT - 6
12 pages
9348 11568 1 PB Published Paper
No ratings yet
9348 11568 1 PB Published Paper
12 pages
Eng Survey S. Subitha
No ratings yet
Eng Survey S. Subitha
12 pages
Research It Creating Rule Sets For JAWS
No ratings yet
Research It Creating Rule Sets For JAWS
19 pages
PALM: Preprocessed Apriori For Logical Matching Using Map Reduce Algorithm
No ratings yet
PALM: Preprocessed Apriori For Logical Matching Using Map Reduce Algorithm
9 pages
Performance Enhancement Using Combinatorial Approach of Classification and Clustering in Machine Learning
No ratings yet
Performance Enhancement Using Combinatorial Approach of Classification and Clustering in Machine Learning
8 pages
CS232 Team Project Final: Infoglut
No ratings yet
CS232 Team Project Final: Infoglut
15 pages
Pertumbuhan Bakteri (Optical Density) : Regression Statistics
No ratings yet
Pertumbuhan Bakteri (Optical Density) : Regression Statistics
3 pages
Text Categorization Using Association Rule and Naïve Bayes Classifier
No ratings yet
Text Categorization Using Association Rule and Naïve Bayes Classifier
9 pages
Chapter 4 Solutions Manual
86% (7)
Chapter 4 Solutions Manual
18 pages
Modern Information Retrieval: A Brief Overview
No ratings yet
Modern Information Retrieval: A Brief Overview
9 pages
Lab 1 ML 414
No ratings yet
Lab 1 ML 414
5 pages
UploadApplicationFormPDF2119 20347
No ratings yet
UploadApplicationFormPDF2119 20347
4 pages
Hci Unit 5
No ratings yet
Hci Unit 5
22 pages
Three Perspectives of Data Mining
No ratings yet
Three Perspectives of Data Mining
8 pages
What Is Web Mining?
No ratings yet
What Is Web Mining?
2 pages
Efficient Image Classification Using Data Mining
No ratings yet
Efficient Image Classification Using Data Mining
18 pages
Example Based Search 2001
No ratings yet
Example Based Search 2001
7 pages
2 Job Rotation
No ratings yet
2 Job Rotation
4 pages
A New Course For PHD Students
No ratings yet
A New Course For PHD Students
22 pages
Background To The Adoption of The Penal Code, 1860
No ratings yet
Background To The Adoption of The Penal Code, 1860
12 pages
Synfocity 312
No ratings yet
Synfocity 312
2 pages
Big Data Vs Data Mining: Abstract
No ratings yet
Big Data Vs Data Mining: Abstract
5 pages
Coding for Beginners: A Step-by-Step Guide to Learn Python, Java, SQL, C, C++, C#, HTML, and CSS from Scratch
From Everand
Coding for Beginners: A Step-by-Step Guide to Learn Python, Java, SQL, C, C++, C#, HTML, and CSS from Scratch
Vere salazar
No ratings yet
SIGIR 2003 Workshop On Distributed Information Retrieval: Jamie Callan Fabio Crestani Mark Sanderson
No ratings yet
SIGIR 2003 Workshop On Distributed Information Retrieval: Jamie Callan Fabio Crestani Mark Sanderson
5 pages
Benefits of FMS, Scheduling
No ratings yet
Benefits of FMS, Scheduling
4 pages
Weka Tutorial
100% (2)
Weka Tutorial
60 pages
Information Retrieval Is A Complex Process Because There Is No Infallible Way To Provide A Direct Connection Between A User
No ratings yet
Information Retrieval Is A Complex Process Because There Is No Infallible Way To Provide A Direct Connection Between A User
4 pages
Interview Etiquettes: Do's and Don'ts
No ratings yet
Interview Etiquettes: Do's and Don'ts
32 pages
Faculty of Computer Science: Research News
No ratings yet
Faculty of Computer Science: Research News
4 pages
Hci Unit 5 PDF
No ratings yet
Hci Unit 5 PDF
22 pages
MDM/KDD2002: Multimedia Data Mining Between Promises and Problems
No ratings yet
MDM/KDD2002: Multimedia Data Mining Between Promises and Problems
4 pages
Structure of Indian Economy
No ratings yet
Structure of Indian Economy
18 pages
MYP UP Template Guide 18-19
No ratings yet
MYP UP Template Guide 18-19
8 pages
Low-Cost 3D Printing Screen
No ratings yet
Low-Cost 3D Printing Screen
202 pages
Clustering With Multiviewpoint-Based Similarity Measure: Abstract
No ratings yet
Clustering With Multiviewpoint-Based Similarity Measure: Abstract
83 pages
Ian Talks Algos & Data Structures A-Z: WebDevAtoZ, #2
From Everand
Ian Talks Algos & Data Structures A-Z: WebDevAtoZ, #2
Ian Eress
No ratings yet
v4 Integration
No ratings yet
v4 Integration
123 pages
Lab 2: Building Collections With The Librarian Interface: Snapshots
No ratings yet
Lab 2: Building Collections With The Librarian Interface: Snapshots
14 pages
Vodafone Strategy Italy
No ratings yet
Vodafone Strategy Italy
46 pages
Gillette Date Codes: Razor Serial Numbers Were Impressed On All Gillette
No ratings yet
Gillette Date Codes: Razor Serial Numbers Were Impressed On All Gillette
7 pages
AGSC PowerPoint Oral Presentation
No ratings yet
AGSC PowerPoint Oral Presentation
22 pages
Labview DSP Module: Digital Signal Processing System-Level Design Using Labview
No ratings yet
Labview DSP Module: Digital Signal Processing System-Level Design Using Labview
6 pages
Lab 3: Adding and Using Metadata: Snapshots: UNESCO Greenstone Workshop Nzdl/Ncsi
No ratings yet
Lab 3: Adding and Using Metadata: Snapshots: UNESCO Greenstone Workshop Nzdl/Ncsi
16 pages
Lab 3e LCD Alarm Clock Page 3e.1
No ratings yet
Lab 3e LCD Alarm Clock Page 3e.1
5 pages
Greenstone: A Comprehensive Open-Source Digital Library Software System
No ratings yet
Greenstone: A Comprehensive Open-Source Digital Library Software System
9 pages
Lab 1e Fixed-Point Output Fall 2010 1e.1
No ratings yet
Lab 1e Fixed-Point Output Fall 2010 1e.1
8 pages
Lab 1e Fixed-Point Output Fall 2010 1e.1
No ratings yet
Lab 1e Fixed-Point Output Fall 2010 1e.1
8 pages
OWA 2014 - Lo Res Brochure (KE)
No ratings yet
OWA 2014 - Lo Res Brochure (KE)
24 pages
OWA 2014 - Lo Res Brochure (KE)
No ratings yet
OWA 2014 - Lo Res Brochure (KE)
24 pages
Lab 4 (1 3/4 Hours) : Configuring Collections: Part I - Working With Full Text Data
No ratings yet
Lab 4 (1 3/4 Hours) : Configuring Collections: Part I - Working With Full Text Data
4 pages
SSRN Id3687251
No ratings yet
SSRN Id3687251
27 pages
The Battery Technology Behind The Wheel: Kurt Kelty
No ratings yet
The Battery Technology Behind The Wheel: Kurt Kelty
41 pages
Greenstone: A Comprehensive Open-Source Digital Library Software System
No ratings yet
Greenstone: A Comprehensive Open-Source Digital Library Software System
9 pages
Open Loop Amplifier Model and Parameters: The Comparator
No ratings yet
Open Loop Amplifier Model and Parameters: The Comparator
1 page
"Have Thy Tools Ready. God Will Find Thee Work." - Sir James Murray
No ratings yet
"Have Thy Tools Ready. God Will Find Thee Work." - Sir James Murray
83 pages
MSP430 DLL Developer's Guide
No ratings yet
MSP430 DLL Developer's Guide
30 pages
SQL for Beginners: A Guide to Excelling in Coding and Database Management
From Everand
SQL for Beginners: A Guide to Excelling in Coding and Database Management
Vere salazar
No ratings yet
Metal 3D Printer
No ratings yet
Metal 3D Printer
31 pages
Lab 7e Preliminary Design and Layout of An Embedded System
No ratings yet
Lab 7e Preliminary Design and Layout of An Embedded System
2 pages
Ruger AR-556 Modern Sporting Rifle Spec Sheet
No ratings yet
Ruger AR-556 Modern Sporting Rifle Spec Sheet
1 page
Digital Electronics
No ratings yet
Digital Electronics
41 pages
Ghana Gas Brochure
No ratings yet
Ghana Gas Brochure
14 pages
Biomedical Knowledge Engineering Using A Computational Grid: Marcello Castellano and Raffaele Stifini
No ratings yet
Biomedical Knowledge Engineering Using A Computational Grid: Marcello Castellano and Raffaele Stifini
25 pages
Irs Unit-V
No ratings yet
Irs Unit-V
48 pages
Business Analytics with SAS Studio: Deliver Business Intelligence by Combining SQL Processing, Insightful Visualizations, and Various Data Mining Techniques
From Everand
Business Analytics with SAS Studio: Deliver Business Intelligence by Combining SQL Processing, Insightful Visualizations, and Various Data Mining Techniques
Rajinder Kr. Chitoria
No ratings yet
Schematron: A language for validating XML
From Everand
Schematron: A language for validating XML
Erik Siegel
No ratings yet
MT-EDACS RAdio Types-0605
No ratings yet
MT-EDACS RAdio Types-0605
1 page
Trackpad iPro Ver. 4.0 Class 7: Windows 10 & MS Office 2019
From Everand
Trackpad iPro Ver. 4.0 Class 7: Windows 10 & MS Office 2019
Team Orange
No ratings yet
10 Science Notes 06 Life Processes 1
No ratings yet
10 Science Notes 06 Life Processes 1
12 pages
The Me 3 Post
No ratings yet
The Me 3 Post
19 pages
Computer Programming: A Simplified Entry to Python, Java, and C++ Programming for Beginners
From Everand
Computer Programming: A Simplified Entry to Python, Java, and C++ Programming for Beginners
Lena Neill
No ratings yet
SK Resolution No. 001-IRP
No ratings yet
SK Resolution No. 001-IRP
8 pages
Bascom 8051
No ratings yet
Bascom 8051
316 pages
Python Machine Learning: Using Scikit Learn, TensorFlow, PyTorch, and Keras, an Introductory Journey into Machine Learning, Deep Learning, Data Analysis, Algorithms, and Data Science
From Everand
Python Machine Learning: Using Scikit Learn, TensorFlow, PyTorch, and Keras, an Introductory Journey into Machine Learning, Deep Learning, Data Analysis, Algorithms, and Data Science
Vere salazar
No ratings yet
Bossing Spreadsheets: A Girl's Guide to Data Analysis: Bossing Up
From Everand
Bossing Spreadsheets: A Girl's Guide to Data Analysis: Bossing Up
Sophie Johnson
No ratings yet
Data Science Programming In Python
From Everand
Data Science Programming In Python
Anita Raichand
No ratings yet
Agito Simulation of Deep Water Systems
No ratings yet
Agito Simulation of Deep Water Systems
3 pages
Agito Simulation of Deep Water Systems
No ratings yet
Agito Simulation of Deep Water Systems
3 pages
Preparing Data for Analysis with JMP
From Everand
Preparing Data for Analysis with JMP
Robert Carver
No ratings yet
Volvo L90F: Parts Catalog
100% (11)
Volvo L90F: Parts Catalog
1,397 pages
Data Structures I Essentials
From Everand
Data Structures I Essentials
Dennis Smolarski
No ratings yet

1997-00 Listing of Working Papers

Uploaded by

1997-00 Listing of Working Papers

Uploaded by

1997-00 Listing of Working Papers

You might also like