0% found this document useful (0 votes)
40 views

Metadata Talk MPI

This document discusses metadata for electronic language resources. It provides background on library catalogues and early metadata initiatives. The key functions of descriptive metadata for language resources are described for users, depositors/managers, and researchers. Components of existing metadata infrastructure are outlined, including schemas, tools, and exchange protocols. Experience with current initiatives is discussed, noting gaps in coverage, usage, and support. The document concludes with lessons learned, emphasizing the need for a flexible framework using registered vocabularies and persistent identifiers to support localization and different resource types. Standards and emerging trends are also briefly mentioned.

Uploaded by

swords900302
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
40 views

Metadata Talk MPI

This document discusses metadata for electronic language resources. It provides background on library catalogues and early metadata initiatives. The key functions of descriptive metadata for language resources are described for users, depositors/managers, and researchers. Components of existing metadata infrastructure are outlined, including schemas, tools, and exchange protocols. Experience with current initiatives is discussed, noting gaps in coverage, usage, and support. The document concludes with lessons learned, emphasizing the need for a flexible framework using registered vocabularies and persistent identifiers to support localization and different resource types. Standards and emerging trends are also briefly mentioned.

Uploaded by

swords900302
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

z

z
z

Relevance, Standards and Usage of Metadata


for Electronic Language Resources

Daan Broeder,, Peter Wittenburg


g
MPI for Psycholinguistics
CLARIN Research Infrastructure

HT: there
HT th is
i nott (yet)
( t) one agreed
d descriptive
d i ti system
t for
f LRT.
LRT Let’s
L t’
limit the damage!
z
z
z
Library
y History
y

• concept of descriptive metadata is of course very old


• library catalogues were used to easily manage and find books
stored somewhere on the shelves
• some liked the catalogues – others liked to look at the book instances
• these catalogues typically had very limited information
• finding the right book (title, author, year, etc)
• quick inspection (citation
(citation, older versions
versions, statistics
statistics, etc)
• managing the library holding (overview, reorganization, missing, etc)
• not the place for deep characterization
• in some catalogues content classification
• genre
• subject (LCSH, IconClass , …)
• libraries the first to introduce/push electronic catalogues
and exchange formats (MARC, etc)
• Dublin Core to describe any authored web-resource was
pushed forward also by librarians.
z
z
z
Motivation in Language
g g Resource Domain

• Constantly more language resources are created of all types


types.
• At MPI about 500.000 digital objects deposited from a large group
of researchers independently of each other with a high annual
increase
• The shear quantity requires new methods to prevent
Digital Chaos or Data Cemetery
1. need good and stable repositories/archives
2. need a good Descriptive Metadata infrastructure
• Several of us realized this
• Early approaches
• TEI header tags (deep descriptive intention)
were used in various projects (Dutch Spoken Corpus)
• CHILDES annotation file header tags (search, filtering etc)
• …
z
z
z
Initiatives for Descriptive
p Metadata for LRT

•Dublin Core MD initiative for all types of authored web-resources


•1998 TEI header
•May 2000 ISLE MD White Paper (IMDI) presented at LREC in
Athens & establishment of an IMDI working gggroup
p
•May 2000 LREC necessity of language classification system
(Ethnologue) now an ISO standard
•December
December 2000 Presentation of the OLAC initiative
•2000 DFKI/ACL Registry of tools
• important activities in other fields
• LOM: DMD for learning objects
• MPEG7: complex integrated approach (DMD + content)
• ISO 19115: g geographic
g p information
• Indecs
•…
• social
i l ttagging
i as alternative
lt ti ffor expertt metadata,
t d t b butt usability
bilit ffor
our domain may be limited
z
z
z
Functions of Descriptive
p Metadata for LRT I

Differences in the approaches


pp wrt to different interest g
groups
p
• users
• search big catalogues with a large number of descriptions
• browsing through linked hierarchies or networks of DMD
• facetted browsing as a combination
• geographic browsing based on GIS coordinates
• quick inspection of metadata to check suitability
• virtual collection building and workflow creation (process journal)
• creating relations between LRs of various sorts
• creating
ti diff
differentt views
i iincluding
l di d dynamic i web-sites
b it
• research questions vs. discovery
•g
give me frequencies
q of correct usage
g of 3. p
person p
plural
inflected form for children of different age and sex
• give me lexicons for Trumai
• granularity
g y
• let me find a specific individual object
• let me find a corpus
z
z
z
Functions of Descriptive
p Metadata for LRT II

•depositors/managers
p g
• canonical hierarchy according to linguistic criteria
and resource bundling (container building)
• for resource management (migration
(migration, moving
moving, etc)
• for simple access and license management
• adding valuable information/knowledge about resources
• for copying parts (access
(access, long
long-term
term archiving)

Example
p Scenario:
• all copied to computer centers
• only parts exchanged between MPI
and regional centers in both directions
z
z
z
DMD Infrastructure Components
p ((until now))

• metadata p
provider <-> service p
provider
• one major difference: DMD Data vs. DMD Service Provider
• DMD Service Provider has no resource management task
• the
th DMD specification
ifi ti
• a schema (flat or structured - until now is one of the main pillars)
• a vocabulary of descriptor elements with key-value pairs
• per elements
l t value
l sets t ((closed,
l d open, semi-closed)
i l d)
• special profiles to include new sub-disciplines
• the tools
• editor, browser(s), search engine (structured vs. unstructured)
• DBMSs. (relational or XML based)
• OAI-PMH protocol (gateway, harvesting)
• linker and virtual collection builder
• view generators
• APIs (Web services: SOAP & REST)
• ..
z
z
z
DMD Experience
p

• some initiatives have done an excellent job


j
• IMDI, OLAC have stabilized and offer services
• DC moved from 15 broad categories to qualified concepts
• vocabularies are registered (community sites
sites, ISO DCR)
• OAI PMH is widely accepted for metadata exchange
• XML harvesting as a less expensive alternative is accepted
• but total coverage is not at all sufficient
• too few repositories are ready/willing to participate
• DMD usage is not at all satisfying (see IMDI usage*)
• necessity not believed despite evangelization
• DMD generation costs money, but is not budgeted
• some researchers still don’t want to share
• some researchers
h would
ld lik
like to participate,
i i b
but …
• lot of legacy material – how to get that in?
• DMD is open – some have ethical/political problems
• user friendliness
f i dli ((what
h t iis thi
this?)
?) tto b
be iimproved
d
• not all functions supported
z
z
z
DMD Lessons learned

• schemas are secondary y - let everyone


y create his/her own schema
• primary are registered and suitable vocabularies and persistent IDs
• a registry for schemas to allow re-usage and look-up

• need a flexible component based framework for DMD (similar to


LMF)
• a REQUIREMENT to use registered vocabularies
• need to support localization and sub-discipline terminology
• a registry allowing to re-use existing schemas or blocks (schema
fragments)
• easy registration of new schemas (using registered vocabularies)
• full support of PIDs at all relevant levels: concepts, resources and
other metadata
• possibility to register useful relations between concepts (pragmatic
ontologies)
• a next ggeneration tools should support
pp such a framework
• need thorough studies of resource types / descriptor sets per type
• need to include web services so others may interact with it.
z
z
z
Standards and other Trends

• which standards/suggestions are there


• ISO TC37/SC4: ISOcat as DCR standard on the way
• of course reuse trustful registries such as DCMI
• W3C, ISO, IETF: PID standard
p
•TEI ODD component framework

• which standards/suggestions are missing


• exhaustive LRT taxonomyy and description
p p
per data resource type
yp
• a feasible suggestion for WS description (UDDI, ebXML did not work)
• an accepted model for new generation DMD
• DMD is in the focus of large initiatives such as DRIVER
(European project to create a Digital Repository Infrastructure)
• will someone take care?
• in the CLARIN project this all will be one of the main issues
• a flexible component model for MD is on the CLARIN list
• poster on Thursday P15
z
z
z

Thank yyou for yyour kind attention.


d g2 z
z
z
IMDI Usage
g
Das Bild k ann nicht angezeigt werden. Dieser Computer v erfügt möglicherweise über zu wenig A rbeitsspeicher, um das Bild zu öffnen, oder das Bild ist beschädigt. Starten Sie den Computer neu, und öffnen Sie dann erneut die Datei. Wenn weiterhin das rote x angezeigt wird, müssen Sie das Bild möglicherweise löschen und dann erneut einfügen.

%
IMDI statistics on 27.000 records: w r itte n r e s o u r c e la n g u a g e ID
w r itte n r e s o u r c e c h a r a c te r e n c o d in g

• many creators did not use content


w r itte n r e s o u r c e c o n te n t e n c o din g
w r itte n r e s o u r c e s iz e
w r itte n r e s o u r c e f or mat

fields (Genre, Subject)


w r itte n r e s o u r c e s u b ty p e
w r itte n r e s o u r c e ty p e
w r itte n r e s o u r c e r es o ur c e lin k

• difficulties with classification, me d ia f ile qu a lity


me d ia f ile f o r ma t

laziness – why should I invest


med ia f ile ty p e
me d ia f ile s iz e
me d ia f ile r e s o u r c e lin k

time, etc
a c to r d e s c r ip tio n
a c to r ed u c a tio n
a c to r s e x

• now after 6 yyears special


p web
a c to r a g e
a c to r b ir th
a c to r e th n ic g r ou p

pages with dynamic REST-based


a c to r f a mily s o c ia l r o le
a c to r c o d e
a c to r f u lln a me

content generation, motivation


a c to r na me
a c to r r o le
a c to r la n g u a g e n a me

increases
a c to r la n g u a g e ID
a c to r lan g ua g e d e s c r ip tio n
c on te n t la n g u a g e n a me
%
c on te nt la n g u a g e ID
c o n te n t la n g u a g e d e s c r ip tio n
c o mmu n ic a tio n c o n te x t c h a n n e l
c ommu n ic atio n c o nte x t e v e nt
c ommun ic a tio n c o n te x t s o c ia l c on te x t
c o mmu n ic a tio n c o n te x t in v o lv e me n t
c o mmu n ic atio n c o nte x t p la n n in g ty p e
c o mmun ic a tio n c o n te x t in te r a c tiv ity
c o n te nt s u b je c t
c o n te n t mo d a litie s
c on te nt ta s k
c on te n t s u b g e n r e
c on te nt g e n r e
c o n te n t d es c r iptio n
p r o je c t n a me
s e s s ion r e g io n
s es s io n a d d r e s s
s e s s io n c o u n tr y
s es s io n c on tin e n t
s e s s io n d e s c r ip tio n
s e s s io n r e c o r d in g d a te
s e s s ion .title
s es s io n .n a me

0 20 40 60 80 100 120
Folie 12

d g2 they already know the data


time investment is for other users
data is considered own prperty and not that of the funder or research community

REST -> more IMDI ????


but it makes them more aware of the
possibilities of metadata
broeder; 23.05.2008

You might also like