SlideShare a Scribd company logo
www.atmire.com
Metadata based
usage statistics
OVERVIEW
1. Why DSpace statistics?
2. Usage event vs. Item metadata
3. Generating metadata based statistics
4. Linking metadata to usage events
5. Performance
6. Problem solved?
Statistics solution that knows DSpace:
Structure
“Which are the most downloaded bitstreams in a collection”
Metadata
“Who are the most popular authors in terms of downloads?”
1 - WHY DSPACE STATISTICS?
USAGE EVENT VS. ITEM METADATA
2 types of metadata:
Usage event metadata
Additional information about the usage event
Item metadata
Additional information about the target of the usage event
USAGE EVENT METADATA
Additional information about the usage event
Not related to repository
Also possible with other statistics solutions:
• IP address
• Country
• User Agent
• HTTP Referrer
• ...
ITEM METADATA
Relate usage event to information stored in
your repository.
Allows statistics queries based on item
metadata.
→ Not possible with a statistics solution that
is not tied to the repository.
GENERATING METADATA BASED STATISTICS
How many downloads did
author "Barnes, Douglas F.”
get in the last year, grouped
by month
Metadata based statistics for DSpace
Metadata based statistics for DSpace
Metadata based statistics for DSpace
Metadata based statistics for DSpace
Metadata based statistics for DSpace
LINKING METADATA TO USAGE EVENTS
Solr Query
http://localhost:8080/solr/statistics/select?
facet=true&facet.offset=0&facet.mincount=1&facet.sort=
false&q=*:*&facet.limit=24&facet.field=dateYearMonth&f
acet.method=enum&fq=bundleName:ORIGINAL&fq=type:
+0&fq=statistics_type:view&fq=-isBot:true&fq=-
isInternal:true&fq=time:[2014-07-01T00:00:00.000Z+TO
+2015-06-06T00:00:00.000Z]&fq=+(author_mtdt:Barnes,
+Douglas+F.)+&wt=javabin&rows=0
LINKING METADATA TO USAGE EVENTS
facet.field=dateYearMonth
group by the field dateYearMonth
fq=type:+0
only include bitstream downloads
fq=bundleName:ORIGINAL
only include files in bundle “ORIGINAL”
fq=-isBot:true
filter out all bot statistics
fq=-isInternal:true
filter out all internal statistics
fq=time:[2014-07-01+TO+2015-06-06]
only include stats that are between Jul 1st 2014
and Jun 6th 2015
fq=+(author_mtdt:Barnes,+Douglas+F.)+
only include statistics that are by
author Barnes, Douglas F.
<response>
<lst name="responseHeader">
...
</lst>
<result name="response" numFound="164" start="0"></result>
<lst name="facet_counts">
<lst name="facet_fields">
<lst name="dateYearMonth">
<int name="2014-07">15</int>
<int name="2014-08">19</int>
<int name="2014-09">15</int>
<int name="2014-10">10</int>
<int name="2014-11">7</int>
<int name="2014-12">13</int>
<int name="2015-01">13</int>
<int name="2015-02">15</int>
<int name="2015-03">21</int>
<int name="2015-04">22</int>
<int name="2015-05">12</int>
<int name="2015-06">2</int>
</lst>
</lst>
</lst>
</response>
LINKING METADATA TO USAGE EVENTS
In a vanilla DSpace installation:
• Usage statistics only contain bitstream IDs: no
metadata
• The metadata is stored in the database
PROPOSED SOLUTION
1. Query the database for bitstream IDs
based on the author metadata
2. Use those IDs to query solr for statistics
PROPOSED SOLUTION: DOWNSIDES
• Two queries to answer one question
• The solr query can get very long and
inefficient to execute
• Inefficient but still possible
PROPOSED SOLUTION: DOWNSIDES
What if we want to show the 10 authors with
the most downloads?
• query the database for all authors
• query SOLR to get the number of usage events
for each author
• sort those counts, and return the 10 highest
PROPOSED SOLUTION: DOWNSIDES
Very inefficient!
• do a lot of queries
• throw away most of the results: we only
need top 10
SOLR FACETS
To do a facet query:
• specify ”facet.field” along with the
regular query
• results will be grouped by the values they have
for that field
SOLR FACETS: EXAMPLE
q=type:0&facet.field=owningItem
q=type:0
search for all usage events that are bitstream downloads
facet.field=owningItem
group these by item
count the # records in each group
OUR SOLUTION
• Add Item metadata to SOLR.
• Use built-in filtering and grouping
CHALLENGE: SIZE OF THE SOLR CORE
That solution creates new challenges
Metadata is duplicated in every statistical record
that takes up a lot of space
and it needs to be kept in sync
SIZE OF SINGLE USAGE EVENT
<doc>
<str name="ip">177.21.194.80</str>
<arr name="ip_search"><str>177.21.194.80</str></arr>
<arr name="ip_ngram"><str>177.21.194.80</str></arr>
<int name="type">0</int>
<int name="id">54</int>
<date name="time">2015-05-11T04:33:49.077Z</date>
<str name="dateYearMonth">2015-05</str>
<str name="dateYear">2015</str>
<str name="continent">SA</str>
<str name="countryCode">BR</str>
<float name="latitude">-10.0</float>
<float name="longitude">-55.0</float>
<arr name="bundleName"><str>ORIGINAL</str></arr>
<arr name="containerBitstream"><int>54</int></arr>
<arr name="owningItem"><int>1652</int></arr>
<arr name="containerItem"><int>1652</int></arr>
<arr name="owningColl"><int>14</int></arr>
<arr name="containerCollection"><int>14</int></arr>
<arr name="owningComm"><int>1</int></arr>
<arr name="containerCommunity"><int>1</int></arr>
<str name="uid">60fe8ebb-b8a9-454c-8eef-3f9f800d1399</str>
<bool name="isBot">false</bool>
<bool name="isInternal">false</bool>
<str name="statistics_type">view</str>
<long name="_version_">1501767933804675072</long>
</doc>
25 elements
<doc>
<str name="ip">177.21.194.80</str>
...
<arr name="author_mtdt">
<str>Khandker, Shahidur R.</str>
<str>Barnes, Douglas F.</str>
<str>Samad, Hussain A.</str>
</arr>
<arr name="subject_mtdt">
<str>ACCESS TO LIGHTING</str>
<str>ACCESS TO MODERN ENERGY</str>
<str>AGRICULTURAL LAND</str>
<str>AGRICULTURAL RESIDUE</str>
<str>AIR CONDITIONERS</str>
<str>AIR POLLUTION</str>
<str>ALTERNATIVE ENERGY</str>
<str>ALTERNATIVE SOURCES OF ENERGY</str>
<str>APPROACH</str>
<str>ATMOSPHERE</str>
<str>AVAILABILITY</str>
<str>BASIC ENERGY</str>
<str>BIOMASS</str>
<str>BIOMASS BURNING</str>
<str>BIOMASS COLLECTION</str>
<str>BIOMASS CONSUMPTION</str>
<str>BIOMASS ENERGY</str>
...
<str>WORLD ENERGY</str>
<str>WORLD ENERGY OUTLOOK</str>
</arr>
...
</doc>
SIZE OF SINGLE USAGE EVENT WITH METADATA
3 authors
140 subjects
KEEPING METADATA IN SYNC
When the metadata of an item changes
• a mistake was corrected
• extra info was added
the statistical records for that item need to be
updated as well
KEEPING METADATA IN SYNC
Item with 7,000 page visits and 5,000 downloads
→ that means updating 12,000 usage events.
• That takes time
• During that time, it takes longer to view other
statistical reports
PERFORMANCE
Size of single usage event
Metadata updates
Amount of events
Live search queries
PERFORMANCE ENHANCEMENT: SYNCING
Try to keep the load created by synching
metadata in the statistics as low as possible:
→ only sync while solr is idle
interrupt the operation when a search request
can’t be handled in time
interrupt the operation when Solr’s memory
usage nears its max
PERFORMANCE ENHANCEMENT: CACHING
Caching
store generated reports in a separate Solr core
retrieving them is very fast
invalidate cached reports after a set time
(e.g. 24 hours)
PERFORMANCE ENHANCEMENT: CACHING
Don’t delete expired cached reports
If a user requests a report that is cached
→ show the outdated version
In the mean time
→ generate a new version
Automatically show new report when it’s done
EXAMPLE: CACHE MISS
EXAMPLE: CACHE MISS
PROBLEM SOLVED?
Additional complexity
Number of usage events
keeps growing
Name variants
Different names for one author
“Who are the Most
Popular Authors in terms
of downloads?”
NAME VARIANTS USE CASE
https://openknowledge.worldbank.org/most-popular/author
Ferreira, Francisco H. G.
Ferreira, Francisco H.G.
Ferreira, Francisco
3 name variants:
Metadata based statistics for DSpace
SOLUTION FOR NAME VARIANTS
include all name variants in Solr query:
author_mtdt:
(Ferreira, Francisco H. G.) OR
(Ferreira, Francisco H.G.) OR
(Ferreira, Francisco)
ALTERNATIVE SOLUTION
If you have unique IDs (e.g. ORCID)
Index, and search for them instead
www.atmire.com
Thank you!
Questions?
Desktop view Phone view
Desktop view
Phone view
Desktop view
Phone view
Ad

More Related Content

What's hot (20)

Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
Databricks
 
Getting started with DSpace 7 REST API
Getting started with DSpace 7 REST APIGetting started with DSpace 7 REST API
Getting started with DSpace 7 REST API
4Science
 
Introduction to linked data
Introduction to linked dataIntroduction to linked data
Introduction to linked data
Open Data Support
 
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
Databricks
 
ETL in the Cloud With Microsoft Azure
ETL in the Cloud With Microsoft AzureETL in the Cloud With Microsoft Azure
ETL in the Cloud With Microsoft Azure
Mark Kromer
 
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
StreamNative
 
Introduction to CKAN
Introduction to CKANIntroduction to CKAN
Introduction to CKAN
OKCon2013
 
Hive partitioning best practices
Hive partitioning  best practicesHive partitioning  best practices
Hive partitioning best practices
Nabeel Moidu
 
Free Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseFree Training: How to Build a Lakehouse
Free Training: How to Build a Lakehouse
Databricks
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Databricks
 
Aggregated queries with Druid on terrabytes and petabytes of data
Aggregated queries with Druid on terrabytes and petabytes of dataAggregated queries with Druid on terrabytes and petabytes of data
Aggregated queries with Druid on terrabytes and petabytes of data
Rostislav Pashuto
 
What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training |...
What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training |...What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training |...
What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training |...
Edureka!
 
How to Build a Semantic Search System
How to Build a Semantic Search SystemHow to Build a Semantic Search System
How to Build a Semantic Search System
Trey Grainger
 
Building Modern Data Platform with Microsoft Azure
Building Modern Data Platform with Microsoft AzureBuilding Modern Data Platform with Microsoft Azure
Building Modern Data Platform with Microsoft Azure
Dmitry Anoshin
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Noritaka Sekiyama
 
Delta Lake with Azure Databricks
Delta Lake with Azure DatabricksDelta Lake with Azure Databricks
Delta Lake with Azure Databricks
Dustin Vannoy
 
Snowflake + Power BI: Cloud Analytics for Everyone
Snowflake + Power BI: Cloud Analytics for EveryoneSnowflake + Power BI: Cloud Analytics for Everyone
Snowflake + Power BI: Cloud Analytics for Everyone
Angel Abundez
 
Snowflake: The most cost-effective agile and scalable data warehouse ever!
Snowflake: The most cost-effective agile and scalable data warehouse ever!Snowflake: The most cost-effective agile and scalable data warehouse ever!
Snowflake: The most cost-effective agile and scalable data warehouse ever!
Visual_BI
 
Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ...
Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ...Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ...
Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ...
Flink Forward
 
[215] Druid로 쉽고 빠르게 데이터 분석하기
[215] Druid로 쉽고 빠르게 데이터 분석하기[215] Druid로 쉽고 빠르게 데이터 분석하기
[215] Druid로 쉽고 빠르게 데이터 분석하기
NAVER D2
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
Databricks
 
Getting started with DSpace 7 REST API
Getting started with DSpace 7 REST APIGetting started with DSpace 7 REST API
Getting started with DSpace 7 REST API
4Science
 
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
Databricks
 
ETL in the Cloud With Microsoft Azure
ETL in the Cloud With Microsoft AzureETL in the Cloud With Microsoft Azure
ETL in the Cloud With Microsoft Azure
Mark Kromer
 
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
StreamNative
 
Introduction to CKAN
Introduction to CKANIntroduction to CKAN
Introduction to CKAN
OKCon2013
 
Hive partitioning best practices
Hive partitioning  best practicesHive partitioning  best practices
Hive partitioning best practices
Nabeel Moidu
 
Free Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseFree Training: How to Build a Lakehouse
Free Training: How to Build a Lakehouse
Databricks
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Databricks
 
Aggregated queries with Druid on terrabytes and petabytes of data
Aggregated queries with Druid on terrabytes and petabytes of dataAggregated queries with Druid on terrabytes and petabytes of data
Aggregated queries with Druid on terrabytes and petabytes of data
Rostislav Pashuto
 
What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training |...
What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training |...What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training |...
What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training |...
Edureka!
 
How to Build a Semantic Search System
How to Build a Semantic Search SystemHow to Build a Semantic Search System
How to Build a Semantic Search System
Trey Grainger
 
Building Modern Data Platform with Microsoft Azure
Building Modern Data Platform with Microsoft AzureBuilding Modern Data Platform with Microsoft Azure
Building Modern Data Platform with Microsoft Azure
Dmitry Anoshin
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Noritaka Sekiyama
 
Delta Lake with Azure Databricks
Delta Lake with Azure DatabricksDelta Lake with Azure Databricks
Delta Lake with Azure Databricks
Dustin Vannoy
 
Snowflake + Power BI: Cloud Analytics for Everyone
Snowflake + Power BI: Cloud Analytics for EveryoneSnowflake + Power BI: Cloud Analytics for Everyone
Snowflake + Power BI: Cloud Analytics for Everyone
Angel Abundez
 
Snowflake: The most cost-effective agile and scalable data warehouse ever!
Snowflake: The most cost-effective agile and scalable data warehouse ever!Snowflake: The most cost-effective agile and scalable data warehouse ever!
Snowflake: The most cost-effective agile and scalable data warehouse ever!
Visual_BI
 
Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ...
Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ...Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ...
Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ...
Flink Forward
 
[215] Druid로 쉽고 빠르게 데이터 분석하기
[215] Druid로 쉽고 빠르게 데이터 분석하기[215] Druid로 쉽고 빠르게 데이터 분석하기
[215] Druid로 쉽고 빠르게 데이터 분석하기
NAVER D2
 

Viewers also liked (20)

DSpace in Belgium and beyond
DSpace in Belgium and beyondDSpace in Belgium and beyond
DSpace in Belgium and beyond
Bram Luyten
 
Working for Atmire
Working for AtmireWorking for Atmire
Working for Atmire
Bram Luyten
 
DSpace repositories today and tomorrow
DSpace repositories today and tomorrowDSpace repositories today and tomorrow
DSpace repositories today and tomorrow
Bram Luyten
 
DSpace UI prototype dsember
DSpace UI prototype dsemberDSpace UI prototype dsember
DSpace UI prototype dsember
Bram Luyten
 
Durable Item Relations for DSpace
Durable Item Relations for DSpaceDurable Item Relations for DSpace
Durable Item Relations for DSpace
Bram Luyten
 
Email deposit
Email depositEmail deposit
Email deposit
Bram Luyten
 
Git and Github - a 90 Minute interactive workshop
Git and Github - a 90 Minute interactive workshopGit and Github - a 90 Minute interactive workshop
Git and Github - a 90 Minute interactive workshop
Bram Luyten
 
So we all have ORCID integrations, now what?
So we all have ORCID integrations, now what?So we all have ORCID integrations, now what?
So we all have ORCID integrations, now what?
Bram Luyten
 
Enterprize aws
Enterprize awsEnterprize aws
Enterprize aws
mamoru tateoka
 
Tarea unidad II
Tarea unidad  II Tarea unidad  II
Tarea unidad II
Angela De Jesus Castro
 
¿Cómo organizar una estrategia de investigación?
¿Cómo organizar una estrategia de investigación?¿Cómo organizar una estrategia de investigación?
¿Cómo organizar una estrategia de investigación?
Grial - University of Salamanca
 
Pilicolayi
PilicolayiPilicolayi
Pilicolayi
Rafig Valiyev
 
Límite de una función
Límite de una funciónLímite de una función
Límite de una función
mariofriedman
 
Price list
Price listPrice list
Price list
Gunaep
 
Classroom20 precentation
Classroom20 precentationClassroom20 precentation
Classroom20 precentation
aivanoulis
 
Rubanomics - Corporate Presentation
Rubanomics - Corporate PresentationRubanomics - Corporate Presentation
Rubanomics - Corporate Presentation
Rheetam Mitra
 
Private Sector Leads Virgin Islands to Solar
Private Sector Leads Virgin Islands to SolarPrivate Sector Leads Virgin Islands to Solar
Private Sector Leads Virgin Islands to Solar
Don Buchanan
 
Presentation3-One Pound
Presentation3-One PoundPresentation3-One Pound
Presentation3-One Pound
ChaseTomlinson
 
Ingles isabel mª
Ingles isabel mªIngles isabel mª
Ingles isabel mª
miguelingp
 
DSpace in Belgium and beyond
DSpace in Belgium and beyondDSpace in Belgium and beyond
DSpace in Belgium and beyond
Bram Luyten
 
Working for Atmire
Working for AtmireWorking for Atmire
Working for Atmire
Bram Luyten
 
DSpace repositories today and tomorrow
DSpace repositories today and tomorrowDSpace repositories today and tomorrow
DSpace repositories today and tomorrow
Bram Luyten
 
DSpace UI prototype dsember
DSpace UI prototype dsemberDSpace UI prototype dsember
DSpace UI prototype dsember
Bram Luyten
 
Durable Item Relations for DSpace
Durable Item Relations for DSpaceDurable Item Relations for DSpace
Durable Item Relations for DSpace
Bram Luyten
 
Git and Github - a 90 Minute interactive workshop
Git and Github - a 90 Minute interactive workshopGit and Github - a 90 Minute interactive workshop
Git and Github - a 90 Minute interactive workshop
Bram Luyten
 
So we all have ORCID integrations, now what?
So we all have ORCID integrations, now what?So we all have ORCID integrations, now what?
So we all have ORCID integrations, now what?
Bram Luyten
 
Límite de una función
Límite de una funciónLímite de una función
Límite de una función
mariofriedman
 
Price list
Price listPrice list
Price list
Gunaep
 
Classroom20 precentation
Classroom20 precentationClassroom20 precentation
Classroom20 precentation
aivanoulis
 
Rubanomics - Corporate Presentation
Rubanomics - Corporate PresentationRubanomics - Corporate Presentation
Rubanomics - Corporate Presentation
Rheetam Mitra
 
Private Sector Leads Virgin Islands to Solar
Private Sector Leads Virgin Islands to SolarPrivate Sector Leads Virgin Islands to Solar
Private Sector Leads Virgin Islands to Solar
Don Buchanan
 
Presentation3-One Pound
Presentation3-One PoundPresentation3-One Pound
Presentation3-One Pound
ChaseTomlinson
 
Ingles isabel mª
Ingles isabel mªIngles isabel mª
Ingles isabel mª
miguelingp
 
Ad

Similar to Metadata based statistics for DSpace (20)

Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...
Tech Triveni
 
Handling of Large Data by Salesforce
Handling of Large Data by SalesforceHandling of Large Data by Salesforce
Handling of Large Data by Salesforce
Thinqloud
 
Large Data Volume Salesforce experiences
Large Data Volume Salesforce experiencesLarge Data Volume Salesforce experiences
Large Data Volume Salesforce experiences
Cidar Mendizabal
 
Unifying your data management with Hadoop
Unifying your data management with HadoopUnifying your data management with Hadoop
Unifying your data management with Hadoop
Jayant Shekhar
 
Database
DatabaseDatabase
Database
nationalmobileapps
 
Search on the fly: how to lighten your Big Data - Simona Russo, Auro Rolle - ...
Search on the fly: how to lighten your Big Data - Simona Russo, Auro Rolle - ...Search on the fly: how to lighten your Big Data - Simona Russo, Auro Rolle - ...
Search on the fly: how to lighten your Big Data - Simona Russo, Auro Rolle - ...
Codemotion
 
Minerva: Drill Storage Plugin for IPFS
Minerva: Drill Storage Plugin for IPFSMinerva: Drill Storage Plugin for IPFS
Minerva: Drill Storage Plugin for IPFS
BowenDing4
 
Automated Data Synchronization: Data Loader, Data Mirror & Beyond
Automated Data Synchronization: Data Loader, Data Mirror & BeyondAutomated Data Synchronization: Data Loader, Data Mirror & Beyond
Automated Data Synchronization: Data Loader, Data Mirror & Beyond
JeremyOtt5
 
Configuring elasticsearch for performance and scale
Configuring elasticsearch for performance and scaleConfiguring elasticsearch for performance and scale
Configuring elasticsearch for performance and scale
Bharvi Dixit
 
Integrating Hadoop in Your Existing DW and BI Environment
Integrating Hadoop in Your Existing DW and BI EnvironmentIntegrating Hadoop in Your Existing DW and BI Environment
Integrating Hadoop in Your Existing DW and BI Environment
Cloudera, Inc.
 
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Spark Summit
 
Séminaire Big Data Alter Way - Elasticsearch - octobre 2014
Séminaire Big Data Alter Way - Elasticsearch - octobre 2014Séminaire Big Data Alter Way - Elasticsearch - octobre 2014
Séminaire Big Data Alter Way - Elasticsearch - octobre 2014
ALTER WAY
 
DataCite How To: Use the MDS
DataCite How To: Use the MDSDataCite How To: Use the MDS
DataCite How To: Use the MDS
Frauke Ziedorn
 
Customer Feedback Analytics for Starbucks
Customer Feedback Analytics for Starbucks Customer Feedback Analytics for Starbucks
Customer Feedback Analytics for Starbucks
Nishant Gandhi
 
Apache Eagle Strata Hadoop World London 2016
Apache Eagle Strata Hadoop World London 2016Apache Eagle Strata Hadoop World London 2016
Apache Eagle Strata Hadoop World London 2016
Arun Karthick Manoharan
 
SharePoint TechCon 2009 - 803
SharePoint TechCon 2009 - 803SharePoint TechCon 2009 - 803
SharePoint TechCon 2009 - 803
Andreas Grabner
 
From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...
From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...
From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...
DataWorks Summit
 
Hadoop HDFS.ppt
Hadoop HDFS.pptHadoop HDFS.ppt
Hadoop HDFS.ppt
6535ANURAGANURAG
 
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
Databricks
 
LDV.pptx
LDV.pptxLDV.pptx
LDV.pptx
Shams Pirzada
 
Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...
Tech Triveni
 
Handling of Large Data by Salesforce
Handling of Large Data by SalesforceHandling of Large Data by Salesforce
Handling of Large Data by Salesforce
Thinqloud
 
Large Data Volume Salesforce experiences
Large Data Volume Salesforce experiencesLarge Data Volume Salesforce experiences
Large Data Volume Salesforce experiences
Cidar Mendizabal
 
Unifying your data management with Hadoop
Unifying your data management with HadoopUnifying your data management with Hadoop
Unifying your data management with Hadoop
Jayant Shekhar
 
Search on the fly: how to lighten your Big Data - Simona Russo, Auro Rolle - ...
Search on the fly: how to lighten your Big Data - Simona Russo, Auro Rolle - ...Search on the fly: how to lighten your Big Data - Simona Russo, Auro Rolle - ...
Search on the fly: how to lighten your Big Data - Simona Russo, Auro Rolle - ...
Codemotion
 
Minerva: Drill Storage Plugin for IPFS
Minerva: Drill Storage Plugin for IPFSMinerva: Drill Storage Plugin for IPFS
Minerva: Drill Storage Plugin for IPFS
BowenDing4
 
Automated Data Synchronization: Data Loader, Data Mirror & Beyond
Automated Data Synchronization: Data Loader, Data Mirror & BeyondAutomated Data Synchronization: Data Loader, Data Mirror & Beyond
Automated Data Synchronization: Data Loader, Data Mirror & Beyond
JeremyOtt5
 
Configuring elasticsearch for performance and scale
Configuring elasticsearch for performance and scaleConfiguring elasticsearch for performance and scale
Configuring elasticsearch for performance and scale
Bharvi Dixit
 
Integrating Hadoop in Your Existing DW and BI Environment
Integrating Hadoop in Your Existing DW and BI EnvironmentIntegrating Hadoop in Your Existing DW and BI Environment
Integrating Hadoop in Your Existing DW and BI Environment
Cloudera, Inc.
 
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Spark Summit
 
Séminaire Big Data Alter Way - Elasticsearch - octobre 2014
Séminaire Big Data Alter Way - Elasticsearch - octobre 2014Séminaire Big Data Alter Way - Elasticsearch - octobre 2014
Séminaire Big Data Alter Way - Elasticsearch - octobre 2014
ALTER WAY
 
DataCite How To: Use the MDS
DataCite How To: Use the MDSDataCite How To: Use the MDS
DataCite How To: Use the MDS
Frauke Ziedorn
 
Customer Feedback Analytics for Starbucks
Customer Feedback Analytics for Starbucks Customer Feedback Analytics for Starbucks
Customer Feedback Analytics for Starbucks
Nishant Gandhi
 
Apache Eagle Strata Hadoop World London 2016
Apache Eagle Strata Hadoop World London 2016Apache Eagle Strata Hadoop World London 2016
Apache Eagle Strata Hadoop World London 2016
Arun Karthick Manoharan
 
SharePoint TechCon 2009 - 803
SharePoint TechCon 2009 - 803SharePoint TechCon 2009 - 803
SharePoint TechCon 2009 - 803
Andreas Grabner
 
From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...
From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...
From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...
DataWorks Summit
 
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
Databricks
 
Ad

More from Bram Luyten (12)

Archiving Sensitive Data
Archiving Sensitive DataArchiving Sensitive Data
Archiving Sensitive Data
Bram Luyten
 
Update on DSpace 7
Update on DSpace 7Update on DSpace 7
Update on DSpace 7
Bram Luyten
 
DSpace 5.7 and 6.1 Preview
DSpace 5.7 and 6.1 PreviewDSpace 5.7 and 6.1 Preview
DSpace 5.7 and 6.1 Preview
Bram Luyten
 
DSpace Today and Tomorrow
DSpace Today and TomorrowDSpace Today and Tomorrow
DSpace Today and Tomorrow
Bram Luyten
 
Mirage 2: A responsive user interface for DSpace
Mirage 2: A responsive user interface for DSpaceMirage 2: A responsive user interface for DSpace
Mirage 2: A responsive user interface for DSpace
Bram Luyten
 
Dépôts institutionnels et collections spéciales en DSpace
Dépôts institutionnels et collections spéciales en DSpaceDépôts institutionnels et collections spéciales en DSpace
Dépôts institutionnels et collections spéciales en DSpace
Bram Luyten
 
Secrets of the DSpace Submission Form
Secrets of the DSpace Submission FormSecrets of the DSpace Submission Form
Secrets of the DSpace Submission Form
Bram Luyten
 
Introduction to XMLUI and Mirage Theming for DSpace 3
Introduction to XMLUI and Mirage Theming for DSpace 3Introduction to XMLUI and Mirage Theming for DSpace 3
Introduction to XMLUI and Mirage Theming for DSpace 3
Bram Luyten
 
What's in Store for DSpace 4?
What's in Store for DSpace 4?What's in Store for DSpace 4?
What's in Store for DSpace 4?
Bram Luyten
 
ORCID for DSpace
ORCID for DSpaceORCID for DSpace
ORCID for DSpace
Bram Luyten
 
Using Github for DSpace development
Using Github for DSpace developmentUsing Github for DSpace development
Using Github for DSpace development
Bram Luyten
 
Workshop: Google Analytics for DSpace
Workshop: Google Analytics for DSpaceWorkshop: Google Analytics for DSpace
Workshop: Google Analytics for DSpace
Bram Luyten
 
Archiving Sensitive Data
Archiving Sensitive DataArchiving Sensitive Data
Archiving Sensitive Data
Bram Luyten
 
Update on DSpace 7
Update on DSpace 7Update on DSpace 7
Update on DSpace 7
Bram Luyten
 
DSpace 5.7 and 6.1 Preview
DSpace 5.7 and 6.1 PreviewDSpace 5.7 and 6.1 Preview
DSpace 5.7 and 6.1 Preview
Bram Luyten
 
DSpace Today and Tomorrow
DSpace Today and TomorrowDSpace Today and Tomorrow
DSpace Today and Tomorrow
Bram Luyten
 
Mirage 2: A responsive user interface for DSpace
Mirage 2: A responsive user interface for DSpaceMirage 2: A responsive user interface for DSpace
Mirage 2: A responsive user interface for DSpace
Bram Luyten
 
Dépôts institutionnels et collections spéciales en DSpace
Dépôts institutionnels et collections spéciales en DSpaceDépôts institutionnels et collections spéciales en DSpace
Dépôts institutionnels et collections spéciales en DSpace
Bram Luyten
 
Secrets of the DSpace Submission Form
Secrets of the DSpace Submission FormSecrets of the DSpace Submission Form
Secrets of the DSpace Submission Form
Bram Luyten
 
Introduction to XMLUI and Mirage Theming for DSpace 3
Introduction to XMLUI and Mirage Theming for DSpace 3Introduction to XMLUI and Mirage Theming for DSpace 3
Introduction to XMLUI and Mirage Theming for DSpace 3
Bram Luyten
 
What's in Store for DSpace 4?
What's in Store for DSpace 4?What's in Store for DSpace 4?
What's in Store for DSpace 4?
Bram Luyten
 
ORCID for DSpace
ORCID for DSpaceORCID for DSpace
ORCID for DSpace
Bram Luyten
 
Using Github for DSpace development
Using Github for DSpace developmentUsing Github for DSpace development
Using Github for DSpace development
Bram Luyten
 
Workshop: Google Analytics for DSpace
Workshop: Google Analytics for DSpaceWorkshop: Google Analytics for DSpace
Workshop: Google Analytics for DSpace
Bram Luyten
 

Recently uploaded (20)

Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul
 
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptxIncreasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Anoop Ashok
 
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Impelsys Inc.
 
Generative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in BusinessGenerative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in Business
Dr. Tathagat Varma
 
TrsLabs - Fintech Product & Business Consulting
TrsLabs - Fintech Product & Business ConsultingTrsLabs - Fintech Product & Business Consulting
TrsLabs - Fintech Product & Business Consulting
Trs Labs
 
Quantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur MorganQuantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur Morgan
Arthur Morgan
 
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptxSpecial Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
shyamraj55
 
Procurement Insights Cost To Value Guide.pptx
Procurement Insights Cost To Value Guide.pptxProcurement Insights Cost To Value Guide.pptx
Procurement Insights Cost To Value Guide.pptx
Jon Hansen
 
Rusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond SparkRusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond Spark
carlyakerly1
 
ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes Partner Innovation Updates for May 2025ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes
 
Big Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur MorganBig Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur Morgan
Arthur Morgan
 
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Aqusag Technologies
 
Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)
Ortus Solutions, Corp
 
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
Alan Dix
 
Complete Guide to Advanced Logistics Management Software in Riyadh.pdf
Complete Guide to Advanced Logistics Management Software in Riyadh.pdfComplete Guide to Advanced Logistics Management Software in Riyadh.pdf
Complete Guide to Advanced Logistics Management Software in Riyadh.pdf
Software Company
 
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdfThe Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
Abi john
 
2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx
Samuele Fogagnolo
 
How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?
Daniel Lehner
 
Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.
hpbmnnxrvb
 
AI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global TrendsAI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global Trends
InData Labs
 
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul
 
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptxIncreasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Anoop Ashok
 
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Impelsys Inc.
 
Generative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in BusinessGenerative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in Business
Dr. Tathagat Varma
 
TrsLabs - Fintech Product & Business Consulting
TrsLabs - Fintech Product & Business ConsultingTrsLabs - Fintech Product & Business Consulting
TrsLabs - Fintech Product & Business Consulting
Trs Labs
 
Quantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur MorganQuantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur Morgan
Arthur Morgan
 
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptxSpecial Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
shyamraj55
 
Procurement Insights Cost To Value Guide.pptx
Procurement Insights Cost To Value Guide.pptxProcurement Insights Cost To Value Guide.pptx
Procurement Insights Cost To Value Guide.pptx
Jon Hansen
 
Rusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond SparkRusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond Spark
carlyakerly1
 
ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes Partner Innovation Updates for May 2025ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes
 
Big Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur MorganBig Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur Morgan
Arthur Morgan
 
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Aqusag Technologies
 
Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)
Ortus Solutions, Corp
 
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
Alan Dix
 
Complete Guide to Advanced Logistics Management Software in Riyadh.pdf
Complete Guide to Advanced Logistics Management Software in Riyadh.pdfComplete Guide to Advanced Logistics Management Software in Riyadh.pdf
Complete Guide to Advanced Logistics Management Software in Riyadh.pdf
Software Company
 
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdfThe Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
Abi john
 
2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx
Samuele Fogagnolo
 
How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?
Daniel Lehner
 
Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.
hpbmnnxrvb
 
AI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global TrendsAI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global Trends
InData Labs
 

Metadata based statistics for DSpace

  • 2. OVERVIEW 1. Why DSpace statistics? 2. Usage event vs. Item metadata 3. Generating metadata based statistics 4. Linking metadata to usage events 5. Performance 6. Problem solved?
  • 3. Statistics solution that knows DSpace: Structure “Which are the most downloaded bitstreams in a collection” Metadata “Who are the most popular authors in terms of downloads?” 1 - WHY DSPACE STATISTICS?
  • 4. USAGE EVENT VS. ITEM METADATA 2 types of metadata: Usage event metadata Additional information about the usage event Item metadata Additional information about the target of the usage event
  • 5. USAGE EVENT METADATA Additional information about the usage event Not related to repository Also possible with other statistics solutions: • IP address • Country • User Agent • HTTP Referrer • ...
  • 6. ITEM METADATA Relate usage event to information stored in your repository. Allows statistics queries based on item metadata. → Not possible with a statistics solution that is not tied to the repository.
  • 7. GENERATING METADATA BASED STATISTICS How many downloads did author "Barnes, Douglas F.” get in the last year, grouped by month
  • 13. LINKING METADATA TO USAGE EVENTS Solr Query http://localhost:8080/solr/statistics/select? facet=true&facet.offset=0&facet.mincount=1&facet.sort= false&q=*:*&facet.limit=24&facet.field=dateYearMonth&f acet.method=enum&fq=bundleName:ORIGINAL&fq=type: +0&fq=statistics_type:view&fq=-isBot:true&fq=- isInternal:true&fq=time:[2014-07-01T00:00:00.000Z+TO +2015-06-06T00:00:00.000Z]&fq=+(author_mtdt:Barnes, +Douglas+F.)+&wt=javabin&rows=0
  • 14. LINKING METADATA TO USAGE EVENTS facet.field=dateYearMonth group by the field dateYearMonth fq=type:+0 only include bitstream downloads fq=bundleName:ORIGINAL only include files in bundle “ORIGINAL” fq=-isBot:true filter out all bot statistics fq=-isInternal:true filter out all internal statistics fq=time:[2014-07-01+TO+2015-06-06] only include stats that are between Jul 1st 2014 and Jun 6th 2015 fq=+(author_mtdt:Barnes,+Douglas+F.)+ only include statistics that are by author Barnes, Douglas F.
  • 15. <response> <lst name="responseHeader"> ... </lst> <result name="response" numFound="164" start="0"></result> <lst name="facet_counts"> <lst name="facet_fields"> <lst name="dateYearMonth"> <int name="2014-07">15</int> <int name="2014-08">19</int> <int name="2014-09">15</int> <int name="2014-10">10</int> <int name="2014-11">7</int> <int name="2014-12">13</int> <int name="2015-01">13</int> <int name="2015-02">15</int> <int name="2015-03">21</int> <int name="2015-04">22</int> <int name="2015-05">12</int> <int name="2015-06">2</int> </lst> </lst> </lst> </response>
  • 16. LINKING METADATA TO USAGE EVENTS In a vanilla DSpace installation: • Usage statistics only contain bitstream IDs: no metadata • The metadata is stored in the database
  • 17. PROPOSED SOLUTION 1. Query the database for bitstream IDs based on the author metadata 2. Use those IDs to query solr for statistics
  • 18. PROPOSED SOLUTION: DOWNSIDES • Two queries to answer one question • The solr query can get very long and inefficient to execute • Inefficient but still possible
  • 19. PROPOSED SOLUTION: DOWNSIDES What if we want to show the 10 authors with the most downloads? • query the database for all authors • query SOLR to get the number of usage events for each author • sort those counts, and return the 10 highest
  • 20. PROPOSED SOLUTION: DOWNSIDES Very inefficient! • do a lot of queries • throw away most of the results: we only need top 10
  • 21. SOLR FACETS To do a facet query: • specify ”facet.field” along with the regular query • results will be grouped by the values they have for that field
  • 22. SOLR FACETS: EXAMPLE q=type:0&facet.field=owningItem q=type:0 search for all usage events that are bitstream downloads facet.field=owningItem group these by item count the # records in each group
  • 23. OUR SOLUTION • Add Item metadata to SOLR. • Use built-in filtering and grouping
  • 24. CHALLENGE: SIZE OF THE SOLR CORE That solution creates new challenges Metadata is duplicated in every statistical record that takes up a lot of space and it needs to be kept in sync
  • 25. SIZE OF SINGLE USAGE EVENT <doc> <str name="ip">177.21.194.80</str> <arr name="ip_search"><str>177.21.194.80</str></arr> <arr name="ip_ngram"><str>177.21.194.80</str></arr> <int name="type">0</int> <int name="id">54</int> <date name="time">2015-05-11T04:33:49.077Z</date> <str name="dateYearMonth">2015-05</str> <str name="dateYear">2015</str> <str name="continent">SA</str> <str name="countryCode">BR</str> <float name="latitude">-10.0</float> <float name="longitude">-55.0</float> <arr name="bundleName"><str>ORIGINAL</str></arr> <arr name="containerBitstream"><int>54</int></arr> <arr name="owningItem"><int>1652</int></arr> <arr name="containerItem"><int>1652</int></arr> <arr name="owningColl"><int>14</int></arr> <arr name="containerCollection"><int>14</int></arr> <arr name="owningComm"><int>1</int></arr> <arr name="containerCommunity"><int>1</int></arr> <str name="uid">60fe8ebb-b8a9-454c-8eef-3f9f800d1399</str> <bool name="isBot">false</bool> <bool name="isInternal">false</bool> <str name="statistics_type">view</str> <long name="_version_">1501767933804675072</long> </doc> 25 elements
  • 26. <doc> <str name="ip">177.21.194.80</str> ... <arr name="author_mtdt"> <str>Khandker, Shahidur R.</str> <str>Barnes, Douglas F.</str> <str>Samad, Hussain A.</str> </arr> <arr name="subject_mtdt"> <str>ACCESS TO LIGHTING</str> <str>ACCESS TO MODERN ENERGY</str> <str>AGRICULTURAL LAND</str> <str>AGRICULTURAL RESIDUE</str> <str>AIR CONDITIONERS</str> <str>AIR POLLUTION</str> <str>ALTERNATIVE ENERGY</str> <str>ALTERNATIVE SOURCES OF ENERGY</str> <str>APPROACH</str> <str>ATMOSPHERE</str> <str>AVAILABILITY</str> <str>BASIC ENERGY</str> <str>BIOMASS</str> <str>BIOMASS BURNING</str> <str>BIOMASS COLLECTION</str> <str>BIOMASS CONSUMPTION</str> <str>BIOMASS ENERGY</str> ... <str>WORLD ENERGY</str> <str>WORLD ENERGY OUTLOOK</str> </arr> ... </doc> SIZE OF SINGLE USAGE EVENT WITH METADATA 3 authors 140 subjects
  • 27. KEEPING METADATA IN SYNC When the metadata of an item changes • a mistake was corrected • extra info was added the statistical records for that item need to be updated as well
  • 28. KEEPING METADATA IN SYNC Item with 7,000 page visits and 5,000 downloads → that means updating 12,000 usage events. • That takes time • During that time, it takes longer to view other statistical reports
  • 29. PERFORMANCE Size of single usage event Metadata updates Amount of events Live search queries
  • 30. PERFORMANCE ENHANCEMENT: SYNCING Try to keep the load created by synching metadata in the statistics as low as possible: → only sync while solr is idle interrupt the operation when a search request can’t be handled in time interrupt the operation when Solr’s memory usage nears its max
  • 31. PERFORMANCE ENHANCEMENT: CACHING Caching store generated reports in a separate Solr core retrieving them is very fast invalidate cached reports after a set time (e.g. 24 hours)
  • 32. PERFORMANCE ENHANCEMENT: CACHING Don’t delete expired cached reports If a user requests a report that is cached → show the outdated version In the mean time → generate a new version Automatically show new report when it’s done
  • 35. PROBLEM SOLVED? Additional complexity Number of usage events keeps growing Name variants Different names for one author
  • 36. “Who are the Most Popular Authors in terms of downloads?” NAME VARIANTS USE CASE
  • 38. Ferreira, Francisco H. G. Ferreira, Francisco H.G. Ferreira, Francisco 3 name variants:
  • 40. SOLUTION FOR NAME VARIANTS include all name variants in Solr query: author_mtdt: (Ferreira, Francisco H. G.) OR (Ferreira, Francisco H.G.) OR (Ferreira, Francisco)
  • 41. ALTERNATIVE SOLUTION If you have unique IDs (e.g. ORCID) Index, and search for them instead