From my joint talk with Alisa Zhila at Lucene/Solr Revolution 2016 in Boston. The talk covers the following:
- Hierarchical Data/Nested Documents
- Indexing Nested Documents
- Querying Nested Documents
- Faceting on Nested Documents
Synchronicity: Just-In-Time Discovery of Lost Web PagesMichael Nelson
The document discusses techniques for discovering lost web pages using lexical signatures. It finds that lexical signatures generated from page titles and content evolve over time, with terms dropping out. Signatures perform best with 5-7 terms. Combining titles with signatures provides better discovery results than either alone. Future work includes predicting "good" titles and augmenting signatures with tags and link neighborhoods.
Invited talk at USEWOD2014 (http://people.cs.kuleuven.be/~bettina.berendt/USEWOD2014/)
A tremendous amount of machine-interpretable information is available in the Linked Open Data Cloud. Unfortunately, much of this data remains underused as machine clients struggle to use the Web. I believe this can be solved by giving machines interfaces similar to those we offer humans, instead of separate interfaces such as SPARQL endpoints. In this talk, I'll discuss the Linked Data Fragments vision on machine access to the Web of Data, and indicate how this impacts usage analysis of the LOD Cloud. We all can learn a lot from how humans access the Web, and those strategies can be applied to querying and analysis. In particular, we have to focus first on solving those use cases that humans can do easily, and only then consider tackling others.
Computer study lesson - Internet Search (25 Mar 2020)wmsklang
Here are the answers to your homework questions:
1. Magnets work by the alignment of atomic or subatomic particles called domains that are polarized (given a magnetic "charge"). The magnetic fields of these polarized domains interact and attract or repel other magnetic materials.
2. A spark plug is a device for delivering electric current from an ignition system to the combustion chamber of a spark-ignition engine to ignite the compressed fuel-air mixture by an electric spark, thereby initiating combustion.
3. A light year is the distance that light travels in one year. Since light travels at about 300,000 kilometers (186,000 miles) per second, one light year equals about 9.46 trillion kilometers or 5.88 trillion
LANL Research Library
March 12, 2009
Martin Klein & Michael L. Nelson
Department of Computer Science
Old Dominion University
Norfolk VA
www.cs.odu.edu/~{mklein,mln}
This document introduces Linked Data Fragments, which is an approach to querying Linked Data in a scalable and reliable way by moving intelligence from centralized servers to distributed clients. It describes how basic Linked Data Fragments can be used to answer SPARQL queries by retrieving and combining relevant fragments. The vision is for clients to be able to query different Linked Data sources across the web using various types of fragments. All Linked Data Fragments software is available as open source.
This document summarizes research into discovering lost web pages using techniques from digital preservation and information retrieval. Key points include:
- Web pages are frequently lost due to broken links or content being moved/removed, but copies may still exist in search engine caches or archives.
- Techniques like lexical signatures (representing a page's content in a few keywords) and analyzing page titles, tags and link neighborhoods can help characterize lost pages and find similar replacement content.
- Experiments showed that lexical signatures degrade over time but page titles are more stable, and combining techniques improves performance in locating replacement content. The goal is to develop a browser extension to help users find lost web pages.
The document discusses the principles of linked open data and Resource Description Framework (RDF). It introduces RDF, SPARQL, and ontologies as standards for the semantic web. It emphasizes using URIs as names for things and linking data to enable discovery on the web. Triples are presented as the basic format for expressing statements about resources in a graph.
This document discusses various approaches for building applications that consume linked data from multiple datasets on the web. It describes characteristics of linked data applications and generic applications like linked data browsers and search engines. It also covers domain-specific applications, faceted browsers, SPARQL endpoints, and techniques for accessing and querying linked data including follow-up queries, querying local caches, crawling data, federated query processing, and on-the-fly dereferencing of URIs. The advantages and disadvantages of each technique are discussed.
The document introduces Hierarchy, a new technology that adds hierarchical data structures to Java. It allows defining hierarchical data like XML and JSON in Java code. Hierarchy provides benefits like easier creation and use of hierarchical data, a dedicated data structure for it, and a way to define fields universally across different usages. The technology is still in development but shows potential to improve how hierarchical data is handled in Java applications and enable new architectural styles.
Presentation of the paper "On Using JSON-LD to Create Evolvable RESTful Services" at the 3rd International Workshop on RESTful Design (WS-REST 2012) at WWW2012 in Lyon, France
It's not rocket surgery - Linked In: ALA 2011Ross Singer
This document provides a brief introduction to linked library data and linked data concepts. It explains the core principles of linked data, including using URIs as names for things and including links between URIs so that additional related data can be discovered. It also discusses common vocabularies and schemas used in linked data like Dublin Core, Bibliontology, and RDA Elements. The document uses a sample book record to demonstrate how linked data can be modeled and interconnected using these vocabularies and external data sources like VIAF, LOC, and Geonames.
Presentation given at Barcamp Chiang Mai 4 on the basics of Semantic Web. A simple introduction with examples, aimed for those with a little Web development experience.
Raises questions about the true identity of Tim Berners-Lee.
This document analyzes data collected from two sources - Hacker Web forums and the Shodan search engine - to answer research questions about cybersecurity topics. For Hacker Web, it finds that discussions mentioning "victims" or "targets" have increased over time. The most discussed topics across forums are Windows, government, malware, and botnets. For Shodan, it estimates that over 2% of Samsung SmartTVs are publicly accessible and potentially exploitable. It also finds that traffic signal systems in Louisiana, especially in Metairie and New Orleans, are the most vulnerable to hacking in the US.
The document provides an overview of how Google works including how it indexes web pages, performs searches, and ranks search results. It also describes various search techniques like using quotation marks, Boolean operators, and limits to refine searches. Tips are provided for using Google as a calculator, dictionary, phone directory, and for getting weather or facts. Links to additional resources on searching Google and Google Scholar are also included.
Effective and efficient google searching power point tutorialJaclyn Lee Parrott
This document provides guidance on effective Google searching. It discusses Google's mission to organize the world's information and make it accessible. It also notes that Google profiles users to target advertising and its products may change. The document then provides examples of basic Google searches and demonstrates more advanced search techniques. It stresses evaluating sources and avoiding plagiarism. Finally, it includes an exercise for readers to practice advanced Google searches.
This document discusses various techniques, called "Google hacks", for efficiently searching Google. It covers basic operators like plus, minus, quotes, and, or signs. It also covers advanced operators like movie, define, weather, and site restrictions. The document provides examples of interesting searches and tools for anonymous Googling and protecting yourself from Google searches.
The document provides an overview and introduction to Neo4j, a graph database. It discusses what graphs and Neo4j are, how to model data in a graph versus SQL, the Cypher query language to interact with Neo4j, and demonstrates Neo4j through the browser. It concludes by suggesting next steps to download Neo4j, choose a driver, join the community, and attend upcoming events.
1) There are several general methods for acquiring web data through R, including reading files directly, scraping HTML/XML/JSON, and using APIs that serve XML/JSON.
2) Scraping web data involves extracting structured information from unstructured HTML/XML pages when no API is available. Packages like rvest and XML can be used to parse and extract the desired data.
3) Many data sources have APIs that allow programmatic access to search, retrieve, or submit data through a set of methods. R packages like taxize and dryad interface with specific APIs to access taxonomic and research data.
Open Source Community Metrics LibreOffice ConferenceDawn Foster
Open Source Community Metrics: Tips and Techniques for Measuring Participation
Do you know what people are really doing in your open source project? Having good community data and metrics for your open source project is a great way to understand what works and what needs improvement over time, and metrics can also be a nice way to highlight contributions from key project members. This session will focus on tips and techniques for collecting and analyzing metrics from tools commonly used by open source projects. It's like people watching, but with data.
The document describes an upcoming security conference titled "First Improvised Security Testing Conference" to be held on August 8th, 2003 in Madrid. It then provides details about a talk to be given by speaker Vicente Aceituno titled "Advanced Google Searching: Google as a hacking tool". The talk will cover various advanced search techniques using Google to find vulnerable servers, files, and other useful information for security testing purposes. These techniques include directory listings, common default pages, language translations, and the potential use of autonomous "robots" to identify targets.
Sustainable queryable access to Linked DataRuben Verborgh
This document discusses sustainable queryable access to Linked Data through the use of Triple Pattern Fragments (TPF). TPFs provide a low-cost interface that allows clients to query datasets through triple patterns. Intelligent clients can execute SPARQL queries over TPFs by breaking queries into triple patterns and aggregating the results. TPFs also enable federated querying across multiple datasets by treating them uniformly as fragments that can be retrieved. The document demonstrates federated querying over DBpedia, VIAF, and Harvard Library datasets using TPF interfaces.
The document discusses the benefits of a federated and decentralized approach to knowledge and data on the web. It argues that centralized approaches like Big Data fail at web scale, as knowledge is inherently distributed and heterogeneous. A federated future based on light interfaces like Triple Pattern Fragments is envisioned, one where clients can query multiple data sources simultaneously for better performance and reliability compared to centralized endpoints. Serendipity and realistic expectations are important principles for this vision.
1. The document discusses various methods for collecting data from websites, including scraping, using APIs, and contacting site owners. It provides examples of projects that used different techniques.
2. Scraping involves programmatically extracting structured data from websites and can be complicated due to legal and ethical issues. APIs provide a safer alternative as long as rate limits are respected.
3. The document provides tips for scraping courteously and effectively, avoiding burdening websites. It also covers common scraping challenges and potential workarounds or alternatives like using APIs or contracting data collection.
Open Source Community Metrics for FOSDEMDawn Foster
Presented in the Community DevRoom at FOSDEM 2013. A longer version of this presentation is available at http://fastwonderblog.com/2012/11/05/open-source-community-metrics-linuxcon-barcelona/
The document discusses Triple Pattern Fragments (TPF), which is an alternative approach to publishing Linked Data compared to SPARQL endpoints and data dumps. TPF servers are simpler and have lower processing costs than SPARQL endpoints. This allows TPF interfaces to have very high availability for clients. The document analyzes usage statistics of the DBpedia TPF interface which show it has been widely used with high uptime. It advocates for TPF as a way to make it easier and more realistic to build applications on live Linked Data.
Working with Deeply Nested Documents in Apache Solr: Presented by Anshum Gupt...Lucidworks
The document discusses querying nested documents in Apache Solr. It provides examples of indexing nested XML and JSON documents in Solr, and demonstrates various ways to query the nested documents, including:
- Finding all documents that mention a keyword
- Returning specific document types (comments, replies) that match a query
- Cross-level queries that search across different levels of nesting
- Block join parent and child queries to return parents or children of matching documents
- Returning all descendants of a document by using the ChildDocTransformer
A presentation given at the Lucene/Solr Revolution 2014 conference to show Solr and Elasticsearch features side by side. The presentation time was only 30 minutes, so only the core usability features were compared. The full video is embedded on the last slide.
Anshum Gupta is an Apache Lucene/Solr committer and Lucidworks employee with over 9 years of experience in search and related technologies. He has been involved with Apache Lucene since 2006 and Apache Solr since 2010, focusing on contributions, releases, and communities around Solr. The document then provides an overview of the major new features and improvements in Apache Solr 4.10, including ease of use enhancements, distributed pivot faceting, core, SolrCloud, and development tool updates.
This document discusses various approaches for building applications that consume linked data from multiple datasets on the web. It describes characteristics of linked data applications and generic applications like linked data browsers and search engines. It also covers domain-specific applications, faceted browsers, SPARQL endpoints, and techniques for accessing and querying linked data including follow-up queries, querying local caches, crawling data, federated query processing, and on-the-fly dereferencing of URIs. The advantages and disadvantages of each technique are discussed.
The document introduces Hierarchy, a new technology that adds hierarchical data structures to Java. It allows defining hierarchical data like XML and JSON in Java code. Hierarchy provides benefits like easier creation and use of hierarchical data, a dedicated data structure for it, and a way to define fields universally across different usages. The technology is still in development but shows potential to improve how hierarchical data is handled in Java applications and enable new architectural styles.
Presentation of the paper "On Using JSON-LD to Create Evolvable RESTful Services" at the 3rd International Workshop on RESTful Design (WS-REST 2012) at WWW2012 in Lyon, France
It's not rocket surgery - Linked In: ALA 2011Ross Singer
This document provides a brief introduction to linked library data and linked data concepts. It explains the core principles of linked data, including using URIs as names for things and including links between URIs so that additional related data can be discovered. It also discusses common vocabularies and schemas used in linked data like Dublin Core, Bibliontology, and RDA Elements. The document uses a sample book record to demonstrate how linked data can be modeled and interconnected using these vocabularies and external data sources like VIAF, LOC, and Geonames.
Presentation given at Barcamp Chiang Mai 4 on the basics of Semantic Web. A simple introduction with examples, aimed for those with a little Web development experience.
Raises questions about the true identity of Tim Berners-Lee.
This document analyzes data collected from two sources - Hacker Web forums and the Shodan search engine - to answer research questions about cybersecurity topics. For Hacker Web, it finds that discussions mentioning "victims" or "targets" have increased over time. The most discussed topics across forums are Windows, government, malware, and botnets. For Shodan, it estimates that over 2% of Samsung SmartTVs are publicly accessible and potentially exploitable. It also finds that traffic signal systems in Louisiana, especially in Metairie and New Orleans, are the most vulnerable to hacking in the US.
The document provides an overview of how Google works including how it indexes web pages, performs searches, and ranks search results. It also describes various search techniques like using quotation marks, Boolean operators, and limits to refine searches. Tips are provided for using Google as a calculator, dictionary, phone directory, and for getting weather or facts. Links to additional resources on searching Google and Google Scholar are also included.
Effective and efficient google searching power point tutorialJaclyn Lee Parrott
This document provides guidance on effective Google searching. It discusses Google's mission to organize the world's information and make it accessible. It also notes that Google profiles users to target advertising and its products may change. The document then provides examples of basic Google searches and demonstrates more advanced search techniques. It stresses evaluating sources and avoiding plagiarism. Finally, it includes an exercise for readers to practice advanced Google searches.
This document discusses various techniques, called "Google hacks", for efficiently searching Google. It covers basic operators like plus, minus, quotes, and, or signs. It also covers advanced operators like movie, define, weather, and site restrictions. The document provides examples of interesting searches and tools for anonymous Googling and protecting yourself from Google searches.
The document provides an overview and introduction to Neo4j, a graph database. It discusses what graphs and Neo4j are, how to model data in a graph versus SQL, the Cypher query language to interact with Neo4j, and demonstrates Neo4j through the browser. It concludes by suggesting next steps to download Neo4j, choose a driver, join the community, and attend upcoming events.
1) There are several general methods for acquiring web data through R, including reading files directly, scraping HTML/XML/JSON, and using APIs that serve XML/JSON.
2) Scraping web data involves extracting structured information from unstructured HTML/XML pages when no API is available. Packages like rvest and XML can be used to parse and extract the desired data.
3) Many data sources have APIs that allow programmatic access to search, retrieve, or submit data through a set of methods. R packages like taxize and dryad interface with specific APIs to access taxonomic and research data.
Open Source Community Metrics LibreOffice ConferenceDawn Foster
Open Source Community Metrics: Tips and Techniques for Measuring Participation
Do you know what people are really doing in your open source project? Having good community data and metrics for your open source project is a great way to understand what works and what needs improvement over time, and metrics can also be a nice way to highlight contributions from key project members. This session will focus on tips and techniques for collecting and analyzing metrics from tools commonly used by open source projects. It's like people watching, but with data.
The document describes an upcoming security conference titled "First Improvised Security Testing Conference" to be held on August 8th, 2003 in Madrid. It then provides details about a talk to be given by speaker Vicente Aceituno titled "Advanced Google Searching: Google as a hacking tool". The talk will cover various advanced search techniques using Google to find vulnerable servers, files, and other useful information for security testing purposes. These techniques include directory listings, common default pages, language translations, and the potential use of autonomous "robots" to identify targets.
Sustainable queryable access to Linked DataRuben Verborgh
This document discusses sustainable queryable access to Linked Data through the use of Triple Pattern Fragments (TPF). TPFs provide a low-cost interface that allows clients to query datasets through triple patterns. Intelligent clients can execute SPARQL queries over TPFs by breaking queries into triple patterns and aggregating the results. TPFs also enable federated querying across multiple datasets by treating them uniformly as fragments that can be retrieved. The document demonstrates federated querying over DBpedia, VIAF, and Harvard Library datasets using TPF interfaces.
The document discusses the benefits of a federated and decentralized approach to knowledge and data on the web. It argues that centralized approaches like Big Data fail at web scale, as knowledge is inherently distributed and heterogeneous. A federated future based on light interfaces like Triple Pattern Fragments is envisioned, one where clients can query multiple data sources simultaneously for better performance and reliability compared to centralized endpoints. Serendipity and realistic expectations are important principles for this vision.
1. The document discusses various methods for collecting data from websites, including scraping, using APIs, and contacting site owners. It provides examples of projects that used different techniques.
2. Scraping involves programmatically extracting structured data from websites and can be complicated due to legal and ethical issues. APIs provide a safer alternative as long as rate limits are respected.
3. The document provides tips for scraping courteously and effectively, avoiding burdening websites. It also covers common scraping challenges and potential workarounds or alternatives like using APIs or contracting data collection.
Open Source Community Metrics for FOSDEMDawn Foster
Presented in the Community DevRoom at FOSDEM 2013. A longer version of this presentation is available at http://fastwonderblog.com/2012/11/05/open-source-community-metrics-linuxcon-barcelona/
The document discusses Triple Pattern Fragments (TPF), which is an alternative approach to publishing Linked Data compared to SPARQL endpoints and data dumps. TPF servers are simpler and have lower processing costs than SPARQL endpoints. This allows TPF interfaces to have very high availability for clients. The document analyzes usage statistics of the DBpedia TPF interface which show it has been widely used with high uptime. It advocates for TPF as a way to make it easier and more realistic to build applications on live Linked Data.
Working with Deeply Nested Documents in Apache Solr: Presented by Anshum Gupt...Lucidworks
The document discusses querying nested documents in Apache Solr. It provides examples of indexing nested XML and JSON documents in Solr, and demonstrates various ways to query the nested documents, including:
- Finding all documents that mention a keyword
- Returning specific document types (comments, replies) that match a query
- Cross-level queries that search across different levels of nesting
- Block join parent and child queries to return parents or children of matching documents
- Returning all descendants of a document by using the ChildDocTransformer
A presentation given at the Lucene/Solr Revolution 2014 conference to show Solr and Elasticsearch features side by side. The presentation time was only 30 minutes, so only the core usability features were compared. The full video is embedded on the last slide.
Anshum Gupta is an Apache Lucene/Solr committer and Lucidworks employee with over 9 years of experience in search and related technologies. He has been involved with Apache Lucene since 2006 and Apache Solr since 2010, focusing on contributions, releases, and communities around Solr. The document then provides an overview of the major new features and improvements in Apache Solr 4.10, including ease of use enhancements, distributed pivot faceting, core, SolrCloud, and development tool updates.
This document summarizes a talk on search given at Search Camp United Nations in NYC on July 10, 2016. The talk will showcase and detail examples of different types of search including rules, typeahead/suggest, signals, and location awareness, and how they can be brought together into a cohesive search experience. It provides information on the speaker, Erik Hatcher, and covers various anatomy of search results and features like relevancy ranking, faceting, highlighting, grouping, spellchecking, autocomplete and more.
Scaling SolrCloud to a large number of CollectionsAnshum Gupta
Anshum Gupta presented on scaling SolrCloud to support thousands of collections. Some challenges included limitations on the cluster state size, overseer performance issues under high load, and difficulties moving or exporting large amounts of data. Solutions involved splitting the cluster state, improving overseer performance through optimizations and dedicated nodes, enabling finer-grained shard splitting and data migration between collections, and implementing distributed deep paging for large result sets. Testing was performed on an AWS infrastructure to validate scaling to billions of documents and thousands of queries/updates per second. Ongoing work continues to optimize and benchmark SolrCloud performance at large scales.
Managing a SolrCloud cluster using APIsAnshum Gupta
The document discusses managing large SolrCloud clusters through APIs. It begins with background on SolrCloud and its terminology. It then demonstrates various APIs for creating and modifying collections, adding/deleting replicas, splitting shards, and monitoring cluster status. It provides recipes for common management tasks like shard splitting, ensuring high availability, and migrating infrastructure. Finally, it mentions upcoming backup/restore capabilities and encourages connecting on social media.
Webinar: Building Conversational Search with FusionLucidworks
Traditional approaches put the burden on the user to specify fields and learn more about how the information is stored before composing a query. New approaches enabled by Fusion allow the end user to type in their normal everyday business language and get back meaningful results.
Battle of the giants: Apache Solr vs ElasticSearchRafał Kuć
Elasticsearch and Apache Solr are both distributed search engines that provide full text search capabilities and real-time analytics on large volumes of data. The document compares their architectures, data models, query languages, and other features. Key differences include Elasticsearch having a more dynamic schema while Solr relies more on predefined schemas, and Elasticsearch natively supports features like nested objects and parent/child relationships that require additional configuration in Solr.
Solr and Elasticsearch, a performance studyCharlie Hull
The document summarizes a performance comparison study conducted between Elasticsearch and SolrCloud. It found that SolrCloud was slightly faster at indexing and querying large datasets, and was able to support a significantly higher queries per second. However, the document notes limitations to the study and concludes that both Elasticsearch and SolrCloud showed acceptable performance, so the best option depends on the specific search application requirements.
Solr as your search and suggest engine karan nangruIndicThreads
Session presented at the 6th IndicThreads.com Conference on Java held in Pune, India on 2-3 Dec. 2011.
http://Java.IndicThreads.com
Adapting Alax Solr to Compare different sets of documents - Joan Codinalucenerevolution
See conference video - http://www.lucidimagination.com/devzone/events/conferences/ApacheLuceneEurocon2011
One of the main features of Solr is Faceted Search. Facets are the top terms present in the results of a query. But facets do not indicate the most statistically relevant terms of a query, that is, these terms that are more present in the documents selected by the query than in the rest of the collection. A critical factor in making such statistical insights broadly useful is to make them visual -- i.e., using charts and graphs that display these quantitative relationships. We will present how to adapt Ajax-Solr to find the most prominent terms of a query compared to the full set or just another query. We are going to present and example on how this can be used to find current topics in the news, and extract that information into visually communicative charts and graphs.
Jay Hill from Lucid Imagination will be giving a presentation on common "sins" or anti-patterns that are seen in Lucene and Solr implementations. The document introduces Hill and Lucid Imagination, which provides commercial support for Lucene and Solr. It notes that there will be time for questions and discusses some of the sins that will be covered, including sloth, greed, pride, lust, envy, gluttony, and wrath.
The document compares and contrasts the Apache Solr and Elasticsearch search engines. It discusses their approaches to indexing structure, configuration, discovery, querying, filtering, faceting, data handling, updates, and cluster monitoring. While both use Lucene for indexing and querying, Elasticsearch has a more dynamic schema, easier configuration changes, and more flexible sharding and replication compared to Solr.
Proposal for nested document support in LuceneMark Harwood
Nested Documents in Lucene provides a solution for representing complex nested data structures in Lucene by allowing multiple "nested" documents to represent related items. It introduces a new NestedDocumentQuery class that understands document relationships and can execute child searches using arbitrary Lucene queries. This allows for efficient joins between parent and child documents when querying nested data.
Praktische Umsetzung der Facettensuche
Vortrag auf der Froscon 2013
http://programm.froscon.org/2013/events/1206.html
Die Facettensuche ist inzwischen zu einem wichtigen Hilfsmittel für die benutzerfreundliche Erschließung von großen Datenmengen geworden. Doch wie kann man eine Facettensuche realisieren und worauf ist dabei zu achten? Ziel des Vortrages ist es, diese Fragen zu beantworten und praktische Hinweise zu geben.
Das Apache Lucene Projekt beinhaltet mit Lucene Core - dem Java-basierten Index- und Such-Framework - und mit Solr - dem hochperformanten und konfigurierbaren Such-Server - zwei mächtige Werkzeuge, die zur Implementierung von Suchmaschinen als Open Source Software zur Verfügung stehen.
Der Vortrag wird beide Ansätze vorstellen und zeigen, wie sich damit eine Facettensuche realisieren lässt. Dabei wird sowohl die Möglichkeit der konfigurationsbasierten Facettensuche in Solr als auch die komplexere Herangehensweise über das Lucene Framework vorgestellt und beide Methoden miteinander verglichen.
Neben dem Thema der technischen Vorgehensweise werden dabei auch allgemeine Punkte der Facettensuche betrachtet, etwa Fragen zur Struktur der zu durchsuchenden Daten und der Auswahl von Facetten bis zu Hinweisen zur Darstellung an der Benutzerschnittstelle.
Automotive Information Research Driven by Apache Solr: Presented by Mario-Lea...Lucidworks
This document summarizes a presentation about using Apache Solr for automotive information research. The presentation covers using Solr for reverse data engineering, aftersales information research, solving the problem of combinatorial explosion in data, ensuring data consistency and timeliness, and using Solr for bill of materials explosions and demand forecasts. It provides examples of how Solr was used to integrate vehicle data from multiple systems, perform full-text search across structured and unstructured data, handle complex data relationships, and optimize performance for an application calculating bill of material explosions.
The document outlines an agenda for a conference on search and recommenders hosted by Lucidworks, including presentations on use cases for ecommerce, compliance, fraud and customer support; a demo of Lucidworks Fusion which leverages signals from user engagement to power both search and recommendations; and a discussion of future directions including ensemble and click-based recommendation approaches.
Downtown SF Lucene/Solr Meetup: Developing Scalable User Search for PlayStati...Lucidworks
The document discusses developing a scalable user search feature for the PlayStation 4. It describes setting up a SolrCloud cluster with 300 million user documents distributed across 4 shards. Personalized search ranks results based on friendship connections by using a Lucene index to store close connections for each user. Challenges included instability in the initial Solr 4.8 cluster which was addressed through configuration changes. An upgrade to Solr 5.4 required fully reindexing the data due to schema changes.
Webinar: Fusion for Business IntelligenceLucidworks
Lucidworks Senior Systems Engineer Allan Syiek discusses simple querying vs. data mining and intelligent search, and how Lucidworks Fusion can help you turn raw data into insight.
Back to Basics Webinar 3 - Thinking in DocumentsJoe Drumgoole
- The document discusses modeling data in MongoDB based on cardinality and access patterns.
- It provides examples of embedding related data for one-to-one and one-to-many relationships, and references for large collections.
- The document recommends considering read/write patterns and embedding objects for efficient access, while breaking out data if it grows too large.
Presentation about working with the Activity Stream in IBM Connections 4+ meaning what the concepts behind the Activity Stream are, who to work with it and how to perform many of the tasks you would need to do such as marking/unmarking as actionable etc.
Mikkel Heisterberg - An introduction to developing for the Activity StreamLetsConnect
The future of business is social and the activity stream is the way events and messages are communicated in the social business. In this session you’ll learn all there is to know about the activity stream including exactly what it is and how to interact with it using your favorite development environment whether that be JavaScript, XPages, Java or even the plain vanilla HTTP based REST API. This session is for you if you want to start working the Activity Stream.
Back to Basics Webinar 3: Schema Design Thinking in DocumentsMongoDB
This is the third webinar of a Back to Basics series that will introduce you to the MongoDB database. This webinar will explain the architecture of document databases.
Jim Gray gave a presentation on Microsoft SQL Server and database research. He discussed SQL Server's goals of being easy to use and scalable. He outlined enhancements to SQL Server 7 including improved replication, query processing, and data warehousing capabilities. Gray also discussed challenges around managing the growing volume of data being created and the importance of data analysis. He concluded by previewing new capabilities for future versions of SQL Server like support for XML and object-relational features.
I want to know more about compuerized text analysisLuke Czarnecki
This document provides an overview of computerized text analysis and discusses ethical considerations related to using social media data for social science research. It begins with an introduction to the speaker's research analyzing ideological differences through language use. It then covers the history and current capabilities of computerized text analysis. A major theme is the need for rigorous ethics applications and approval processes as technology has outpaced philosophy. The document concludes with a demonstration of basic functions and capabilities in R for collecting, preprocessing, and analyzing text data.
Conceptos básicos. seminario web 3 : Diseño de esquema pensado para documentosMongoDB
Este es el tercer seminario web de la serie Conceptos básicos, en la que se realiza una introducción a la base de datos MongoDB. En este seminario web se explica la arquitectura de las bases de datos de documentos.
Scaling Recommendations, Semantic Search, & Data Analytics with solrTrey Grainger
This presentation is from the inaugural Atlanta Solr Meetup held on 2014/10/21 at Atlanta Tech Village.
Description: CareerBuilder uses Solr to power their recommendation engine, semantic search, and data analytics products. They maintain an infrastructure of hundreds of Solr servers, holding over a billion documents and serving over a million queries an hour across thousands of unique search indexes. Come learn how CareerBuilder has integrated Solr into their technology platform (with assistance from Hadoop, Cassandra, and RabbitMQ) and walk through api and code examples to see how you can use Solr to implement your own real-time recommendation engine, semantic search, and data analytics solutions.
Speaker: Trey Grainger is the Director of Engineering for Search & Analytics at CareerBuilder.com and is the co-author of Solr in Action (2014, Manning Publications), the comprehensive example-driven guide to Apache Solr. His search experience includes handling multi-lingual content across dozens of markets/languages, machine learning, semantic search, big data analytics, customized Lucene/Solr scoring models, data mining and recommendation systems. Trey is also the Founder of Celiaccess.com, a gluten-free search engine, and is a frequent speaker at Lucene and Solr-related conferences.
How Graphs Help Investigative Journalists to Connect the Dotsjexp
Investigative journalists use graphs and graph databases like Neo4j to connect disparate pieces of data and uncover hidden relationships. The Panama Papers investigation involved loading over 2.6 TB of leaked data into Neo4j to allow over 370 journalists from 80 countries to collaborate and find connections between entities, addresses, intermediaries and officers. Visualizing the data in Neo4j helped journalists tell the full story and have a global impact, exposing offshore dealings of world leaders and others.
This document discusses the evolution of the web from a web of documents to a web of linked data. It outlines the principles of linked data, which involve using URIs to identify things and linking those URIs to other URIs so that machines can discover more data. RDF is introduced as a standard data model for publishing linked data on the web using triples. Examples of linked data applications and datasets are provided to illustrate how linked data allows the web to function as a global database.
This document summarizes the different types of data that can be saved by an app created with Vizwik, including global script values, local browser storage, simple data, complex data, table data, media, and web data. It describes how each type of data is stored and accessed, such as using scripts to get, set, and use simple data values or asynchronous calls and callbacks to manage table data and rows. The document also covers sharing data privately or publicly and accessing user media and web data.
Ten to fifteen years ago, we picked between a few major SQL databases. Now our apps have a variety of needs, and an overwhelming selection of database platforms. There are 5 main database families. In this talk we’ll survey all 5: Relational (SQL), Key/Value (NoSQL), Columnar (NoSQL), Document (NoSQL), and Graph (NoSQL). We’ll cover what scenarios each family handles well. In addition, we’ll discuss the most popular members of each family. So, the next time you need to pick a database, you’ll know which one - or ones - are the best fit.
Navigating the Transition from relational to NoSQL - CloudCon Expo 2012exponential-inc
The document discusses transitioning from relational to NoSQL databases. Relational databases have rigid schemas and cannot scale out easily, while NoSQL databases offer more flexibility through document and other data models. NoSQL databases include document, key-value, column, and graph databases. Document databases store data as documents with flexible, independent structures and support auto-sharding and replication for scaling. They provide an alternative to the rigid structure of relational databases.
LESSON 1- MICROSOFT ACCESS CREATING DATABASE.pdfJoshCasas1
Microsoft Access is a software application that could help students to create databases and organize data using database tools like, reports, modules, tables and queries. Database Relational is a tool that could organize the data by its relationship (One is to One, One is to Many and Many is to Many.
The document discusses adding new data sources to the Evergreen Reporter. It describes the amount of existing data in the reporter and the process for adding new tables, including creating the tables, uploading data, and configuring the field mapper to integrate the new tables. An example is provided of adding award nomination data and querying it to answer a reference question. The challenges of ongoing data maintenance and staff training are also addressed.
This document summarizes Michael Hunger's presentation on how graphs make databases fun again. Some key points:
- Traditional relational databases have issues modeling connected data and performing complex queries over relationships. Graph databases like Neo4j can more naturally represent connected data as nodes and relationships.
- Neo4j was originally created to solve issues modeling connected data for a digital asset management system. It uses a graph data model and allows complex relationship queries through its Cypher query language.
- The document demonstrates importing meetup data into Neo4j and running queries to find connections between users, groups, and topics. It also shows examples of querying actor relationships and movie data.
- Tools are presented
This document provides instructions for creating a simple BI Publisher report using real data from a PeopleSoft query. The key steps are:
1. Download real data from an existing PeopleSoft query in XML format.
2. Create a BI Publisher template in Word, linking it to the real data XML file. Format and preview the template.
3. Associate the template with a new BI Publisher report definition in PeopleSoft, linking it to the original query data source.
4. View the final formatted report by publishing it from the report definition in PeopleSoft.
This document discusses SolrCloud cluster management APIs. It provides a brief history of SolrCloud and how cluster management has evolved since its introduction in Solr 4.0 when there were no APIs for managing distributed clusters. It outlines several key SolrCloud cluster management APIs for creating and managing collections, replica placement strategies, scaling up clusters, moving data between shards and nodes, monitoring cluster status, managing leader elections, and migrating cluster infrastructure. It envisions rule-based automation for tasks like monitoring disk usage and automatically adding/removing replicas based on cluster status.
Anshum Gupta presented on the Apache Solr security framework. He began with an introduction of himself and overview of Apache Lucene and Solr. The presentation then covered the need for security in Solr, available security options which include SSL, ZooKeeper ACLs, and authentication and authorization frameworks. Gupta discussed the authentication and authorization plugin architectures, available plugins like BasicAuth and Kerberos, and benefits of the security frameworks like enabling multi-tenant and access controlled features. He concluded with recommendations on writing custom plugins and next steps to improve Solr security.
Talk given at airbnb HQ in San Francisco on July 8th, 2015 at the Downtown SF Apache Lucene/Solr meetup.
This talk covers an overview of both, the authentication and authorization frameworks in Apache Solr, and how they work together. It also provides an overview of existing plugins and how to enable them to restrict user access to resources within Solr.
Anshum Gupta is an Apache Lucene/Solr committer who works at Lucidworks. He discusses the history and capabilities of Apache Lucene, an open source information retrieval library, and Apache Solr, an enterprise search platform built on Lucene. Solr has over 8 million downloads and is used by many large companies for search capabilities including indexing, faceting, auto-complete, and scalability to handle large datasets. Major updates in Solr 5 include improved performance, security features, and analytics capabilities.
This document discusses deploying and managing Apache Solr at scale. It introduces the Solr Scale Toolkit, an open source tool for deploying and managing SolrCloud clusters in cloud environments like AWS. The toolkit uses Python tools like Fabric to provision machines, deploy ZooKeeper ensembles, configure and start SolrCloud clusters. It also supports benchmark testing and system monitoring. The document demonstrates using the toolkit and discusses lessons learned around indexing and query performance at scale.
Challenges in Migrating Imperative Deep Learning Programs to Graph Execution:...Raffi Khatchadourian
Efficiency is essential to support responsiveness w.r.t. ever-growing datasets, especially for Deep Learning (DL) systems. DL frameworks have traditionally embraced deferred execution-style DL code that supports symbolic, graph-based Deep Neural Network (DNN) computation. While scalable, such development tends to produce DL code that is error-prone, non-intuitive, and difficult to debug. Consequently, more natural, less error-prone imperative DL frameworks encouraging eager execution have emerged at the expense of run-time performance. While hybrid approaches aim for the "best of both worlds," the challenges in applying them in the real world are largely unknown. We conduct a data-driven analysis of challenges---and resultant bugs---involved in writing reliable yet performant imperative DL code by studying 250 open-source projects, consisting of 19.7 MLOC, along with 470 and 446 manually examined code patches and bug reports, respectively. The results indicate that hybridization: (i) is prone to API misuse, (ii) can result in performance degradation---the opposite of its intention, and (iii) has limited application due to execution mode incompatibility. We put forth several recommendations, best practices, and anti-patterns for effectively hybridizing imperative DL code, potentially benefiting DL practitioners, API designers, tool developers, and educators.
Viam product demo_ Deploying and scaling AI with hardware.pdfcamilalamoratta
Building AI-powered products that interact with the physical world often means navigating complex integration challenges, especially on resource-constrained devices.
You'll learn:
- How Viam's platform bridges the gap between AI, data, and physical devices
- A step-by-step walkthrough of computer vision running at the edge
- Practical approaches to common integration hurdles
- How teams are scaling hardware + software solutions together
Whether you're a developer, engineering manager, or product builder, this demo will show you a faster path to creating intelligent machines and systems.
Resources:
- Documentation: https://on.viam.com/docs
- Community: https://discord.com/invite/viam
- Hands-on: https://on.viam.com/codelabs
- Future Events: https://on.viam.com/updates-upcoming-events
- Request personalized demo: https://on.viam.com/request-demo
Generative Artificial Intelligence (GenAI) in BusinessDr. Tathagat Varma
My talk for the Indian School of Business (ISB) Emerging Leaders Program Cohort 9. In this talk, I discussed key issues around adoption of GenAI in business - benefits, opportunities and limitations. I also discussed how my research on Theory of Cognitive Chasms helps address some of these issues
Slides for the session delivered at Devoxx UK 2025 - Londo.
Discover how to seamlessly integrate AI LLM models into your website using cutting-edge techniques like new client-side APIs and cloud services. Learn how to execute AI models in the front-end without incurring cloud fees by leveraging Chrome's Gemini Nano model using the window.ai inference API, or utilizing WebNN, WebGPU, and WebAssembly for open-source models.
This session dives into API integration, token management, secure prompting, and practical demos to get you started with AI on the web.
Unlock the power of AI on the web while having fun along the way!
Enterprise Integration Is Dead! Long Live AI-Driven Integration with Apache C...Markus Eisele
We keep hearing that “integration” is old news, with modern architectures and platforms promising frictionless connectivity. So, is enterprise integration really dead? Not exactly! In this session, we’ll talk about how AI-infused applications and tool-calling agents are redefining the concept of integration, especially when combined with the power of Apache Camel.
We will discuss the the role of enterprise integration in an era where Large Language Models (LLMs) and agent-driven automation can interpret business needs, handle routing, and invoke Camel endpoints with minimal developer intervention. You will see how these AI-enabled systems help weave business data, applications, and services together giving us flexibility and freeing us from hardcoding boilerplate of integration flows.
You’ll walk away with:
An updated perspective on the future of “integration” in a world driven by AI, LLMs, and intelligent agents.
Real-world examples of how tool-calling functionality can transform Camel routes into dynamic, adaptive workflows.
Code examples how to merge AI capabilities with Apache Camel to deliver flexible, event-driven architectures at scale.
Roadmap strategies for integrating LLM-powered agents into your enterprise, orchestrating services that previously demanded complex, rigid solutions.
Join us to see why rumours of integration’s relevancy have been greatly exaggerated—and see first hand how Camel, powered by AI, is quietly reinventing how we connect the enterprise.
Train Smarter, Not Harder – Let 3D Animation Lead the Way!
Discover how 3D animation makes inductions more engaging, effective, and cost-efficient.
Check out the slides to see how you can transform your safety training process!
Slide 1: Why 3D animation changes the game
Slide 2: Site-specific induction isn’t optional—it’s essential
Slide 3: Visitors are most at risk. Keep them safe
Slide 4: Videos beat text—especially when safety is on the line
Slide 5: TechEHS makes safety engaging and consistent
Slide 6: Better retention, lower costs, safer sites
Slide 7: Ready to elevate your induction process?
Can an animated video make a difference to your site's safety? Let's talk.
AI and Data Privacy in 2025: Global TrendsInData Labs
In this infographic, we explore how businesses can implement effective governance frameworks to address AI data privacy. Understanding it is crucial for developing effective strategies that ensure compliance, safeguard customer trust, and leverage AI responsibly. Equip yourself with insights that can drive informed decision-making and position your organization for success in the future of data privacy.
This infographic contains:
-AI and data privacy: Key findings
-Statistics on AI data privacy in the today’s world
-Tips on how to overcome data privacy challenges
-Benefits of AI data security investments.
Keep up-to-date on how AI is reshaping privacy standards and what this entails for both individuals and organizations.
The cost benefit of implementing a Dell AI Factory solution versus AWS and Azure
Our research shows that hosting GenAI workloads on premises, either in a traditional Dell solution or using managed Dell APEX Subscriptions, could significantly lower your GenAI costs over 4 years compared to hosting these workloads in the cloud. In fact, we found that a Dell AI Factory on-premises solution could reduce costs by at much as 71 percent vs. a comparable AWS SageMaker solution and as much as 61 percent vs. a comparable Azure ML solution. These results show that organizations looking to implement GenAI and reap the business benefits to come can find many advantages in an on-premises Dell AI Factory solution, whether they opt to purchase and manage it themselves or engage with Dell APEX Subscriptions. Choosing an on-premises Dell AI Factory solution could save your organization significantly over hosting GenAI in the cloud, while giving you control over the security and privacy of your data as well as any updates and changes to the environment, and while ensuring your environment is managed consistently.
TrsLabs - Leverage the Power of UPI PaymentsTrs Labs
Revolutionize your Fintech growth with UPI Payments
"Riding the UPI strategy" refers to leveraging the Unified Payments Interface (UPI) to drive digital payments in India and beyond. This involves understanding UPI's features, benefits, and potential, and developing strategies to maximize its usage and impact. Essentially, it's about strategically utilizing UPI to promote digital payments, financial inclusion, and economic growth.
Hybridize Functions: A Tool for Automatically Refactoring Imperative Deep Lea...Raffi Khatchadourian
Efficiency is essential to support responsiveness w.r.t. ever-growing datasets, especially for Deep Learning (DL) systems. DL frameworks have traditionally embraced deferred execution-style DL code—supporting symbolic, graph-based Deep Neural Network (DNN) computation. While scalable, such development is error-prone, non-intuitive, and difficult to debug. Consequently, more natural, imperative DL frameworks encouraging eager execution have emerged but at the expense of run-time performance. Though hybrid approaches aim for the “best of both worlds,” using them effectively requires subtle considerations to make code amenable to safe, accurate, and efficient graph execution—avoiding performance bottlenecks and semantically inequivalent results. We discuss the engineering aspects of a refactoring tool that automatically determines when it is safe and potentially advantageous to migrate imperative DL code to graph execution and vice-versa.
UiPath Automation Suite – Cas d'usage d'une NGO internationale basée à GenèveUiPathCommunity
Nous vous convions à une nouvelle séance de la communauté UiPath en Suisse romande.
Cette séance sera consacrée à un retour d'expérience de la part d'une organisation non gouvernementale basée à Genève. L'équipe en charge de la plateforme UiPath pour cette NGO nous présentera la variété des automatisations mis en oeuvre au fil des années : de la gestion des donations au support des équipes sur les terrains d'opération.
Au délà des cas d'usage, cette session sera aussi l'opportunité de découvrir comment cette organisation a déployé UiPath Automation Suite et Document Understanding.
Cette session a été diffusée en direct le 7 mai 2025 à 13h00 (CET).
Découvrez toutes nos sessions passées et à venir de la communauté UiPath à l’adresse suivante : https://community.uipath.com/geneva/.
Vaibhav Gupta BAML: AI work flows without Hallucinationsjohn409870
Shipping Agents
Vaibhav Gupta
Cofounder @ Boundary
in/vaigup
boundaryml/baml
Imagine if every API call you made
failed only 5% of the time
boundaryml/baml
Imagine if every LLM call you made
failed only 5% of the time
boundaryml/baml
Imagine if every LLM call you made
failed only 5% of the time
boundaryml/baml
Fault tolerant systems are hard
but now everything must be
fault tolerant
boundaryml/baml
We need to change how we
think about these systems
Aaron Villalpando
Cofounder @ Boundary
Boundary
Combinator
boundaryml/baml
We used to write websites like this:
boundaryml/baml
But now we do this:
boundaryml/baml
Problems web dev had:
boundaryml/baml
Problems web dev had:
Strings. Strings everywhere.
boundaryml/baml
Problems web dev had:
Strings. Strings everywhere.
State management was impossible.
boundaryml/baml
Problems web dev had:
Strings. Strings everywhere.
State management was impossible.
Dynamic components? forget about it.
boundaryml/baml
Problems web dev had:
Strings. Strings everywhere.
State management was impossible.
Dynamic components? forget about it.
Reuse components? Good luck.
boundaryml/baml
Problems web dev had:
Strings. Strings everywhere.
State management was impossible.
Dynamic components? forget about it.
Reuse components? Good luck.
Iteration loops took minutes.
boundaryml/baml
Problems web dev had:
Strings. Strings everywhere.
State management was impossible.
Dynamic components? forget about it.
Reuse components? Good luck.
Iteration loops took minutes.
Low engineering rigor
boundaryml/baml
React added engineering rigor
boundaryml/baml
The syntax we use changes how we
think about problems
boundaryml/baml
We used to write agents like this:
boundaryml/baml
Problems agents have:
boundaryml/baml
Problems agents have:
Strings. Strings everywhere.
Context management is impossible.
Changing one thing breaks another.
New models come out all the time.
Iteration loops take minutes.
boundaryml/baml
Problems agents have:
Strings. Strings everywhere.
Context management is impossible.
Changing one thing breaks another.
New models come out all the time.
Iteration loops take minutes.
Low engineering rigor
boundaryml/baml
Agents need
the expressiveness of English,
but the structure of code
F*** You, Show Me The Prompt.
boundaryml/baml
<show don’t tell>
Less prompting +
More engineering
=
Reliability +
Maintainability
BAML
Sam
Greg Antonio
Chris
turned down
openai to join
ex-founder, one
of the earliest
BAML users
MIT PhD
20+ years in
compilers
made his own
database, 400k+
youtube views
Vaibhav Gupta
in/vaigup
[email protected]
boundaryml/baml
Thank you!
Vaibhav Gupta BAML: AI work flows without Hallucinationsjohn409870
Working with deeply nested documents in Apache Solr
1. O C T O B E R 1 1 - 1 4 , 2 0 1 6 • B O S T O N , M A
2. Working with deeply nested documents in Apache Solr
Anshum Gupta, Alisa Zhila
IBM Watson
3. 3
Anshum Gupta
• Apache Lucene/Solr committer and PMC member
• Search guy @ IBM Watson.
• Interested in search and related stuff.
• Apache Lucene since 2006 and Solr since 2010.
4. 4
Alisa Zhila
• Apache Lucene/Solr supporter :)
• Natural Language Processing technologies @ IBM Watson
• Interested in search and related stuff
7. 7
• Social media comments, Email threads,
Annotated data - AI
• Relationship between documents
• Possibility to flatten
Need for nested data
EXAMPLE: Blog Post with Comments
Peter Navarro outlines the Trump economic plan
Tyler Cowen, September 27, 2016 at 3:07am
Trump proposes eliminating America’s $500 billion
trade deficit through a combination of increased
exports and reduced imports.
1 Ray Lopez September 27, 2016 at 3:21 am
I’ll be the first to say this, but the analysis is flawed.
{negative}
2 Brian Donohue September 27, 2016 at 9:20 am
The math checks out. Solid.
{positive}
examples from http://marginalrevolution.com
8. 8
• Can not flatten, need to retain context
• Relationship between documents
• Get all 'positive comments' to 'posts about
Trump' -- IMPOSSIBLE!!!
Nested Documents
EXAMPLE: Data Flattening
Title: Peter Navarro outlines the Trump economic plan
Author: Tyler Cowen
Date: September 27, 2016 at 3:07am
Body: Trump proposes eliminating America’s $500 billion
trade deficit through a combination of increased exports and
reduced imports.
Comment_authors: [Ray Lopez, Brian Donohue]
Comment_dates: [September 27, 2016 at 3:21 am,
September 27, 2016 at 9:20 am]
Comment_texts: ["I’ll be the first to say this, but the analysis is
flawed.", "The math checks out. Solid."]
Comment_sentiments: [negative, positive]
9. 9
• Can not flatten, need to retain context
• Relationship between documents
• Get all 'positive comments' to 'posts about
Trump' -- POSSIBLE!!! (stay tuned)
Nested Documents
EXAMPLE: Hierarchical Documents
Type: Post
Title: Peter Navarro outlines the Trump economic plan
Author: Tyler Cowen
Date: September 27, 2016 at 3:07am
Body: Trump proposes eliminating America’s $500 billion
trade deficit through a combination of increased exports and
reduced imports.
Type: Comment
Author: Ray Lopez
Date: September 27, 2016 at 3:21 am
Text: I’ll be the first to say this, but the analysis is flawed.
Sentiment: negative
Type: Comment
Author: Brian Donohue
Date: September 27, 2016 at 9:20 am
Text: The math checks out. Solid.
Sentiment: positive
10. 10
• Blog Post Data with Comments and Replies
from http://marginalrevolution.com (cured)
• 2 posts, 2-3 comments per post, 0-3 replies
per comment
• Extracted keywords & sentiment data
• 4 levels of "nesting"
• Too big to show on slides
• Data + Scripts + Demo Queries:
• https://github.com/alisa-ipn/solr-
revolution-2016-nested-demo
Running Example
12. 12
• Nested XML
• JSON Documents
• Add _childDocument_ tags for all children
• Pre-process field names to FQNs
• Lose information, or add that as meta-data during pre-processing
• JSON Document endpoint (6x only) - /update/json/docs
• Field name mappings
• Child Document splitting - Enhanced support coming soon.
Sending Documents to Solr
13. 13
solr-6.2.1$ bin/post -c demo-xml ./data/example-data.xml
Sending Documents to Solr: Nested XML
<add>
<doc>
<field name="type">post</field>
<field name="author"> "Alex Tabarrok"</field>
<field name="title">"The Irony of Hillary Clinton’s Data Analytics"</
field>
<field name="body">"Barack Obama’s campaign adopted data but
Hillary Clinton’s campaign has been molded by data from birth."</field>
<field name="id">"12015-24204"</field>
<doc>
<field name="type">comment</field>
<field name="author">"Todd"</field>
<field name="text">"Clinton got out data-ed and out organized in
2008 by Obama. She seems at least to learn over time, and apply the
lessons learned to the real world."</field>
<field name="sentiment">"positive"</field>
<field name="id">"29798-24171"</field>
<doc>
<field name="type">reply</field>
<field name="author">"The Other Jim"</field>
<field name="text">"No, she lost because (1) she is thoroughly
detested person and (2) the DNC decided Obama should therefore
win."</field>
<field name="sentiment">"negative"</field>
<field name="id">"29798-21232"</field>
</doc>
</doc>
</doc>
</add>
14. 14
• Add _childDocument_ tags for all children
• Pre-process field names to FQNs
• Lose information, or add that as meta-data during pre-processing
solr-6.2.1$ bin/post -c demo-solr-json ./data/small-example-data-solr.json -format solr
Sending Documents to Solr: JSON Documents
[{ "path": "1.posts",
"id": "28711",
"author": "Alex Tabarrok",
"title": "The Irony of Hillary Clinton’s Data Analytics",
"body": "Barack Obama’s campaign adopted data but Hillary Clinton’s campaign
has been molded by data from birth.",
"_childDocuments_": [
{
"path": "2.posts.comments",
"id": "28711-19237",
"author": "Todd",
"text": "Clinton got out data-ed and out organized in 2008 by Obama. She
seems at least to learn over time, and apply the lessons learned to the real world.",
"sentiment": "positive",
"_childDocuments_": [
{
"path": "3.posts.comments.replies",
"author": "The Other Jim",
"id": "28711-12444",
"sentiment": "negative",
"text": "No, she lost because (1) she is thoroughly detested person and
(2) the DNC decided Obama should therefore win."
}]}]}]
15. 15
• JSON Document endpoint (6x only) - /update/json/docs
• Field name mappings
• Child Document splitting - Enhanced support coming soon.
solr-6.2.1$ curl 'http://localhost:8983/solr/gettingstarted/update/json/docs?
split=/|/posts|/posts/comments|/posts/comments/replies&commit=true' --data-
binary @small-example-data.json -H ‘Content-type:application/json'
NOTE: All documents must contain a unique ID.
Sending Documents to Solr: JSON Endpoint
16. 16
• Update Request Processors don’t work with nested documents
• Example:
• UUID update processor does not auto-add an id for a child document.
• Workaround:
• Take responsibility at the client layer to handle the computation for nested
documents.
• Change the update processor in Solr to handle nested documents.
Update Processors and Nested Documents
17. 17
• The entire block needs reindexing
• Forgot to add a meta-data field that might be useful? Complete reindex
• Store everything in Solr IF
• it’s too expensive to reconstruct the doc from original data source
• No access to data anymore e.g. streaming data
Re-Indexing Your Documents
18. 18
• Various ways to index nested documents
• Need to re-index entire block
Nested Document Indexing Summary
21. 21
{
"text":["LOL. I enjoyed Trump during last night’s stand-up bit, but this is funnier."],
"path":["3.posts.comments.replies"]},
{
"text":["Hillary was impressive, for sure, and Trump spent time spluttering and floundering, but he was actually able to find his feet and score some points."],
"path":["2.posts.comments"]},
{
"text":["No one goes to Clinton rallies while tens of thousands line up to see Trump, data-mining leads to a fantasy view of the World."],
"path":["2.posts.comments"]}
Returning certain types of documents
Find all comments and replies that mention Trump
q=(path:2.posts.comments OR path:3.posts.comments.replies) AND text:Trump
Recipe:
At the data pre-processing stage, add a field that indicates document type
and also its path in the hierarchy (-- stay tuned):
25. 25
Returning parents by querying children:
Block Join Parent Query
Find all comments whose keywords detected positive sentiment towards Hillary
q={!parent which="path:2.posts.comments"}path:3.posts.comments.keywords AND text:Hillary AND sentiment:positive
Query
Level 3
Result
Level 2
{
"author":["Brian Donohue"],
"text":["Hillary was impressive, for sure, and Trump spent time spluttering and floundering,
but he was actually able to find his feet and score some points."],
"path":["2.posts.comments"]},
{
"author":["Todd"],
"text":["Clinton got out data-ed and out organized in 2008 by Obama. She seems at least to
learn over time, and apply the lessons learned to the real world."],
"path":["2.posts.comments"]}
26. 26
{
"sentiment":["negative"],
"text":["LOL. I enjoyed Trump during last night’s stand-up bit, but this is funnier."],
"path":["3.posts.comments.replies"]},
{
"sentiment":["neutral"],
"text":["So then I guess he will also eliminate the current account surplus? What will happen to U.S.
asset values?"],
"path":["3.posts.comments.replies"]},
{
"sentiment":["positive"],
"text":["Agreed why spend time data-mining for a fantasy view of the world , when instead you can see
a fantasy in person?"],
"path":["3.posts.comments.replies"]}
Returning children by querying parents:
Block Join Child Query
Find replies to negative comments
q={!child of="path:2.posts.comments"}path:2.posts.comments AND sentiment:negative&fq=path:3.posts.comments.replies
Query
Level 2
Result
Level 3
27. 27
{
"sentiment":["negative"],
"text":["LOL. I enjoyed Trump during last night’s stand-up bit, but this is funnier."],
"path":["3.posts.comments.replies"]},
{
"sentiment":["neutral"],
"text":["So then I guess he will also eliminate the current account surplus? What will happen to U.S.
asset values?"],
"path":["3.posts.comments.replies"]},
{
"sentiment":["positive"],
"text":["Agreed why spend time data-mining for a fantasy view of the world , when instead you can see
a fantasy in person?"],
"path":["3.posts.comments.replies"]}
Returning children by querying parents:
Block Join Child Query
Find replies to negative comments
q={!child of="path:2.posts.comments"}path:2.posts.comments AND sentiment:negative&fq=path:3.posts.comments.replies
Query
Level 2
Result
Level 3
Block Join Child Query + Filtering Query
A bit counterintuitive and non-symmetrical to the BJPQ
28. 28
{
"path":["4.posts.comments.replies.keywords"],
"id":"17413-13550",
"text":["Trump"]},
{
"text":["LOL. I enjoyed Trump during last night’s stand-up bit, but this is funnier."],
"path":["3.posts.comments.replies"],
"id":"17413-66188"},
{
"path":["3.posts.comments.keywords"],
"id":"12413-12487",
"text":["Hillary"]},
{
"text":["Agreed why spend time data-mining for a fantasy view of the world , when instead you can see
a fantasy in person?"],
"path":["3.posts.comments.replies"],
"id":"12413-10998"}
Returning all document's descendants
Block Join Child Query
Find all descendants of negative comments
q={!child of="path:2.posts.comments"}path:2.posts.comments AND sentiment:negative
Query
Level 2
Results
Level 3
Results
Level 4
29. 29
Returning all document's descendants
Block Join Child Query
Find all descendants of negative comments
q={!child of="path:2.posts.comments"}path:2.posts.comments AND sentiment:negative
Query
Level 2
Results
Level 3
Results
Level 4
{
"path":["4.posts.comments.replies.keywords"],
"id":"17413-13550",
"text":["Trump"]},
{
"text":["LOL. I enjoyed Trump during last night’s stand-up bit, but this is funnier."],
"path":["3.posts.comments.replies"],
"id":"17413-66188"},
{
"path":["3.posts.comments.keywords"],
"id":"12413-12487",
"text":["Hillary"]},
{
"text":["Agreed why spend time data-mining for a fantasy view of the world , when instead you can see
a fantasy in person?"],
"path":["3.posts.comments.replies"],
"id":"12413-10998"}
Issue: no grouping by parent
What if we want to bring the whole sub-structure?
30. 30
Find all negative comments and return them with all their descendants
q=path:2.posts.comments AND sentiment:negative&fl=*,[child parentFilter=path:2.*]
Query
Level 2
Result
Level 2
sub-
hierarchy
Returning document with all descendants:
ChildDocTransformer
{
"sentiment":["negative"],
"text":["I’ll be the first to say this, but the analysis is flawed."],
"path":["2.posts.comments"],
"_childDocuments_":[
{
"path":["4.posts.comments.replies.keywords"],
"text":["Trump"]},
{
"text":["LOL. I enjoyed Trump during last night’s stand-up bit, but this is funnier."],
"path":["3.posts.comments.replies"]},
{
"path":["4.posts.comments.replies.keywords"],
"text":["U.S."]},
{
"text":["So then I guess he will also eliminate the current account surplus? What
will happen to U.S. asset values?"],
"path":["3.posts.comments.replies"]}
]
},
...
Issue: the "sub-hierarchy" is flat
31. • Returns all descendant documents along with the queried document
• flattens the sub-hierarchy
• Workarounds:
• Reconstruct the document using path ("path":["3.posts.comments.replies"])
information in case you want the entire subtree (result post-processing)
• use childFilter in case you want a specific level
31
“This transformer returns all descendant documents of each parent document matching your query in
a flat list nested inside the matching parent document." (ChildDocTransformer cwiki)
Returning document with all descendants:
ChildDocTransformer
32. 32
Find all negative comments and return them with all replies to them
q=path:2.posts.comments AND sentiment:negative&fl=*,[child parentFilter=path:2.*
childFilter=path:3.posts.comments.replies]
{
"sentiment":["negative"],
"text":["I’ll be the first to say this, but the analysis is flawed."],
"path":["2.posts.comments"],
"_childDocuments_":[
{
"text":["LOL. I enjoyed Trump during last night’s stand-up bit, but this is
funnier."],
"path":["3.posts.comments.replies"]},
{
"text":["So then I guess he will also eliminate the current account surplus? What
will happen to U.S. asset values?"],
"path":["3.posts.comments.replies"]}
]
},
...
Returning document with specific descendants:
ChildDocTransformer + childFilter
Query
Level 2:comments
Result
Level 2:comments
+ Level 3:replies
33. 33
Find all negative comments and return them with all their descendants that mention Trump
q=path:2.posts.comments AND sentiment:negative&fl=*,[child parentFilter=path:2.* childFilter=text:Trump]
{
"sentiment":["negative"],
"text":["I’ll be the first to say this, but the analysis is flawed."],
"path":["2.posts.comments"],
"_childDocuments_":[
{
"path":["4.posts.comments.replies.keywords"],
"text":["Trump"]},
{
"text":["LOL. I enjoyed Trump during last night’s stand-up bit, but this is
funnier."],
"path":["3.posts.comments.replies"]}
]
},
...
Returning document with queried descendants:
ChildDocTransformer + childFilter
Query
Level 2:comments
Result
Level 2:comments
+ sub-levels
Issue: cannot use boolean expressions in childFilter query
34. 34
Cross-Level Querying Mechanisms:
• Block Join Parent Query
• Block Join Children Query
• ChildDocTransformer
Good points:
• overlapping & complementary features
• good capabilities of querying direct ancestors/descendants
• possible to query on siblings of different type
Drawbacks:
• need for data-preprocessing for better querying flexibility
• limited support of querying over non-directly related branches (overcome with graphs?)
• flattening nested data (additional post-processing is needed for reconstruction)
Nested Document Querying Summary
44. 44
• Experimental Feature
• Needs to be turned on explicitly in solrconfig.xml
More info: https://cwiki.apache.org/confluence/display/solr/BlockJoin+Faceting
Block Join Faceting
47. 47
Output Comparison
Block Join Facet JSON Facet API
"facet_fields":{
"text":[
"dnc",1,
"hillary",3,
"obama",1,
"trump",3,
"u.s",1
]
}
"top_keywords":{
"buckets":[{
"val":"Hillary",
"count":4,
"counts_by_comments":3},
{
"val":"Trump",
"count":3,
"counts_by_comments":3},
{
"val":"DNC",
"count":1,
"counts_by_comments":1},
{
"val":"Obama",
"count":2,
"counts_by_comments":1},
{
"val":"U.S.",
"count":1,
"counts_by_comments":1}
]}
Distribution of keywords that appear in comments and replies by the comments
48. 48
Output Comparison
Block Join Facet JSON Facet API
"facet_fields":{
"text":[
"dnc",1,
"hillary",3,
"obama",1,
"trump",3,
"u.s",1
]
}
"top_keywords":{
"buckets":[{
"val":"Hillary",
"count":4,
"counts_by_comments":3},
{
"val":"Trump",
"count":3,
"counts_by_comments":3},
{
"val":"DNC",
"count":1,
"counts_by_comments":1},
...
Distribution of keywords that appear in comments and replies by the comments
Output is sorted in alphabetical
order. It cannot be changed
facet:{
top_keywords : {
...
sort: "counts_by_comments desc"
}}}
49. 49
JSON Facet API:
• Experimental - but more mature
• More developed and established feature
• bulky JSON syntax
• faceting on children by non-top level ancestors requires introducing unique branch
identifiers similar to "_root_" on each level
Block Join Facet:
• Experimental feature
• Lacks controls: sorting, limit...
• traditional query-style syntax
• proper handling of faceting on children by non-top level ancestors
Hierarchical Faceting Summary
50. 50
• Returning hierarchical structure
• JSON facet rollups is in the works - SOLR-8998
• Graph querying might replace a lot of functionalities of cross-level querying - No
distributed support right now.
• There’s more but the community would love to have more people involved!
Community Roadmap