A talk about the (hidden) document processing capability built right into Apache Solr. We show you what it its, how to use it, how to write your own plugins and suggest some future improvements.
Key topics when migrating from FAST to Solr, EuroCon 2010Cominvent AS
Presented during Lucene EuroCon 2010 in Prague. This presentation assumes no prior experience with FAST ESP, but some idea of what Solr/Lucene is. It gives you some hints on what to expect when migrating.
Oslo Solr MeetUp March 2012 - Solr4 alphaCominvent AS
Jan Høydahl presented what is new in Solr 4.0 including near real-time search capabilities, SolrCloud for distributed search across multiple cores, an improved spellchecker, smaller indexes using Flex, pluggable ranking, new sorting functions, and an updated admin GUI. Some key features being added in Solr 4.0 are support for Apache ZooKeeper, auto load balancing of queries across collections, and fault tolerant indexing.
The document provides an overview of the ELF (Executable and Linkable Format) file format used by most Unix operating systems. It discusses how ELF files contain sections and segments that provide information to linkers and loaders. Specifically, it explains that sections contain code and data and are used by linkers to connect pieces at compile time, while segments define memory permissions and locations and are used by loaders to map the binary into memory at runtime. It also gives examples of common sections like .text, .data, .rodata, and describes how dynamic linking with the PLT and GOT tables allows functions to be resolved at load time.
The document discusses integrating Apache Solr with Apache Oak for scalable search capabilities. It provides an overview of IndexEditor and QueryIndex APIs for mapping Oak content changes and queries to Solr. The Oak Solr bundle includes implementations for indexing and searching Oak content on Solr. Additional bundles support embedded or remote Solr deployment. The talk demonstrates populating a Solr index with Oak content and discusses further improvements.
Cloudera - Using morphlines for on the-fly ETL by Wolfgang HoschekHakka Labs
In this talk Senior Software engineer Wolfgang Hoschek from Cloudera discusses Morphlines, the easy way to build and integrate ETL apps for Hadoop. The talk was recorded at the SumbleUpon offices.
Cloudera Morphlines is a new open source framework that reduces the time and skills necessary to integrate, build, and change Hadoop processing applications that extract, transform, and load data into Apache Solr, Apache HBase, HDFS, enterprise data warehouses, or analytic online dashboards.
Wolfgang Hoschek is a Software Engineer on the Platform team and the lead developer on Morphlines. He is a former CERN fellow and received his Ph.D from the Technical University of Vienna, Austria, and M.S from the University of Linz, Austria.
The document discusses Perl programming and is divided into 5 modules: 1) Introduction to Perl, 2) Regular Expressions, 3) File Handling, 4) Connecting to Databases, and 5) Introduction to Perl Programming. It provides an overview of Perl variables, data types, operators, and basic programming structures. It also covers installing Perl, Perl modules, and interacting with files and databases.
ORC Improvement & Roadmap in Apache Spark 2.3 and 2.4Dongjoon Hyun
The document discusses improvements to ORC (Optimized Row Columnar) in Apache Spark 2.3 and 2.4. It covers major features of Spark 2.3 like the vectorized ORC reader and structured streaming with ORC. It summarizes the history of integration between Spark, ORC, and Hive. It also categorizes and discusses previous issues with ORC in Spark, covering topics like writer versions, performance, structured streaming, column names, schema evolution, and robustness. The current approach of supporting two ORC file formats is described.
Terraform is an open source tool for building, changing, and versioning infrastructure safely and efficiently. It allows users to define and provision a datacenter infrastructure using a high-level configuration language known as HashiCorp Configuration Language. Some key features of Terraform include supporting multiple cloud providers and services, being declarative and reproducible, and maintaining infrastructure as code with immutable infrastructure. It works by defining configuration files that specify what resources need to be created. The configuration is written in HCL. Terraform uses these files to create and manage infrastructure resources like VMs, network, storage, containers and more across multiple cloud platforms.
These slides were presented by Hossein Falaki of Databricks to the Atlanta Apache Spark User Group on Thursday, March 9, 2017: https://www.meetup.com/Atlanta-Apache-Spark-User-Group/events/238120227/
Solr 4: Run Solr in SolrCloud Mode on your local file system.gutierrezga00
Running Solr in SolrCloud Mode on your local file system using Solr version 4.10.3. It demonstrate how configure the Apache Solr binaries so that you can create any number of SolrCloud instances without having the need to modified the binaries.
YouTube: http://youtu.be/70AKyQYoLqM
Download sample SolrCloud scripts: https://github.com/gutierrezga00/SolrCloud_LocalFileSystem
Presentation Slides: http://www.slideshare.net/gutierrezga00/solr-cloud-local-file-system
Download Solr version 4.10.3: http://lucene.apache.org/solr/
Download Zookeeper version 3.4.6: http://zookeeper.apache.org/
Automating a Vendor File Load Process with Perl and Shell ScriptingRoy Zimmer
The document describes automating the process of retrieving vendor files from an FTP site, processing the files, splitting them, editing records, and loading them into a library system. Perl scripts and shell scripts are used to log into the FTP site, find the needed files, split files based on invoice numbers, edit records, prepare them for loading, and perform the loading. Passwords are automatically changed every two months using additional scripts. The overall process is designed to run hands-off on a regular schedule.
Introduction to Structured Streaming | Big Data Hadoop Spark Tutorial | Cloud...CloudxLab
This document provides an introduction to Spark Structured Streaming. It discusses that Structured Streaming is a scalable, fault-tolerant stream processing engine built on the Spark SQL engine. It expresses streaming computations similar to batch processing and guarantees end-to-end exactly-once processing. The document also provides a code example of a word count application using Structured Streaming and discusses output modes for writing streaming query results.
This is the slide deck of the Zend webinar "Introduction to column oriented databases in PHP".
This webinar is a quick crash-course and practical session that will explain:
* What are column oriented databases and how do they differ from the standard rational (row oriented) databases.
* Advantages and disadvantages
* Is a column oriented database for me?
* How to use it from PHP
Presenter – Slavey Karadzhov
WWW09 - Triplify Light-Weight Linked Data Publication from Relational DatabasesSören Auer
Triplify is a tool that publishes semantic data from relational databases on the web as Linked Data. It works by mapping SQL queries to RDF representations. The SQL queries select structured data from databases behind existing web applications. Triplify then converts the query results into RDF triples. This exposes the semantics behind web applications and makes the data accessible to semantic search engines and applications. Triplify aims to overcome the lack of semantic data on the web by leveraging existing relational data sources.
A tutorial presentation based on spark.apache.org documentation.
I gave this presentation at Amirkabir University of Technology as Teaching Assistant of Cloud Computing course of Dr. Amir H. Payberah in spring semester 2015.
Scylla Summit 2017: SMF: The Fastest RPC in the WestScyllaDB
On a quest to build the fastest durable log broker in the west, we had to rethink all of the components needed to deliver on this promise. First, we began by building the fastest RPC system in the west, SMF. SMF is a new RPC mechanism, IDL-compiler, and libraries that make using Seastar easy. In this talk, I will cover SMF in detail and show a live demo on how you can get started using it to build your next application so you can live in the future.
Presto is a distributed SQL query engine that allows users to run SQL queries against various data sources. It consists of three main components - a coordinator, workers, and clients. The coordinator manages query execution by generating execution plans, coordinating workers, and returning final results to the client. Workers contain execution engines that process individual tasks and fragments of a query plan. The system uses a dynamic query scheduler to distribute tasks across workers based on data and node locality.
This document discusses OWL2 RL and OWLIM, which is a semantic repository that implements OWL2 RL. OWLIM uses its own rule-based reasoner called TRREE to perform reasoning over RDF triples by applying forward-chaining rules. It offers predefined semantics like RDFS, OWL2 QL, and fully conforms to OWL2 RL except for D-Entailment. The Weaver brings the reasoner closer to the data to avoid bottlenecks from serialization/deserialization. OWLIM's reasoner is more compact and centralized than alternatives like Drools, but is less flexible to maintain rules.
Apache Spark Introduction | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2spQIBA
This CloudxLab Introduction to Apache Spark tutorial helps you to understand Spark in detail. Below are the topics covered in this tutorial:
1) Spark Architecture
2) Why Apache Spark?
3) Shortcoming of MapReduce
4) Downloading Apache Spark
5) Starting Spark With Scala Interactive Shell
6) Starting Spark With Python Interactive Shell
7) Getting started with spark-submit
Hadoop Performance Optimization at Scale, Lessons Learned at TwitterDataWorks Summit
- Profiling Hadoop jobs at Twitter revealed that compression/decompression of intermediate data and deserialization of complex object keys were very expensive. Optimizing these led to performance improvements of 1.5x or more.
- Using columnar file formats like Apache Parquet allows reading only needed columns, avoiding deserialization of unused data. This led to gains of up to 3x.
- Scala macros were developed to generate optimized implementations of Hadoop's RawComparator for common data types, avoiding deserialization for sorting.
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2LCTufA
This CloudxLab Introduction to SparkR tutorial helps you to understand SparkR in detail. Below are the topics covered in this tutorial:
1) SparkR (R on Spark)
2) SparkR DataFrames
3) Launch SparkR
4) Creating DataFrames from Local DataFrames
5) DataFrame Operation
6) Creating DataFrames - From JSON
7) Running SQL Queries from SparkR
This is the fourteenth (and last for now) set of slides from a Perl programming course that I held some years ago.
I want to share it with everyone looking for intransitive Perl-knowledge.
A table of content for all presentations can be found at i-can.eu.
The source code for the examples and the presentations in ODP format are on https://github.com/kberov/PerlProgrammingCourse
Scylla Summit 2017: How to Run Cassandra/Scylla from a MySQL DBA's Point of ViewScyllaDB
Are you a MySQL DBA or DevOps individual being asked to run Cassandra or Scylla? Feeling overwhelmed? In this talk, I will present Cassandra/Scylla operations in terms that directly relate to MySQL. I will show you comparisons between the Information Schema and the Cassandra/Scylla System keyspace(s). I will also talk about metrics available in MySQL versus Cassandra/Scylla and how to retrieve them. Finally, I will talk about how MySQL replication compares with Cassandra replication. Hopefully, when I am done you will be able to relate to Cassandra operations in a practical and useful way.
Parquet Strata/Hadoop World, New York 2013Julien Le Dem
Parquet is a columnar storage format for Hadoop data. It was developed collaboratively by Twitter and Cloudera to address the need for efficient analytics on large datasets. Parquet provides more efficient compression and I/O compared to row-based formats by only reading and decompressing the columns needed by a query. It has been adopted by many companies for analytics workloads involving terabytes to petabytes of data. Parquet is language-independent and supports integration with frameworks like Hive, Pig, and Impala. It provides significant performance improvements and storage savings compared to traditional row-based formats.
Ruby on Rails (RoR) as a back-end processor for Apex Espen Brækken
This document discusses using Ruby and Ruby on Rails (RoR) as a supplement to Oracle Application Express (Apex). It provides an overview of why a supplement may be needed, why Ruby and Rails were chosen, and how ActiveRecord in Rails simplifies database access through object mapping. Key points covered include conventions over configuration in Rails, the anatomy of Rails including ActiveRecord, and examples of ActiveRecord usage with database configuration through YAML files rather than direct connection hashes.
Compare and contrast RDF triple stores and NoSQL: are triples stores NoSQL or not?
Talk given 2011-09-08 tot he BigData/NoSQL meetup at Bristol University.
A practical introduction to Apache Solr.
Slides for NeoCom 2020 days at University of Zaragoza.
https://eina.unizar.es/noticias/neocom-2020
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...CloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2sm9c61
This CloudxLab Introduction to Spark SQL & DataFrames tutorial helps you to understand Spark SQL & DataFrames in detail. Below are the topics covered in this slide:
1) Loading XML
2) What is RPC - Remote Process Call
3) Loading AVRO
4) Data Sources - Parquet
5) Creating DataFrames From Hive Table
6) Setting up Distributed SQL Engine
The Latest in Spatial & Temporal Search: Presented by David SmileyLucidworks
David Smiley presented on the latest developments in spatial and temporal search in Lucene and Solr. He discussed strategies for indexing and searching spatial data like polygons using approaches like RecursivePrefixTreeStrategy and SerializedDVStrategy. He also covered temporal search using approaches like date range fields and the upcoming DateRangePrefixTree. Recent contributions from students were highlighted and future work like spatial heatmaps was discussed.
These slides were presented by Hossein Falaki of Databricks to the Atlanta Apache Spark User Group on Thursday, March 9, 2017: https://www.meetup.com/Atlanta-Apache-Spark-User-Group/events/238120227/
Solr 4: Run Solr in SolrCloud Mode on your local file system.gutierrezga00
Running Solr in SolrCloud Mode on your local file system using Solr version 4.10.3. It demonstrate how configure the Apache Solr binaries so that you can create any number of SolrCloud instances without having the need to modified the binaries.
YouTube: http://youtu.be/70AKyQYoLqM
Download sample SolrCloud scripts: https://github.com/gutierrezga00/SolrCloud_LocalFileSystem
Presentation Slides: http://www.slideshare.net/gutierrezga00/solr-cloud-local-file-system
Download Solr version 4.10.3: http://lucene.apache.org/solr/
Download Zookeeper version 3.4.6: http://zookeeper.apache.org/
Automating a Vendor File Load Process with Perl and Shell ScriptingRoy Zimmer
The document describes automating the process of retrieving vendor files from an FTP site, processing the files, splitting them, editing records, and loading them into a library system. Perl scripts and shell scripts are used to log into the FTP site, find the needed files, split files based on invoice numbers, edit records, prepare them for loading, and perform the loading. Passwords are automatically changed every two months using additional scripts. The overall process is designed to run hands-off on a regular schedule.
Introduction to Structured Streaming | Big Data Hadoop Spark Tutorial | Cloud...CloudxLab
This document provides an introduction to Spark Structured Streaming. It discusses that Structured Streaming is a scalable, fault-tolerant stream processing engine built on the Spark SQL engine. It expresses streaming computations similar to batch processing and guarantees end-to-end exactly-once processing. The document also provides a code example of a word count application using Structured Streaming and discusses output modes for writing streaming query results.
This is the slide deck of the Zend webinar "Introduction to column oriented databases in PHP".
This webinar is a quick crash-course and practical session that will explain:
* What are column oriented databases and how do they differ from the standard rational (row oriented) databases.
* Advantages and disadvantages
* Is a column oriented database for me?
* How to use it from PHP
Presenter – Slavey Karadzhov
WWW09 - Triplify Light-Weight Linked Data Publication from Relational DatabasesSören Auer
Triplify is a tool that publishes semantic data from relational databases on the web as Linked Data. It works by mapping SQL queries to RDF representations. The SQL queries select structured data from databases behind existing web applications. Triplify then converts the query results into RDF triples. This exposes the semantics behind web applications and makes the data accessible to semantic search engines and applications. Triplify aims to overcome the lack of semantic data on the web by leveraging existing relational data sources.
A tutorial presentation based on spark.apache.org documentation.
I gave this presentation at Amirkabir University of Technology as Teaching Assistant of Cloud Computing course of Dr. Amir H. Payberah in spring semester 2015.
Scylla Summit 2017: SMF: The Fastest RPC in the WestScyllaDB
On a quest to build the fastest durable log broker in the west, we had to rethink all of the components needed to deliver on this promise. First, we began by building the fastest RPC system in the west, SMF. SMF is a new RPC mechanism, IDL-compiler, and libraries that make using Seastar easy. In this talk, I will cover SMF in detail and show a live demo on how you can get started using it to build your next application so you can live in the future.
Presto is a distributed SQL query engine that allows users to run SQL queries against various data sources. It consists of three main components - a coordinator, workers, and clients. The coordinator manages query execution by generating execution plans, coordinating workers, and returning final results to the client. Workers contain execution engines that process individual tasks and fragments of a query plan. The system uses a dynamic query scheduler to distribute tasks across workers based on data and node locality.
This document discusses OWL2 RL and OWLIM, which is a semantic repository that implements OWL2 RL. OWLIM uses its own rule-based reasoner called TRREE to perform reasoning over RDF triples by applying forward-chaining rules. It offers predefined semantics like RDFS, OWL2 QL, and fully conforms to OWL2 RL except for D-Entailment. The Weaver brings the reasoner closer to the data to avoid bottlenecks from serialization/deserialization. OWLIM's reasoner is more compact and centralized than alternatives like Drools, but is less flexible to maintain rules.
Apache Spark Introduction | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2spQIBA
This CloudxLab Introduction to Apache Spark tutorial helps you to understand Spark in detail. Below are the topics covered in this tutorial:
1) Spark Architecture
2) Why Apache Spark?
3) Shortcoming of MapReduce
4) Downloading Apache Spark
5) Starting Spark With Scala Interactive Shell
6) Starting Spark With Python Interactive Shell
7) Getting started with spark-submit
Hadoop Performance Optimization at Scale, Lessons Learned at TwitterDataWorks Summit
- Profiling Hadoop jobs at Twitter revealed that compression/decompression of intermediate data and deserialization of complex object keys were very expensive. Optimizing these led to performance improvements of 1.5x or more.
- Using columnar file formats like Apache Parquet allows reading only needed columns, avoiding deserialization of unused data. This led to gains of up to 3x.
- Scala macros were developed to generate optimized implementations of Hadoop's RawComparator for common data types, avoiding deserialization for sorting.
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2LCTufA
This CloudxLab Introduction to SparkR tutorial helps you to understand SparkR in detail. Below are the topics covered in this tutorial:
1) SparkR (R on Spark)
2) SparkR DataFrames
3) Launch SparkR
4) Creating DataFrames from Local DataFrames
5) DataFrame Operation
6) Creating DataFrames - From JSON
7) Running SQL Queries from SparkR
This is the fourteenth (and last for now) set of slides from a Perl programming course that I held some years ago.
I want to share it with everyone looking for intransitive Perl-knowledge.
A table of content for all presentations can be found at i-can.eu.
The source code for the examples and the presentations in ODP format are on https://github.com/kberov/PerlProgrammingCourse
Scylla Summit 2017: How to Run Cassandra/Scylla from a MySQL DBA's Point of ViewScyllaDB
Are you a MySQL DBA or DevOps individual being asked to run Cassandra or Scylla? Feeling overwhelmed? In this talk, I will present Cassandra/Scylla operations in terms that directly relate to MySQL. I will show you comparisons between the Information Schema and the Cassandra/Scylla System keyspace(s). I will also talk about metrics available in MySQL versus Cassandra/Scylla and how to retrieve them. Finally, I will talk about how MySQL replication compares with Cassandra replication. Hopefully, when I am done you will be able to relate to Cassandra operations in a practical and useful way.
Parquet Strata/Hadoop World, New York 2013Julien Le Dem
Parquet is a columnar storage format for Hadoop data. It was developed collaboratively by Twitter and Cloudera to address the need for efficient analytics on large datasets. Parquet provides more efficient compression and I/O compared to row-based formats by only reading and decompressing the columns needed by a query. It has been adopted by many companies for analytics workloads involving terabytes to petabytes of data. Parquet is language-independent and supports integration with frameworks like Hive, Pig, and Impala. It provides significant performance improvements and storage savings compared to traditional row-based formats.
Ruby on Rails (RoR) as a back-end processor for Apex Espen Brækken
This document discusses using Ruby and Ruby on Rails (RoR) as a supplement to Oracle Application Express (Apex). It provides an overview of why a supplement may be needed, why Ruby and Rails were chosen, and how ActiveRecord in Rails simplifies database access through object mapping. Key points covered include conventions over configuration in Rails, the anatomy of Rails including ActiveRecord, and examples of ActiveRecord usage with database configuration through YAML files rather than direct connection hashes.
Compare and contrast RDF triple stores and NoSQL: are triples stores NoSQL or not?
Talk given 2011-09-08 tot he BigData/NoSQL meetup at Bristol University.
A practical introduction to Apache Solr.
Slides for NeoCom 2020 days at University of Zaragoza.
https://eina.unizar.es/noticias/neocom-2020
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...CloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2sm9c61
This CloudxLab Introduction to Spark SQL & DataFrames tutorial helps you to understand Spark SQL & DataFrames in detail. Below are the topics covered in this slide:
1) Loading XML
2) What is RPC - Remote Process Call
3) Loading AVRO
4) Data Sources - Parquet
5) Creating DataFrames From Hive Table
6) Setting up Distributed SQL Engine
The Latest in Spatial & Temporal Search: Presented by David SmileyLucidworks
David Smiley presented on the latest developments in spatial and temporal search in Lucene and Solr. He discussed strategies for indexing and searching spatial data like polygons using approaches like RecursivePrefixTreeStrategy and SerializedDVStrategy. He also covered temporal search using approaches like date range fields and the upcoming DateRangePrefixTree. Recent contributions from students were highlighted and future work like spatial heatmaps was discussed.
The document discusses the 6th generation Intel Core processor. It introduces computer processors and the concept of generations. It describes the components of a processor and the Intel Tick-Tock model. It provides details on the 6th generation processor including that it is built on the new Skylake microarchitecture using Intel's 14nm manufacturing process. It highlights key benefits such as a leap in performance and power efficiency as well as stunning graphics.
The document discusses various components of a computer system including the CPU, memory, input, and output devices. It provides details on CPU clock speed, cores, types of processors from Intel. It describes the hierarchy of memory from primary like RAM and ROM to secondary storage such as hard disks, CDs, DVDs. Input devices covered include keyboards, mice, joysticks, and scanners. Output devices discussed are monitors, printers, plotters and projectors.
The document provides an introduction to microprocessors. It defines a microprocessor as the central processing unit (CPU) of a computer. The history section outlines some important early microprocessors, including the 4-bit Intel 4004 from 1971, the 8-bit Intel 8080 and 8085 from the 1970s, and the 32-bit Pentium from 1995. It also discusses more recent microprocessors like Intel's 64-bit Core series from 2007 and Haswell architecture. The document explains that a microprocessor is capable of arithmetic, logical operations through its arithmetic logic unit (ALU) and controls data flow between memory and peripherals through its control unit. It provides references for further information.
Digital signal processing (DSP) involves analyzing, interpreting, and manipulating signals in a digital representation. DSP became prominent with advances in digital electronics and fast Fourier transform algorithms. Modern DSPs are optimized for multiply-accumulate operations and real-time processing using fixed-point arithmetic. The four biggest DSP manufacturers are Texas Instruments, Freescale, Lucent Technologies, and Analog Devices.
Central Processing Unit CUP by madridista ujjwalUjwal Limbu
CPU known as a processor is the brain of the computer, process data (input) and converts it into meaningful information (output).
Arithmetic logical unit ALU
Registers
System Bus
Data transfer instruction set of 8085 micro processorvishalgohel12195
Data transfer instruction set of 8085 micro processor
WHAT IS INSTRUCTION?
CLASSIFICATION OF INSTRUCTION.
DATA TRANSFER INSTRUCTION.
EXAMPLES
PROGRAMME OF DATA TRANFER INSTRUCTION
Arrandale is the code name for Intel's mobile Core processors, including the Intel Core i3, i5, and i7. It features 32nm dual-core processing, Intel Hyper-Threading Technology, Intel Smart Cache, and Intel HD Graphics. The document discusses the key features and benefits of Arrandale, including its advantages of improved performance over previous processors through technologies like Turbo Boost and integrated graphics, as well as some disadvantages like higher cost and potential overheating issues. In conclusion, Arrandale is praised as a significant achievement for Intel in providing faster multi-tasking and better graphics performance for mobile processors.
The document discusses the Intel Core i7 processor. It goes over the generations of Intel processors leading up to the Core i7. The Core i7 is Intel's fastest desktop processor, featuring quad-core processing, Intel Turbo Boost and Hyper-Threading technologies, and an integrated graphics processor. It provides powerful performance for multimedia tasks, gaming, and content creation. The Core i7 has advantages like a large shared cache and fast processing speeds while maintaining relatively low power consumption.
Solr Compute Cloud - An Elastic SolrCloud Infrastructure Nitin S
Scaling search platforms for serving hundreds of millions of documents with low latency and high throughput workloads at an optimized cost is an extremely hard problem. BloomReach has implemented Sc2, which is an elastic Solr infrastructure for Big Data applications, supporting heterogeneous workloads and hosted in the cloud. It dynamically grows/shrinks search servers to provide application and pipeline level isolation, NRT search and indexing, latency guarantees, and application-specific performance tuning. In addition, it provides various high availability features such as differential real-time streaming, disaster recovery, context aware replication, and automatic shard and replica rebalancing, all with a zero downtime guarantee for all consumers. This infrastructure currently serves hundreds of millions of documents in millisecond response times with a load ranging in the order of 200-300K QPS.
This presentation will describe an innovate implementation of scaling Solr in an elastic fashion. It will review the architecture and take a deep dive into how each of these components interact to make the infrastructure truly elastic, real time, and robust while serving latency needs.
SolrCloud uses Zookeeper to elect a leader node for each shard. The leader coordinates write requests to ensure consistency. When the leader dies, Zookeeper detects this and elects a new leader based on the nodes' sequence numbers registered with Zookeeper. The new leader syncs updates with replicas and can replay logs if any replicas are too far behind. This allows write requests to continue being served with high availability despite leader failures.
Perl Bag of Tricks - Baltimore Perl mongersbrian d foy
The document discusses various Perl tricks and techniques, including using regular expressions to manipulate strings, testing code with arrays of test cases, and handling errors gracefully by returning a null object.
This document provides an overview of searching in the cloud using Apache Solr. It discusses how Solr allows for full-text search across distributed servers and datasets. Key features of SolrCloud include centralized configuration in Zookeeper, automatic failover, near-real-time indexing, leader election, and optimistic locking for durable writes across shards. The document also covers Solr schemas, indexing data from various sources, caching, and using SolrJ and SolrCloud.
This document summarizes concepts and techniques for administering and monitoring SolrCloud, including: how SolrCloud distributes data across shards and replicas; how to start a local or distributed SolrCloud cluster; how to create, split, and reload collections using the Collections API; how to modify schemas dynamically using the Schema API; directory implementations and segment merging; configuring autocommits; caching in Solr; metrics to monitor such as indexing throughput, search latency, and JVM memory usage; and tools for monitoring Solr clusters like the Solr administration panel and JMX.
The document discusses CPUs and microprocessors. It describes the components of a CPU including the clock and instruction sets. It then discusses the evolution of Intel processors from early chips like the 4004 to modern dual-core and quad-core CPUs. It also covers microcontrollers and factors to consider when choosing a microcontroller for an embedded system.
How SolrCloud Changes the User Experience In a Sharded Environmentlucenerevolution
Presented by Erick Erickson, Lucid Imagination - See conference video - http://www.lucidimagination.com/devzone/events/conferences/lucene-revolution-2012
The next major release of Solr (4.0) will include "SolrCloud", which provides new distributed capabilities for both in-house and externally-hosted Solr installations. Among the new capabilities are: Automatic Distributed Indexing, High Availability and Failover, Near Real Time searching and Fault Tolerance. This talk will focus, at a high level, on how these new capabilities impact the design of Solr-based search applications primarily from infrastructure and operational perspectives.
Deploying and managing SolrCloud in the cloud using the Solr Scale Toolkitthelabdude
SolrCloud is a set of features in Apache Solr that enable elastic scaling of search indexes using sharding and replication. In this presentation, Tim Potter will demonstrate how to provision, configure, and manage a SolrCloud cluster in Amazon EC2, using a Fabric/boto based solution for automating SolrCloud operations. Attendees will come away with a solid understanding of how to operate a large-scale Solr cluster, as well as tools to help them do it. Tim will also demonstrate these tools live during his presentation. Covered technologies, include: Apache Solr, Apache ZooKeeper, Linux, Python, Fabric, boto, Apache Kafka, Apache JMeter.
The document discusses various concepts related to CPU scheduling in operating systems, including:
1) CPU scheduling aims to maximize CPU utilization by allowing other processes to run when one process is waiting for I/O. Short-term schedulers select ready processes from memory to run on the CPU.
2) Scheduling can be preemptive or nonpreemptive depending on when context switches occur. Dispatchers are responsible for context switches between processes.
3) Common scheduling criteria include CPU utilization, throughput, turnaround time, waiting time, and response time.
4) Scheduling algorithms like FCFS, SJF, priority, round robin, and multilevel queue scheduling aim to optimize
The document discusses benchmarking the performance of Apache Solr. It describes testing the indexing performance of SolrCloud clusters of varying sizes. The results show that indexing performance scales nearly linearly as nodes are added. It also discusses using the Solr Scale Toolkit, which is a set of tools for deploying, managing, and benchmarking SolrCloud clusters. Future work mentioned includes benchmarking mixed workloads and integrating chaos monkey tests.
This document provides an overview and introduction to the Solr search platform. It describes how Solr can be used to index and search content, integrate with other systems, and handle common search issues. The presentation also discusses Lucene, the search library that powers Solr, and how content from various sources like databases, files, and rich documents can be indexed.
CrossRef How-to: A Technical Introduction to the Basics of CrossRef, Chuck Ko...Crossref
This presentation has been updated. Please use the following link: http://www.slideshare.net/CrossRef/introduction-to-crossref-webinar
DataFinder is software developed by the German Aerospace Center (DLR) to help scientists and engineers efficiently manage and organize their large and growing scientific data sets. It provides a structured way to organize data through customizable data models and metadata, and can integrate various storage resources. DataFinder was created in Python due to its ease of use and maintainability. It uses a client-server model with a WebDAV server to manage metadata and data structures, and can access different storage backends. Customizations through Python scripts allow users to automate tasks and integrate it into their workflows.
This document discusses tools for testing web services over HTTP in Python. It introduces HTTPie, a command line tool for making HTTP requests, and Behave, a behavior-driven development tool that uses the Gherkin language to write human-readable test cases. The document provides examples of using HTTPie to debug services and Behave steps to test authentication on a sample API.
- The document provides an overview of Apache Solr, an open source enterprise search platform. It discusses how to install and configure Solr, load sample data, and perform various search queries. It also offers tips for advanced search functionality, indexing, and scaling Solr for large datasets.
DataFinder: A Python Application for Scientific Data ManagementAndreas Schreiber
DataFinder is a Python application developed by the German Aerospace Center (DLR) for efficient management of large scientific and technical data sets. It provides a structured way to organize data through customizable data models and flexible use of distributed storage resources. DataFinder uses a client-server model with a WebDAV server to store metadata and data. It allows integration of data management into scientific workflows through a Python API and scripting.
Business systems are meals. Nodes are a key ingredient. Do you configure and deploy business systems into multiple environments, with all the implied complexity of managing necessary variations and eliminating unwanted inconsistencies? Do you currently create Visio diagrams to describe your business systems, but long for something more actionable? Or do you use an existing definition format that you wish you could move into Chef?
In this talk we will share techniques for template-driven deployment of topologies composed of Chef nodes, and describe the latest incarnation of these techniques, which is supported by the knife topo plugin and exchange format, as well as our Automate.Insights SaaS based offering. We will also describe our latest work to integrate with chef-provisioning, and how these techniques could be used to map from existing formats such as AWS CloudFormation.
With such techniques you will be able to accelerate unleashing the awesome power of Chef across your enterprise.
Talk at ChefConf 2015 on techniques for template-driven deployment of topologies composed of Chef nodes, by John Sweitzer and Christine Draper.
Video:
https://www.youtube.com/watch?v=hoXf0Uo5bCo
Oleksandr Shcherbak: DLT tool as ingest part of your ETL process (UA)Lviv Startup Club
Oleksandr Shcherbak: DLT tool as ingest part of your ETL process (UA)
Kyiv AI & BigData Day 2024
Website – https://aiconf.com.ua/kyiv
Youtube – https://www.youtube.com/startuplviv
FB – https://www.facebook.com/aiconf
Innovate2014 Better Integrations Through Open InterfacesSteve Speicher
- The document discusses open interfaces and integrated lifecycle tools through linked data and open standards like OSLC, taking inspiration from principles of the World Wide Web.
- It promotes using open protocols like REST and HTTP for tool integration instead of tight coupling, and outlines guidelines for using URIs, HTTP, and semantic standards like RDF and SPARQL to represent and share resource data on the web.
- OSLC is presented as a solution for lifecycle integration across requirements management, quality management, change management and other tools using common resource definitions and linked data over open APIs.
The document provides an overview of CrossRef technical basics:
- It discusses how DOIs work by resolving references through the DOI resolver.
- It explains the flow of transactions when joining CrossRef and depositing/querying metadata using XML.
- It outlines the various technical methods for depositing metadata, querying for DOIs, and obtaining metadata through both user interfaces and programmatic APIs.
Oleksandr Shcherbak: DLT tool as ingest part of your ETL process (UA)Lviv Startup Club
Oleksandr Shcherbak: DLT tool as ingest part of your ETL process (UA)
AI & BigData Online Day 2025 Spring
Website – https://aiconf.com.ua
Youtube – https://www.youtube.com/startuplviv
FB – https://www.facebook.com/aiconf/
The document discusses the SOAP_Toolkit, which allows for contract-first and code-first web services development in PHP. It provides an overview of the basics of SOAP_Toolkit, the contract-first process of generating service code and clients from a WSDL, and the code-first approach of prototyping without a WSDL. Considerations for object-relational mapping and interoperability are also briefly covered.
apidays LIVE Helsinki - Implementing OpenAPI and GraphQL Services with gRPC b...apidays
apidays LIVE Helsinki - APIs, Platforms, And Ecosystems - Transforming Industries And Experiences
Implementing OpenAPI and GraphQL Services with gRPC
Tim Burks, Software Engineer at Google
The document provides an overview of network programming in Python. It discusses key Python concepts like lists, dictionaries, tuples and strings. It then covers network programming topics like sockets, TCP/IP, HTTP requests and responses. It introduces the select module for building non-blocking servers that can handle multiple clients simultaneously using a single thread.
The document discusses designing robust data architectures for decision making. It advocates for building architectures that can easily add new data sources, improve and expand analytics, standardize metadata and storage for easy data access, discover and recover from mistakes. The key aspects discussed are using Kafka as a data bus to decouple pipelines, retaining all data for recovery and experimentation, treating the filesystem as a database by storing intermediate data, leveraging Spark and Spark Streaming for batch and stream processing, and maintaining schemas for integration and evolution of the system.
Scott Spendolini presents on the behind-the-scenes workings of Oracle Application Express (APEX). He discusses his background with APEX and Oracle and provides an overview of the presentation. Key topics include HTML form basics, the wwv_flow package that powers APEX, and how the f and show procedures handle page rendering and processing. The presentation also examines session management and national language settings.
The document discusses different XML parsers in Java including DOM, SAX, and StAX. DOM represents the XML document as an in-memory tree which allows flexible processing but uses more memory. SAX is event-driven and reads the XML sequentially using less memory. StAX is similar to SAX but simplified and "pull"-based where the developer manually navigates elements. The document also covers using JAXP for XML processing independence and the key classes involved in DOM and StAX parsing.
My talk at Lucene/Solr Revolution 2017, Las Vegas
The improved plugin system being proposed in this talk utilizes PF4J to add bundle packaging (zip/jar), plugin discovery (repositories), one-line install/upgrade and automatic version compatibility checks. Think of it as Homebrew or Apt-Get for Solr :) The hope is that this will encourage hundreds of new plugins being created and thus give Solr developers a sense of community and a new “stage” to perform on.
This document discusses scaling search with Apache SolrCloud. It provides an introduction to Solr and how scaling search was difficult in previous versions due to manually managing shards and replicas. SolrCloud makes scaling easier by utilizing ZooKeeper for centralized configuration and management across a cluster. Nodes can be added to a SolrCloud cluster and will automatically be configured and assigned as shards or replicas. This allows for effortless scaling, fault tolerance, and load balancing. The document promotes upcoming features in Solr 4 and demonstrates indexing and querying in a SolrCloud cluster.
First oslo solr community meetup lightning talk janhoyCominvent AS
The document discusses setting up a Solr cluster using Solr Cloud. It describes distributing an index across multiple shards each with replicas for redundancy. Zookeeper is used to manage the cluster configuration and routing of queries to shards. An example 4-node cluster is outlined with 2 shards, each containing a replica, across 4 Jetty instances to demonstrate a basic Solr Cloud setup.
Dagens Næringslivs overgang til Lucene/Solr søkCominvent AS
Foredrag på GoOpen, Oslo, 2011 (Norwegian language)
NHST Media Group lager nettsidene for bl.a. Dagens Næringsliv, Dagens IT og en rekke engelskspråklige bransjeaviser. Systemutvikler Hans Jørgen Hoel og søke-arkitekt Jan Høydahl forteller om prosessen etter at det ble besluttet å erstatte søkeløsningen fra FAST med fri programvare Apache Solr. Vi vil forsøke å besvare bl.a.: Hvilke utfordringer møtte vi som følge av forskjeller i de to plattformene? Hvorfor bygde vi vårt eget søkerammeverk? Har det nye søket innfridd forventningene?
Se også www.goopen.no, www.cominvent.com og www.nhst.no og Twitter hashtag #GoOpen
Oslo Enterprise MeetUp May 12th 2010 - Jan HøydahlCominvent AS
Presentation held at Oslo Enterprise MeetUp in May, pitched towards an audience who come from the FAST ESP side and have some existing FAST knowledge. Check out one of my other presentations if you're most familiar with Lucene/Solr.
Frokostseminar mai 2010 solr open source cominvent asCominvent AS
Slides fra frokostseminar om Open Souce søk med Apache Lucene/Solr i Oslo mai 2010. Dette var et arrangement av Cominvent AS og FindWise AB.
Presentation is in Norwegian language
This document discusses migrating from Microsoft FAST to Apache Solr. It provides an overview of the migration process and key steps to consider, such as mapping the FAST index profile to the Solr schema, migrating content by feeding it into Solr, and options for migrating document processing and the search middleware. It also addresses operational concerns and provides additional resources for learning more about Solr.
Presentation of Norwegian based search consulting company Cominvent AS, focusing on Apache Solr/Lucene/ElasticSearch and other enterprise search and big data technology.
Artificial Intelligence is providing benefits in many areas of work within the heritage sector, from image analysis, to ideas generation, and new research tools. However, it is more critical than ever for people, with analogue intelligence, to ensure the integrity and ethical use of AI. Including real people can improve the use of AI by identifying potential biases, cross-checking results, refining workflows, and providing contextual relevance to AI-driven results.
News about the impact of AI often paints a rosy picture. In practice, there are many potential pitfalls. This presentation discusses these issues and looks at the role of analogue intelligence and analogue interfaces in providing the best results to our audiences. How do we deal with factually incorrect results? How do we get content generated that better reflects the diversity of our communities? What roles are there for physical, in-person experiences in the digital world?
Quantum Computing Quick Research Guide by Arthur MorganArthur Morgan
This is a Quick Research Guide (QRG).
QRGs include the following:
- A brief, high-level overview of the QRG topic.
- A milestone timeline for the QRG topic.
- Links to various free online resource materials to provide a deeper dive into the QRG topic.
- Conclusion and a recommendation for at least two books available in the SJPL system on the QRG topic.
QRGs planned for the series:
- Artificial Intelligence QRG
- Quantum Computing QRG
- Big Data Analytics QRG
- Spacecraft Guidance, Navigation & Control QRG (coming 2026)
- UK Home Computing & The Birth of ARM QRG (coming 2027)
Any questions or comments?
- Please contact Arthur Morgan at [email protected].
100% human made.
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptxJustin Reock
Building 10x Organizations with Modern Productivity Metrics
10x developers may be a myth, but 10x organizations are very real, as proven by the influential study performed in the 1980s, ‘The Coding War Games.’
Right now, here in early 2025, we seem to be experiencing YAPP (Yet Another Productivity Philosophy), and that philosophy is converging on developer experience. It seems that with every new method we invent for the delivery of products, whether physical or virtual, we reinvent productivity philosophies to go alongside them.
But which of these approaches actually work? DORA? SPACE? DevEx? What should we invest in and create urgency behind today, so that we don’t find ourselves having the same discussion again in a decade?
This is the keynote of the Into the Box conference, highlighting the release of the BoxLang JVM language, its key enhancements, and its vision for the future.
Book industry standards are evolving rapidly. In the first part of this session, we’ll share an overview of key developments from 2024 and the early months of 2025. Then, BookNet’s resident standards expert, Tom Richardson, and CEO, Lauren Stewart, have a forward-looking conversation about what’s next.
Link to recording, presentation slides, and accompanying resource: https://bnctechforum.ca/sessions/standardsgoals-for-2025-standards-certification-roundup/
Presented by BookNet Canada on May 6, 2025 with support from the Department of Canadian Heritage.
Procurement Insights Cost To Value Guide.pptxJon Hansen
Procurement Insights integrated Historic Procurement Industry Archives, serves as a powerful complement — not a competitor — to other procurement industry firms. It fills critical gaps in depth, agility, and contextual insight that most traditional analyst and association models overlook.
Learn more about this value- driven proprietary service offering here.
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdfAbi john
Analyze the growth of meme coins from mere online jokes to potential assets in the digital economy. Explore the community, culture, and utility as they elevate themselves to a new era in cryptocurrency.
Spark is a powerhouse for large datasets, but when it comes to smaller data workloads, its overhead can sometimes slow things down. What if you could achieve high performance and efficiency without the need for Spark?
At S&P Global Commodity Insights, having a complete view of global energy and commodities markets enables customers to make data-driven decisions with confidence and create long-term, sustainable value. 🌍
Explore delta-rs + CDC and how these open-source innovations power lightweight, high-performance data applications beyond Spark! 🚀
Role of Data Annotation Services in AI-Powered ManufacturingAndrew Leo
From predictive maintenance to robotic automation, AI is driving the future of manufacturing. But without high-quality annotated data, even the smartest models fall short.
Discover how data annotation services are powering accuracy, safety, and efficiency in AI-driven manufacturing systems.
Precision in data labeling = Precision on the production floor.
Linux Support for SMARC: How Toradex Empowers Embedded DevelopersToradex
Toradex brings robust Linux support to SMARC (Smart Mobility Architecture), ensuring high performance and long-term reliability for embedded applications. Here’s how:
• Optimized Torizon OS & Yocto Support – Toradex provides Torizon OS, a Debian-based easy-to-use platform, and Yocto BSPs for customized Linux images on SMARC modules.
• Seamless Integration with i.MX 8M Plus and i.MX 95 – Toradex SMARC solutions leverage NXP’s i.MX 8 M Plus and i.MX 95 SoCs, delivering power efficiency and AI-ready performance.
• Secure and Reliable – With Secure Boot, over-the-air (OTA) updates, and LTS kernel support, Toradex ensures industrial-grade security and longevity.
• Containerized Workflows for AI & IoT – Support for Docker, ROS, and real-time Linux enables scalable AI, ML, and IoT applications.
• Strong Ecosystem & Developer Support – Toradex offers comprehensive documentation, developer tools, and dedicated support, accelerating time-to-market.
With Toradex’s Linux support for SMARC, developers get a scalable, secure, and high-performance solution for industrial, medical, and AI-driven applications.
Do you have a specific project or application in mind where you're considering SMARC? We can help with Free Compatibility Check and help you with quick time-to-market
For more information: https://www.toradex.com/computer-on-modules/smarc-arm-family
How Can I use the AI Hype in my Business Context?Daniel Lehner
𝙄𝙨 𝘼𝙄 𝙟𝙪𝙨𝙩 𝙝𝙮𝙥𝙚? 𝙊𝙧 𝙞𝙨 𝙞𝙩 𝙩𝙝𝙚 𝙜𝙖𝙢𝙚 𝙘𝙝𝙖𝙣𝙜𝙚𝙧 𝙮𝙤𝙪𝙧 𝙗𝙪𝙨𝙞𝙣𝙚𝙨𝙨 𝙣𝙚𝙚𝙙𝙨?
Everyone’s talking about AI but is anyone really using it to create real value?
Most companies want to leverage AI. Few know 𝗵𝗼𝘄.
✅ What exactly should you ask to find real AI opportunities?
✅ Which AI techniques actually fit your business?
✅ Is your data even ready for AI?
If you’re not sure, you’re not alone. This is a condensed version of the slides I presented at a Linkedin webinar for Tecnovy on 28.04.2025.
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...TrustArc
Most consumers believe they’re making informed decisions about their personal data—adjusting privacy settings, blocking trackers, and opting out where they can. However, our new research reveals that while awareness is high, taking meaningful action is still lacking. On the corporate side, many organizations report strong policies for managing third-party data and consumer consent yet fall short when it comes to consistency, accountability and transparency.
This session will explore the research findings from TrustArc’s Privacy Pulse Survey, examining consumer attitudes toward personal data collection and practical suggestions for corporate practices around purchasing third-party data.
Attendees will learn:
- Consumer awareness around data brokers and what consumers are doing to limit data collection
- How businesses assess third-party vendors and their consent management operations
- Where business preparedness needs improvement
- What these trends mean for the future of privacy governance and public trust
This discussion is essential for privacy, risk, and compliance professionals who want to ground their strategies in current data and prepare for what’s next in the privacy landscape.
Big Data Analytics Quick Research Guide by Arthur MorganArthur Morgan
This is a Quick Research Guide (QRG).
QRGs include the following:
- A brief, high-level overview of the QRG topic.
- A milestone timeline for the QRG topic.
- Links to various free online resource materials to provide a deeper dive into the QRG topic.
- Conclusion and a recommendation for at least two books available in the SJPL system on the QRG topic.
QRGs planned for the series:
- Artificial Intelligence QRG
- Quantum Computing QRG
- Big Data Analytics QRG
- Spacecraft Guidance, Navigation & Control QRG (coming 2026)
- UK Home Computing & The Birth of ARM QRG (coming 2027)
Any questions or comments?
- Please contact Arthur Morgan at [email protected].
100% human made.
Complete Guide to Advanced Logistics Management Software in Riyadh.pdfSoftware Company
Explore the benefits and features of advanced logistics management software for businesses in Riyadh. This guide delves into the latest technologies, from real-time tracking and route optimization to warehouse management and inventory control, helping businesses streamline their logistics operations and reduce costs. Learn how implementing the right software solution can enhance efficiency, improve customer satisfaction, and provide a competitive edge in the growing logistics sector of Riyadh.
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025BookNet Canada
Book industry standards are evolving rapidly. In the first part of this session, we’ll share an overview of key developments from 2024 and the early months of 2025. Then, BookNet’s resident standards expert, Tom Richardson, and CEO, Lauren Stewart, have a forward-looking conversation about what’s next.
Link to recording, transcript, and accompanying resource: https://bnctechforum.ca/sessions/standardsgoals-for-2025-standards-certification-roundup/
Presented by BookNet Canada on May 6, 2025 with support from the Department of Canadian Heritage.
2. What will I cover?
Who is Jan Høydahl?
Intro to Solr’s (hidden) UpdateChain
How to write your own UpdateProcessors
Example: Web crawl @ Oslo University
A vision for future improvements
Conclusion
2
8. Why document processing?
But what if you want to:
Add or remove fields?
Make decisions based on other fields?
We need a way to modify the Document
8
23. Other examples
Company
The Apache Software Foundation
(ASF) is a non-profit corporation to
support Apache software projects.
The ASF was formed from the
Apache Group and incorporated in
Delaware, U.S., in June 1999.
Location Date
Entity extraction
19
29. Writing your own processor
•Make generic processors - parameterized
•Use SchemaAware, SolrCoreAware and
ResourceLoaderAware interfaces
•Prefix param names to avoid name clash
•Testing and testable methods
•Donate back to Apache & document on Wiki
24
39. Donations back to Apache
SOLR-2599: FieldCopyProcessor
SOLR-2825: RegexReplaceProcessor
SOLR-2826: URLClassifyProcessor
SOLR-2827: RegexpBoostProcessor
SOLR-2828: StaticRankProcessor
Binary Document Dumper (?)
Many thanks for the donations!
29
43. Improvements
Processors re-created for every request
Duplication of config between chains
No support for non-linear or sub chains
Does not scale very well
Lack native scripting language support
34
44. Improvements
Pain:
Potentially expensive initialization
StaticRankProcessor: read&parse 50.000 lines
Proposed cure:
Keep persistent state object in factory:
private final Map<Object,Object> sharedObjCache
new StaticRankProcessor(params, request,
response, nextProcessor, sharedObjCache);
Processor uses sharedObjCache for state
35
45. Improvements
Processors re-created for every request
Duplication of config between chains
No support for non-linear or sub chains
Does not scale very well
Lack native scripting language support
36
46. Improvements
Pain:
Multi chains often need identical Processors
UiO’s two chains share 80% -> copy/paste
Proposed cure:
Allow sharing of named instances
Define:
<processor name="langid" class="..">
Refer:
<processor ref="langid" />
See SOLR-2823
37
47. Improvements
Processors re-created for every request
Duplication of config between chains
No support for non-linear or sub chains
Does not scale very well
Lack native scripting language support
38
48. Improvements
Pain:
Chains are linear only
Hard to do branching, sub chains, conditional...
Proposed cure (SOLR-2841):
New scriptable Update Chain - alternative to XML
Script chain logic in solr/conf/updateproc.groovy
Full flexibility:
chain myChain {
if(doc.getFieldValue("type").equals("pdf"))
process(tikaproc)
}
39
49. Improvements
Processors re-created for every request
Duplication of config between chains
No support for non-linear chains or sub chains
Does not scale very well
Lack native scripting language support
40
50. Improvements
Pain:
Single threaded
Heavy processing not efficient
Proposed cure:
Local: Use multi threaded update requests
SolrCloud: Dedicated nodes, role=“processor” ?
Wrap an external pipeline in UpdateProcessor
Example: OpenPipelineUpdateProcessor ?
41
51. Improvements
Processors re-created for every request
Duplication of config between chains
No support for non-linear chains or sub chains
Does not scale very well
Lack native scripting language support
42
52. Improvements
Pain:
Not really a “problem” :-)
Nice to write processors in Python, Groovy, JS...
Proposed cure:
Now: Finish SOLR-1725: Script based Processor
Later: Make scripts first-class processors
<processor script="myScript.py" />
or
<processor ref="myScript" />
43
54. New standalone framework?
•The UpdateChain is Solr specific
•Interest for a pure pipeline framework
•Search engine independent
•Scalable
•Rich pool of processors
•Several existing candidates
•Some initial thoughts:
http://wiki.apache.org/solr/DocumentProcessing
45
56. Summary
•Document centric vs field centric processing
•UpdateChain is there - use it!
•Works well for most “light” cases
•Scaling issues, but caching config may help
•More processors welcome!
47
59. Alternative pipelines
OpenPipeline (Dieselpoint)
•OpenPipe (T-Rank, now on GitHub)
•Pypes (ESR)
•UIMA (Apache)
•Eclipse SMILA
•Apache commons pipeline
•Piped (FoundIT, Norway)
•Behemoth (DigitaPebble)
•FindWise and TwigKit also has some technology
50
60. Calling out from UpdateChain
This is one way an
external pipeline
system can be
integrated with Solr.
The main benefit of
such a method is you
can continue to feed
content with SolrJ, DIH
or other Update
Request Handlers.
51
61. Scaling with external pipeline
Here is a more
advanced,
distributed
case, where a
Solr node is
dedicated for
processing, and
the entry point
Solr only
dispatches the
requests.
52