SlideShare a Scribd company logo
1
Solr + Hadoop: Interactive Search for
Hadoop
Gregory Chanan (gchanan AT cloudera.com)
OC Big Data Meetup 07/16/14
Agenda
• Big Data and Search – setting the stage
• Cloudera Search Architecture
• Component Deep Dive
• Security
• Conclusion
Agenda
• Big Data and Search – setting the stage
• Cloudera Search Architecture
• Component Deep Dive
• Security
• Conclusion
Why Search?
• Hadoop for everyone
• Typical case:
• Ingest data to storage engine (HDFS, HBase, etc)
• Process data (MapReduce, Hive, Impala)
• Experts know MapReduce
• Savvy people know SQL
• Everyone knows Search!
Why Search?
An Integrated Part of
the Hadoop System
One pool of data
One security framework
One set of system resources
One management interface
Benefits of Search
• Improved Big Data ROI
• An interactive experience without technical knowledge
• Faster time to insight
• Exploratory analysis, esp. unstructured data
• Broad range of indexing options to accommodate needs
• Cost efficiency
• Single scalable platform; no incremental investment
• No need for separate systems, storage
What is Cloudera Search?
• Full-text, interactive search with faceted navigation
• Apache Solr integrated with CDH
• Established, mature search with vibrant community
• In production environments for years
• Open Source
• 100% Apache, 100% Solr
• Standard Solr APIs
• Batch, near real-time, and on-demand indexing
• Available for CDH4 and CDH5
Agenda
• Big Data and Search – setting the stage
• Cloudera Search Architecture
• Component Deep Dive
• Security
• Conclusion
Apache Hadoop
• Apache HDFS
• Distributed file system
• High reliability
• High throughput
• Apache MapReduce
• Parallel, distributed programming model
• Allows processing of large datasets
• Fault tolerant
Apache Lucene
• Full text search library
• Indexing
• Querying
• Traditional inverted index
• Batch and Incremental indexing
• We are using version 4.4 in current release
Apache Solr
• Search service built using Lucene
• Ships with Lucene (same TLP at Apache)
• Provides XML/HTTP/JSON/Python/Ruby/… APIs
• Indexing
• Query
• Administrative interface
• Also rich web admin GUI via HTTP
Apache SolrCloud
• Provides distributed Search capability
• Part of Solr (not a separate library/codebase)
• Shards – provide scalability
• partition index for size
• replicate for query performance
• Uses ZooKeeper for coordination
• No split-brain issues
• Simplifies operations
SolrCloud Architecture
• Updates automatically sent to
the correct shard
• Replicas handle queries,
forward updates to the leader
• Leader indexes the document
for the shard, and forwards
the index notation to itself
and any replicas.
SolrCloud Architecture
Visual representation via admin UI
Distributed Search on Hadoop
Flume
Hue UI
Custom
UI
Custom
App
Solr
Solr
Solr
SolrCloud
query
query
query
index
Hadoop Cluster
MR
HDFS
index
HBase
index
ZK
Agenda
• Big Data and Search – setting the stage
• Cloudera Search Architecture
• Component Deep Dive
• Indexing
• ETL - morphlines
• Querying
• Security
• Conclusion
Indexing
• Near Real Time (NRT)
• Flume
• HBase Indexer
• Batch
• MapReduceIndexerTool
• HBaseBatchIndexer
Near Real Time Indexing with Flume
Log File
Solr and Flume
• Data ingest at scale
• Flexible extraction and
mapping
• Indexing at data ingest
HDFS
Flume
Agent
Indexer
Other
Log File
Flume
Agent
Indexer
18
Apache Flume - MorphlineSolrSink
• A Flume Source…
• Receives/gathers events
• A Flume Channel…
• Carries the event – MemoryChannel or reliable FileChannel
• A Flume Sink…
• Sends the events on to the next location
• Flume MorphlineSolrSink
• Integrates Cloudera Morphlines library
• ETL, more on that in a bit
• Does batching
• Results sent to Solr for indexing
Indexing
• Near Real Time (NRT)
• Flume
• HBase Indexer
• Batch
• MapReduceIndexerTool
• HBaseBatchIndexer
Near Real Time Indexing of Apache HBase
HDFS
HBase
interactiveload
HBase
Indexer(s)
Replication Solr server
Solr server
Solr server
Solr server
Solr server
Search
+ =
planet-sized tabular data
immediate access & updates
fast & flexible information
discovery
BIG DATA DATAMANAGEMENT
Lily HBase Indexer
• Collaboration between NGData & Cloudera
• NGData are creators of the Lily data management platform
• Lily HBase Indexer
• Service which acts as a HBase replication listener
• HBase replication features, such as filtering, supported
• Replication updates trigger indexing of updates (rows)
• Integrates Cloudera Morphlines library for ETL of rows
• AL2 licensed on github https://github.com/ngdata
Indexing
• Near Real Time (NRT)
• Flume
• HBase Indexer
• Batch
• MapReduceIndexerTool
• HBaseBatchIndexer
Scalable Batch Indexing
Index
shard
Files
Index
shard
Indexer
Files
Solr
server
Indexer
Solr
server
24
HDFS
Solr and MapReduce
• Flexible, scalable batch
indexing
• Start serving new indices
with no downtime
• On-demand indexing, cost-
efficient re-indexing
MapReduce Indexer
MapReduce Job with two parts
1) Scan HDFS for files to be indexed
• Much like Unix “find” – see HADOOP-8989
• Output is NLineInputFormat’ed file
2) Mapper/Reducer indexing step
• Mapper extracts content via Cloudera Morphlines
• Reducer indexes documents via embedded Solr server
• Originally based on SOLR-1301
• Many modifications to enable linear scalability
MapReduce Indexer “golive”
• Cloudera created this to bridge the gap between NRT
(low latency, expensive) and Batch (high latency,
cheap at scale) indexing
• Results of MR indexing operation are immediately
merged into a live SolrCloud serving cluster
• No downtime for users
• No NRT expense
• Linear scale out to the size of your MR cluster
Indexing
• Near Real Time (NRT)
• Flume
• HBase Indexer
• Batch
• MapReduceIndexerTool
• HBaseBatchIndexer
HBase + MapReduce
• Run MapReduce job over HBase tables
• Same architecture as running over HDFS
• Similar to HBase’s CopyTable
• Support for go-live
Agenda
• Big Data and Search – setting the stage
• Cloudera Search Architecture
• Component Deep Dive
• Indexing
• ETL - morphlines
• Querying
• Security
• Conclusion
Cloudera Morphlines
• Open Source framework for simple ETL
• Simplify ETL
• Built-in commands and library support (Avro format, Hadoop
SequenceFiles, grok for syslog messages)
• Configuration over coding
• Standardize ETL
• Ships as part of Kite SDK, formerly Cloudera
Developer Kit (CDK)
• It’s a Java library
• AL2 licensed on github https://github.com/kite-sdk
Cloudera Morphlines Architecture
Solr
Solr
Solr
SolrCloud
Logs, tweets, social
media, html,
images, pdf, text….
Anything you want
to index
Flume, MR Indexer, HBase indexer, etc...
Or your application!
Morphline Library
Morphlines can be embedded in any application…
Extraction and Mapping
• Modeled after Unix
pipelines (records instead
of lines)
• Simple and flexible data
transformation
• Reusable across multiple
index workloads
• Over time, extend and re-
use across platform
workloads
syslog Flume
Agent
Solr sink
Command: readLine
Command: grok
Command: loadSolr
Solr
Event
Record
Record
Record
Document
MorphlineLibrary
Morphline Example – syslog with grok
morphlines : [
{
id : morphline1
importCommands : ["com.cloudera.**", "org.apache.solr.**"]
commands : [
{ readLine {} }
{
grok {
dictionaryFiles : [/tmp/grok-dictionaries]
expressions : {
message : """<%{POSINT:syslog_pri}>%{SYSLOGTIMESTAMP:syslog_timestamp}
%{SYSLOGHOST:syslog_hostname} %{DATA:syslog_program}(?:[%{POSINT:syslog_pid}])?:
%{GREEDYDATA:syslog_message}"""
}
}
}
{ loadSolr {} }
]
}
]
Example Input
<164>Feb 4 10:46:14 syslog sshd[607]: listening on 0.0.0.0 port 22
Output Record
syslog_pri:164
syslog_timestamp:Feb 4 10:46:14
syslog_hostname:syslog
syslog_program:sshd
syslog_pid:607
syslog_message:listening on 0.0.0.0 port 22.
Current Command Library
• Integrate with and load into Apache Solr
• Flexible log file analysis
• Single-line record, multi-line records, CSV files
• Regex based pattern matching and extraction
• Integration with Avro
• Integration with Apache Hadoop Sequence Files
• Integration with SolrCell and all Apache Tika parsers
• Auto-detection of MIME types from binary data using
Apache Tika
Current Command Library (cont)
• Scripting support for dynamic java code
• Operations on fields for assignment and comparison
• Operations on fields with list and set semantics
• if-then-else conditionals
• A small rules engine (tryRules)
• String and timestamp conversions
• slf4j logging
• Yammer metrics and counters
• Decompression and unpacking of arbitrarily nested
container file formats
• Etc…
Agenda
• Big Data and Search – setting the stage
• Cloudera Search Architecture
• Component Deep Dive
• Indexing
• ETL - morphlines
• Querying
• Security
• Conclusion
Querying
• Built-in solr web UI
• Write your own
• Hue
Simple, Customizable Search Interface
Hue
• Simple UI
• Navigated, faceted drill
down
• Customizable display
• Full text search,
standard Solr API and
query language
Agenda
• Big Data and Search – setting the stage
• Cloudera Search Architecture
• Component Deep Dive
• Security
• Conclusion
Security
• Upstream Solr doesn’t deal with security
• Cloudera Search supports kerberos authentication
• Similar to Oozie / WebHDFS
• Collection-Level Authorization via Apache Sentry
• Document-Level Authorization via Apache Sentry
(new in CDH5.1)
Agenda
• Big Data and Search – setting the stage
• Cloudera Search Architecture
• Component Deep Dive
• Indexing
• ETL - morphlines
• Querying
• Security
• Collection-Level Authorization
• Document-Level Authorization
• Conclusion
Collection-Level Authorization
• Sentry supports role-based granting of
privileges
• each role can be granted QUERY, UPDATE, and/or
administrative privileges on an index (collection)
• Privileges stored in a “policy file” on HDFS
Policy File
[groups]
# Assigns each Hadoop group to its set of roles
dev_ops = engineer_role, ops_role
[roles]
# Assigns each role to its set of privileges
engineer_role = collection = source_code->action=Query,
collection = source_code- > action=Update
ops_role = collection = hbase_logs->action=Query
Integrating Sentry and Solr
• Solr Request Handlers:
• Specified per collection in solrconfig.xml:
• Request to: http://localhost:8983/solr/collection1/select
Is dispatched to an instance of solr.SearchHandler
Sentry Request Handlers
• Sentry ships with its own version of solrconfig.xml
with secure handlers, called solrconfig.xml.secure
• Use a SearchComponent to implement the checking
• Update Requests handled in a similar way
Agenda
• Big Data and Search – setting the stage
• Cloudera Search Architecture
• Component Deep Dive
• Indexing
• ETL - morphlines
• Querying
• Security
• Collection-Level Authorization
• Document-Level Authorization
• Conclusion
Document-level authorization Motivation
• Index-level authorization useful when access control
requirements for documents are homogeneous
• Security requirements may require restricting access
to a subset of documents
Document-level authorization Motivation
• Consider “Confidential” and “Secret” documents.
How to store with only index-level authorization?
• Pushes complexity to application. Doc-level
authorization designed to solve this problem
Document-level authorization model
• Instead of storing in HDFS Policy File:
[groups]
# Assigns each Hadoop group to its set of roles
dev_ops = engineer_role, ops_role
[roles]
# Assigns each role to its set of privileges
engineer_role = collection = source_code->action=Query,
collection = source_code- > action=Update
ops_role = collection = hbase_logs->action=Query
• Store authorization tokens in each document
• Many more documents than collections; doesn’t scale to
store document-level info in Policy File
• Can use Solr’s built-in filtering capabilities to restrict access
Document-level authorization model
• A configurable token field stores the authorization tokens
• The authorization tokens are Sentry roles, i.e. “ops_role”
[roles]
ops_role = collection = hbase_logs->action=Query
• Represents the roles that are allowed to view the
document. To view a document, the querying user must
belong to at least one role whose token is stored in the
token field
• Can modify document permissions without restarting
Solr
• Can modify role memberships without reindexing
Document-level authorization impl
• Intercepts the request via a SearchComponent
• SearchComponent adds an “fq” or FilterQuery
• Filter out all documents that don’t have “role1” or “role2” in
authField
• Multiple “fq”s work as intersection, so malicious user
can’t avoid by injection his own fq
• Filters are cached, so only construction expense once
• Note: does not supersede index-level authorization
Document-level authorization config
• Configuration via solrconfig.xml.secure (per
collection):
<!-- Set to true to enabled document-level authorization -->
<bool name="enabled">false</bool>
<!-- Field where the auth tokens are stored in the document -->
<str name="sentryAuthField">sentry_auth</str>
<!-- Auth token defined to allow any role to access the document.
Uncomment to enable. -->
<!--<str name="allRolesToken">*</str>-->
• For backwards compatibility, not enabled
• No tokens = no access. To allow all users to access a
document, use the allRolesToken. Useful for getting started
Conclusion
• Cloudera Search
• Free Download
• Extensive documentation
• Send your questions and feedback to search-
user@cloudera.org
• Take the Search online training
• Cloudera Manager Standard (i.e. the free version)
• Simple management of Search
• Free Download
• QuickStart VM also available!
Ad

More Related Content

What's hot (20)

The Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder Hortonworks
The Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder HortonworksThe Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder Hortonworks
The Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder Hortonworks
Data Con LA
 
Embeddable data transformation for real time streams
Embeddable data transformation for real time streamsEmbeddable data transformation for real time streams
Embeddable data transformation for real time streams
Joey Echeverria
 
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on HiveFaster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
DataWorks Summit/Hadoop Summit
 
Hadoop application architectures - using Customer 360 as an example
Hadoop application architectures - using Customer 360 as an exampleHadoop application architectures - using Customer 360 as an example
Hadoop application architectures - using Customer 360 as an example
hadooparchbook
 
Built-In Security for the Cloud
Built-In Security for the CloudBuilt-In Security for the Cloud
Built-In Security for the Cloud
DataWorks Summit
 
Large Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured StreamingLarge Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured Streaming
Databricks
 
Storage Requirements and Options for Running Spark on Kubernetes
Storage Requirements and Options for Running Spark on KubernetesStorage Requirements and Options for Running Spark on Kubernetes
Storage Requirements and Options for Running Spark on Kubernetes
DataWorks Summit
 
Combining Machine Learning frameworks with Apache Spark
Combining Machine Learning frameworks with Apache SparkCombining Machine Learning frameworks with Apache Spark
Combining Machine Learning frameworks with Apache Spark
DataWorks Summit/Hadoop Summit
 
Data Engineer's Lunch #55: Get Started in Data Engineering
Data Engineer's Lunch #55: Get Started in Data EngineeringData Engineer's Lunch #55: Get Started in Data Engineering
Data Engineer's Lunch #55: Get Started in Data Engineering
Anant Corporation
 
GCP Meetup #3 - Approaches to Cloud Native Architectures
GCP Meetup #3 - Approaches to Cloud Native ArchitecturesGCP Meetup #3 - Approaches to Cloud Native Architectures
GCP Meetup #3 - Approaches to Cloud Native Architectures
nine
 
Trend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache BigtopTrend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache Bigtop
Evans Ye
 
Accelerating Big Data Insights
Accelerating Big Data InsightsAccelerating Big Data Insights
Accelerating Big Data Insights
DataWorks Summit
 
Leveraging docker for hadoop build automation and big data stack provisioning
Leveraging docker for hadoop build automation and big data stack provisioningLeveraging docker for hadoop build automation and big data stack provisioning
Leveraging docker for hadoop build automation and big data stack provisioning
Evans Ye
 
Querying Druid in SQL with Superset
Querying Druid in SQL with SupersetQuerying Druid in SQL with Superset
Querying Druid in SQL with Superset
DataWorks Summit
 
Data Lake and the rise of the microservices
Data Lake and the rise of the microservicesData Lake and the rise of the microservices
Data Lake and the rise of the microservices
Bigstep
 
Powering Interactive BI Analytics with Presto and Delta Lake
Powering Interactive BI Analytics with Presto and Delta LakePowering Interactive BI Analytics with Presto and Delta Lake
Powering Interactive BI Analytics with Presto and Delta Lake
Databricks
 
Owning Your Own (Data) Lake House
Owning Your Own (Data) Lake HouseOwning Your Own (Data) Lake House
Owning Your Own (Data) Lake House
Data Con LA
 
RUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey Kharlamov
RUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey KharlamovRUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey Kharlamov
RUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey Kharlamov
Big Data Spain
 
How to Use Innovative Data Handling and Processing Techniques to Drive Alpha ...
How to Use Innovative Data Handling and Processing Techniques to Drive Alpha ...How to Use Innovative Data Handling and Processing Techniques to Drive Alpha ...
How to Use Innovative Data Handling and Processing Techniques to Drive Alpha ...
DataWorks Summit
 
Lambda-less Stream Processing @Scale in LinkedIn
Lambda-less Stream Processing @Scale in LinkedIn Lambda-less Stream Processing @Scale in LinkedIn
Lambda-less Stream Processing @Scale in LinkedIn
DataWorks Summit/Hadoop Summit
 
The Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder Hortonworks
The Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder HortonworksThe Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder Hortonworks
The Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder Hortonworks
Data Con LA
 
Embeddable data transformation for real time streams
Embeddable data transformation for real time streamsEmbeddable data transformation for real time streams
Embeddable data transformation for real time streams
Joey Echeverria
 
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on HiveFaster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
DataWorks Summit/Hadoop Summit
 
Hadoop application architectures - using Customer 360 as an example
Hadoop application architectures - using Customer 360 as an exampleHadoop application architectures - using Customer 360 as an example
Hadoop application architectures - using Customer 360 as an example
hadooparchbook
 
Built-In Security for the Cloud
Built-In Security for the CloudBuilt-In Security for the Cloud
Built-In Security for the Cloud
DataWorks Summit
 
Large Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured StreamingLarge Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured Streaming
Databricks
 
Storage Requirements and Options for Running Spark on Kubernetes
Storage Requirements and Options for Running Spark on KubernetesStorage Requirements and Options for Running Spark on Kubernetes
Storage Requirements and Options for Running Spark on Kubernetes
DataWorks Summit
 
Combining Machine Learning frameworks with Apache Spark
Combining Machine Learning frameworks with Apache SparkCombining Machine Learning frameworks with Apache Spark
Combining Machine Learning frameworks with Apache Spark
DataWorks Summit/Hadoop Summit
 
Data Engineer's Lunch #55: Get Started in Data Engineering
Data Engineer's Lunch #55: Get Started in Data EngineeringData Engineer's Lunch #55: Get Started in Data Engineering
Data Engineer's Lunch #55: Get Started in Data Engineering
Anant Corporation
 
GCP Meetup #3 - Approaches to Cloud Native Architectures
GCP Meetup #3 - Approaches to Cloud Native ArchitecturesGCP Meetup #3 - Approaches to Cloud Native Architectures
GCP Meetup #3 - Approaches to Cloud Native Architectures
nine
 
Trend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache BigtopTrend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache Bigtop
Evans Ye
 
Accelerating Big Data Insights
Accelerating Big Data InsightsAccelerating Big Data Insights
Accelerating Big Data Insights
DataWorks Summit
 
Leveraging docker for hadoop build automation and big data stack provisioning
Leveraging docker for hadoop build automation and big data stack provisioningLeveraging docker for hadoop build automation and big data stack provisioning
Leveraging docker for hadoop build automation and big data stack provisioning
Evans Ye
 
Querying Druid in SQL with Superset
Querying Druid in SQL with SupersetQuerying Druid in SQL with Superset
Querying Druid in SQL with Superset
DataWorks Summit
 
Data Lake and the rise of the microservices
Data Lake and the rise of the microservicesData Lake and the rise of the microservices
Data Lake and the rise of the microservices
Bigstep
 
Powering Interactive BI Analytics with Presto and Delta Lake
Powering Interactive BI Analytics with Presto and Delta LakePowering Interactive BI Analytics with Presto and Delta Lake
Powering Interactive BI Analytics with Presto and Delta Lake
Databricks
 
Owning Your Own (Data) Lake House
Owning Your Own (Data) Lake HouseOwning Your Own (Data) Lake House
Owning Your Own (Data) Lake House
Data Con LA
 
RUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey Kharlamov
RUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey KharlamovRUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey Kharlamov
RUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey Kharlamov
Big Data Spain
 
How to Use Innovative Data Handling and Processing Techniques to Drive Alpha ...
How to Use Innovative Data Handling and Processing Techniques to Drive Alpha ...How to Use Innovative Data Handling and Processing Techniques to Drive Alpha ...
How to Use Innovative Data Handling and Processing Techniques to Drive Alpha ...
DataWorks Summit
 

Viewers also liked (8)

Delivering Apache Hadoop for the Modern Data Architecture
Delivering Apache Hadoop for the Modern Data Architecture Delivering Apache Hadoop for the Modern Data Architecture
Delivering Apache Hadoop for the Modern Data Architecture
Hortonworks
 
Big Data Computing Architecture
Big Data Computing ArchitectureBig Data Computing Architecture
Big Data Computing Architecture
Gang Tao
 
NoSQL, Apache SOLR and Apache Hadoop
NoSQL, Apache SOLR and Apache HadoopNoSQL, Apache SOLR and Apache Hadoop
NoSQL, Apache SOLR and Apache Hadoop
Dmitry Kan
 
Clickstream & Social Media Analysis using Apache Spark
Clickstream & Social Media Analysis using Apache SparkClickstream & Social Media Analysis using Apache Spark
Clickstream & Social Media Analysis using Apache Spark
TUMRA | Big Data Science - Gain a competitive advantage through Big Data & Data Science
 
Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache Solr
Andy Jackson
 
Large Scale Log Analysis with HBase and Solr at Amadeus (Martin Alig, ETH Zur...
Large Scale Log Analysis with HBase and Solr at Amadeus (Martin Alig, ETH Zur...Large Scale Log Analysis with HBase and Solr at Amadeus (Martin Alig, ETH Zur...
Large Scale Log Analysis with HBase and Solr at Amadeus (Martin Alig, ETH Zur...
Swiss Big Data User Group
 
Solr+Hadoop = Big Data Search
Solr+Hadoop = Big Data SearchSolr+Hadoop = Big Data Search
Solr+Hadoop = Big Data Search
Cloudera, Inc.
 
Presentation solr 10 Aout 2011 (french)
Presentation solr 10 Aout 2011 (french)Presentation solr 10 Aout 2011 (french)
Presentation solr 10 Aout 2011 (french)
Thibaud Vibes
 
Delivering Apache Hadoop for the Modern Data Architecture
Delivering Apache Hadoop for the Modern Data Architecture Delivering Apache Hadoop for the Modern Data Architecture
Delivering Apache Hadoop for the Modern Data Architecture
Hortonworks
 
Big Data Computing Architecture
Big Data Computing ArchitectureBig Data Computing Architecture
Big Data Computing Architecture
Gang Tao
 
NoSQL, Apache SOLR and Apache Hadoop
NoSQL, Apache SOLR and Apache HadoopNoSQL, Apache SOLR and Apache Hadoop
NoSQL, Apache SOLR and Apache Hadoop
Dmitry Kan
 
Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache Solr
Andy Jackson
 
Large Scale Log Analysis with HBase and Solr at Amadeus (Martin Alig, ETH Zur...
Large Scale Log Analysis with HBase and Solr at Amadeus (Martin Alig, ETH Zur...Large Scale Log Analysis with HBase and Solr at Amadeus (Martin Alig, ETH Zur...
Large Scale Log Analysis with HBase and Solr at Amadeus (Martin Alig, ETH Zur...
Swiss Big Data User Group
 
Solr+Hadoop = Big Data Search
Solr+Hadoop = Big Data SearchSolr+Hadoop = Big Data Search
Solr+Hadoop = Big Data Search
Cloudera, Inc.
 
Presentation solr 10 Aout 2011 (french)
Presentation solr 10 Aout 2011 (french)Presentation solr 10 Aout 2011 (french)
Presentation solr 10 Aout 2011 (french)
Thibaud Vibes
 
Ad

Similar to Solr + Hadoop: Interactive Search for Hadoop (20)

Search On Hadoop Frontier Meetup
Search On Hadoop Frontier MeetupSearch On Hadoop Frontier Meetup
Search On Hadoop Frontier Meetup
gregchanan
 
Search On Hadoop
Search On HadoopSearch On Hadoop
Search On Hadoop
bigdatagurus_meetup
 
Search onhadoopsfhug081413
Search onhadoopsfhug081413Search onhadoopsfhug081413
Search onhadoopsfhug081413
gregchanan
 
Adding Search to the Hadoop Ecosystem
Adding Search to the Hadoop EcosystemAdding Search to the Hadoop Ecosystem
Adding Search to the Hadoop Ecosystem
Cloudera, Inc.
 
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
larsgeorge
 
SolrCloud on Hadoop
SolrCloud on HadoopSolrCloud on Hadoop
SolrCloud on Hadoop
Alex Moundalexis
 
Spark volume requirements 2018
Spark volume requirements 2018Spark volume requirements 2018
Spark volume requirements 2018
Rachit Arora
 
Solr + Hadoop = Big Data Search
Solr + Hadoop = Big Data SearchSolr + Hadoop = Big Data Search
Solr + Hadoop = Big Data Search
Mark Miller
 
Apache drill
Apache drillApache drill
Apache drill
MapR Technologies
 
The First Class Integration of Solr with Hadoop
The First Class Integration of Solr with HadoopThe First Class Integration of Solr with Hadoop
The First Class Integration of Solr with Hadoop
lucenerevolution
 
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoop
Bigdatapump
 
Big Data Retrospective - STL Big Data IDEA Jan 2019
Big Data Retrospective - STL Big Data IDEA Jan 2019Big Data Retrospective - STL Big Data IDEA Jan 2019
Big Data Retrospective - STL Big Data IDEA Jan 2019
Adam Doyle
 
An intro to Azure Data Lake
An intro to Azure Data LakeAn intro to Azure Data Lake
An intro to Azure Data Lake
Rick van den Bosch
 
Practical Machine Learning for Smarter Search with Spark+Solr
Practical Machine Learning for Smarter Search with Spark+SolrPractical Machine Learning for Smarter Search with Spark+Solr
Practical Machine Learning for Smarter Search with Spark+Solr
Jake Mannix
 
Practical Machine Learning for Smarter Search with Solr and Spark
Practical Machine Learning for Smarter Search with Solr and SparkPractical Machine Learning for Smarter Search with Solr and Spark
Practical Machine Learning for Smarter Search with Solr and Spark
Jake Mannix
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoop
bddmoscow
 
Big Data Architecture Workshop - Vahid Amiri
Big Data Architecture Workshop -  Vahid AmiriBig Data Architecture Workshop -  Vahid Amiri
Big Data Architecture Workshop - Vahid Amiri
datastack
 
Introduction to Hive and HCatalog
Introduction to Hive and HCatalogIntroduction to Hive and HCatalog
Introduction to Hive and HCatalog
markgrover
 
Webinar: Solr & Fusion for Big Data
Webinar: Solr & Fusion for Big DataWebinar: Solr & Fusion for Big Data
Webinar: Solr & Fusion for Big Data
Lucidworks
 
Cortana Analytics Workshop: Azure Data Lake
Cortana Analytics Workshop: Azure Data LakeCortana Analytics Workshop: Azure Data Lake
Cortana Analytics Workshop: Azure Data Lake
MSAdvAnalytics
 
Search On Hadoop Frontier Meetup
Search On Hadoop Frontier MeetupSearch On Hadoop Frontier Meetup
Search On Hadoop Frontier Meetup
gregchanan
 
Search onhadoopsfhug081413
Search onhadoopsfhug081413Search onhadoopsfhug081413
Search onhadoopsfhug081413
gregchanan
 
Adding Search to the Hadoop Ecosystem
Adding Search to the Hadoop EcosystemAdding Search to the Hadoop Ecosystem
Adding Search to the Hadoop Ecosystem
Cloudera, Inc.
 
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
larsgeorge
 
Spark volume requirements 2018
Spark volume requirements 2018Spark volume requirements 2018
Spark volume requirements 2018
Rachit Arora
 
Solr + Hadoop = Big Data Search
Solr + Hadoop = Big Data SearchSolr + Hadoop = Big Data Search
Solr + Hadoop = Big Data Search
Mark Miller
 
The First Class Integration of Solr with Hadoop
The First Class Integration of Solr with HadoopThe First Class Integration of Solr with Hadoop
The First Class Integration of Solr with Hadoop
lucenerevolution
 
Big Data Retrospective - STL Big Data IDEA Jan 2019
Big Data Retrospective - STL Big Data IDEA Jan 2019Big Data Retrospective - STL Big Data IDEA Jan 2019
Big Data Retrospective - STL Big Data IDEA Jan 2019
Adam Doyle
 
Practical Machine Learning for Smarter Search with Spark+Solr
Practical Machine Learning for Smarter Search with Spark+SolrPractical Machine Learning for Smarter Search with Spark+Solr
Practical Machine Learning for Smarter Search with Spark+Solr
Jake Mannix
 
Practical Machine Learning for Smarter Search with Solr and Spark
Practical Machine Learning for Smarter Search with Solr and SparkPractical Machine Learning for Smarter Search with Solr and Spark
Practical Machine Learning for Smarter Search with Solr and Spark
Jake Mannix
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoop
bddmoscow
 
Big Data Architecture Workshop - Vahid Amiri
Big Data Architecture Workshop -  Vahid AmiriBig Data Architecture Workshop -  Vahid Amiri
Big Data Architecture Workshop - Vahid Amiri
datastack
 
Introduction to Hive and HCatalog
Introduction to Hive and HCatalogIntroduction to Hive and HCatalog
Introduction to Hive and HCatalog
markgrover
 
Webinar: Solr & Fusion for Big Data
Webinar: Solr & Fusion for Big DataWebinar: Solr & Fusion for Big Data
Webinar: Solr & Fusion for Big Data
Lucidworks
 
Cortana Analytics Workshop: Azure Data Lake
Cortana Analytics Workshop: Azure Data LakeCortana Analytics Workshop: Azure Data Lake
Cortana Analytics Workshop: Azure Data Lake
MSAdvAnalytics
 
Ad

Recently uploaded (20)

Landscape of Requirements Engineering for/by AI through Literature Review
Landscape of Requirements Engineering for/by AI through Literature ReviewLandscape of Requirements Engineering for/by AI through Literature Review
Landscape of Requirements Engineering for/by AI through Literature Review
Hironori Washizaki
 
Adobe Illustrator Crack FREE Download 2025 Latest Version
Adobe Illustrator Crack FREE Download 2025 Latest VersionAdobe Illustrator Crack FREE Download 2025 Latest Version
Adobe Illustrator Crack FREE Download 2025 Latest Version
kashifyounis067
 
Automation Techniques in RPA - UiPath Certificate
Automation Techniques in RPA - UiPath CertificateAutomation Techniques in RPA - UiPath Certificate
Automation Techniques in RPA - UiPath Certificate
VICTOR MAESTRE RAMIREZ
 
EASEUS Partition Master Crack + License Code
EASEUS Partition Master Crack + License CodeEASEUS Partition Master Crack + License Code
EASEUS Partition Master Crack + License Code
aneelaramzan63
 
Download Wondershare Filmora Crack [2025] With Latest
Download Wondershare Filmora Crack [2025] With LatestDownload Wondershare Filmora Crack [2025] With Latest
Download Wondershare Filmora Crack [2025] With Latest
tahirabibi60507
 
Solidworks Crack 2025 latest new + license code
Solidworks Crack 2025 latest new + license codeSolidworks Crack 2025 latest new + license code
Solidworks Crack 2025 latest new + license code
aneelaramzan63
 
LEARN SEO AND INCREASE YOUR KNOWLDGE IN SOFTWARE INDUSTRY
LEARN SEO AND INCREASE YOUR KNOWLDGE IN SOFTWARE INDUSTRYLEARN SEO AND INCREASE YOUR KNOWLDGE IN SOFTWARE INDUSTRY
LEARN SEO AND INCREASE YOUR KNOWLDGE IN SOFTWARE INDUSTRY
NidaFarooq10
 
Adobe After Effects Crack FREE FRESH version 2025
Adobe After Effects Crack FREE FRESH version 2025Adobe After Effects Crack FREE FRESH version 2025
Adobe After Effects Crack FREE FRESH version 2025
kashifyounis067
 
The Significance of Hardware in Information Systems.pdf
The Significance of Hardware in Information Systems.pdfThe Significance of Hardware in Information Systems.pdf
The Significance of Hardware in Information Systems.pdf
drewplanas10
 
Exploring Wayland: A Modern Display Server for the Future
Exploring Wayland: A Modern Display Server for the FutureExploring Wayland: A Modern Display Server for the Future
Exploring Wayland: A Modern Display Server for the Future
ICS
 
Exploring Code Comprehension in Scientific Programming: Preliminary Insight...
Exploring Code Comprehension  in Scientific Programming:  Preliminary Insight...Exploring Code Comprehension  in Scientific Programming:  Preliminary Insight...
Exploring Code Comprehension in Scientific Programming: Preliminary Insight...
University of Hawai‘i at Mānoa
 
Pixologic ZBrush Crack Plus Activation Key [Latest 2025] New Version
Pixologic ZBrush Crack Plus Activation Key [Latest 2025] New VersionPixologic ZBrush Crack Plus Activation Key [Latest 2025] New Version
Pixologic ZBrush Crack Plus Activation Key [Latest 2025] New Version
saimabibi60507
 
Interactive odoo dashboards for sales, CRM , Inventory, Invoice, Purchase, Pr...
Interactive odoo dashboards for sales, CRM , Inventory, Invoice, Purchase, Pr...Interactive odoo dashboards for sales, CRM , Inventory, Invoice, Purchase, Pr...
Interactive odoo dashboards for sales, CRM , Inventory, Invoice, Purchase, Pr...
AxisTechnolabs
 
Designing AI-Powered APIs on Azure: Best Practices& Considerations
Designing AI-Powered APIs on Azure: Best Practices& ConsiderationsDesigning AI-Powered APIs on Azure: Best Practices& Considerations
Designing AI-Powered APIs on Azure: Best Practices& Considerations
Dinusha Kumarasiri
 
What Do Contribution Guidelines Say About Software Testing? (MSR 2025)
What Do Contribution Guidelines Say About Software Testing? (MSR 2025)What Do Contribution Guidelines Say About Software Testing? (MSR 2025)
What Do Contribution Guidelines Say About Software Testing? (MSR 2025)
Andre Hora
 
Top 10 Client Portal Software Solutions for 2025.docx
Top 10 Client Portal Software Solutions for 2025.docxTop 10 Client Portal Software Solutions for 2025.docx
Top 10 Client Portal Software Solutions for 2025.docx
Portli
 
Explaining GitHub Actions Failures with Large Language Models Challenges, In...
Explaining GitHub Actions Failures with Large Language Models Challenges, In...Explaining GitHub Actions Failures with Large Language Models Challenges, In...
Explaining GitHub Actions Failures with Large Language Models Challenges, In...
ssuserb14185
 
Scaling GraphRAG: Efficient Knowledge Retrieval for Enterprise AI
Scaling GraphRAG:  Efficient Knowledge Retrieval for Enterprise AIScaling GraphRAG:  Efficient Knowledge Retrieval for Enterprise AI
Scaling GraphRAG: Efficient Knowledge Retrieval for Enterprise AI
danshalev
 
Why Orangescrum Is a Game Changer for Construction Companies in 2025
Why Orangescrum Is a Game Changer for Construction Companies in 2025Why Orangescrum Is a Game Changer for Construction Companies in 2025
Why Orangescrum Is a Game Changer for Construction Companies in 2025
Orangescrum
 
Maxon CINEMA 4D 2025 Crack FREE Download LINK
Maxon CINEMA 4D 2025 Crack FREE Download LINKMaxon CINEMA 4D 2025 Crack FREE Download LINK
Maxon CINEMA 4D 2025 Crack FREE Download LINK
younisnoman75
 
Landscape of Requirements Engineering for/by AI through Literature Review
Landscape of Requirements Engineering for/by AI through Literature ReviewLandscape of Requirements Engineering for/by AI through Literature Review
Landscape of Requirements Engineering for/by AI through Literature Review
Hironori Washizaki
 
Adobe Illustrator Crack FREE Download 2025 Latest Version
Adobe Illustrator Crack FREE Download 2025 Latest VersionAdobe Illustrator Crack FREE Download 2025 Latest Version
Adobe Illustrator Crack FREE Download 2025 Latest Version
kashifyounis067
 
Automation Techniques in RPA - UiPath Certificate
Automation Techniques in RPA - UiPath CertificateAutomation Techniques in RPA - UiPath Certificate
Automation Techniques in RPA - UiPath Certificate
VICTOR MAESTRE RAMIREZ
 
EASEUS Partition Master Crack + License Code
EASEUS Partition Master Crack + License CodeEASEUS Partition Master Crack + License Code
EASEUS Partition Master Crack + License Code
aneelaramzan63
 
Download Wondershare Filmora Crack [2025] With Latest
Download Wondershare Filmora Crack [2025] With LatestDownload Wondershare Filmora Crack [2025] With Latest
Download Wondershare Filmora Crack [2025] With Latest
tahirabibi60507
 
Solidworks Crack 2025 latest new + license code
Solidworks Crack 2025 latest new + license codeSolidworks Crack 2025 latest new + license code
Solidworks Crack 2025 latest new + license code
aneelaramzan63
 
LEARN SEO AND INCREASE YOUR KNOWLDGE IN SOFTWARE INDUSTRY
LEARN SEO AND INCREASE YOUR KNOWLDGE IN SOFTWARE INDUSTRYLEARN SEO AND INCREASE YOUR KNOWLDGE IN SOFTWARE INDUSTRY
LEARN SEO AND INCREASE YOUR KNOWLDGE IN SOFTWARE INDUSTRY
NidaFarooq10
 
Adobe After Effects Crack FREE FRESH version 2025
Adobe After Effects Crack FREE FRESH version 2025Adobe After Effects Crack FREE FRESH version 2025
Adobe After Effects Crack FREE FRESH version 2025
kashifyounis067
 
The Significance of Hardware in Information Systems.pdf
The Significance of Hardware in Information Systems.pdfThe Significance of Hardware in Information Systems.pdf
The Significance of Hardware in Information Systems.pdf
drewplanas10
 
Exploring Wayland: A Modern Display Server for the Future
Exploring Wayland: A Modern Display Server for the FutureExploring Wayland: A Modern Display Server for the Future
Exploring Wayland: A Modern Display Server for the Future
ICS
 
Exploring Code Comprehension in Scientific Programming: Preliminary Insight...
Exploring Code Comprehension  in Scientific Programming:  Preliminary Insight...Exploring Code Comprehension  in Scientific Programming:  Preliminary Insight...
Exploring Code Comprehension in Scientific Programming: Preliminary Insight...
University of Hawai‘i at Mānoa
 
Pixologic ZBrush Crack Plus Activation Key [Latest 2025] New Version
Pixologic ZBrush Crack Plus Activation Key [Latest 2025] New VersionPixologic ZBrush Crack Plus Activation Key [Latest 2025] New Version
Pixologic ZBrush Crack Plus Activation Key [Latest 2025] New Version
saimabibi60507
 
Interactive odoo dashboards for sales, CRM , Inventory, Invoice, Purchase, Pr...
Interactive odoo dashboards for sales, CRM , Inventory, Invoice, Purchase, Pr...Interactive odoo dashboards for sales, CRM , Inventory, Invoice, Purchase, Pr...
Interactive odoo dashboards for sales, CRM , Inventory, Invoice, Purchase, Pr...
AxisTechnolabs
 
Designing AI-Powered APIs on Azure: Best Practices& Considerations
Designing AI-Powered APIs on Azure: Best Practices& ConsiderationsDesigning AI-Powered APIs on Azure: Best Practices& Considerations
Designing AI-Powered APIs on Azure: Best Practices& Considerations
Dinusha Kumarasiri
 
What Do Contribution Guidelines Say About Software Testing? (MSR 2025)
What Do Contribution Guidelines Say About Software Testing? (MSR 2025)What Do Contribution Guidelines Say About Software Testing? (MSR 2025)
What Do Contribution Guidelines Say About Software Testing? (MSR 2025)
Andre Hora
 
Top 10 Client Portal Software Solutions for 2025.docx
Top 10 Client Portal Software Solutions for 2025.docxTop 10 Client Portal Software Solutions for 2025.docx
Top 10 Client Portal Software Solutions for 2025.docx
Portli
 
Explaining GitHub Actions Failures with Large Language Models Challenges, In...
Explaining GitHub Actions Failures with Large Language Models Challenges, In...Explaining GitHub Actions Failures with Large Language Models Challenges, In...
Explaining GitHub Actions Failures with Large Language Models Challenges, In...
ssuserb14185
 
Scaling GraphRAG: Efficient Knowledge Retrieval for Enterprise AI
Scaling GraphRAG:  Efficient Knowledge Retrieval for Enterprise AIScaling GraphRAG:  Efficient Knowledge Retrieval for Enterprise AI
Scaling GraphRAG: Efficient Knowledge Retrieval for Enterprise AI
danshalev
 
Why Orangescrum Is a Game Changer for Construction Companies in 2025
Why Orangescrum Is a Game Changer for Construction Companies in 2025Why Orangescrum Is a Game Changer for Construction Companies in 2025
Why Orangescrum Is a Game Changer for Construction Companies in 2025
Orangescrum
 
Maxon CINEMA 4D 2025 Crack FREE Download LINK
Maxon CINEMA 4D 2025 Crack FREE Download LINKMaxon CINEMA 4D 2025 Crack FREE Download LINK
Maxon CINEMA 4D 2025 Crack FREE Download LINK
younisnoman75
 

Solr + Hadoop: Interactive Search for Hadoop

  • 1. 1 Solr + Hadoop: Interactive Search for Hadoop Gregory Chanan (gchanan AT cloudera.com) OC Big Data Meetup 07/16/14
  • 2. Agenda • Big Data and Search – setting the stage • Cloudera Search Architecture • Component Deep Dive • Security • Conclusion
  • 3. Agenda • Big Data and Search – setting the stage • Cloudera Search Architecture • Component Deep Dive • Security • Conclusion
  • 4. Why Search? • Hadoop for everyone • Typical case: • Ingest data to storage engine (HDFS, HBase, etc) • Process data (MapReduce, Hive, Impala) • Experts know MapReduce • Savvy people know SQL • Everyone knows Search!
  • 5. Why Search? An Integrated Part of the Hadoop System One pool of data One security framework One set of system resources One management interface
  • 6. Benefits of Search • Improved Big Data ROI • An interactive experience without technical knowledge • Faster time to insight • Exploratory analysis, esp. unstructured data • Broad range of indexing options to accommodate needs • Cost efficiency • Single scalable platform; no incremental investment • No need for separate systems, storage
  • 7. What is Cloudera Search? • Full-text, interactive search with faceted navigation • Apache Solr integrated with CDH • Established, mature search with vibrant community • In production environments for years • Open Source • 100% Apache, 100% Solr • Standard Solr APIs • Batch, near real-time, and on-demand indexing • Available for CDH4 and CDH5
  • 8. Agenda • Big Data and Search – setting the stage • Cloudera Search Architecture • Component Deep Dive • Security • Conclusion
  • 9. Apache Hadoop • Apache HDFS • Distributed file system • High reliability • High throughput • Apache MapReduce • Parallel, distributed programming model • Allows processing of large datasets • Fault tolerant
  • 10. Apache Lucene • Full text search library • Indexing • Querying • Traditional inverted index • Batch and Incremental indexing • We are using version 4.4 in current release
  • 11. Apache Solr • Search service built using Lucene • Ships with Lucene (same TLP at Apache) • Provides XML/HTTP/JSON/Python/Ruby/… APIs • Indexing • Query • Administrative interface • Also rich web admin GUI via HTTP
  • 12. Apache SolrCloud • Provides distributed Search capability • Part of Solr (not a separate library/codebase) • Shards – provide scalability • partition index for size • replicate for query performance • Uses ZooKeeper for coordination • No split-brain issues • Simplifies operations
  • 13. SolrCloud Architecture • Updates automatically sent to the correct shard • Replicas handle queries, forward updates to the leader • Leader indexes the document for the shard, and forwards the index notation to itself and any replicas.
  • 15. Distributed Search on Hadoop Flume Hue UI Custom UI Custom App Solr Solr Solr SolrCloud query query query index Hadoop Cluster MR HDFS index HBase index ZK
  • 16. Agenda • Big Data and Search – setting the stage • Cloudera Search Architecture • Component Deep Dive • Indexing • ETL - morphlines • Querying • Security • Conclusion
  • 17. Indexing • Near Real Time (NRT) • Flume • HBase Indexer • Batch • MapReduceIndexerTool • HBaseBatchIndexer
  • 18. Near Real Time Indexing with Flume Log File Solr and Flume • Data ingest at scale • Flexible extraction and mapping • Indexing at data ingest HDFS Flume Agent Indexer Other Log File Flume Agent Indexer 18
  • 19. Apache Flume - MorphlineSolrSink • A Flume Source… • Receives/gathers events • A Flume Channel… • Carries the event – MemoryChannel or reliable FileChannel • A Flume Sink… • Sends the events on to the next location • Flume MorphlineSolrSink • Integrates Cloudera Morphlines library • ETL, more on that in a bit • Does batching • Results sent to Solr for indexing
  • 20. Indexing • Near Real Time (NRT) • Flume • HBase Indexer • Batch • MapReduceIndexerTool • HBaseBatchIndexer
  • 21. Near Real Time Indexing of Apache HBase HDFS HBase interactiveload HBase Indexer(s) Replication Solr server Solr server Solr server Solr server Solr server Search + = planet-sized tabular data immediate access & updates fast & flexible information discovery BIG DATA DATAMANAGEMENT
  • 22. Lily HBase Indexer • Collaboration between NGData & Cloudera • NGData are creators of the Lily data management platform • Lily HBase Indexer • Service which acts as a HBase replication listener • HBase replication features, such as filtering, supported • Replication updates trigger indexing of updates (rows) • Integrates Cloudera Morphlines library for ETL of rows • AL2 licensed on github https://github.com/ngdata
  • 23. Indexing • Near Real Time (NRT) • Flume • HBase Indexer • Batch • MapReduceIndexerTool • HBaseBatchIndexer
  • 24. Scalable Batch Indexing Index shard Files Index shard Indexer Files Solr server Indexer Solr server 24 HDFS Solr and MapReduce • Flexible, scalable batch indexing • Start serving new indices with no downtime • On-demand indexing, cost- efficient re-indexing
  • 25. MapReduce Indexer MapReduce Job with two parts 1) Scan HDFS for files to be indexed • Much like Unix “find” – see HADOOP-8989 • Output is NLineInputFormat’ed file 2) Mapper/Reducer indexing step • Mapper extracts content via Cloudera Morphlines • Reducer indexes documents via embedded Solr server • Originally based on SOLR-1301 • Many modifications to enable linear scalability
  • 26. MapReduce Indexer “golive” • Cloudera created this to bridge the gap between NRT (low latency, expensive) and Batch (high latency, cheap at scale) indexing • Results of MR indexing operation are immediately merged into a live SolrCloud serving cluster • No downtime for users • No NRT expense • Linear scale out to the size of your MR cluster
  • 27. Indexing • Near Real Time (NRT) • Flume • HBase Indexer • Batch • MapReduceIndexerTool • HBaseBatchIndexer
  • 28. HBase + MapReduce • Run MapReduce job over HBase tables • Same architecture as running over HDFS • Similar to HBase’s CopyTable • Support for go-live
  • 29. Agenda • Big Data and Search – setting the stage • Cloudera Search Architecture • Component Deep Dive • Indexing • ETL - morphlines • Querying • Security • Conclusion
  • 30. Cloudera Morphlines • Open Source framework for simple ETL • Simplify ETL • Built-in commands and library support (Avro format, Hadoop SequenceFiles, grok for syslog messages) • Configuration over coding • Standardize ETL • Ships as part of Kite SDK, formerly Cloudera Developer Kit (CDK) • It’s a Java library • AL2 licensed on github https://github.com/kite-sdk
  • 31. Cloudera Morphlines Architecture Solr Solr Solr SolrCloud Logs, tweets, social media, html, images, pdf, text…. Anything you want to index Flume, MR Indexer, HBase indexer, etc... Or your application! Morphline Library Morphlines can be embedded in any application…
  • 32. Extraction and Mapping • Modeled after Unix pipelines (records instead of lines) • Simple and flexible data transformation • Reusable across multiple index workloads • Over time, extend and re- use across platform workloads syslog Flume Agent Solr sink Command: readLine Command: grok Command: loadSolr Solr Event Record Record Record Document MorphlineLibrary
  • 33. Morphline Example – syslog with grok morphlines : [ { id : morphline1 importCommands : ["com.cloudera.**", "org.apache.solr.**"] commands : [ { readLine {} } { grok { dictionaryFiles : [/tmp/grok-dictionaries] expressions : { message : """<%{POSINT:syslog_pri}>%{SYSLOGTIMESTAMP:syslog_timestamp} %{SYSLOGHOST:syslog_hostname} %{DATA:syslog_program}(?:[%{POSINT:syslog_pid}])?: %{GREEDYDATA:syslog_message}""" } } } { loadSolr {} } ] } ] Example Input <164>Feb 4 10:46:14 syslog sshd[607]: listening on 0.0.0.0 port 22 Output Record syslog_pri:164 syslog_timestamp:Feb 4 10:46:14 syslog_hostname:syslog syslog_program:sshd syslog_pid:607 syslog_message:listening on 0.0.0.0 port 22.
  • 34. Current Command Library • Integrate with and load into Apache Solr • Flexible log file analysis • Single-line record, multi-line records, CSV files • Regex based pattern matching and extraction • Integration with Avro • Integration with Apache Hadoop Sequence Files • Integration with SolrCell and all Apache Tika parsers • Auto-detection of MIME types from binary data using Apache Tika
  • 35. Current Command Library (cont) • Scripting support for dynamic java code • Operations on fields for assignment and comparison • Operations on fields with list and set semantics • if-then-else conditionals • A small rules engine (tryRules) • String and timestamp conversions • slf4j logging • Yammer metrics and counters • Decompression and unpacking of arbitrarily nested container file formats • Etc…
  • 36. Agenda • Big Data and Search – setting the stage • Cloudera Search Architecture • Component Deep Dive • Indexing • ETL - morphlines • Querying • Security • Conclusion
  • 37. Querying • Built-in solr web UI • Write your own • Hue
  • 38. Simple, Customizable Search Interface Hue • Simple UI • Navigated, faceted drill down • Customizable display • Full text search, standard Solr API and query language
  • 39. Agenda • Big Data and Search – setting the stage • Cloudera Search Architecture • Component Deep Dive • Security • Conclusion
  • 40. Security • Upstream Solr doesn’t deal with security • Cloudera Search supports kerberos authentication • Similar to Oozie / WebHDFS • Collection-Level Authorization via Apache Sentry • Document-Level Authorization via Apache Sentry (new in CDH5.1)
  • 41. Agenda • Big Data and Search – setting the stage • Cloudera Search Architecture • Component Deep Dive • Indexing • ETL - morphlines • Querying • Security • Collection-Level Authorization • Document-Level Authorization • Conclusion
  • 42. Collection-Level Authorization • Sentry supports role-based granting of privileges • each role can be granted QUERY, UPDATE, and/or administrative privileges on an index (collection) • Privileges stored in a “policy file” on HDFS
  • 43. Policy File [groups] # Assigns each Hadoop group to its set of roles dev_ops = engineer_role, ops_role [roles] # Assigns each role to its set of privileges engineer_role = collection = source_code->action=Query, collection = source_code- > action=Update ops_role = collection = hbase_logs->action=Query
  • 44. Integrating Sentry and Solr • Solr Request Handlers: • Specified per collection in solrconfig.xml: • Request to: http://localhost:8983/solr/collection1/select Is dispatched to an instance of solr.SearchHandler
  • 45. Sentry Request Handlers • Sentry ships with its own version of solrconfig.xml with secure handlers, called solrconfig.xml.secure • Use a SearchComponent to implement the checking • Update Requests handled in a similar way
  • 46. Agenda • Big Data and Search – setting the stage • Cloudera Search Architecture • Component Deep Dive • Indexing • ETL - morphlines • Querying • Security • Collection-Level Authorization • Document-Level Authorization • Conclusion
  • 47. Document-level authorization Motivation • Index-level authorization useful when access control requirements for documents are homogeneous • Security requirements may require restricting access to a subset of documents
  • 48. Document-level authorization Motivation • Consider “Confidential” and “Secret” documents. How to store with only index-level authorization? • Pushes complexity to application. Doc-level authorization designed to solve this problem
  • 49. Document-level authorization model • Instead of storing in HDFS Policy File: [groups] # Assigns each Hadoop group to its set of roles dev_ops = engineer_role, ops_role [roles] # Assigns each role to its set of privileges engineer_role = collection = source_code->action=Query, collection = source_code- > action=Update ops_role = collection = hbase_logs->action=Query • Store authorization tokens in each document • Many more documents than collections; doesn’t scale to store document-level info in Policy File • Can use Solr’s built-in filtering capabilities to restrict access
  • 50. Document-level authorization model • A configurable token field stores the authorization tokens • The authorization tokens are Sentry roles, i.e. “ops_role” [roles] ops_role = collection = hbase_logs->action=Query • Represents the roles that are allowed to view the document. To view a document, the querying user must belong to at least one role whose token is stored in the token field • Can modify document permissions without restarting Solr • Can modify role memberships without reindexing
  • 51. Document-level authorization impl • Intercepts the request via a SearchComponent • SearchComponent adds an “fq” or FilterQuery • Filter out all documents that don’t have “role1” or “role2” in authField • Multiple “fq”s work as intersection, so malicious user can’t avoid by injection his own fq • Filters are cached, so only construction expense once • Note: does not supersede index-level authorization
  • 52. Document-level authorization config • Configuration via solrconfig.xml.secure (per collection): <!-- Set to true to enabled document-level authorization --> <bool name="enabled">false</bool> <!-- Field where the auth tokens are stored in the document --> <str name="sentryAuthField">sentry_auth</str> <!-- Auth token defined to allow any role to access the document. Uncomment to enable. --> <!--<str name="allRolesToken">*</str>--> • For backwards compatibility, not enabled • No tokens = no access. To allow all users to access a document, use the allRolesToken. Useful for getting started
  • 53. Conclusion • Cloudera Search • Free Download • Extensive documentation • Send your questions and feedback to search- [email protected] • Take the Search online training • Cloudera Manager Standard (i.e. the free version) • Simple management of Search • Free Download • QuickStart VM also available!

Editor's Notes

  • #6: This is the “Big Picture”