0% found this document useful (0 votes)

64 views9 pages

E6893 Big Data Analytics:: Speech Analytics Software Library

Uploaded by

zelalem

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

64 views9 pages

E6893 Big Data Analytics:: Speech Analytics Software Library

Uploaded by

zelalem

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 9

E6893 Big Data Analytics:

Speech Analytics Software Library

Team Members: Kyle White

December 9, 2014
1 E6893 Big Data Analytics – Final Project Presentation © 2014 CY Lin, Columbia University
Motivation

• A need exists in the speech processing and recognition community to adapt or create
tools for coping with increasing amounts of available data.

• Example: A speech recognition project can require on the order of 50GB or more of
training audio data for acoustic modeling to realize gains for popular approaches such as
decoding speech via neural network classifiers. Transcribed domain data on the order of
gigabytes is needed for high performance language models. This scenario calls for
speech processing solutions that are able to store, access, and process this volume of
data.

2 E6893 Big Data Analytics – Final Project Presentation © 2014 CY Lin, Columbia University
Software Library Goals

• The goal of the speech analytics library is to assemble, adapt, or develop capabilities to
condition, decode, and understand speech content in an archive of speech related data.

• The initial goal of the library, for this course project, is to utilize an open-source language
modeling toolkit in combination with the MapReduce paradigm to enable distributed
training of n-gram language models.

3 E6893 Big Data Analytics – Final Project Presentation © 2014 CY Lin, Columbia University
Library Overview

• The Speech Analytics Library is intended to be a collection of tools, and the single tool
that currently inhabits the library allows the distributed training of n-gram language models
for speech recognition. In the first phase, Hadoop is utilized to count n-grams across the
input transcriptions. The intermediate output is an n-gram document in a simple, RSA
internal format. Phase two compiles the intermediate n-gram count documents and
utilizes berkeleylm for the creation of a n-gram language model in the commonly accepted
ARPA format.

Language Model Generation Tool

Transcripts
Phase I N-gram count Phase II ARPA Format
documents Language
Model

4 E6893 Big Data Analytics – Final Project Presentation © 2014 CY Lin, Columbia University
Language Model Generation: Phase I

• Phase one processing is centered around the usage of Hadoop distributed computing and
Lucene text analysis libraries.
• The Hadoop Library is utilized to create a Java MapReduce application capable of
taking textual transcriptions as input and producing output n-gram count documents.
Custom Client, Mapper, Reducer, and Partitioner classes are utilized in this step.

• The Lucene core library capabilities for analysis/tokenization are utilized by custom
Mapper classes to perform analysis on input for n-gram creation from transcript
strings. The sequence of Lucene TokenStream filters is shown below.

WhitespaceTokenizer

LowerCaseFilter

ShingleFilter

5 E6893 Big Data Analytics – Final Project Presentation © 2014 CY Lin, Columbia University
Language Model Generation: Phase II

• Phase two processing is centered around the usage of the berkeleylm library to generate
a language model. At this time the second phase simply outputs this language model to
an ARPA format language model file. Alternatively, berkeleylm excels at efficiently storing
and querying large language models during runtime for a speech recognizer, and it is for
possible future uses related to this shared focus on large data sets that berkeleylm was
chosen.

Example Input Example Output

• The library utilizes the following existing tools to fulfill its goals.

Capability Leveraged Tool

HDFS Apache Hadoop
MapReduce Apache Hadoop
Text Segmentation and Apache Lucene
Tokenization
Language Modeling berkeleylm

• Hadoop – Chosen for solid file system and implementation of MapReduce framework.
Selected for strong usage scenario with batch processing an audio archive of speech
related data.
• Lucene – Top level Apache product with track record for text analysis and tokenization
• berkelylm – Library written for language modeling in the context of large data volumes.
Meant for estimating, storing, and accessing large n-gram language models

• The library is currently a two phase affair, where each phase must be kicked off
from the command line. Simplification to a single phase is planned. The library
is built via Maven, but a pre-built JAR is included in the repository.
• Example scripts to run the two phases are included in the repository /scripts
folder, and an example of one script is below.

• Work completed so far allows distributed training of n-gram language models, and
presents the opportunity for additional work to better utilize berkeleylm as a language
model manager during runtime operations which potentially depend on large language
models.

Programming with Nim: Definitive Reference for Developers and Engineers
From Everand
Programming with Nim: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Vala Programming Language Essentials: Definitive Reference for Developers and Engineers
From Everand
Vala Programming Language Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
SpaCy for Natural Language Processing: Definitive Reference for Developers and Engineers
From Everand
SpaCy for Natural Language Processing: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Go File Handling for New Coders: A Practical Guide with Examples
From Everand
Go File Handling for New Coders: A Practical Guide with Examples
William E. Clark
No ratings yet
Essential Apache Beam: Definitive Reference for Developers and Engineers
From Everand
Essential Apache Beam: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Unit 5
No ratings yet
Unit 5
20 pages
Mastering Nim Programming: High-Performance Metaprogramming and Compile-Time Execution
From Everand
Mastering Nim Programming: High-Performance Metaprogramming and Compile-Time Execution
Robert Johnson
No ratings yet
2023 Dravidianlangtech-1
No ratings yet
2023 Dravidianlangtech-1
330 pages
Natural Language Processing tools and approaches
No ratings yet
Natural Language Processing tools and approaches
106 pages
Q Tips: Fast, Scalable, and Maintainable Kdb+
From Everand
Q Tips: Fast, Scalable, and Maintainable Kdb+
Nick Psaris
No ratings yet
UNIT-1
No ratings yet
UNIT-1
99 pages
Notes - Ryan
No ratings yet
Notes - Ryan
258 pages
Lecture_1_Introduction
No ratings yet
Lecture_1_Introduction
57 pages
Module I NLP
No ratings yet
Module I NLP
65 pages
OpenMP in Practice: Definitive Reference for Developers and Engineers
From Everand
OpenMP in Practice: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
NLP UNIT-4
No ratings yet
NLP UNIT-4
62 pages
MOD-1
No ratings yet
MOD-1
71 pages
Lec-1 Introduction
No ratings yet
Lec-1 Introduction
68 pages
NLP ppt
No ratings yet
NLP ppt
20 pages
Language Models and Application of Natural Language Processing
No ratings yet
Language Models and Application of Natural Language Processing
70 pages
Akchukwu_wisdom_Chidi_Seminar_corrected_version[1]
No ratings yet
Akchukwu_wisdom_Chidi_Seminar_corrected_version[1]
17 pages
Unit 5 Updated
No ratings yet
Unit 5 Updated
107 pages
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet
Kafka Up and Running for Network DevOps: Set Your Network Data in Motion
From Everand
Kafka Up and Running for Network DevOps: Set Your Network Data in Motion
Eric Chou
No ratings yet
NLP Notes Unit 1to5 final
No ratings yet
NLP Notes Unit 1to5 final
75 pages
Lecture 6 n Gram Language Models Contd Annotations
No ratings yet
Lecture 6 n Gram Language Models Contd Annotations
36 pages
Introduction to Natural Language Processing
No ratings yet
Introduction to Natural Language Processing
31 pages
Natural Language Processing Notes by Prof. Suresh R. Mestry: L I L L L I
No ratings yet
Natural Language Processing Notes by Prof. Suresh R. Mestry: L I L L L I
41 pages
Racket Unleashed: Building Powerful Programs with Functional and Language-Oriented Programming
From Everand
Racket Unleashed: Building Powerful Programs with Functional and Language-Oriented Programming
Robert Johnson
No ratings yet
AnandKumar Course Intro IT356
No ratings yet
AnandKumar Course Intro IT356
42 pages
DSA Module 5 Notes (2)
No ratings yet
DSA Module 5 Notes (2)
23 pages
NLP Text Classification Week4
No ratings yet
NLP Text Classification Week4
26 pages
CASE STUDY - Speech Recognition
No ratings yet
CASE STUDY - Speech Recognition
25 pages
Recent Advances in Gen Ai
No ratings yet
Recent Advances in Gen Ai
21 pages
SMR6!
No ratings yet
SMR6!
14 pages
UNIT 1 (1)
No ratings yet
UNIT 1 (1)
20 pages
Generative AI For Constructive Communication
No ratings yet
Generative AI For Constructive Communication
56 pages
Team03 Project Report PDF
No ratings yet
Team03 Project Report PDF
39 pages
1 cs772 Introduction Week of 3jan22
No ratings yet
1 cs772 Introduction Week of 3jan22
53 pages
NLP Week 02 - 02
No ratings yet
NLP Week 02 - 02
33 pages
Introduction to Language Models
No ratings yet
Introduction to Language Models
24 pages
Major Complete Presentation - Major Project Presentation.
No ratings yet
Major Complete Presentation - Major Project Presentation.
28 pages
Natural Language Processing
No ratings yet
Natural Language Processing
6 pages
Probabilistic Language Modeling Challenges
No ratings yet
Probabilistic Language Modeling Challenges
12 pages
UNIT NO 1
No ratings yet
UNIT NO 1
19 pages
Recent Advances
No ratings yet
Recent Advances
21 pages
nlp sem unit 5
No ratings yet
nlp sem unit 5
9 pages
Unit 5 A.I
No ratings yet
Unit 5 A.I
17 pages
NLP Assignment 2
No ratings yet
NLP Assignment 2
8 pages
INTRO TO LANGUAGE MODELS - SOUMYASIS MISHRA - 191001021003 - BCS4C
No ratings yet
INTRO TO LANGUAGE MODELS - SOUMYASIS MISHRA - 191001021003 - BCS4C
10 pages
CL Assignments
No ratings yet
CL Assignments
22 pages
NLp
No ratings yet
NLp
12 pages
PPT_Format_edit[1] (2)
No ratings yet
PPT_Format_edit[1] (2)
10 pages
MScIT-Sem4
No ratings yet
MScIT-Sem4
8 pages
Large-Scale_News_Classification_using_BERT_Languag (1)
No ratings yet
Large-Scale_News_Classification_using_BERT_Languag (1)
9 pages
Applied Natural Language Processing: Projects
No ratings yet
Applied Natural Language Processing: Projects
26 pages
SPA.2018.8563389
No ratings yet
SPA.2018.8563389
6 pages
Natural Language Processing
No ratings yet
Natural Language Processing
8 pages
CCS369
No ratings yet
CCS369
2 pages
Introduction to Programming Languages
From Everand
Introduction to Programming Languages
IntroBooks Team
4/5 (1)
Konica Minolta Bizhub 501 All Active Solutions
No ratings yet
Konica Minolta Bizhub 501 All Active Solutions
125 pages
FAMCO DAB Esybox Error and Solutions With Pictures
No ratings yet
FAMCO DAB Esybox Error and Solutions With Pictures
31 pages
4 ATTEMPT AFTER STUDYING
No ratings yet
4 ATTEMPT AFTER STUDYING
7 pages
Fortinet Investor Presentation - March 2022 FINAL V2
No ratings yet
Fortinet Investor Presentation - March 2022 FINAL V2
62 pages
Applying in Mercantile Bank Limited
No ratings yet
Applying in Mercantile Bank Limited
1 page
Openstack
No ratings yet
Openstack
19 pages
Television 2008 04
No ratings yet
Television 2008 04
52 pages
grandmougin
No ratings yet
grandmougin
18 pages
TechUpdate090929-enduser
No ratings yet
TechUpdate090929-enduser
35 pages
O & M Manual Template
No ratings yet
O & M Manual Template
11 pages
Data Structure and Algorithm Exam
100% (1)
Data Structure and Algorithm Exam
3 pages
Lead innovative thinking and practice
No ratings yet
Lead innovative thinking and practice
25 pages
DE2 115 User Manual
No ratings yet
DE2 115 User Manual
122 pages
Fortran 77 Tutorial
0% (1)
Fortran 77 Tutorial
20 pages
DevOps-For-Business-Agility
No ratings yet
DevOps-For-Business-Agility
2 pages
Aws Videos
No ratings yet
Aws Videos
20 pages
Huawei_ekitEngine_S530_Series_Switch
No ratings yet
Huawei_ekitEngine_S530_Series_Switch
7 pages
Foglight Install Unix Embedded Mysql
No ratings yet
Foglight Install Unix Embedded Mysql
65 pages
Effective CV Writing
No ratings yet
Effective CV Writing
31 pages
Unit 6 Ethics, Privacy, and Security: Topic 1: Ethics in Health Informatics
No ratings yet
Unit 6 Ethics, Privacy, and Security: Topic 1: Ethics in Health Informatics
22 pages
Buffer Overflow Setuid
No ratings yet
Buffer Overflow Setuid
11 pages
CG12 BSP
No ratings yet
CG12 BSP
31 pages
NLP Programming en 01 Unigramlm
No ratings yet
NLP Programming en 01 Unigramlm
28 pages
2 - Power Supply
No ratings yet
2 - Power Supply
6 pages
Super Position and Statically Determinate Beam
No ratings yet
Super Position and Statically Determinate Beam
25 pages
SMSC Change Specification: Change Request PM98/xxx
No ratings yet
SMSC Change Specification: Change Request PM98/xxx
24 pages
Miles and Snow's Organizational Strategies
No ratings yet
Miles and Snow's Organizational Strategies
15 pages
IOT Notes
No ratings yet
IOT Notes
8 pages
RTI Online - Video Call Summons
No ratings yet
RTI Online - Video Call Summons
1 page
Mobile App Developer Sample Resume
No ratings yet
Mobile App Developer Sample Resume
6 pages
Application of Artificial Neural Network To Predict Total Dissolved Solid in Achechay River Basin
No ratings yet
Application of Artificial Neural Network To Predict Total Dissolved Solid in Achechay River Basin
9 pages
CHAPTER 61G15-33 Responsibility Rules of Professional Engineers Concerning The Design of Electrical Systems
No ratings yet
CHAPTER 61G15-33 Responsibility Rules of Professional Engineers Concerning The Design of Electrical Systems
4 pages
CHAPTER 61G15-33 Responsibility Rules of Professional Engineers Concerning The Design of Electrical Systems
No ratings yet
CHAPTER 61G15-33 Responsibility Rules of Professional Engineers Concerning The Design of Electrical Systems
4 pages
Data Transformation Cheatsheet
No ratings yet
Data Transformation Cheatsheet
2 pages
CV - Imas Nur Tiarani
No ratings yet
CV - Imas Nur Tiarani
1 page
Positive Negative Design Rubric
No ratings yet
Positive Negative Design Rubric
1 page

E6893 Big Data Analytics:: Speech Analytics Software Library

Uploaded by

E6893 Big Data Analytics:: Speech Analytics Software Library

Uploaded by

E6893 Big Data Analytics:

Speech Analytics Software Library

Team Members: Kyle White

Language Model Generation Tool

Example Input Example Output

Capability Leveraged Tool

You might also like