0% found this document useful (0 votes)

14 views10 pages

ppt_poject

The document discusses the development of an automated machine learning solution to identify duplicate questions in Q&A communities, addressing challenges such as semantic variation and high dimensionality. It outlines the importance of data preprocessing, feature engineering, and various machine learning models to improve detection accuracy. The expected outcomes include enhanced user experience and content quality in Q&A platforms, with future work focusing on advanced neural architectures and real-time processing capabilities.

Uploaded by

Ishak gauri

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views10 pages

ppt_poject

Uploaded by

Ishak gauri

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

Identifying Duplicate Questions in Q&A

Communities Using Machine Learning

Submitted by:

Ishak Gauri (QID: 22030522)

Rahul Kumar
Subhash Yadav (QID: 22030574)
Aayush Kumar Jha
Mukul Sharma (QID: 23030191)

Department of Computer Science Engineering Quantum University, Roorkee

IG by Ishak Gauri
Introduction to Q&A Communities
Q&A communities like Quora and Stack Exchange are Duplicate questions degrade content quality. They create
invaluable resources. They facilitate knowledge sharing fragmented information. Manual detection is impractical
and problem-solving among users. due to the immense scale of these platforms.

Our primary objective is to develop an automated solution. This system will leverage machine learning. It aims to identify
duplicate question pairs efficiently.
The Problem: Duplicate
Questions
Fragmented Poor User
Information Experience
Duplicate questions lead Users encounter
to scattered answers. redundant content. This
This makes finding can frustrate and deter
comprehensive engagement within the
information difficult. community.

Scalability Challenges
Manually identifying duplicates is impossible. Q&A platforms
contain millions of questions daily.

Addressing this issue is crucial. It ensures a clean and effective

knowledge base. An automated solution is essential.
Challenges in Duplicate
Detection
Semantic Variation High Dimensionality
Questions can have the same Processing natural language
meaning. However, they use text creates high-dimensional
vastly different phrasing and data. This requires advanced
vocabulary. This poses a feature engineering
significant challenge for techniques.
detection.

Computational Load
Large datasets demand efficient algorithms. Real-time detection
systems need optimized computational power.

These challenges necessitate sophisticated machine learning models.

We need methods capable of understanding context and meaning, not
just keywords.
Quora Question Pairs
Dataset
Feature Description Example

Question 1 Text First question in "What is the

the pair meaning of life?"

Question 2 Text Second question "What's the

in the pair purpose of
existence?"

Is_Duplicate Label Binary: 1 for 1 (for the example

duplicate, 0 for above)
not

This dataset is crucial for training. It provides a real-world

benchmark. The millions of labeled pairs ensure robust model
training.
Data Preprocessing Pipeline
Text Cleaning
Remove special characters, punctuation, and normalize casing.

Tokenization
Break text into individual words or sub-word units.

Stop Word Removal

Eliminate common words (e.g., "the," "a") that offer little semantic value.

Stemming/Lemmatization
Reduce words to their root form. This normalizes variations (e.g., "running" to "run").

This preprocessing ensures data quality. It prepares the text for effective feature extraction. Clean data improves model
accuracy.
Feature Engineering Approaches
TF-IDF
Bag-of-Words (BoW)
Weighs words by frequency. It also
Represents text as a bag of word
considers their inverse document
occurrences. Ignores grammar and
frequency. This highlights important
word order.
terms.

Syntactic Features Word Embeddings

Includes features like sentence Dense vector representations of
length, shared n-grams, and edit words. They capture semantic
distance. This captures structural relationships (e.g., Word2Vec,
similarities. GloVe).

Effective feature engineering is vital. It translates raw text into numerical representations. These features are
understandable by machine learning models.
Machine Learning Models
Deep Learning (Siamese Networks)
Advanced neural networks for learning similarities. They compare two inputs
using shared weights.

Ensemble Methods (XGBoost, Random Forest)

Combine multiple models for improved performance. They reduce
bias and variance.

Traditional ML (SVM, Logistic Regression)

Baseline models for classification. They provide a
foundational understanding of data separability.

We will evaluate different model complexities. This ranges from classic algorithms to deep learning. Each model offers
unique strengths for text similarity.
Evaluation Metrics

F1
F1-Score
Harmonic mean of precision and recall. It balances false positives
and false negatives.

Accuracy
Accuracy
Proportion of correctly classified instances. Useful for balanced
datasets.

Precision
Precision
Proportion of true positives among all positive predictions.
Minimizes false alarms.

Recall
Recall
Proportion of true positives among all actual positives. Minimizes
missed duplicates.

These metrics collectively assess model performance. They provide

a comprehensive view of how well the model identifies duplicate
questions.
Expected Outcomes and
Future Work
Automated Duplicate Detection
A robust system capable of automatically identifying duplicate
questions.

Improved User Experience

Cleaner Q&A communities with less redundancy. This enhances
information discovery.

Enhanced Content Quality

Consolidated information. This reduces fragmentation and
improves overall platform value.

Future work includes exploring more advanced neural architectures. We

will also investigate real-time processing capabilities for live platforms.
Cross-lingual duplicate detection is another promising avenue.

Project Synopsis-1
100% (1)
Project Synopsis-1
11 pages
School Opcrf Targets 2024
100% (11)
School Opcrf Targets 2024
12 pages
duplicate_question_detection_report__1_
No ratings yet
duplicate_question_detection_report__1_
26 pages
Project_lab_report_printing
No ratings yet
Project_lab_report_printing
25 pages
Design of Efficient Model To Predict Duplications in Questionnaire Forum Using Machine Learning
No ratings yet
Design of Efficient Model To Predict Duplications in Questionnaire Forum Using Machine Learning
7 pages
Technical Seminar-PPT Format
No ratings yet
Technical Seminar-PPT Format
17 pages
TextFeatureEnginerring-NLP lec2
No ratings yet
TextFeatureEnginerring-NLP lec2
60 pages
Fake News Detection A Deep Dive Into NLP Models 1
No ratings yet
Fake News Detection A Deep Dive Into NLP Models 1
10 pages
Automatic-Question-Generation-Transformi
No ratings yet
Automatic-Question-Generation-Transformi
10 pages
(IJIT-V6I3P1) :asst. Prof. Omprakash Yadav, Saikumar Kandakatla, Shantanu Sawant, Chandan Soni, Murari Indra Bahadur
No ratings yet
(IJIT-V6I3P1) :asst. Prof. Omprakash Yadav, Saikumar Kandakatla, Shantanu Sawant, Chandan Soni, Murari Indra Bahadur
4 pages
Policies for Training & Placement
No ratings yet
Policies for Training & Placement
13 pages
tc107 Research Paper
No ratings yet
tc107 Research Paper
6 pages
Copy of AI Driven Hate Speech Detection and Prevention System[1]
No ratings yet
Copy of AI Driven Hate Speech Detection and Prevention System[1]
10 pages
Quora Question Pairs Similarity: Bachelor of Engineering IN Computer Science & Engineering
No ratings yet
Quora Question Pairs Similarity: Bachelor of Engineering IN Computer Science & Engineering
6 pages
Career Technology
No ratings yet
Career Technology
166 pages
Compilerppt.pdf (1)
No ratings yet
Compilerppt.pdf (1)
15 pages
MLT Quantum
No ratings yet
MLT Quantum
138 pages
Merrill's First Principles of Instruction
0% (1)
Merrill's First Principles of Instruction
4 pages
Computer Teaching Strategies
78% (18)
Computer Teaching Strategies
25 pages
Enhancing Data Connection Resilience in Serverless Environments (1).Pptx
No ratings yet
Enhancing Data Connection Resilience in Serverless Environments (1).Pptx
11 pages
Sip Table of Contents
No ratings yet
Sip Table of Contents
14 pages
PyCaret 3.0 cheat_sheet
No ratings yet
PyCaret 3.0 cheat_sheet
2 pages
CS3604-AS02
No ratings yet
CS3604-AS02
1 page
Early Grade Mathematics Assessment (EGMA) : A Conceptual Framework Based On Mathematics Skills Development in Children
No ratings yet
Early Grade Mathematics Assessment (EGMA) : A Conceptual Framework Based On Mathematics Skills Development in Children
69 pages
The Role of Contextualization
No ratings yet
The Role of Contextualization
16 pages
Taboo Conversation Topics - Listening
100% (1)
Taboo Conversation Topics - Listening
7 pages
Quiz_ [AAB01] Questionnaire_ Assess your understanding of the contents studied from Units 1 to 3 by completing the questionnaire_
No ratings yet
Quiz_ [AAB01] Questionnaire_ Assess your understanding of the contents studied from Units 1 to 3 by completing the questionnaire_
8 pages
Interpreter
No ratings yet
Interpreter
2 pages
Iste Certification Alignment Map
No ratings yet
Iste Certification Alignment Map
12 pages
A Short Guide To Oral Assessment
No ratings yet
A Short Guide To Oral Assessment
29 pages
The Effects of Film Subtitles On English Listening
No ratings yet
The Effects of Film Subtitles On English Listening
8 pages
Mistral Large: Mistral AI's Multilingual AI Transforming Coding and Math
No ratings yet
Mistral Large: Mistral AI's Multilingual AI Transforming Coding and Math
7 pages
Ciri-Ciri Dan Kepentingan Komponen Pedagogi Kontemporari
No ratings yet
Ciri-Ciri Dan Kepentingan Komponen Pedagogi Kontemporari
2 pages
AI 1001 - Top AI Tools
No ratings yet
AI 1001 - Top AI Tools
18 pages
SOP - Essay Guidelines
No ratings yet
SOP - Essay Guidelines
4 pages
Letter of Recommendation
No ratings yet
Letter of Recommendation
1 page
Mathematics: Quarter 3 - Module 2: Week 6 - Week 9
No ratings yet
Mathematics: Quarter 3 - Module 2: Week 6 - Week 9
41 pages
Eni Ranxha: Curriculum Vitae
No ratings yet
Eni Ranxha: Curriculum Vitae
1 page
Memorandum of Agreement SIL
No ratings yet
Memorandum of Agreement SIL
3 pages
Descriptive Writing Assessment
No ratings yet
Descriptive Writing Assessment
1 page
4° Classroom Language
No ratings yet
4° Classroom Language
4 pages
Sao-An, Willy B.: Educational Qualification
No ratings yet
Sao-An, Willy B.: Educational Qualification
2 pages
The Effect of Laissez-Faire Leadership Style On Academic Performance of Primary School Pupils in Selected Primary Schools in Kasese District
No ratings yet
The Effect of Laissez-Faire Leadership Style On Academic Performance of Primary School Pupils in Selected Primary Schools in Kasese District
6 pages
DLL Q4 - Probability
100% (2)
DLL Q4 - Probability
3 pages
Lesson Plan 1 2
No ratings yet
Lesson Plan 1 2
6 pages
Grade: 3 Lesson/Unit: Games-Capture The Flag Date: April 4, 2016
No ratings yet
Grade: 3 Lesson/Unit: Games-Capture The Flag Date: April 4, 2016
2 pages
Comprehensive Flow for Static Typing in JavaScript: Definitive Reference for Developers and Engineers
From Everand
Comprehensive Flow for Static Typing in JavaScript: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Essential Techniques for Rust Programming: Definitive Reference for Developers and Engineers
From Everand
Essential Techniques for Rust Programming: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
WLang Essentials: Definitive Reference for Developers and Engineers
From Everand
WLang Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Modula-2 Language and Programming Techniques: Definitive Reference for Developers and Engineers
From Everand
Modula-2 Language and Programming Techniques: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Scala Programming Essentials: Definitive Reference for Developers and Engineers
From Everand
Scala Programming Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Clojure Essentials: Definitive Reference for Developers and Engineers
From Everand
Clojure Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Object-Relational Mapping Concepts and Techniques: Definitive Reference for Developers and Engineers
From Everand
Object-Relational Mapping Concepts and Techniques: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
The Codecracker
From Everand
The Codecracker
Pasquale De Marco
No ratings yet
Composite Pattern in Modern Software Design: Definitive Reference for Developers and Engineers
From Everand
Composite Pattern in Modern Software Design: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Essentials of OCaml Programming: Definitive Reference for Developers and Engineers
From Everand
Essentials of OCaml Programming: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Advanced Metaprogramming Techniques: Definitive Reference for Developers and Engineers
From Everand
Advanced Metaprogramming Techniques: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Erlang Systems Programming: Definitive Reference for Developers and Engineers
From Everand
Erlang Systems Programming: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Rebol Programming Insights: Definitive Reference for Developers and Engineers
From Everand
Rebol Programming Insights: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Facade Pattern in Software Architecture: Definitive Reference for Developers and Engineers
From Everand
Facade Pattern in Software Architecture: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Building Software Interpreters: Definitive Reference for Developers and Engineers
From Everand
Building Software Interpreters: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Rust Essentials for New Developers: A Practical Guide with Examples
From Everand
Rust Essentials for New Developers: A Practical Guide with Examples
William E. Clark
No ratings yet
Keras Deep Learning Essentials: Definitive Reference for Developers and Engineers
From Everand
Keras Deep Learning Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Applied APL Programming: Definitive Reference for Developers and Engineers
From Everand
Applied APL Programming: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Programming with Nim: Definitive Reference for Developers and Engineers
From Everand
Programming with Nim: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Mastering Advanced Object-Oriented Programming in Java: Unlock the Secrets of Expert-Level Skills
From Everand
Mastering Advanced Object-Oriented Programming in Java: Unlock the Secrets of Expert-Level Skills
Larry Jones
No ratings yet
The Sequence Algorithms Handbook
From Everand
The Sequence Algorithms Handbook
Pasquale De Marco
No ratings yet
Speech-to-Text Systems and Technologies: Definitive Reference for Developers and Engineers
From Everand
Speech-to-Text Systems and Technologies: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
C# Data Structures and Algorithms: Harness the power of C# to build a diverse range of efficient applications
From Everand
C# Data Structures and Algorithms: Harness the power of C# to build a diverse range of efficient applications
Marcin Jamro
No ratings yet
Foundational Models and Architectures S1: Generative AI, #1
From Everand
Foundational Models and Architectures S1: Generative AI, #1
Leaster Startx
No ratings yet
The Essence of Programming: A Comprehensive Guide to Object-Oriented Programming
From Everand
The Essence of Programming: A Comprehensive Guide to Object-Oriented Programming
Pasquale De Marco
No ratings yet
The Paradigm of Data
From Everand
The Paradigm of Data
Pasquale De Marco
No ratings yet
Java OOP Simplified: A Practical Guide with Examples
From Everand
Java OOP Simplified: A Practical Guide with Examples
William E. Clark
No ratings yet
Java Algorithms for Beginners: A Practical Guide with Examples
From Everand
Java Algorithms for Beginners: A Practical Guide with Examples
William E. Clark
No ratings yet
JavaScript Data Structures Explained: A Practical Guide with Examples
From Everand
JavaScript Data Structures Explained: A Practical Guide with Examples
William E. Clark
No ratings yet
Practical MXNet Applications: Definitive Reference for Developers and Engineers
From Everand
Practical MXNet Applications: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Computer Programming and Problem Solving Explorations
From Everand
Computer Programming and Problem Solving Explorations
Pasquale De Marco
No ratings yet
Crafting Excellence in Software Development
From Everand
Crafting Excellence in Software Development
Pasquale De Marco
No ratings yet
The Newbie’s Guidebook to ChatGPT: A Beginner's Tutorial: The Newbie’s Guidebook
From Everand
The Newbie’s Guidebook to ChatGPT: A Beginner's Tutorial: The Newbie’s Guidebook
Timothy King
No ratings yet
Regular Expressions Demystified: A Practical Guide with Examples
From Everand
Regular Expressions Demystified: A Practical Guide with Examples
William E. Clark
No ratings yet
Python Regular Expressions Explained: A Practical Guide with Examples
From Everand
Python Regular Expressions Explained: A Practical Guide with Examples
William E. Clark
No ratings yet
Programming in Star
From Everand
Programming in Star
Francis McCabe
No ratings yet
Advanced Deep Learning Techniques for Natural Language Understanding: A Comprehensive Guide
From Everand
Advanced Deep Learning Techniques for Natural Language Understanding: A Comprehensive Guide
Adam Jones
No ratings yet
Implementing Domain-Specific Languages with Xtext and Xtend - Second Edition
From Everand
Implementing Domain-Specific Languages with Xtext and Xtend - Second Edition
Lorenzo Bettini
4/5 (1)
50 Breakthrough AI Concepts in 500 Words Each: In 500 words, #17
From Everand
50 Breakthrough AI Concepts in 500 Words Each: In 500 words, #17
Nietsnie Trebla
No ratings yet
AI for Everyone: An Intermediate Guide to Artificial Intelligence
From Everand
AI for Everyone: An Intermediate Guide to Artificial Intelligence
Nova Clarke
No ratings yet
Artificial Intelligence 2024 Book 2 of 2: AI, #2
From Everand
Artificial Intelligence 2024 Book 2 of 2: AI, #2
Yang Yen Thaw
No ratings yet
SQLite Database Programming for Xamarin: Cross-platform C# database development for iOS and Android using SQLite.XM
From Everand
SQLite Database Programming for Xamarin: Cross-platform C# database development for iOS and Android using SQLite.XM
Anthony Serpico
No ratings yet
Mastering Computer Programming: A Comprehensive Guide
From Everand
Mastering Computer Programming: A Comprehensive Guide
Kondwani Hara
No ratings yet
Lexicon of Programming Terminology: Lexicon of Tech and Business, #17
From Everand
Lexicon of Programming Terminology: Lexicon of Tech and Business, #17
Mustafa Al-Dori
5/5 (1)
Hugging Face Transformers Essentials: From Fine-Tuning to Deployment
From Everand
Hugging Face Transformers Essentials: From Fine-Tuning to Deployment
Robert Johnson
No ratings yet
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
César Pérez López
No ratings yet
Visual Word: Unlocking the Power of Image Understanding
From Everand
Visual Word: Unlocking the Power of Image Understanding
Fouad Sabry
No ratings yet
Kernel Methods: Fundamentals and Applications
From Everand
Kernel Methods: Fundamentals and Applications
Fouad Sabry
No ratings yet

ppt_poject

Uploaded by

ppt_poject

Uploaded by

Identifying Duplicate Questions in Q&A

Communities Using Machine Learning

Ishak Gauri (QID: 22030522)

Department of Computer Science Engineering Quantum University, Roorkee

Addressing this issue is crucial. It ensures a clean and effective

These challenges necessitate sophisticated machine learning models.

Question 1 Text First question in "What is the

Question 2 Text Second question "What's the

Is_Duplicate Label Binary: 1 for 1 (for the example

This dataset is crucial for training. It provides a real-world

Stop Word Removal

Syntactic Features Word Embeddings

Ensemble Methods (XGBoost, Random Forest)

Traditional ML (SVM, Logistic Regression)

These metrics collectively assess model performance. They provide

Improved User Experience

Enhanced Content Quality

Future work includes exploring more advanced neural architectures. We

You might also like