Samenvatting Data Analyse

1. MapReduce is a parallel processing framework where map workers scan input data and output key-value pairs, and reduce workers then group the outputs by key and perform computations on the values for each key. 2. Scoring functions in information retrieval assign a relevance score between 0 and 1 to quantify the match between a query and document based on factors like term frequency and inverse document frequency. 3. Frequent itemset mining finds sets of items that frequently occur together in a transaction database by using the Apriori property and iterating to find candidate frequent itemsets.

Uploaded by

Vee Mandler

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views

Samenvatting Data Analyse

Uploaded by

Vee Mandler

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 3

HC1

Precisie = deel van de geleverde docs dat relevant zijn voor de info. die
nodig is (selectiviteit).
Recall = deel van relevante docs in collectie die zijn opgeleverd (sensitivity).
Boolean retrieval -> basic model, AND OR NOT () (bibliotheek catalogen).
Sparse matrix approach -> docID, Hash Table (no range queries)/B-Tree voor Dict, array/linked list voor post. Q = t1
AND t2 -> locate p1 and p2 for t1 and t2, calc. intersection of p1 and p2 by list merging -> √n skip pointers, n =p.L
Phrase Q -> positional index. Wil-card: permuterm index.

HC2
MapReduce: Map and Reduce workers in parallel processing. (impl.
Hadoop). <k,v>, map worker: scans own input once, does one uniform
calc. on each k-v pair, output = set of k-v pairs: >=0, is first grouped on
key, then given to Reduce worker. Output results of all RW = result of
calc. Map always works on one < k, v > tuple. Reduce always works on
one < k, [v1, v2, ..., vn] > tuple.

HC3
scoring function: s: <q, d> -> v, with q(uery), d(ocument) v ∈ [0, 1] or v ∈ R+.  expresses quality of match between
q, d, enables us to calc. top-k.
term frequency: higher score, if t occurs more in d. Should contribute to the score function.
inverse document frequency: measure of rareness. Define dft = numb of docs containing t, N is total numb of docs.
idft = log(N/dft).  weight(t,d) = tft,d X idft It does nothing with single queries, fav long docs
 solution = Vector space model.

want to look for the doc vector with the smallest angle with query vector. = generalization of tf-idf scoring.
x . y = 0  two vectors have a mismatch on all terms.

Conjunctive equality query: SELECT * FROM R WHERE C1 AND … AND Ck

Hierin is conditie Ci van de vorm Aj = qj of Aj IN (qj1, … , qjl).
Similarity coefficient: S(u,v) = idfk(u) als u (waarde Ak in q) = v (waarde Ak in tupel), anders 0.
QF similarity: gebruik workload, en is maat voor de populariteit van de zoekterm. QFk(v) = RQFk(v) / RQFMAXk
-> RQFk(v) is de “raw query frequency” van waarde (term) onder attribuut Ak in de workload.
-> RQFMAXk is de frequentie van de meest voorkomende term.
-> S(u,v) = QF(u), als u = v, anders 0. En voor tuple T en query Q:
Attribute value similarity: Jaccard coefficient meet de similarity tussen sets W(t) en W(q):
Attribute similarity: similarity tussen een query term q en een term t, S(t,q) =
J(W(t), W(q)) * QF(q)
HC4
top-k query processing: f is monotone & non-increasing  if you
increase the value of one of the parameters of f and keep the others
constant, f will not decrease.
Treshold algorithm:
No Random Access
Algorithm:

Frequent item set mining: for an association rule X → Y, define • the support is
s(XY) • the confidence is s(XY)/s(X).
An item set is frequent if its support is bigger than a user- specified minimum
support threshold. The Apriori property: A set is
a candidate frequent set if all its subsets are
frequent, because:
- If X is frequent, then all its subsets are also
frequent.
- If X has a subset that is not frequent, then it cannot be frequent.
naïve complexity = O(2m), apriori worst-
case complexity =

HC5
PageRank = Using link structure to
define importance of a web site:
- When many sites refer to you, you are
important
- When important sites refer to you, you are important
- When a site referring to you has many outgoing links, this decreases the weight of the reference
HP = P, solving this gives complexity of O(n^3 ) & all-or-nothing.
alternative  fixpoint iteration :
- Start with a vector P (0) = (1/n, 1/n, ..., 1/n)^T
- Calculate P^(k) = HP^(k−1), for a certain k
- To solve dangling node (no outgoing edges) use teleportation
G = αS + (1 – α)T
- α = 1, we cannot guarantee convergence & it is slower;
- α = 0, we get results that completely ignore the structure of the web: all pages are equal.
HC6
BLOSUM  represents log-odds ratios.
dynamic programming: O(mn)
- Global alignment (Needleman-Wunsch)
Local alignment (Smith-Waterman)

BLAST: if size of k

• increases, then precision increases, recall decreases

• decreases, then precision decreases, recall increases

CSI 4107 - Winter 2016 - Midterm
0% (1)
CSI 4107 - Winter 2016 - Midterm
10 pages
Final Sol
100% (1)
Final Sol
8 pages
Motorway Signal Mark 1 (MS1)
No ratings yet
Motorway Signal Mark 1 (MS1)
1 page
Payroll Management System
No ratings yet
Payroll Management System
8 pages
Armv7 A Cortex A Series PG PDF
No ratings yet
Armv7 A Cortex A Series PG PDF
421 pages
3 Retrieval Models
No ratings yet
3 Retrieval Models
87 pages
IR - Models
100% (3)
IR - Models
58 pages
Chapter 2 Modeling: Modern Information Retrieval by R. Baeza-Yates and B. Ribeir
No ratings yet
Chapter 2 Modeling: Modern Information Retrieval by R. Baeza-Yates and B. Ribeir
47 pages
Unit2 Bool Vect Example ST
No ratings yet
Unit2 Bool Vect Example ST
34 pages
NLP SEE
No ratings yet
NLP SEE
27 pages
Boolean and Vector Space Retrieval Models
No ratings yet
Boolean and Vector Space Retrieval Models
27 pages
Lect 13-Text Ranking
No ratings yet
Lect 13-Text Ranking
58 pages
IR Systems Usually Adopt Index Terms To Process Queries Index Term
No ratings yet
IR Systems Usually Adopt Index Terms To Process Queries Index Term
24 pages
IR Unit-2 - Updated
No ratings yet
IR Unit-2 - Updated
125 pages
IRS 2nd Chap
No ratings yet
IRS 2nd Chap
42 pages
F-IR
No ratings yet
F-IR
30 pages
Toc CS246 PRK
No ratings yet
Toc CS246 PRK
17 pages
Retrieval Models and Rank Retrieval
No ratings yet
Retrieval Models and Rank Retrieval
16 pages
Cs 473 HW 5
No ratings yet
Cs 473 HW 5
4 pages
Unit-4 1
No ratings yet
Unit-4 1
7 pages
Tif Idf Cosine Similarity
No ratings yet
Tif Idf Cosine Similarity
44 pages
Basic IR: Modeling
No ratings yet
Basic IR: Modeling
22 pages
Module 3 Indexing Part A
No ratings yet
Module 3 Indexing Part A
46 pages
Session 4 Text Feature
No ratings yet
Session 4 Text Feature
40 pages
Web Search Engines: Rooted in Information Retrieval (IR) Systems
No ratings yet
Web Search Engines: Rooted in Information Retrieval (IR) Systems
48 pages
Vector Space Model: TF - IDF: Adapted From Lectures by
No ratings yet
Vector Space Model: TF - IDF: Adapted From Lectures by
37 pages
TF Idf
100% (3)
TF Idf
38 pages
Unit Ii BD
No ratings yet
Unit Ii BD
74 pages
02 Chap02a-BooleanAndvector Models
No ratings yet
02 Chap02a-BooleanAndvector Models
30 pages
Lec 4
No ratings yet
Lec 4
39 pages
Boolean and Vector Space Retrieval Models
No ratings yet
Boolean and Vector Space Retrieval Models
31 pages
IR_finals_cheat_sheet
No ratings yet
IR_finals_cheat_sheet
2 pages
4 IRModels
No ratings yet
4 IRModels
32 pages
4_IRModels
No ratings yet
4_IRModels
46 pages
IR Models: Chapter Five
100% (1)
IR Models: Chapter Five
26 pages
wdm3,4,5
No ratings yet
wdm3,4,5
12 pages
Information Retrieval System and The Pagerank Algorithm
No ratings yet
Information Retrieval System and The Pagerank Algorithm
37 pages
IR Cheatsheet Final
No ratings yet
IR Cheatsheet Final
3 pages
CS8080 INFORMATION RETRIEVAL TECHNIQUES II INTERNAL EXAMINATION - Google Forms
No ratings yet
CS8080 INFORMATION RETRIEVAL TECHNIQUES II INTERNAL EXAMINATION - Google Forms
420 pages
5 B IRModels
No ratings yet
5 B IRModels
51 pages
4 IRinArabic2021 Ranked Retrieval I
No ratings yet
4 IRinArabic2021 Ranked Retrieval I
49 pages
TYBSC-CS - SEM6 - IR - APR19 Munotes Mumbai University
No ratings yet
TYBSC-CS - SEM6 - IR - APR19 Munotes Mumbai University
2 pages
Recuperación Información Modelo Vectorial
No ratings yet
Recuperación Información Modelo Vectorial
40 pages
CS583 Info Retrieval
No ratings yet
CS583 Info Retrieval
34 pages
IR_MOD2_NOTES
No ratings yet
IR_MOD2_NOTES
26 pages
UNIT5-User Search Techniques
No ratings yet
UNIT5-User Search Techniques
24 pages
Chapter Five IR Models
No ratings yet
Chapter Five IR Models
28 pages
Ranked Retrieval
No ratings yet
Ranked Retrieval
52 pages
Lec2 2
No ratings yet
Lec2 2
17 pages
Text Mining
No ratings yet
Text Mining
23 pages
Chapter 6_Scoring Term weighting and vector space model
No ratings yet
Chapter 6_Scoring Term weighting and vector space model
43 pages
Chapter 5 IR
No ratings yet
Chapter 5 IR
46 pages
CEGP013091: 49.248.216.238 08/12/2018 13:08:58 Static-238
No ratings yet
CEGP013091: 49.248.216.238 08/12/2018 13:08:58 Static-238
3 pages
Module1PartBInformationRetrievalWebdocuments
No ratings yet
Module1PartBInformationRetrievalWebdocuments
49 pages
5 IRModels IR
No ratings yet
5 IRModels IR
25 pages
April 2019
No ratings yet
April 2019
2 pages
CCS369 - TSS-Unit 3
No ratings yet
CCS369 - TSS-Unit 3
55 pages
Efficient Parallel Set-Similarity Joins Using Mapreduce: Tilani Gunawardena
No ratings yet
Efficient Parallel Set-Similarity Joins Using Mapreduce: Tilani Gunawardena
47 pages
4_IRModels
No ratings yet
4_IRModels
30 pages
Tf-Idf: David Kauchak cs160 Fall 2009
No ratings yet
Tf-Idf: David Kauchak cs160 Fall 2009
51 pages
Keyword Search in Structured Databases: Vagelis Hristidis
No ratings yet
Keyword Search in Structured Databases: Vagelis Hristidis
58 pages
Scoring
No ratings yet
Scoring
49 pages
Calculus I Essentials
From Everand
Calculus I Essentials
Editors of REA
1/5 (1)
iConnectivity+AUDIO4c+User+Guide (1)
No ratings yet
iConnectivity+AUDIO4c+User+Guide (1)
35 pages
Title: Presented By:-Gaurav Sharma Roll No. - 19EMCCS037 Batch - A, Year - 4th Branch - CSE
No ratings yet
Title: Presented By:-Gaurav Sharma Roll No. - 19EMCCS037 Batch - A, Year - 4th Branch - CSE
13 pages
Panaversity Cloud Native Applied Generative AI Engineer
No ratings yet
Panaversity Cloud Native Applied Generative AI Engineer
36 pages
M Resume
No ratings yet
M Resume
2 pages
System Analysis & Design
0% (1)
System Analysis & Design
2 pages
Smartax mt880 Adsl Router Quick Start-English
100% (2)
Smartax mt880 Adsl Router Quick Start-English
3 pages
Fortinet Secure Sdwan
No ratings yet
Fortinet Secure Sdwan
13 pages
A&H Ilive Surface Getting Started
100% (1)
A&H Ilive Surface Getting Started
32 pages
2020-Codeless Deep Learning With KNIME
100% (1)
2020-Codeless Deep Learning With KNIME
313 pages
Experiment 1 Foc A and B
No ratings yet
Experiment 1 Foc A and B
15 pages
A Natively Flexible 32-Bit Arm Microprocessor - Nature
No ratings yet
A Natively Flexible 32-Bit Arm Microprocessor - Nature
1 page
SAP PM Sastageek TR
No ratings yet
SAP PM Sastageek TR
36 pages
Configure a Panorama Administrator With Certificate-Based Authentication for the Web Interface
No ratings yet
Configure a Panorama Administrator With Certificate-Based Authentication for the Web Interface
3 pages
CAQA5e ch6
No ratings yet
CAQA5e ch6
16 pages
Nandini M Resume
No ratings yet
Nandini M Resume
1 page
KNJN RS232 FPGA Boards
No ratings yet
KNJN RS232 FPGA Boards
30 pages
Power BI Assignment
No ratings yet
Power BI Assignment
15 pages
CPP
No ratings yet
CPP
81 pages
Elecbits - Bluetooth Control Vehicle - Synopsis
No ratings yet
Elecbits - Bluetooth Control Vehicle - Synopsis
5 pages
Push Pull System Spring 2014
No ratings yet
Push Pull System Spring 2014
25 pages
Cpanel Tutorial
67% (3)
Cpanel Tutorial
25 pages
The Digital Horizon Emerging Tech for a Better Tomorrow
No ratings yet
The Digital Horizon Emerging Tech for a Better Tomorrow
3 pages
Training Guide - 3 Days Final
No ratings yet
Training Guide - 3 Days Final
144 pages
Apttus Data Models
No ratings yet
Apttus Data Models
7 pages
Avaya CMS Custom Reports
No ratings yet
Avaya CMS Custom Reports
196 pages
Quantum Convolutional Neural Networks - PennyLane
No ratings yet
Quantum Convolutional Neural Networks - PennyLane
1 page
Cascade Control Signal Distribution Block
No ratings yet
Cascade Control Signal Distribution Block
4 pages

Samenvatting Data Analyse

Uploaded by

Samenvatting Data Analyse

Uploaded by

HC1

Conjunctive equality query: SELECT * FROM R WHERE C1 AND … AND Ck

• increases, then precision increases, recall decreases

You might also like