Samenvatting Data Analyse
Samenvatting Data Analyse
Precisie = deel van de geleverde docs dat relevant zijn voor de info. die
nodig is (selectiviteit).
Recall = deel van relevante docs in collectie die zijn opgeleverd (sensitivity).
Boolean retrieval -> basic model, AND OR NOT () (bibliotheek catalogen).
Sparse matrix approach -> docID, Hash Table (no range queries)/B-Tree voor Dict, array/linked list voor post. Q = t1
AND t2 -> locate p1 and p2 for t1 and t2, calc. intersection of p1 and p2 by list merging -> √n skip pointers, n =p.L
Phrase Q -> positional index. Wil-card: permuterm index.
HC2
MapReduce: Map and Reduce workers in parallel processing. (impl.
Hadoop). <k,v>, map worker: scans own input once, does one uniform
calc. on each k-v pair, output = set of k-v pairs: >=0, is first grouped on
key, then given to Reduce worker. Output results of all RW = result of
calc. Map always works on one < k, v > tuple. Reduce always works on
one < k, [v1, v2, ..., vn] > tuple.
HC3
scoring function: s: <q, d> -> v, with q(uery), d(ocument) v ∈ [0, 1] or v ∈ R+. expresses quality of match between
q, d, enables us to calc. top-k.
term frequency: higher score, if t occurs more in d. Should contribute to the score function.
inverse document frequency: measure of rareness. Define dft = numb of docs containing t, N is total numb of docs.
idft = log(N/dft). weight(t,d) = tft,d X idft It does nothing with single queries, fav long docs
solution = Vector space model.
want to look for the doc vector with the smallest angle with query vector. = generalization of tf-idf scoring.
x . y = 0 two vectors have a mismatch on all terms.
Frequent item set mining: for an association rule X → Y, define • the support is
s(XY) • the confidence is s(XY)/s(X).
An item set is frequent if its support is bigger than a user- specified minimum
support threshold. The Apriori property: A set is
a candidate frequent set if all its subsets are
frequent, because:
- If X is frequent, then all its subsets are also
frequent.
- If X has a subset that is not frequent, then it cannot be frequent.
naïve complexity = O(2m), apriori worst-
case complexity =
HC5
PageRank = Using link structure to
define importance of a web site:
- When many sites refer to you, you are
important
- When important sites refer to you, you are important
- When a site referring to you has many outgoing links, this decreases the weight of the reference
HP = P, solving this gives complexity of O(n^3 ) & all-or-nothing.
alternative fixpoint iteration :
- Start with a vector P (0) = (1/n, 1/n, ..., 1/n)^T
- Calculate P^(k) = HP^(k−1), for a certain k
- To solve dangling node (no outgoing edges) use teleportation
G = αS + (1 – α)T
- α = 1, we cannot guarantee convergence & it is slower;
- α = 0, we get results that completely ignore the structure of the web: all pages are equal.
HC6
BLOSUM represents log-odds ratios.
dynamic programming: O(mn)
- Global alignment (Needleman-Wunsch)
Local alignment (Smith-Waterman)
BLAST: if size of k