# Information Retrieval

## Similarity Measures

Let documents be represented by term vectors in the following form: $D_i = (t_1, t_2, …, t_n)$.

• Dice coefficient: $Sim(D_i,D_j) = \frac{2 \times (D_i \cdot D_j)}{||D_i|| + ||D_j||} \quad \quad \textrm{For sets:} \quad Sim(X,Y) = \frac{2 \times |X \cap Y|}{|X| + |Y|}$
• Jaccard coefficient: $Sim(D_i,D_j) = \frac{D_i \cdot D_j}{||D_i|| + ||D_j|| - D_i \cdot D_j} \quad \quad \textrm{For sets:} \quad Sim(X,Y) = \frac{|X \cap Y|}{|X| + |Y| - |X \cap Y|} = \frac{|X \cap Y|}{|X \cup Y|}$
• Cosine coefficient: $Sim(D_i,D_j) = \frac{D_i \cdot D_j}{|D_i| \times |D_j|}$

## Evaluation

The purpose of evaluation (Keen 1971):

• The need for measures with which to make merit comparisons within a single test situation
• The need for measures with which to make comparisons between results obtained in different test situations
• The need for assessing the merit of a real-life system.

To measure IR effectivenes in a standard way, we need a test collection consists of:

1. A document collection
2. A test suite of information needs, expressible as querys
3. A set of relevance judgement (e.g. a binary assessment of either relevant or non-relevant for each query-document pair.
• Gold standard / Ground Truth: a binary classification of a document as either relevant or non-relevant in the collection
• Relevance: for assessing relative to information need, NOT a query.
• Standard test collections (see Manning et al. (2008) pp.153-154):
• Cranfield: 1398 abstracts of aerodynamics journal articles, a set of 225 queries, and relevance judgments of all pair; too small nowadays for evaluation.
• TREC (Test Retrieval Conference)
• Lancaster (1979) suggested that statistics of documents in an IR system can be summarised in a 2 x 2 matrix:
• Precision: The proportion of retrieved documents that are relevant: $P = \frac{TP}{TP + FP}$
• Recall: The proportion of relevant documents retrieved: $R = \frac{TP}{TP + FN}$
• Fallout: The proportion of non-relevant documents retrieved: $F = \frac{FP}{FP + TN}$
• Accuracy: The fraction of correctly classified documents: $Accuracy = \frac{TP + TN}{TP + FP + TN + FN}$
Relevant Not Relevant
Retrieved True Positive
(TP)
False Positive
(FP)
Not Retrieved False Negative
(FN)
True Negative
(TN)
• Weakness of accuracy: It is not appropriate because in almost all circumstances, the data is extremely skewed: over 99.9% of the documents are in the non-relevant category. A system aims at maximising accuracy can appear to perform well by simply deeming all documents as irrelevant. However, it is totally unsatisfying for the system to label everything as non-relevant because users always want to see some documents, and can be assumed to have a certain tolerance for seeing some false positive. (Manning et al., p.155)
• Generality: The proportion of relevant documents per query: $G = \frac{TP + FN}{TP + FP + TN + FN}$
• F-measure: A combined measure which assesses the tradeoff between precision and recall. It is a weighted harmonic mean. In its general form, it is defined as: $F = \frac{1}{\alpha \frac{1}{P} + (1 - \alpha) \frac{1}{R}} = \frac{(\beta^2 + 1)PR}{\beta^2 P + R}$, where $\beta = \frac{1 - \alpha}{\alpha}$, $\alpha \in [0,1]$. A balanced measure with equal weights on precision and recall is called an $F_1$ measure: $F_1 = \frac{2 \times P \times R}{P + R}$

### Why harmonic mean is used?

As we can always achieve 100% recall if we return all the documents, an arithmetic mean of $P$ and $R$ can at least achieve 50% in this case. This suggests that it is an unsuitable measure. The harmonic mean, on the other hand, returns a value closer to the minimum of two numbers than arithmetic or geometric means. (Manning et al., pp.156-157)

• Limitations of precision and recall (Chowdhury 2004, pp.251-252)
• Different users may want different levels of recall , but users are often unable to specify exactly how many items they want to be retrieved
• Recall assumes that all relevant documents have the same value (importance), which is not always true. Different documents have different degrees of relevance.

## References

• Manning, D. C., Raghavan, P., and SchutzeIntroduction H. (2008) Introduction to Information Retrieval. Cambridge, UK: Cambridge University Press.
• Chowdhury, G. G. (2004) Introduction to Modern Information Retrieval, 2nd Edition. London: Facet Publishing.
• Keen, E. M. (1971) ‘Evaluation parameters’. In: Salton, G. (ed.), The SMART retrieval system: experiments in automatic document processing. Englewood Cliffs, NJ: Prentice-Hall, pp.74-111.
• Lancaster, F. W. (1979) Information retrieval systems: characteristics, testing and evaluation. New York: John Wiley.
• van Rijsbergen, C. J. (1979) Information retrieval, 2nd Edition. London: Butterworth.