## Similarity Measures

Let documents be represented by term vectors in the following form: $D_i = (t_1, t_2, …, t_n)$.

**Dice coefficient**: $ Sim(D_i,D_j) = \frac{2 \times (D_i \cdot D_j)}{||D_i|| + ||D_j||} \quad \quad \textrm{For sets:} \quad Sim(X,Y) = \frac{2 \times |X \cap Y|}{|X| + |Y|} $**Jaccard coefficient**: $ Sim(D_i,D_j) = \frac{D_i \cdot D_j}{||D_i|| + ||D_j|| - D_i \cdot D_j} \quad \quad \textrm{For sets:} \quad Sim(X,Y) = \frac{|X \cap Y|}{|X| + |Y| - |X \cap Y|} = \frac{|X \cap Y|}{|X \cup Y|} $**Cosine coefficient**: $ Sim(D_i,D_j) = \frac{D_i \cdot D_j}{|D_i| \times |D_j|} $

## Evaluation

The purpose of evaluation (Keen 1971):

- The need for measures with which to make merit comparisons within a single test situation
- The need for measures with which to make comparisons between results obtained in different test situations
- The need for assessing the merit of a real-life system.

To measure IR effectivenes in a standard way, we need a test collection consists of:

- A document collection
- A test suite of information needs, expressible as querys
- A set of relevance judgement (e.g. a binary assessment of either relevant or non-relevant for each query-document pair.

**Gold standard / Ground Truth**: a binary classification of a document as either relevant or non-relevant in the collection**Relevance**: for assessing relative to information need, NOT a query.**Standard test collections**(see Manning et al. (2008) pp.153-154):- Cranfield: 1398 abstracts of aerodynamics journal articles, a set of 225 queries, and relevance judgments of all pair; too small nowadays for evaluation.
- TREC (Test Retrieval Conference)

- Lancaster (1979) suggested that statistics of documents in an IR system can be summarised in a 2 x 2 matrix:
**Precision**: The proportion of retrieved documents that are relevant: $P = \frac{TP}{TP + FP}$**Recall**: The proportion of relevant documents retrieved: $R = \frac{TP}{TP + FN}$**Fallout**: The proportion of non-relevant documents retrieved: $F = \frac{FP}{FP + TN}$**Accuracy**: The fraction of correctly classified documents: $Accuracy = \frac{TP + TN}{TP + FP + TN + FN}$

Relevant | Not Relevant | |
---|---|---|

Retrieved | True Positive (TP) |
False Positive (FP) |

Not Retrieved | False Negative (FN) |
True Negative (TN) |

**Weakness of accuracy**: It is not appropriate because in almost all circumstances, the data is extremely skewed: over 99.9% of the documents are in the non-relevant category. A system aims at maximising accuracy can appear to perform well by simply deeming all documents as irrelevant. However, it is totally unsatisfying for the system to label everything as non-relevant because users always want to see some documents, and can be assumed to have a certain tolerance for seeing some false positive. (Manning et al., p.155)**Generality**: The proportion of relevant documents per query: $G = \frac{TP + FN}{TP + FP + TN + FN}$**F-measure**: A combined measure which assesses the tradeoff between precision and recall. It is a weighted harmonic mean. In its general form, it is defined as: $F = \frac{1}{\alpha \frac{1}{P} + (1 - \alpha) \frac{1}{R}} = \frac{(\beta^2 + 1)PR}{\beta^2 P + R}$, where $\beta = \frac{1 - \alpha}{\alpha}$, $\alpha \in [0,1]$. A balanced measure with equal weights on precision and recall is called an $F_1$ measure: $F_1 = \frac{2 \times P \times R}{P + R}$

### Why harmonic mean is used?

As we can always achieve 100% recall if we return all the documents, an arithmetic mean of $P$ and $R$ can at least achieve 50% in this case. This suggests that it is an unsuitable measure. The harmonic mean, on the other hand, returns a value closer to the minimum of two numbers than arithmetic or geometric means. (Manning et al., pp.156-157)

- Limitations of precision and recall (Chowdhury 2004, pp.251-252)
- Different users may want different levels of recall , but users are often unable to specify exactly how many items they want to be retrieved
- Recall assumes that all relevant documents have the same value (importance), which is not always true. Different documents have different degrees of relevance.

## References

- Manning, D. C., Raghavan, P., and SchutzeIntroduction H. (2008) Introduction to Information Retrieval. Cambridge, UK: Cambridge University Press.
- Chowdhury, G. G. (2004) Introduction to Modern Information Retrieval, 2nd Edition. London: Facet Publishing.
- Keen, E. M. (1971) ‘Evaluation parameters’. In: Salton, G. (ed.), The SMART retrieval system: experiments in automatic document processing. Englewood Cliffs, NJ: Prentice-Hall, pp.74-111.
- Lancaster, F. W. (1979) Information retrieval systems: characteristics, testing and evaluation. New York: John Wiley.
- van Rijsbergen, C. J. (1979) Information retrieval, 2nd Edition. London: Butterworth.

PREVIOUSGraph Emebeddings