Similarity Measure of Order 2: specification and implementations (#334) · Issues · gargantext / haskell-gargantext

Similarity Measure of Order 2: specification and implementations

Several distributional measures have been tested in the literature (Weeds & Weir, 2003;2005), one of the objective being to find terms that are the most structurally equivalent to a target term i or with a similar terms-terms co-occurrence distribution.

Performance of measures also depend on the frequency of the target term.

First implementation

Based on Weeds & Weir 2005, GarganText implement the Additive MI-based CRM metrics which is the one that has the best performance in the WordNet Prediction Task while having fair performance in minimizing the α-skew Divergence Measure between the two distribution.

Here is the description of the Additive MI-based CRM metrics that is called Order2 in GarganText.

Notations:

N total number of documents
n_{ij} the number of co-occurences of i and j
n_{i} the number of documents containing i.
I_{ik} = log(\frac{\frac{n_{ik}}{N}}{\frac{n_{i}}{N}*\frac{n_{k}}{N}}) = log(\frac{N\times n_{ik}}{n_{i}\times n_{k}}) the mutual information between i and j.
sim_{mi}(i,j)=\frac{\Sigma_{k \neq i,j ; (I_{ik} >0 \wedge I_{jk} >0)}^{} I_{jk}} {\Sigma_{k \neq i,j ; I_{jk}}^{}}

Warning :

For the numerator, beware the condition I_{ik} >0 \wedge I_{jk}>0: we are summing on terms that both co-occur with i and j, i.e. n_{ik}>0 and n_{jk}>0
The condition k \neq i,j is important

Alternative implementation

It is also very interesting to implement the Difference-Weighted metrics sim_{mi}^{dw}

sim_{mi}^{dw}(i,j)=\frac{\Sigma_{k \neq i,j ; (I_{ik} >0 \wedge I_{jk} >0)}^{} min(I_{jk},I_{ik})}{\Sigma_{k \neq i,j ; I_{jk}}^{}}

Although the first one is state-of-the art for retrieval of structurally equivalent terms, the task of the graph explorer is a little different since we also try to provide a hierarchical organization of the terms. This second metrics, might be more suitable with respect to this. Since the code should be quite similar to the metrics aboce, I would like to make is available in the code for further test in the dev version and see if it could be an interesting addition to the graph toolbox after the 0.0.7 release.

References :

Weeds, Julie, et David Weir. 2003. « A general framework for distributional similarity ». In Proceedings of the 2003 conference on Empirical methods in natural language processing, 81‑88.
Weeds, Julie, et David Weir. 2005. « Co-occurrence Retrieval: A Flexible Framework for Lexical Distributional Similarity ». Computational Linguistics 31 (4): 439‑75. https://doi.org/10.1162/089120105775299122.

Several distributional measures have been tested in the literature (Weeds & Weir, 2003;2005), one of the objective being to find terms that are the most structurally equivalent to a target term $`i`$ or with a similar terms-terms co-occurrence distribution.
    
Performance of measures also depend on the frequency of the target term.

## First implementation
Based on Weeds & Weir 2005, GarganText implement the Additive MI-based CRM metrics which is the one that has the best performance in the WordNet Prediction Task while having fair performance in minimizing the α-skew Divergence Measure between the two distribution. 
    
Here is the description of the Additive MI-based CRM metrics that is called Order2 in GarganText.

__Notations:__
* $`N`$ total number of documents
* $`n_{ij}`$ the number of co-occurences of $`i`$ and $`j`$
* $`n_{i}`$ the number of documents containing $`i`$.    
* $`I_{ik} =  log(\frac{\frac{n_{ik}}{N}}{\frac{n_{i}}{N}*\frac{n_{k}}{N}}) = log(\frac{N\times n_{ik}}{n_{i}\times n_{k}})`$ the mutual information between $`i`$ and $`j`$.    
* $`sim_{mi}(i,j)=\frac{\Sigma_{k \neq i,j ; (I_{ik} >0 \wedge I_{jk} >0)}^{} I_{jk}}
{\Sigma_{k \neq i,j ; I_{jk}}^{}}`$    
    
__Warning :__
* For the numerator, beware the condition $`I_{ik} >0 \wedge I_{jk}>0`$: we are summing on terms that both co-occur with $`i`$ and $`j`$, i.e. $`n_{ik}>0`$ and $`n_{jk}>0`$
* The condition $`k \neq i,j`$ is important

## Alternative implementation
It is also very interesting to implement the Difference-Weighted metrics $`sim_{mi}^{dw}`$

$`sim_{mi}^{dw}(i,j)=\frac{\Sigma_{k \neq i,j ; (I_{ik} >0 \wedge I_{jk} >0)}^{} min(I_{jk},I_{ik})}{\Sigma_{k \neq i,j ; I_{jk}}^{}}`$

__References :__  
* Weeds, Julie, et David Weir. 2003. « A general framework for distributional similarity ». In Proceedings of the 2003 conference on Empirical methods in natural language processing, 81‑88.
* Weeds, Julie, et David Weir. 2005. « Co-occurrence Retrieval: A Flexible Framework for Lexical Distributional Similarity ». Computational Linguistics 31 (4): 439‑75. https://doi.org/10.1162/089120105775299122.

Edited Mar 25, 2024 by delanoe