Similarity Measure of Order 2: specification and implementations
Several distributional measures have been tested in the literature (Weeds & Weir, 2003;2005), one of the objective being to find terms that are the most structurally equivalent to a target term i
or with a similar terms-terms co-occurrence distribution.
Performance of measures also depend on the frequency of the target term.
First implementation
Based on Weeds & Weir 2005, GarganText implement the Additive MI-based CRM metrics which is the one that has the best performance in the WordNet Prediction Task while having fair performance in minimizing the α-skew Divergence Measure between the two distribution.
Here is the description of the Additive MI-based CRM metrics that is called Order2 in GarganText.
Notations:
-
N
total number of documents -
n_{ij}
the number of co-occurences ofi
andj
-
n_{i}
the number of documents containingi
. -
I_{ik} = log(\frac{\frac{n_{ik}}{N}}{\frac{n_{i}}{N}*\frac{n_{k}}{N}}) = log(\frac{N\times n_{ik}}{n_{i}\times n_{k}})
the mutual information betweeni
andj
. sim_{mi}(i,j)=\frac{\Sigma_{k \neq i,j ; (I_{ik} >0 \wedge I_{jk} >0)}^{} I_{jk}} {\Sigma_{k \neq i,j ; I_{jk}}^{}}
Warning :
- For the numerator, beware the condition
I_{ik} >0 \wedge I_{jk}>0
: we are summing on terms that both co-occur withi
andj
, i.e.n_{ik}>0
andn_{jk}>0
- The condition
k \neq i,j
is important
Alternative implementation
It is also very interesting to implement the Difference-Weighted metrics sim_{mi}^{dw}
sim_{mi}^{dw}(i,j)=\frac{\Sigma_{k \neq i,j ; (I_{ik} >0 \wedge I_{jk} >0)}^{} min(I_{jk},I_{ik})}{\Sigma_{k \neq i,j ; I_{jk}}^{}}
Although the first one is state-of-the art for retrieval of structurally equivalent terms, the task of the graph explorer is a little different since we also try to provide a hierarchical organization of the terms. This second metrics, might be more suitable with respect to this. Since the code should be quite similar to the metrics aboce, I would like to make is available in the code for further test in the dev version and see if it could be an interesting addition to the graph toolbox after the 0.0.7 release.
References :
- Weeds, Julie, et David Weir. 2003. « A general framework for distributional similarity ». In Proceedings of the 2003 conference on Empirical methods in natural language processing, 81‑88.
- Weeds, Julie, et David Weir. 2005. « Co-occurrence Retrieval: A Flexible Framework for Lexical Distributional Similarity ». Computational Linguistics 31 (4): 439‑75. https://doi.org/10.1162/089120105775299122.