Skip to content

  • Projects
  • Groups
  • Snippets
  • Help
    • Loading...
    • Help
    • Submit feedback
    • Contribute to GitLab
  • Sign in
haskell-gargantext
haskell-gargantext
  • Project
    • Project
    • Details
    • Activity
    • Releases
    • Cycle Analytics
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributors
    • Graph
    • Compare
    • Charts
  • Issues 159
    • Issues 159
    • List
    • Board
    • Labels
    • Milestones
  • Merge Requests 8
    • Merge Requests 8
  • CI / CD
    • CI / CD
    • Pipelines
    • Jobs
    • Schedules
    • Charts
  • Wiki
    • Wiki
  • Snippets
    • Snippets
  • Members
    • Members
  • Collapse sidebar
  • Activity
  • Graph
  • Charts
  • Create a new issue
  • Jobs
  • Commits
  • Issue Boards
  • gargantext
  • haskell-gargantexthaskell-gargantext
  • Issues
  • #334

Closed
Open
Opened Mar 21, 2024 by david Chavalarias@davidchavalarias
  • Report abuse
  • New issue
Report abuse New issue

Similarity Measure of Order 2: specification and implementations

Several distributional measures have been tested in the literature (Weeds & Weir, 2003;2005), one of the objective being to find terms that are the most structurally equivalent to a target term i or with a similar terms-terms co-occurrence distribution.

Performance of measures also depend on the frequency of the target term.

First implementation

Based on Weeds & Weir 2005, GarganText implement the Additive MI-based CRM metrics which is the one that has the best performance in the WordNet Prediction Task while having fair performance in minimizing the α-skew Divergence Measure between the two distribution.

Here is the description of the Additive MI-based CRM metrics that is called Order2 in GarganText.

Notations:

  • N total number of documents
  • n_{ij} the number of co-occurences of i and j
  • n_{i} the number of documents containing i.
  • I_{ik} = log(\frac{\frac{n_{ik}}{N}}{\frac{n_{i}}{N}*\frac{n_{k}}{N}}) = log(\frac{N\times n_{ik}}{n_{i}\times n_{k}}) the mutual information between i and j.
  • sim_{mi}(i,j)=\frac{\Sigma_{k \neq i,j ; (I_{ik} >0 \wedge I_{jk} >0)}^{} I_{jk}} {\Sigma_{k \neq i,j ; I_{jk}}^{}}

Warning :

  • For the numerator, beware the condition I_{ik} >0 \wedge I_{jk}>0: we are summing on terms that both co-occur with i and j, i.e. n_{ik}>0 and n_{jk}>0
  • The condition k \neq i,j is important

Alternative implementation

It is also very interesting to implement the Difference-Weighted metrics sim_{mi}^{dw}

sim_{mi}^{dw}(i,j)=\frac{\Sigma_{k \neq i,j ; (I_{ik} >0 \wedge I_{jk} >0)}^{} min(I_{jk},I_{ik})}{\Sigma_{k \neq i,j ; I_{jk}}^{}}

Although the first one is state-of-the art for retrieval of structurally equivalent terms, the task of the graph explorer is a little different since we also try to provide a hierarchical organization of the terms. This second metrics, might be more suitable with respect to this. Since the code should be quite similar to the metrics aboce, I would like to make is available in the code for further test in the dev version and see if it could be an interesting addition to the graph toolbox after the 0.0.7 release.

References :

  • Weeds, Julie, et David Weir. 2003. « A general framework for distributional similarity ». In Proceedings of the 2003 conference on Empirical methods in natural language processing, 81‑88.
  • Weeds, Julie, et David Weir. 2005. « Co-occurrence Retrieval: A Flexible Framework for Lexical Distributional Similarity ». Computational Linguistics 31 (4): 439‑75. https://doi.org/10.1162/089120105775299122.
Edited Mar 25, 2024 by delanoe
Assignee
Assign to
Stabilisation
Milestone
Stabilisation
Assign milestone
Time tracking
None
Due date
None
1
Labels
map/Graph
Assign labels
  • View project labels
Reference: gargantext/haskell-gargantext#334