Similarité distributionnelle

Ditrance ditributionelle

tags: `gargantext`

Objective : We want to compute with matrices processing the similarity between term $i$ and term $j$ : prox(i,j)=\frac{\Sigma_{k \neq i,j} min(\frac{n_{ik}^2}{n_{ii}n_{kk}},\frac{n_{jk}^2}{n_{jj}n_{kk}})}{\Sigma_{k \neq i,j}\frac{n_{ik}^2}{ n_{ii}n_{kk}}}

where n_{ij} is the number of cooccurrences between term i and term j

Definitions and Operators for multi-dimensional arrays

Definitions

operator : .* and ./ cell by cell multiplication and division of the matrix
operator * is the matrix multiplication
operator : Diag(M)=[n_{ii}]_i (vecteur)
Matrice M=[n_{ij}]_{i,j}
Id = identity Matrix (ones on the diagonal, zeros everywhere else)
Matrice ones : O=[1]_{i,j} (one everywhere)
D(M)=Id .* M (the square matrix that has zeros everywhere except on its diagonal where it is equal to M)

Operators

repmat :: Matrix -> Int -> Int -> Int -> 3 dimensional array repmat M $n_1$ $n_2$ $n_3$ is the 3 dimensional array that replicates M along dimensions 1 2 and 3 respectively n_1, n_2 and n_3 times.

Example : if M=[m_{i,j}]_{i,j \leq n} is a n x n Matrix, M'=repmat M 1 1 n = [m'_{i,j,k}]_{i,j,k \leq n} is a 3 dimentional cube shaped array such that \forall i,j,k, m'_{i,j,k}=m_{i,j}

Type definition :: Vect123 [i j k] is of type Vect123 iff :

i,j,k\in\{1,~2,~3\}
sort asc {i,j,k}={1, 2, 3}

proj :: Int \in {1 2 3} -> Vect123 -> Int

Noted proj n [i j k] it returns the pth element of [i j k] Operator that permute the numbers 1 2 3. Example :

proj 1 [3 1 2] = 3 ;
proj 2 [3 1 2] = 1

permute :: 3 dimentional array -> Vect123 -> 3 dimensional array permute M' [x y z] = M" where

M'=[m'_{i,j,k}]_{i,j,k}
M"=[m'_{proj~x~[i~j~k],proj~y~[i~j~k],proj~z~[i~j~k]}]_{i,j,k}

sum3 :: 3 dimentional array -> Matrix -- projection of a 3 dimentional array onto a matrix by summation over one of the dimensions sum3 M'=M where
M'=[m'_{i,j,k}]_{i,j,k} M=[m_{i,j}]_{i,j} where m_{i,j}=\Sigma_k m'_{i,j,k}

Main operations

We first build the matrix that will guaranty the condition k\neq i,j

I=repmat (O-Id) 1 1 N
I_1=permute~I~[1~3~2] has 0 where i=k and ones elsewhere
I_2=permute~I~[3~2~1] has 0 where j=k and ones elsewhere
II=I_1.*I_2=[r_{i,j,k}] is a matrix such that k= i,j \implies r_{i,j,k}=0 ; r_{i,j,k}=1 otherwise.

Then Let

D_1=O * D(M) =[n_{jj}]_{i,j}
D_2=D(M) * O =[n_{ii}]_{i,j}
MI=(M ./ D_1) .* (M ./ D_2 )=[\frac{n_{ij}^2}{n_{ii}n_{jj}}]_{ij}
W=repmat MI 1 1 N=[\frac{n_{ij}^2}{n_{ii}n_{jj}}]_{ijk}
W_1=permute~W~[1~3~2]=[\frac{n_{ik}^2}{n_{ii}n_{kk}}]_{ijk}
W_2=permute~W~[3~2~1]=[\frac{n_{jk}^2}{n_{jj}n_{kk}}]_{ijk}
W'=min(W_1,W_2)=[min(\frac{n_{ik}^2}{n_{ii}n_{kk}},\frac{n_{jk}^2}{n_{jj}n_{kk}})]_{ijk}
Z_1=sum3~(W'.*II)=[\Sigma_{k \neq i,j} min(\frac{n_{ik}^2}{n_{ii}n_{kk}},\frac{n_{jk}^2}{n_{jj}n_{kk}})]_{i,j}
Z_2=sum3~(W_1.*II)=[\Sigma_{k \neq i,j} \frac{n_{ik}^2}{n_{ii}n_{kk}}]_{i,j}

Then prox(M)=[prox(i,j)]_{i,j}=Z_1./Z_2

Log Distributionnal Similarity

In order to be aligned with the V3 and also to take into account terms Zipf law distribution,the final implementation of the distributional similarity should be a variant of the above.

We will consider

To=\Sigma_{i,j}n_{ij}

s_i=\Sigma_{i\neq k}n_{ik}

For i\neq j, m_{i,j}=log(To*\frac{n_{ij}}{s_i*s_j})

sumMin_{ij}=\Sigma_{k \neq i,j} min(m_{ik},m_{jk})

SumM_i=\Sigma_{k\neq i,j} m_{ik}

And the similarity LogSim will be defined by:

LogSim(i,j)=\frac{sumMin_{ij}}{SumM_i}

Ditrance ditributionelle
===
###### tags: `gargantext`

__Objective :__ We want to compute with matrices processing the similarity between term $i$ and term $j$ :
 prox(i,j)=$`\frac{\Sigma_{k \neq i,j} min(\frac{n_{ik}^2}{n_{ii}n_{kk}},\frac{n_{jk}^2}{n_{jj}n_{kk}})}{\Sigma_{k \neq i,j}\frac{n_{ik}^2}{ n_{ii}n_{kk}}}`$

where $`n_{ij}`$ is the number of cooccurrences between term $`i`$ and term $`j`$

### Definitions and Operators for multi-dimensional arrays

#### Definitions
* operator : .* and ./ cell by cell multiplication and division of the matrix
* operator * is the matrix multiplication
* operator : Diag(M)=[$`n_{ii}]_i`$ (vecteur)
* Matrice $`M=[n_{ij}]_{i,j}`$
* $`Id = identity Matrix`$ (ones on the diagonal, zeros everywhere else)
* Matrice ones : $`O=[1]_{i,j}`$ (one everywhere)
* $`D(M)=Id .* M`$ (the square matrix that has zeros everywhere except on its diagonal where it is equal to M)

#### Operators
__repmat :: Matrix -> Int -> Int -> Int -> 3 dimensional array__
repmat M $n_1$ $n_2$ $n_3$ is the 3 dimensional array that replicates M along dimensions 1 2 and 3 respectively $`n_1`$, $`n_2`$ and $`n_3`$ times.

Example : if $`M=[m_{i,j}]_{i,j \leq n}`$ is a n x n Matrix,  M'=repmat M 1 1 n = $`[m'_{i,j,k}]_{i,j,k \leq n}`$ is a 3 dimentional cube shaped array such that  $`\forall i,j,k, m'_{i,j,k}=m_{i,j}`$

Type definition :: Vect123
[i j k] is of type Vect123 iff :
* $`i,j,k\in\{1,~2,~3\}`$
* sort asc {i,j,k}={1, 2, 3}

__proj :: Int $`\in`$ {1 2 3} -> Vect123 ->  Int__

Noted proj n [i j k] it returns the pth element of [i j k] 
Operator that permute the numbers 1 2 3. Example :
* proj 1 [3 1 2] = 3 ;
* proj 2 [3 1 2] = 1

__permute :: 3 dimentional array -> Vect123 -> 3 dimensional array__
permute M' [x y z] = M" 
where      
* $`M'=[m'_{i,j,k}]_{i,j,k}`$

* $`M"=[m'_{proj~x~[i~j~k],proj~y~[i~j~k],proj~z~[i~j~k]}]_{i,j,k}`$

__sum3 :: 3 dimentional array -> Matrix__
-- projection of a 3 dimentional array onto a matrix by summation over one of the dimensions
sum3 M'=M 
where  
    $`M'=[m'_{i,j,k}]_{i,j,k}`$ 
    $`M=[m_{i,j}]_{i,j}`$ where $`m_{i,j}=\Sigma_k m'_{i,j,k}`$

### Main operations
We first build the matrix that will guaranty the condition $`k\neq i,j`$
* $`I`$=repmat (O-Id)  1 1 N
* $`I_1=permute~I~[1~3~2]`$ has 0 where i=k and ones elsewhere
* $`I_2=permute~I~[3~2~1]`$ has 0 where j=k and ones elsewhere
* $`II=I_1.*I_2=[r_{i,j,k}]`$ is a matrix such that $`k= i,j \implies r_{i,j,k}=0 ; r_{i,j,k}=1`$ otherwise.

Then Let 
* $`D_1=O * D(M) =[n_{jj}]_{i,j}`$
* $`D_2=D(M) * O =[n_{ii}]_{i,j}`$
* $`MI=(M ./ D_1) .* (M ./ D_2 )=[\frac{n_{ij}^2}{n_{ii}n_{jj}}]_{ij}`$
* W=repmat MI 1 1 N=$`[\frac{n_{ij}^2}{n_{ii}n_{jj}}]_{ijk}`$
* $`W_1=permute~W~[1~3~2]=[\frac{n_{ik}^2}{n_{ii}n_{kk}}]_{ijk}`$
* $`W_2=permute~W~[3~2~1]=[\frac{n_{jk}^2}{n_{jj}n_{kk}}]_{ijk}`$
* $`W'=min(W_1,W_2)=[min(\frac{n_{ik}^2}{n_{ii}n_{kk}},\frac{n_{jk}^2}{n_{jj}n_{kk}})]_{ijk}`$
* $`Z_1=sum3~(W'.*II)=[\Sigma_{k \neq i,j} min(\frac{n_{ik}^2}{n_{ii}n_{kk}},\frac{n_{jk}^2}{n_{jj}n_{kk}})]_{i,j}`$
* $`Z_2=sum3~(W_1.*II)=[\Sigma_{k \neq i,j} \frac{n_{ik}^2}{n_{ii}n_{kk}}]_{i,j}`$

Then $`prox(M)=[prox(i,j)]_{i,j}=Z_1./Z_2`$

### Log Distributionnal Similarity 
In order to be aligned with the V3 and also to take into account terms Zipf law distribution,the final implementation of the distributional similarity should be a variant of the above.

We will consider

$`To=\Sigma_{i,j}n_{ij}`$

$`s_i=\Sigma_{i\neq k}n_{ik}`$

For $`i\neq j`$,  $`m_{i,j}=log(To*\frac{n_{ij}}{s_i*s_j})`$

$`sumMin_{ij}=\Sigma_{k \neq i,j} min(m_{ik},m_{jk})`$

$`SumM_i=\Sigma_{k\neq i,j} m_{ik}`$