Similarité distributionnelle
Ditrance ditributionelle
gargantext
tags: Objective : We want to compute with matrices processing the similarity between term $i$ and term $j$ :
prox(i,j)=\frac{\Sigma_{k \neq i,j} min(\frac{n_{ik}^2}{n_{ii}n_{kk}},\frac{n_{jk}^2}{n_{jj}n_{kk}})}{\Sigma_{k \neq i,j}\frac{n_{ik}^2}{ n_{ii}n_{kk}}}
where n_{ij}
is the number of cooccurrences between term i
and term j
Definitions and Operators for multi-dimensional arrays
Definitions
- operator : .* and ./ cell by cell multiplication and division of the matrix
- operator * is the matrix multiplication
- operator : Diag(M)=[
n_{ii}]_i
(vecteur) - Matrice
M=[n_{ij}]_{i,j}
-
Id = identity Matrix
(ones on the diagonal, zeros everywhere else) - Matrice ones :
O=[1]_{i,j}
(one everywhere) -
D(M)=Id .* M
(the square matrix that has zeros everywhere except on its diagonal where it is equal to M)
Operators
repmat :: Matrix -> Int -> Int -> Int -> 3 dimensional array
repmat M $n_1$ $n_2$ $n_3$ is the 3 dimensional array that replicates M along dimensions 1 2 and 3 respectively n_1
, n_2
and n_3
times.
Example : if M=[m_{i,j}]_{i,j \leq n}
is a n x n Matrix, M'=repmat M 1 1 n = [m'_{i,j,k}]_{i,j,k \leq n}
is a 3 dimentional cube shaped array such that \forall i,j,k, m'_{i,j,k}=m_{i,j}
Type definition :: Vect123 [i j k] is of type Vect123 iff :
i,j,k\in\{1,~2,~3\}
- sort asc {i,j,k}={1, 2, 3}
proj :: Int \in
{1 2 3} -> Vect123 -> Int
Noted proj n [i j k] it returns the pth element of [i j k] Operator that permute the numbers 1 2 3. Example :
- proj 1 [3 1 2] = 3 ;
- proj 2 [3 1 2] = 1
permute :: 3 dimentional array -> Vect123 -> 3 dimensional array permute M' [x y z] = M" where
-
M'=[m'_{i,j,k}]_{i,j,k}
-
M"=[m'_{proj~x~[i~j~k],proj~y~[i~j~k],proj~z~[i~j~k]}]_{i,j,k}
sum3 :: 3 dimentional array -> Matrix
-- projection of a 3 dimentional array onto a matrix by summation over one of the dimensions
sum3 M'=M
where
M'=[m'_{i,j,k}]_{i,j,k}
M=[m_{i,j}]_{i,j}
where m_{i,j}=\Sigma_k m'_{i,j,k}
Main operations
We first build the matrix that will guaranty the condition k\neq i,j
-
I
=repmat (O-Id) 1 1 N -
I_1=permute~I~[1~3~2]
has 0 where i=k and ones elsewhere -
I_2=permute~I~[3~2~1]
has 0 where j=k and ones elsewhere -
II=I_1.*I_2=[r_{i,j,k}]
is a matrix such thatk= i,j \implies r_{i,j,k}=0 ; r_{i,j,k}=1
otherwise.
Then Let
D_1=O * D(M) =[n_{jj}]_{i,j}
D_2=D(M) * O =[n_{ii}]_{i,j}
MI=(M ./ D_1) .* (M ./ D_2 )=[\frac{n_{ij}^2}{n_{ii}n_{jj}}]_{ij}
- W=repmat MI 1 1 N=
[\frac{n_{ij}^2}{n_{ii}n_{jj}}]_{ijk}
W_1=permute~W~[1~3~2]=[\frac{n_{ik}^2}{n_{ii}n_{kk}}]_{ijk}
W_2=permute~W~[3~2~1]=[\frac{n_{jk}^2}{n_{jj}n_{kk}}]_{ijk}
W'=min(W_1,W_2)=[min(\frac{n_{ik}^2}{n_{ii}n_{kk}},\frac{n_{jk}^2}{n_{jj}n_{kk}})]_{ijk}
Z_1=sum3~(W'.*II)=[\Sigma_{k \neq i,j} min(\frac{n_{ik}^2}{n_{ii}n_{kk}},\frac{n_{jk}^2}{n_{jj}n_{kk}})]_{i,j}
Z_2=sum3~(W_1.*II)=[\Sigma_{k \neq i,j} \frac{n_{ik}^2}{n_{ii}n_{kk}}]_{i,j}
Then prox(M)=[prox(i,j)]_{i,j}=Z_1./Z_2
Log Distributionnal Similarity
In order to be aligned with the V3 and also to take into account terms Zipf law distribution,the final implementation of the distributional similarity should be a variant of the above.
We will consider
To=\Sigma_{i,j}n_{ij}
s_i=\Sigma_{i\neq k}n_{ik}
For i\neq j
, m_{i,j}=log(To*\frac{n_{ij}}{s_i*s_j})
sumMin_{ij}=\Sigma_{k \neq i,j} min(m_{ik},m_{jk})
SumM_i=\Sigma_{k\neq i,j} m_{ik}
And the similarity LogSim will be defined by:
LogSim(i,j)=\frac{sumMin_{ij}}{SumM_i}