[Solved] Can Pandas DataFrame efficiently calculate PMI (Pointwise Mutual Information)?
I’ve looked around and surprisingly haven’t found an easy use of framework or existing code for the calculation of Pointwise Mutual Information (Wiki PMI) despite libraries like Scikit-learn offering a metric for overall Mutual Information (by histogram). This is in the context of Python and Pandas!
My problem:
I have a DataFrame with a series of [x,y] examples in each row and wish to calculate a series of PMI values as per the formula (or a simpler one):
PMI(x, y) = log( p(x,y) / p(x) * p(y) )
So far my approach is:
def pmi_func(df, x, y):
df['freq_x'] = df.groupby(x).transform('count')
df['freq_y'] = df.groupby(y).transform('count')
df['freq_x_y'] = df.groupby([x, y]).transform('count')
df['pmi'] = np.log( df['freq_x_y'] / (df['freq_x'] * df['freq_y']) )
Would this give a valid and/or efficient computation?
Sample I/O:
x y PMI
0 0 0.176
0 0 0.176
0 1 0
Solution #1:
I would add three bits.
def pmi(dff, x, y):
df = dff.copy()
df['f_x'] = df.groupby(x)[x].transform('count')
df['f_y'] = df.groupby(y)[y].transform('count')
df['f_xy'] = df.groupby([x, y])[x].transform('count')
df['pmi'] = np.log(len(df.index) * df['f_xy'] / (df['f_x'] * df['f_y']) )
return df
df.groupby(x)[x].transform('count')
anddf.groupby(y)[y].transform('count')
should be used so that only
count is retured.np.log(len(df.index) * df['f_xy'] / (df['f_x'] * df['f_y'])
probabilities to be used.- work on copy of dataframe, rather than modifying input dataframe.
Solution #2:
Solution (with SKlearn KDE alternative as well):
Please comment for review
from sklearn.neighbors.kde import KernelDensity
# pmi function
def pmi_func(df, x, y):
freq_x = df.groupby(x).transform('count')
freq_y = df.groupby(y).transform('count')
freq_x_y = df.groupby([x, y]).transform('count')
df['pmi'] = np.log( len(df.index) * (freq_x_y / (freq_x * freq_y)) )
# pmi with kernel density estimation
def kernel_pmi_func(df, x, y):
# reshape data
x = np.array(df[x])
y = np.array(df[y])
x_y = np.stack((x, y), axis=-1)
# kernel density estimation
kde_x = KernelDensity(kernel='gaussian', bandwidth=0.1).fit(x[:, np.newaxis])
kde_y = KernelDensity(kernel='gaussian', bandwidth=0.1).fit(y[:, np.newaxis])
kde_x_y = KernelDensity(kernel='gaussian', bandwidth=0.1).fit(x_y)
# score
p_x = pd.Series(np.exp(kde_x.score_samples(x[:, np.newaxis])))
p_y = pd.Series(np.exp(kde_y.score_samples(y[:, np.newaxis])))
p_x_y = pd.Series(np.exp(kde_x_y.score_samples(x_y)))
df['pmi'] = np.log( p_x_y / (p_x * p_y) )
The answers/resolutions are collected from stackoverflow, are licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0 .