n-gram 手动实现
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
import queue
import math
corpus = r"""ppap ppap ppap
I have a pen , I have an apple .
( Eh~ ) Apple pen !
I have a pen , I have a pineapple .
( Eh~ ) pineapple pen !
apple pen ~ pineapple pen ( Eh~ )
pen pineapple Apple pen !
pen pineapple Apple pen .
you don't have an apple .
I don't have a pen .
he has a pineapple .
he doesn't have an apple !
"""
from collections import defaultdict

class LessThanWin(object):
def __init__(self, s: list):
self.present = s
def size(self):
return len(self.present)
def hash(self):
return ' '.join(self.present)

class NNode(object):
def __init__(self, s: list):
self.given = s[:-1]
self.present = [s[-1]]
def size(self):
return len(self.present)
def hash_given(self):
return ' '.join(self.given)
def hash(self):
return ' '.join(self.given+self.present)

class Que(object):
def __init__(self, maxsize: int):
self.q = []
self.maxsize = maxsize
self.size = 0
def put(self, any):
self.size += 1
self.q.append(any)
if self.size 《 self.maxsize:
return LessThanWin(self.q)
else:
if self.size 》 self.maxsize:
self.q = self.q[1:]
self.size -= 1
return NNode(self.q)

class model_args(object):
def __init__(self,counter_given,counter_present,counter_present_n_1,lenth,win):
self.counter_given = counter_given
self.counter_present_n_1 = counter_present_n_1
self.counter_present = counter_present
self.char_len = lenth
self.win = win
self.sum_n_1 = sum( [v for key,v in counter_present_n_1.items()] )

def train(text:str,window:int):
counter_present = defaultdict(int)
counter_present_n_1 = defaultdict(int) # n-1 初始搭配;我省去了加 n-1 个 BOS 的步骤;其实等价;证明请看 “证明”
counter_given = defaultdict(int) # 只计算该初始 n-1搭配 占所有初始 n-1搭配 的比率
all_char =set()

for line in corpus.split('\n'):
chars = line.strip().split() # char 其实就是词;因为 word 也有句子的意思,我看到的时候很迷茫,所以我用 chars 代表 词
q = Que(window)
for ch in chars:
all_char.add(ch)
node = q.put(ch)
if type(node).__name__ =='LessThanWin':

if node.size() == window-1:
counter_present_n_1[node.hash()]+=1;
else:
counter_given[node.hash_given()]+=1
counter_present[node.hash()]+=1;

return model_args(counter_given,counter_present,counter_present_n_1,len(all_char),window)

def predict(s:str,model:model_args):
counter_given,counter_present,counter_present_n_1,charV,window = model.counter_given,model.counter_present,model.counter_present_n_1,model.char_len,model.win
# 这句话就是 从 model_args 里边读参数,没别的意思
n_1_all= 0;
chars = s.split()
q = Que(window)
p = 0
for ch in chars:
node = q.put(ch)
if type(node).__name__ =='LessThanWin':
if node.size() == window-1:
p += math.log10( (counter_present_n_1[node.hash()]+1.0)/(model.sum_n_1+charV) ) ; #加1平滑
else:
p += math.log10( (counter_present[node.hash()]+1.0) /(counter_given[node.hash_given()]+charV) )

return p;

if __name__ == "__main__":

test_list=[
'I have a pen .',
'you have a pen .',
'I am a pen .',
'he are a apple .',
'he has a pen .',
'ppap ppap ppap']
win = 3 #win 即 n-gram 的 n;不要开太大,因为我的语料库就很小
model = train(corpus,win)

for sentence in test_list:
p = predict(sentence,model)
print(sentence,' log10(p)=',p)

证明

我要证得是,程序里的“只计算该初始 n-1搭配 占所有初始 n-1搭配 的比率” 与添加n-1 个《BOS》 等价

先看一下2-gram 中
p(你真好看) = p(你|《BOS》)*p(真|你)*p(好|真)*p(看|好)*p(《EOS》|看)

其中p(你|《BOS》) 就是《BOS》后出现“你” 的概率;显然每个句子肯定都以《BOS》开头;所以p(你|《BOS》) = p(你) 指的是“你”出现在句子开头的概率。

3-gram 中,如果加两个 《BOS》
p(你|《BOS》《BOS》)p(真|《BOS》你) = p(你)p(真|你) 解释一下,p(你) 指的是“你”出现在句子开头的概率;p(真|你) 是 “你”出现在句子的情况下 出现“真”的概率
=p(你真) 即“你真”出现在句子开头的概率
所以对于 n-gram 文法,可以不在句子前边加n-1个《BOS》 直接计算 n-1 搭配出现在句子开头的概率即可;
举个例子:
pen pineapple Apple pen !
pen pineapple Apple pen .
you don’t have an apple .
I don’t have a pen .
he has a pineapple .
he doesn’t have an apple !

3-gram 中 p(pen pineapple)=2/6 ; p(pen pineapple Apple pen !) = p(pen pineapple)p(Apple|pen pineapple)p(pen|pineapple Apple)p(!|Apple pen)

我没有考虑 《EOS》 的作用

文章作者: Mr. Wang
文章链接: http://blog.zhyu.wang/2020/03/21/n-gram/
版权声明: 本博客所有文章除特别声明外,均采用 CC BY-NC-SA 4.0 许可协议。转载请注明来自 蓝玉冰城