代码来源:https://github.com/RMSnow/KG-Course,代码目录结构如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 - EventExtraction/ - data/ - preprocess/(数据预处理) - CEC/(原数据文件) - dataset.json(实验数据集) - preprocess.ipynb(预处理、数据分析的代码) - data_load.ipynb(制作模型所需要的各项输入、输出矩阵) - *.npy(模型的输入与输出) - model/ - img/(由keras自动生成的模型架构图) - model/(训练好的模型参数文件) - predict/(模型预测输出的矩阵) - dataset_split.py(训练集/测试集划分) - DMCNN.py(DMCNN模型与CNN模型) - TextCNN.py(TextCNN模型) - train.py(训练、预测所需的各项函数) - *.ipynb(训练过程、模型预测、性能结果等) - readme.md
数据格式为 XML 格式,如下。其中 Event 记录了每一个标签,在 Event 标签下 Denoter 标签为触发词
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 <?xml version="1.0" encoding="UTF-8"?> <Body > <Title > 澳大利亚2014年火灾:高温致一夜间发生几百起火灾</Title > <ReportTime type ="absTime" > 2014年1月15日</ReportTime > <Content > <Paragraph > <Sentence > <Event eid ="e1" type ="thoughtevent" > <Time tid ="t1" type ="relTime" > 1月15日,</Time > 据外媒 <Participant sid ="s1" > 《俄罗斯报》</Participant > 报道 <Denoter type ="statement" did ="d1" > 称</Denoter > , </Event > <Event eid ="e2" > 位于 <Location lid ="l2" > 北半球</Location > 的 <Participant sid ="s2" > 澳大利亚</Participant > 现在正 <Denoter did ="d2" type ="movement" > 处于</Denoter > <Object oid ="o2" > 炎热的夏季</Object > , </Event > <Event eid ="e3" > 而近日也到了高温酷暑的时候,当地时间 <Time tid ="t3" type ="relTime" > 1月14日晚</Time > , <Location lid ="l3" > 澳大利亚南部</Location > 一夜间发生至少250起 <Denoter type ="emergency" did ="d3" > 火灾</Denoter > 。 </Event > </Sentence > <Sentence > 受炎热天气及雷雨天气影响, <Event eid ="e4" > <Location lid ="l4" > 澳大利亚南部</Location > 一夜间发生至少250起 <Denoter did ="d4" type ="emergency" > 火灾</Denoter > ,灾情多集中在维多利亚州。 </Event > </Sentence > <Sentence > <Event eid ="e5" > 火灾发生后, <Participant sid ="s5" > 救援人员</Participant > 立即 <Denoter did ="d5" type ="operation" > 展开</Denoter > <Object oid ="o5" > 救灾行动</Object > 。 </Event > </Sentence > <Sentence > 目前,大部分起火点火势已被控制。</Sentence > </Paragraph > </Content > <eRelation relType ="Thoughtcontent" thoughtevent_eid ="e1" thoughtcontent_eids ="e2-e5" /> <eRelation relType ="Follow" bevent_eid ="e4" aevent_eid ="e5" /> </Body >
XML 格式转为 Json parse_xml_string 函数 parse_xml_string
函数提取 xml 文件中的所有的句子,首先用 bs 解析并找到所有的 sentence 标签,再将其中的内容全部提取出来,组成一个列表。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 import xmltodictimport jsonimport collectionsimport refrom bs4 import BeautifulSoup as BSimport osdef parse_xml_string (file_path ): with open (file_path, 'r' , encoding='utf-8' ) as f: doc_string = f.read() soup = BS(doc_string) sentence_elements = soup.find_all('sentence' ) sentences = [] for i, elem in enumerate (sentence_elements): elem = str (elem) elem = elem.replace('\n' , '' ).replace('\t' , '' ).replace('\r' , '' ) pattern = re.compile ('>[\u4e00-\u9fa50-9A-Za-z.,。!?:;“”"()《》]+<' ) sentence = ' ' .join([x.replace('<' , '' ).replace('>' , '' ) for x in pattern.findall(elem)]) sentences.append(sentence) return sentences
parse_xml 函数 parse_xml 函数用于将 xml 格式的数据转为 json 格式的数据,并保存下来。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 def parse_xml (file_path, save_dir ): json_sentences = [] with open (file_path, 'r' , encoding='utf-8' ) as f: doc_string = f.read() doc = xmltodict.parse(doc_string) paragraphs = doc['Body' ]['Content' ]['Paragraph' ] if type (paragraphs) != list : paragraphs = collections.OrderedDict(paragraphs) assert type (paragraphs) == collections.OrderedDict paragraphs = [paragraphs] for i, paragraph in enumerate (paragraphs): try : sentences = paragraph['Sentence' ] except : continue if type (sentences) != list : sentences = [sentences] for j, sentence in enumerate (sentences): json_sentence = collections.OrderedDict() try : events = sentence['Event' ] except : assert type (sentence) == str json_sentences.append(json_sentence) continue if type (events) != list : events = collections.OrderedDict(events) assert type (events) == collections.OrderedDict events = [events] for e, event in enumerate (events): json_event = collections.OrderedDict() for k, v in event.items(): if k in ['@eid' , '#text' ]: continue if type (v) == collections.OrderedDict or type (v) == dict : if k == 'Denoter' : json_event[k] = v else : try : json_event[k] = v['#text' ] except : continue else : json_event[k] = v json_sentence['event{}' .format (e)] = json_event json_sentences.append(json_sentence) raw_sentences = parse_xml_string(file_path) try : assert len (raw_sentences) == len (json_sentences) except : assert file_path == './CEC/食物中毒/印度发生假酒集体中毒事件.xml' or file_path == './CEC/食物中毒\印度发生假酒集体中毒事件.xml' del raw_sentences[3 ] assert len (raw_sentences) == len (json_sentences) for i, json_sentence in enumerate (json_sentences): json_sentence['sentence' ] = raw_sentences[i] file_name = file_path.split('/' )[-1 ].split('.xml' )[0 ] dir_path, _ = os.path.split(os.path.join(save_dir, "{}.json" .format (file_name))) if not os.path.exists(dir_path): os.makedirs(dir_path) with open ('{}/{}.json' .format (save_dir, file_name), 'w' , encoding='utf-8' ) as f: json.dump(json_sentences, f, indent=4 , ensure_ascii=False , sort_keys=True ) xml_files = [] for path, dir_list, file_list in os.walk('./CEC/' ): for file_name in file_list: if '.xml' in file_name: xml_files.append(os.path.join(path, file_name)) len (xml_files) for xml_file in xml_files: parse_xml(xml_file, save_dir='./CEC-xml2json/' )
xml 转为 json 后的数据格式如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 [ { "event0" : { "@type" : "thoughtevent" , "Denoter" : { "#text" : "称" , "@did" : "d1" , "@type" : "statement" }, "Participant" : "《俄罗斯报》" , "Time" : "1月15日," }, "event1" : { "Denoter" : { "#text" : "处于" , "@did" : "d2" , "@type" : "movement" }, "Location" : "北半球" , "Object" : "炎热的夏季" , "Participant" : "澳大利亚" }, "event2" : { "Denoter" : { "#text" : "火灾" , "@did" : "d3" , "@type" : "emergency" }, "Location" : "澳大利亚南部" , "Time" : "1月14日晚" }, "sentence" : "1月15日, 《俄罗斯报》 称 , 北半球 澳大利亚 处于 炎热的夏季 , 1月14日晚 澳大利亚南部 火灾 。" }, { "event0" : { "Denoter" : { "#text" : "火灾" , "@did" : "d4" , "@type" : "emergency" }, "Location" : "澳大利亚南部" }, "sentence" : "澳大利亚南部 火灾 ,灾情多集中在维多利亚州。" }, { "event0" : { "Denoter" : { "#text" : "展开" , "@did" : "d5" , "@type" : "operation" }, "Object" : "救灾行动" , "Participant" : "救援人员" }, "sentence" : "救援人员 展开 救灾行动 。" }, { "sentence" : "目前,大部分起火点火势已被控制。" } ]
首先获取所有的 json 格式文件,将每一个 json 文件的每一个句子保存至 sentences 列表中:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 import jsonimport pandas as pdimport osimport jiebaimport randomjson_files = [] for path, dir_list, file_list in os.walk('./CEC-xml2json/' ): for file_name in file_list: if '.json' in file_name: json_files.append(os.path.join(path, file_name)) len (json_files) sentences = [] for json_file in json_files: with open (json_file, 'r' , encoding='utf-8' ) as f: sentences += json.load(f) len (sentences)
cut_sentence
函数用于对句子进行分词,分词结果用空格 隔开。
1 2 3 4 5 6 7 8 9 10 11 12 13 def cut_sentence (text ): cut_text = '' texts = text.split() for t in texts: cut_text += ' ' .join(list (jieba.cut(t))) + ' ' return cut_text[:-1 ] print (sentences[2 ]['sentence' ])cut_sentence(sentences[2 ]['sentence' ])
筛选可用句子 下面开始筛选可用的句子,主要保存的是触发词以及对应的事件类型、论元(只保存 Participant 或者 Object ,并且 Participant 优先,在 Participant 没有时选取 Object,二者统一被称为 event_arguments)。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 valid_sentences = [] for sentence in sentences: if len (sentence) == 1 : continue valid_sentence = dict () text = sentence['sentence' ] cut_text = cut_sentence(text) words = cut_text.split() valid_sentence['sentence' ] = text valid_sentence['sentence_words' ] = cut_text triggers = [] for key, value in sentence.items(): if 'event' not in key: continue trigger = dict () try : trigger['event' ] = value['Denoter' ]['@type' ] if trigger['event' ] == 'thoughtevent' : continue trigger['event_trigger' ] = value['Denoter' ]['#text' ] except : continue if trigger['event_trigger' ] not in words: continue if 'Participant' in value.keys(): participants = value['Participant' ] if type (participants) == list : for participant in participants: if participant not in words: continue if 'event_arguments' not in trigger.keys(): trigger['event_arguments' ] = [participant] else : trigger['event_arguments' ].append(participant) else : assert type (participants) == str if participants not in words: continue trigger['event_arguments' ] = [participants] elif 'Object' in value.keys(): participants = value['Object' ] if participants not in words: continue trigger['event_arguments' ] = [participants] triggers.append(trigger) if len (triggers) == 0 : continue valid_sentence['triggers' ] = triggers valid_sentences.append(valid_sentence) len (sentences), len (valid_sentences)with open ('./dataset.json' , 'w' , encoding='utf-8' ) as f: json.dump(valid_sentences, f, sort_keys=True , indent=4 , ensure_ascii=False )
一个合法的句子格式如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 { 'sentence': '到 晚上9 时左右 , 事故现场 基本 清理 完毕 , 104 国道小溪岭段 恢复了 通车 。', 'sentence_words': '到 晚上 9 时 左右 , 事故现场 基本 清理 完毕 , 104 国道 小溪 岭段 恢复 了 通车 。', 'triggers': [ { 'event': 'operation', 'event_trigger': '清理', 'event_arguments': ['事故现场'] }, { 'event': 'stateChange', 'event_trigger': '通车' } ] }
数据统计 读取数据集中的数据:
1 2 3 4 5 6 7 import jsonimport pandas as pdwith open ('./dataset.json' , 'r' , encoding='utf-8' ) as f: sentences = json.load(f) len (sentences)
句子中 Trigger 数量统计 triggers_num 记录了每一个句子中的触发词数量,转为 DataFrame,可以看到平均每个句子中有 2 个触发词,最少 1 个触发词,最多 8 个触发词。
1 2 3 4 5 6 triggers_num = [len (s['triggers' ]) for s in sentences] len (triggers_num)triggers_df = pd.DataFrame({'triggers_num' : triggers_num}) triggers_df.describe()
triggers_num
count
1665.000000
mean
2.045646
std
1.298235
min
1.000000
25%
1.000000
50%
2.000000
75%
3.000000
max
8.000000
1/1 比例(句子中只有一个事件) 1 2 3 4 len (triggers_df[triggers_df['triggers_num' ] == 1 ])len (triggers_df[triggers_df['triggers_num' ] == 1 ]) / len (triggers_df)
1/N 比例 1 2 3 4 len (triggers_df[triggers_df['triggers_num' ] > 1 ])len (triggers_df[triggers_df['triggers_num' ] > 1 ]) / len (triggers_df)
Event 类型统计 - 7 统计事件类型的数量为 7,并且统计了每一种事件的出现次数。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 event_type = dict () for sent in sentences: for trigger in sent['triggers' ]: t = trigger['event' ] if t not in event_type.keys(): event_type[t] = 1 else : event_type[t] += 1 len (event_type)event_type
Argument 数量统计 arguments_num 保存了每一个触发词对应的论元数量。将之转为 DataFrame,可以看到每一个论元平均有 0.4495 个论元。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 arguments_num = [] for sent in sentences: for trigger in sent['triggers' ]: if 'event_arguments' not in trigger.keys(): arguments_num.append(0 ) else : arguments_num.append(len (trigger['event_arguments' ])) len (arguments_num)arguments_df = pd.DataFrame({'num' : arguments_num}) arguments_df.describe()
num
count
3406.000000
mean
0.449501
std
0.497516
min
0.000000
25%
0.000000
50%
0.000000
75%
1.000000
max
1.000000
统计数量,其中有一个论元的触发词个数为 1531;没有论元的触发词个数为 1875
1 2 3 4 arguments_df['num' ].value_counts()
因为所有的触发词中,论元至多只有一个,所以将 event argument 的类型由 list 转为 str:
1 2 3 4 5 6 7 8 for sent in sentences: for trigger in sent['triggers' ]: if 'event_arguments' in trigger.keys(): trigger['event_arguments' ] = trigger['event_arguments' ][0 ] with open ('./dataset.json' , 'w' ) as f: json.dump(sentences, f, sort_keys=True , indent=4 , ensure_ascii=False )
句子的最大长度 - 85 统计句子的最大长度,得到最大长度为 85 (单位:词)。
1 2 3 4 5 6 nums = [] for piece in sentences: nums.append(len (piece['sentence_words' ].split())) max (nums), nums.index(max (nums))
Train / Test 划分 triggers_num 为每个句子中的触发词数量,y 为下标。
1 2 3 4 5 6 7 8 9 10 11 12 13 import jsonfrom sklearn.model_selection import train_test_splitimport numpy as npwith open ('./dataset.json' , 'r' , encoding='utf-8' ) as f: dataset = json.load(f) len (dataset) triggers_num = [len (p['triggers' ]) for p in dataset] y = np.arange(len (dataset)) y.shape
用 train_test_split 函数对数据集进行分割,测试集占比 0.2
1 2 3 train_index, test_index = train_test_split(y, test_size=0.2 , stratify=triggers_num, random_state=0 ) train_index.shape, test_index.shape
train_test_split 函数中 stratify 参数的作用:保持测试集与整个数据集里result的数据分类比例一致。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 def check_triggers_num (indexes ): num = np.array(triggers_num) chosen_num = num[indexes] events = list (set (chosen_num)) events.sort() num2len = dict () for e in events: num2len[e] = len (chosen_num[chosen_num==e]) print (num2len) check_triggers_num(y) {1 : 765 , 2 : 429 , 3 : 249 , 4 : 131 , 5 : 55 , 6 : 20 , 7 : 11 , 8 : 5 } check_triggers_num(train_index) {1 : 612 , 2 : 343 , 3 : 199 , 4 : 105 , 5 : 44 , 6 : 16 , 7 : 9 , 8 : 4 } check_triggers_num(test_index) {1 : 153 , 2 : 86 , 3 : 50 , 4 : 26 , 5 : 11 , 6 : 4 , 7 : 2 , 8 : 1 }