代码来源:https://github.com/RMSnow/KG-Course,代码目录结构如下: 
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 - EventExtraction/ 	- data/ 		- preprocess/(数据预处理) 			- CEC/(原数据文件) 			- dataset.json(实验数据集) 			- preprocess.ipynb(预处理、数据分析的代码) 		- data_load.ipynb(制作模型所需要的各项输入、输出矩阵) 		- *.npy(模型的输入与输出) 	- model/ 		- img/(由keras自动生成的模型架构图) 		- model/(训练好的模型参数文件) 		- predict/(模型预测输出的矩阵) 		- dataset_split.py(训练集/测试集划分) 		- DMCNN.py(DMCNN模型与CNN模型) 		- TextCNN.py(TextCNN模型) 		- train.py(训练、预测所需的各项函数) 		- *.ipynb(训练过程、模型预测、性能结果等) 	- readme.md 
数据格式为 XML 格式,如下。其中 Event 记录了每一个标签,在 Event 标签下 Denoter 标签为触发词
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 <?xml version="1.0" encoding="UTF-8"?> <Body >  	<Title > 澳大利亚2014年火灾:高温致一夜间发生几百起火灾</Title >    	<ReportTime  type ="absTime" > 2014年1月15日</ReportTime >    	<Content >   		<Paragraph >   			<Sentence >   				<Event  eid ="e1"  type ="thoughtevent" >   					<Time  tid ="t1"  type ="relTime" > 1月15日,</Time >  据外媒  					<Participant  sid ="s1" > 《俄罗斯报》</Participant >  报道  					<Denoter  type ="statement"  did ="d1" > 称</Denoter > , 				</Event >    				<Event  eid ="e2" > 位于  					<Location  lid ="l2" > 北半球</Location > 的  					<Participant  sid ="s2" > 澳大利亚</Participant >  现在正  					<Denoter  did ="d2"  type ="movement" > 处于</Denoter >    					<Object  oid ="o2" > 炎热的夏季</Object > , 				</Event >    				<Event  eid ="e3" > 而近日也到了高温酷暑的时候,当地时间  					<Time  tid ="t3"  type ="relTime" > 1月14日晚</Time > ,  					<Location  lid ="l3" > 澳大利亚南部</Location >  一夜间发生至少250起  					<Denoter  type ="emergency"  did ="d3" > 火灾</Denoter > 。 				</Event >   			</Sentence >    			<Sentence > 受炎热天气及雷雨天气影响,  				<Event  eid ="e4" >   					<Location  lid ="l4" > 澳大利亚南部</Location > 一夜间发生至少250起  					<Denoter  did ="d4"  type ="emergency" > 火灾</Denoter > ,灾情多集中在维多利亚州。 				</Event >   			</Sentence >    			<Sentence >   				<Event  eid ="e5" > 火灾发生后,  					<Participant  sid ="s5" > 救援人员</Participant >  立即  					<Denoter  did ="d5"  type ="operation" > 展开</Denoter >    					<Object  oid ="o5" > 救灾行动</Object > 。 				</Event >   			</Sentence >    			<Sentence > 目前,大部分起火点火势已被控制。</Sentence >   		</Paragraph >   	</Content >    	<eRelation  relType ="Thoughtcontent"  thoughtevent_eid ="e1"  thoughtcontent_eids ="e2-e5" />    	<eRelation  relType ="Follow"  bevent_eid ="e4"  aevent_eid ="e5" />   </Body > 
XML 格式转为 Json parse_xml_string 函数 parse_xml_string 函数提取 xml 文件中的所有的句子,首先用 bs 解析并找到所有的 sentence 标签,再将其中的内容全部提取出来,组成一个列表。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 import  xmltodictimport  jsonimport  collectionsimport  refrom  bs4 import  BeautifulSoup as  BSimport  osdef  parse_xml_string (file_path ):    with  open (file_path, 'r' , encoding='utf-8' ) as  f:         doc_string = f.read()              soup = BS(doc_string)	     sentence_elements = soup.find_all('sentence' )	          sentences = []     for  i, elem in  enumerate (sentence_elements):         elem = str (elem)                  elem = elem.replace('\n' , '' ).replace('\t' , '' ).replace('\r' , '' )                  pattern = re.compile ('>[\u4e00-\u9fa50-9A-Za-z.,。!?:;“”"()《》]+<' )         sentence = ' ' .join([x.replace('<' , '' ).replace('>' , '' ) for  x in  pattern.findall(elem)])                  sentences.append(sentence)          return  sentences 
parse_xml 函数 parse_xml 函数用于将 xml 格式的数据转为 json 格式的数据,并保存下来。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 def  parse_xml (file_path, save_dir ):    json_sentences = []               with  open (file_path, 'r' , encoding='utf-8' ) as  f:         doc_string = f.read()          doc = xmltodict.parse(doc_string)     paragraphs = doc['Body' ]['Content' ]['Paragraph' ]     if  type (paragraphs) != list :         paragraphs = collections.OrderedDict(paragraphs)         assert  type (paragraphs) == collections.OrderedDict         paragraphs = [paragraphs] 	          for  i, paragraph in  enumerate (paragraphs):         try :             sentences = paragraph['Sentence' ]         except :             continue  		         if  type (sentences) != list :             sentences = [sentences] 		                  for  j, sentence in  enumerate (sentences):             json_sentence = collections.OrderedDict()                          try :                 events = sentence['Event' ]	             except :                                  assert  type (sentence) == str                  json_sentences.append(json_sentence)                 continue              if  type (events) != list :                 events = collections.OrderedDict(events)                 assert  type (events) == collections.OrderedDict                 events = [events]                                       for  e, event in  enumerate (events):                 json_event = collections.OrderedDict()                                  for  k, v in  event.items():                     if  k in  ['@eid' , '#text' ]:                         continue                                               if  type (v) == collections.OrderedDict or  type (v) == dict :                         if  k == 'Denoter' :                                                          json_event[k] = v                         else :                             try :                                                                  json_event[k] = v['#text' ]                             except :                                 continue                                                       else :                         json_event[k] = v                                  json_sentence['event{}' .format (e)] = json_event                              json_sentences.append(json_sentence)               raw_sentences = parse_xml_string(file_path)     try :         assert  len (raw_sentences) == len (json_sentences)     except :         assert  file_path == './CEC/食物中毒/印度发生假酒集体中毒事件.xml'  or  file_path == './CEC/食物中毒\印度发生假酒集体中毒事件.xml'          del  raw_sentences[3 ]         assert  len (raw_sentences) == len (json_sentences)          for  i, json_sentence in  enumerate (json_sentences):         json_sentence['sentence' ] = raw_sentences[i]          file_name = file_path.split('/' )[-1 ].split('.xml' )[0 ]     dir_path, _ = os.path.split(os.path.join(save_dir, "{}.json" .format (file_name)))     if  not  os.path.exists(dir_path):         os.makedirs(dir_path)          with  open ('{}/{}.json' .format (save_dir, file_name), 'w' , encoding='utf-8' ) as  f:         json.dump(json_sentences, f, indent=4 , ensure_ascii=False , sort_keys=True )          xml_files = [] for  path, dir_list, file_list in  os.walk('./CEC/' ):    for  file_name in  file_list:         if  '.xml'  in  file_name:             xml_files.append(os.path.join(path, file_name)) len (xml_files)	for  xml_file in  xml_files:    parse_xml(xml_file, save_dir='./CEC-xml2json/' ) 
xml 转为 json 后的数据格式如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 [     {         "event0" : {             "@type" : "thoughtevent" ,             "Denoter" : {                 "#text" : "称" ,                 "@did" : "d1" ,                 "@type" : "statement"              },             "Participant" : "《俄罗斯报》" ,             "Time" : "1月15日,"          },         "event1" : {             "Denoter" : {                 "#text" : "处于" ,                 "@did" : "d2" ,                 "@type" : "movement"              },             "Location" : "北半球" ,             "Object" : "炎热的夏季" ,             "Participant" : "澳大利亚"          },         "event2" : {             "Denoter" : {                 "#text" : "火灾" ,                 "@did" : "d3" ,                 "@type" : "emergency"              },             "Location" : "澳大利亚南部" ,             "Time" : "1月14日晚"          },         "sentence" : "1月15日, 《俄罗斯报》 称 , 北半球 澳大利亚 处于 炎热的夏季 , 1月14日晚 澳大利亚南部 火灾 。"      },     {         "event0" : {             "Denoter" : {                 "#text" : "火灾" ,                 "@did" : "d4" ,                 "@type" : "emergency"              },             "Location" : "澳大利亚南部"          },         "sentence" : "澳大利亚南部 火灾 ,灾情多集中在维多利亚州。"      },     {         "event0" : {             "Denoter" : {                 "#text" : "展开" ,                 "@did" : "d5" ,                 "@type" : "operation"              },             "Object" : "救灾行动" ,             "Participant" : "救援人员"          },         "sentence" : "救援人员 展开 救灾行动 。"      },     {         "sentence" : "目前,大部分起火点火势已被控制。"      } ] 
首先获取所有的 json 格式文件,将每一个 json 文件的每一个句子保存至 sentences 列表中:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 import  jsonimport  pandas as  pdimport  osimport  jiebaimport  randomjson_files = [] for  path, dir_list, file_list in  os.walk('./CEC-xml2json/' ):    for  file_name in  file_list:         if  '.json'  in  file_name:             json_files.append(os.path.join(path, file_name)) len (json_files)		sentences = [] for  json_file in  json_files:    with  open (json_file, 'r' , encoding='utf-8' ) as  f:         sentences += json.load(f) len (sentences)	
cut_sentence 函数用于对句子进行分词,分词结果用空格 隔开。
1 2 3 4 5 6 7 8 9 10 11 12 13 def  cut_sentence (text ):    cut_text = ''      texts = text.split()     for  t in  texts:         cut_text += ' ' .join(list (jieba.cut(t))) + ' '      return  cut_text[:-1 ] print (sentences[2 ]['sentence' ])cut_sentence(sentences[2 ]['sentence' ]) 
筛选可用句子 下面开始筛选可用的句子,主要保存的是触发词以及对应的事件类型、论元(只保存 Participant 或者 Object ,并且 Participant 优先,在 Participant 没有时选取 Object,二者统一被称为 event_arguments)。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 valid_sentences = []	 for  sentence in  sentences:    if  len (sentence) == 1 :                  continue           valid_sentence = dict ()          text = sentence['sentence' ]     cut_text = cut_sentence(text)     words = cut_text.split()     valid_sentence['sentence' ] = text     valid_sentence['sentence_words' ] = cut_text          triggers = []	     for  key, value in  sentence.items():                  if  'event'  not  in  key:             continue                       trigger = dict ()                           try :             trigger['event' ] = value['Denoter' ]['@type' ]	                          if  trigger['event' ] == 'thoughtevent' :                 continue              trigger['event_trigger' ] = value['Denoter' ]['#text' ]	         except :             continue                            if  trigger['event_trigger' ] not  in  words:             continue                   if  'Participant'  in  value.keys():             participants = value['Participant' ]             if  type (participants) == list :                                  for  participant in  participants:                     if  participant not  in  words:                         continue                                           if  'event_arguments'  not  in  trigger.keys():                         trigger['event_arguments' ] = [participant]                     else :                         trigger['event_arguments' ].append(participant)             else :                 assert  type (participants) == str                                                    if  participants not  in  words:                     continue                                       trigger['event_arguments' ] = [participants]         elif  'Object'  in  value.keys():             participants = value['Object' ]                                       if  participants not  in  words:                 continue                                   trigger['event_arguments' ] = [participants]                      triggers.append(trigger)          if  len (triggers) == 0 :         continue               valid_sentence['triggers' ] = triggers     valid_sentences.append(valid_sentence)      len (sentences), len (valid_sentences)with  open ('./dataset.json' , 'w' , encoding='utf-8' ) as  f:    json.dump(valid_sentences, f, sort_keys=True , indent=4 , ensure_ascii=False ) 
一个合法的句子格式如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 {     'sentence': '到 晚上9 时左右 , 事故现场 基本 清理 完毕 , 104 国道小溪岭段 恢复了 通车 。',     'sentence_words': '到 晚上 9  时 左右 , 事故现场 基本 清理 完毕 , 104  国道 小溪 岭段 恢复 了 通车 。',     'triggers': [         {             'event': 'operation',             'event_trigger': '清理',             'event_arguments': ['事故现场']         },         {             'event': 'stateChange',              'event_trigger': '通车'         }     ] } 
数据统计 读取数据集中的数据:
1 2 3 4 5 6 7 import  jsonimport  pandas as  pdwith  open ('./dataset.json' , 'r' , encoding='utf-8' ) as  f:    sentences = json.load(f) len (sentences)
句子中 Trigger 数量统计 triggers_num 记录了每一个句子中的触发词数量,转为 DataFrame,可以看到平均每个句子中有 2 个触发词,最少 1 个触发词,最多 8 个触发词。
1 2 3 4 5 6 triggers_num = [len (s['triggers' ]) for  s in  sentences]	 len (triggers_num)triggers_df = pd.DataFrame({'triggers_num' : triggers_num}) triggers_df.describe() 
triggers_num 
 
 
count 
1665.000000 
 
mean 
2.045646 
 
std 
1.298235 
 
min 
1.000000 
 
25% 
1.000000 
 
50% 
2.000000 
 
75% 
3.000000 
 
max 
8.000000 
 
1/1 比例(句子中只有一个事件) 1 2 3 4 len (triggers_df[triggers_df['triggers_num' ] == 1 ])len (triggers_df[triggers_df['triggers_num' ] == 1 ]) / len (triggers_df)
1/N 比例 1 2 3 4 len (triggers_df[triggers_df['triggers_num' ] > 1 ])len (triggers_df[triggers_df['triggers_num' ] > 1 ]) / len (triggers_df)
Event 类型统计 - 7 统计事件类型的数量为 7,并且统计了每一种事件的出现次数。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 event_type = dict () for  sent in  sentences:    for  trigger in  sent['triggers' ]:         t = trigger['event' ]         if  t not  in  event_type.keys():             event_type[t] = 1          else :             event_type[t] += 1  len (event_type)event_type 
Argument 数量统计 arguments_num 保存了每一个触发词对应的论元数量。将之转为 DataFrame,可以看到每一个论元平均有 0.4495 个论元。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 arguments_num = []	 for  sent in  sentences:    for  trigger in  sent['triggers' ]:         if  'event_arguments'  not  in  trigger.keys():             arguments_num.append(0 )         else :             arguments_num.append(len (trigger['event_arguments' ])) len (arguments_num)arguments_df = pd.DataFrame({'num' : arguments_num}) arguments_df.describe() 
num 
 
 
count 
3406.000000 
 
mean 
0.449501 
 
std 
0.497516 
 
min 
0.000000 
 
25% 
0.000000 
 
50% 
0.000000 
 
75% 
1.000000 
 
max 
1.000000 
 
统计数量,其中有一个论元的触发词个数为 1531;没有论元的触发词个数为 1875
1 2 3 4 arguments_df['num' ].value_counts() 
因为所有的触发词中,论元至多只有一个,所以将 event argument 的类型由 list 转为 str:
1 2 3 4 5 6 7 8 for  sent in  sentences:    for  trigger in  sent['triggers' ]:         if  'event_arguments'  in  trigger.keys():             trigger['event_arguments' ] = trigger['event_arguments' ][0 ]      with  open ('./dataset.json' , 'w' ) as  f:    json.dump(sentences, f, sort_keys=True , indent=4 , ensure_ascii=False ) 
句子的最大长度 - 85 统计句子的最大长度,得到最大长度为 85 (单位:词)。
1 2 3 4 5 6 nums = [] for  piece in  sentences:    nums.append(len (piece['sentence_words' ].split()))     max (nums), nums.index(max (nums))
Train / Test 划分 triggers_num 为每个句子中的触发词数量,y 为下标。
1 2 3 4 5 6 7 8 9 10 11 12 13 import  jsonfrom  sklearn.model_selection import  train_test_splitimport  numpy as  npwith  open ('./dataset.json' , 'r' , encoding='utf-8' ) as  f:    dataset = json.load(f)      len (dataset)	triggers_num = [len (p['triggers' ]) for  p in  dataset] y = np.arange(len (dataset)) y.shape		 
用 train_test_split 函数对数据集进行分割,测试集占比 0.2
1 2 3 train_index, test_index = train_test_split(y, test_size=0.2 , stratify=triggers_num, random_state=0 ) train_index.shape, test_index.shape 
train_test_split 函数中 stratify 参数的作用:保持测试集与整个数据集里result的数据分类比例一致。
 
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 def  check_triggers_num (indexes ):    num = np.array(triggers_num)          chosen_num = num[indexes]     events = list (set (chosen_num))     events.sort()          num2len = dict ()     for  e in  events:         num2len[e] = len (chosen_num[chosen_num==e])          print (num2len)      check_triggers_num(y) {1 : 765 , 2 : 429 , 3 : 249 , 4 : 131 , 5 : 55 , 6 : 20 , 7 : 11 , 8 : 5 } check_triggers_num(train_index) {1 : 612 , 2 : 343 , 3 : 199 , 4 : 105 , 5 : 44 , 6 : 16 , 7 : 9 , 8 : 4 } check_triggers_num(test_index) {1 : 153 , 2 : 86 , 3 : 50 , 4 : 26 , 5 : 11 , 6 : 4 , 7 : 2 , 8 : 1 }