¸üÐÂʱ¼ä:2022Äê02ÔÂ09ÈÕ17ʱ41·Ö À´Ô´:ÀÖÓãµç¾º ä¯ÀÀ´ÎÊý:
Îı¾Êý¾Ý·ÖÎöÄܹ»ÓÐЧ°ïÖúÎÒÃÇÀí½âÊý¾ÝÓïÁÏ, ¿ìËÙ¼ì²é³öÓïÁÏ¿ÉÄÜ´æÔÚµÄÎÊÌâ, ²¢Ö¸µ¼Ö®ºóÄ£ÐÍѵÁ·¹ý³ÌÖÐһЩ³¬²ÎÊýµÄÑ¡Ôñ.
ÎÒÃǽ«»ùÓÚÕæÊµµÄÖÐÎľƵêÆÀÂÛÓïÁÏÀ´½²½â³£Óõļ¸ÖÖÎı¾Êý¾Ý·ÖÎö·½·¨.
ÊôÓÚ¶þ·ÖÀàµÄÖÐÎÄÇé¸Ð·ÖÎöÓïÁÏ, ¸ÃÓïÁÏ´æ·ÅÔÚ"./cn_data"Ŀ¼ÏÂ.ÆäÖÐtrain.tsv´ú±íѵÁ·¼¯, dev.tsv´ú±íÑéÖ¤¼¯, ¶þÕßÊý¾ÝÑùʽÏàͬ.
¸Ãtrain.tsvÊý¾ÝÑùʽ:
sentence label Ôç²Í²»ºÃ,·þÎñ²»µ½Î»,Íí²ÍÎÞÎ÷²Í,Ôç²ÍÍí²ÍÏàͬ,·¿¼äÌõ¼þ²»ºÃ,²ÍÌü²»·ÖÎüÑÌÇø.·¿¼ä²»·ÖÓÐÎÞÑÌ·¿. 0 È¥µÄʱºò ,¾Æµê´óÌüºÍ²ÍÌüÔÚ×°ÐÞ,¸Ð¾õ´óÌüÓе㼷.ÓÉÓÚ²ÍÌü×°ÐÞ±¾À´¸ÃÏíÊܵÄÔç·¹,ҲûÓÐÏíÊÜ(ËûÃÇÊÇ8µã¿ªÊ¼Ã¿¸ö·¿¼äËÍ,µ«ÊÇÎÒʱ¼äÀ´²»¼°ÁË)²»¹ýǰ̨·þÎñԱ̬¶ÈºÃ! 1 Óкܳ¤Ê±¼äûÓÐÔÚÎ÷²Ø´óÏÃסÁË£¬ÒÔǰȥ±±¾©ÔÚÕâÀïסµÄ½Ï¶à¡£Õâ´Îס½øÀ´·¢ÏÖ»»ÁËÒº¾§µçÊÓ£¬µ«ÍøÂç²»ÊǺܺã¬ËûÃÇ×Ô¼ºËµÊÇÊշѵÄÔÒòÔì³ÉµÄ¡£ÆäËü»¹ºÃ¡£ 1 ·Ç³£ºÃµÄµØÀíλÖã¬×¡µÄÊǺÀ»ªº£¾°·¿£¬´ò¿ª´°»§¾Í¿ÉÒÔ¿´¼ûÕ»Çźͺ£¾°¡£¼ÇµÃºÜÔçÒÔǰҲס¹ý£¬ÏÖÔÚÖØÐÂ×°ÐÞÁË¡£×ܵÄÀ´Ëµ±È½ÏÂúÒ⣬ÒÔºó»¹»áס 1 ½»Í¨ºÜ·½±ã£¬·¿¼äСÁËÒ»µã£¬µ«ÊǸɾ»Õû½à£¬ºÜÓÐÏã¸ÛµÄÌØÉ«£¬ÐԼ۱Ƚϸߣ¬ÍƼöÒ»ÏÂŶ 1 ¾ÆµêµÄ×°ÐޱȽϳ¾ɣ¬·¿¼äµÄ¸ôÒô£¬Ö÷ÒªÊÇÎÀÉú¼äµÄ¸ôÒô·Ç³£²î£¬Ö»ÄÜËãÊÇÒ»°ãµÄ 0 ¾ÆµêÓеã¾É£¬·¿¼ä±È½ÏС£¬µ«¾ÆµêµÄλ×Ó²»´í£¬¾ÍÔÚº£±ß£¬¿ÉÒÔÖ±½ÓÈ¥ÓÎÓ¾¡£8Â¥µÄº£¾°´ò¿ª´°»§¾ÍÊǺ£¡£Èç¹ûÏëסÔÚÈÈÄֵĵشø£¬ÕâÀï²»ÊÇÒ»¸öºÜºÃµÄÑ¡Ôñ£¬²»¹ýÍþº£³ÇÊÐÕæµÄ±È½ÏС£¬´ò³µ»¹ÊÇÏ൱±ãÒ˵ġ£ÍíÉϾƵêÃſڳö×â³µ±È½ÏÉÙ¡£ 1 λÖúܺã¬×ß·µ½ÎÄÃí¡¢ÇåÁ¹ËÂ5·ÖÖÓ¶¼Óò»ÁË£¬Öܱ߹«½»³µºÜ¶àºÜ·½±ã£¬¾ÍÊdzö×â³µ²»Ì«°®È¥£¨ÀϳÇÇøÂ·Õ°®¶Â³µ£©£¬ÒòΪÊÇÀϱö¹ÝËùÒÔÉèʩҪ³Â¾ÉЩ£¬ 1 ¾ÆµêÉ豸һ°ã£¬Ì×·¿ÀïÎÔÊҵIJ»ÄÜÉÏÍø£¬Òªµ½¿ÍÌüÈ¥¡£ 0
train.tsvÖеÄÊý¾ÝÄÚÈݹ²·ÖΪ2ÁÐ, µÚÒ»ÁÐÊý¾Ý´ú±í¾ßÓиÐÇéÉ«²ÊµÄÆÀÂÛÎı¾; µÚ¶þÁÐÊý¾Ý, 0»ò1, ´ú±íÿÌõÎı¾Êý¾ÝÊÇ»ý¼«»òÕßÏû¼«µÄÆÀÂÛ, 0´ú±íÏû¼«, 1´ú±í»ý¼«.
# µ¼Èë±Ø±¸¹¤¾ß°ü
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt
# ÉèÖÃÏÔʾ·ç¸ñ
plt.style.use('fivethirtyeight')
# ·Ö±ð¶ÁȡѵÁ·tsvºÍÑéÖ¤tsv
train_data = pd.read_csv("./cn_data/train.tsv", sep="\t")
valid_data = pd.read_csv("./cn_data/dev.tsv", sep="\t")
# »ñµÃѵÁ·Êý¾Ý±êÇ©ÊýÁ¿·Ö²¼
sns.countplot("label", data=train_data)
plt.title("train_data")
plt.show()
# »ñÈ¡ÑéÖ¤Êý¾Ý±êÇ©ÊýÁ¿·Ö²¼
sns.countplot("label", data=valid_data)
plt.title("valid_data")
plt.show()


×¢Ò⣺ÔÚÉî¶ÈѧϰģÐÍÆÀ¹ÀÖÐ, ÎÒÃÇÒ»°ãʹÓÃACC×÷ΪÆÀ¹ÀÖ¸±ê, ÈôÏ뽫ACCµÄ»ùÏß¶¨ÒåÔÚ50%×óÓÒ, ÔòÐèÒªÎÒÃǵÄÕý¸ºÑù±¾±ÈÀýά³ÖÔÚ1:1×óÓÒ, ·ñÔò¾ÍÒª½øÐбØÒªµÄÊý¾ÝÔöÇ¿»òÊý¾Ýɾ¼õ. ÉÏͼÖÐѵÁ·ºÍÑéÖ¤¼¯Õý¸ºÑù±¾¶¼ÉÔÓв»¾ùºâ, ¿ÉÒÔ½øÐÐһЩÊý¾ÝÔöÇ¿¡£
# ÔÚѵÁ·Êý¾ÝÖÐÌí¼Óеľä×Ó³¤¶ÈÁÐ, ÿ¸öÔªËØµÄÖµ¶¼ÊǶÔÓ¦µÄ¾ä×ÓÁеij¤¶È
train_data["sentence_length"] = list(map(lambda x: len(x), train_data["sentence"]))
# »æÖƾä×Ó³¤¶ÈÁеÄÊýÁ¿·Ö²¼Í¼
sns.countplot("sentence_length", data=train_data)
# Ö÷Òª¹Ø×¢count³¤¶È·Ö²¼µÄ×Ý×ø±ê, ²»ÐèÒª»æÖƺá×ø±ê, ºá×ø±ê·¶Î§Í¨¹ýdistͼ½øÐв鿴
plt.xticks([])
plt.show()
# »æÖÆdist³¤¶È·Ö²¼Í¼
sns.distplot(train_data["sentence_length"])
# Ö÷Òª¹Ø×¢dist³¤¶È·Ö²¼ºá×ø±ê, ²»ÐèÒª»æÖÆ×Ý×ø±ê
plt.yticks([])
plt.show()
# ÔÚÑéÖ¤Êý¾ÝÖÐÌí¼Óеľä×Ó³¤¶ÈÁÐ, ÿ¸öÔªËØµÄÖµ¶¼ÊǶÔÓ¦µÄ¾ä×ÓÁеij¤¶È
valid_data["sentence_length"] = list(map(lambda x: len(x), valid_data["sentence"]))
# »æÖƾä×Ó³¤¶ÈÁеÄÊýÁ¿·Ö²¼Í¼
sns.countplot("sentence_length", data=valid_data)
# Ö÷Òª¹Ø×¢count³¤¶È·Ö²¼µÄ×Ý×ø±ê, ²»ÐèÒª»æÖƺá×ø±ê, ºá×ø±ê·¶Î§Í¨¹ýdistͼ½øÐв鿴
plt.xticks([])
plt.show()
# »æÖÆdist³¤¶È·Ö²¼Í¼
sns.distplot(valid_data["sentence_length"])
# Ö÷Òª¹Ø×¢dist³¤¶È·Ö²¼ºá×ø±ê, ²»ÐèÒª»æÖÆ×Ý×ø±ê
plt.yticks([])
plt.show()


ͨ¹ý»æÖƾä×Ó³¤¶È·Ö²¼Í¼, ¿ÉÒÔµÃÖªÎÒÃǵÄÓïÁÏÖд󲿷־ä×Ó³¤¶ÈµÄ·Ö²¼·¶Î§, ÒòΪģÐ͵ÄÊäÈëÒªÇóΪ¹Ì¶¨³ß´çµÄÕÅÁ¿£¬ºÏÀíµÄ³¤¶È·¶Î§¶ÔÖ®ºó½øÐоä×ӽضϲ¹Æë(¹æ·¶³¤¶È)Æðµ½¹Ø¼üµÄÖ¸µ¼×÷ÓÃ. ÉÏͼÖд󲿷־ä×Ó³¤¶ÈµÄ·¶Î§´óÖÂΪ20-250Ö®¼ä¡£
# »æÖÆÑµÁ·¼¯³¤¶È·Ö²¼µÄÉ¢µãͼ sns.stripplot(y='sentence_length',x='label',data=train_data) plt.show() # »æÖÆÑéÖ¤¼¯³¤¶È·Ö²¼µÄÉ¢µãͼ sns.stripplot(y='sentence_length',x='label',data=valid_data) plt.show()


ͨ¹ý²é¿´Õý¸ºÑù±¾³¤¶ÈÉ¢µãͼ, ¿ÉÒÔÓÐЧ¶¨Î»Òì³£µãµÄ³öÏÖλÖÃ, °ïÖúÎÒÃǸü׼ȷ½øÐÐÈ˹¤ÓïÁÏÉó²é. ÉÏͼÖÐÔÚѵÁ·¼¯ÕýÑù±¾ÖгöÏÖÁËÒì³£µã, ËüµÄ¾ä×Ó³¤¶È½ü3500×óÓÒ, ÐèÒªÎÒÃÇÈ˹¤Éó²é¡£
# µ¼ÈëjiebaÓÃÓÚ·Ö´Ê
# µ¼Èëchain·½·¨ÓÃÓÚ±âÆ½»¯Áбí
import jieba
from itertools import chain
# ½øÐÐѵÁ·¼¯µÄ¾ä×Ó½øÐзִÊ, ²¢Í³¼Æ³ö²»Í¬´Ê»ãµÄ×ÜÊý
train_vocab = set(chain(*map(lambda x: jieba.lcut(x), train_data["sentence"])))
print("ѵÁ·¼¯¹²°üº¬²»Í¬´Ê»ã×ÜÊýΪ£º", len(train_vocab))
# ½øÐÐÑéÖ¤¼¯µÄ¾ä×Ó½øÐзִÊ, ²¢Í³¼Æ³ö²»Í¬´Ê»ãµÄ×ÜÊý
valid_vocab = set(chain(*map(lambda x: jieba.lcut(x), valid_data["sentence"])))
print("ѵÁ·¼¯¹²°üº¬²»Í¬´Ê»ã×ÜÊýΪ£º", len(valid_vocab))
Êä³öЧ¹û:
ѵÁ·¼¯¹²°üº¬²»Í¬´Ê»ã×ÜÊýΪ£º 12147 ѵÁ·¼¯¹²°üº¬²»Í¬´Ê»ã×ÜÊýΪ£º 6857
# ʹÓÃjiebaÖеĴÊÐÔ±ê×¢¹¦ÄÜ
import jieba.posseg as pseg
def get_a_list(text):
"""ÓÃÓÚ»ñÈ¡ÐÎÈÝ´ÊÁбí"""
# ʹÓÃjiebaµÄ´ÊÐÔ±ê×¢·½·¨ÇзÖÎı¾,»ñµÃ¾ßÓдÊÐÔÊôÐÔflagºÍ´Ê»ãÊôÐÔwordµÄ¶ÔÏó,
# ´Ó¶øÅжÏflagÊÇ·ñΪÐÎÈÝ´Ê,À´·µ»Ø¶ÔÓ¦µÄ´Ê»ã
r = []
for g in pseg.lcut(text):
if g.flag == "a":
r.append(g.word)
return r
# µ¼Èë»æÖÆ´ÊÔÆµÄ¹¤¾ß°ü
from wordcloud import WordCloud
def get_word_cloud(keywords_list):
# ʵÀý»¯»æÖÆ´ÊÔÆµÄÀà, ÆäÖвÎÊýfont_pathÊÇ×ÖÌå·¾¶, ΪÁËÄܹ»ÏÔʾÖÐÎÄ,
# max_wordsÖ¸´ÊÔÆÍ¼Ïñ×î¶àÏÔʾ¶àÉÙ¸ö´Ê, background_colorΪ±³¾°ÑÕÉ«
wordcloud = WordCloud(font_path="./SimHei.ttf", max_words=100, background_color="white")
# ½«´«ÈëµÄÁбíת»¯³É´ÊÔÆÉú³ÉÆ÷ÐèÒªµÄ×Ö·û´®ÐÎʽ
keywords_string = " ".join(keywords_list)
# Éú³É´ÊÔÆ
wordcloud.generate(keywords_string)
# »æÖÆÍ¼Ïñ²¢ÏÔʾ
plt.figure()
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()
# »ñµÃѵÁ·¼¯ÉÏÕýÑù±¾
p_train_data = train_data[train_data["label"]==1]["sentence"]
# ¶ÔÕýÑù±¾µÄÿ¸ö¾ä×ÓµÄÐÎÈÝ´Ê
train_p_a_vocab = chain(*map(lambda x: get_a_list(x), p_train_data))
#print(train_p_n_vocab)
# »ñµÃѵÁ·¼¯ÉϸºÑù±¾
n_train_data = train_data[train_data["label"]==0]["sentence"]
# »ñÈ¡¸ºÑù±¾µÄÿ¸ö¾ä×ÓµÄÐÎÈÝ´Ê
train_n_a_vocab = chain(*map(lambda x: get_a_list(x), n_train_data))
# µ÷ÓûæÖÆ´ÊÔÆº¯Êý
get_word_cloud(train_p_a_vocab)
get_word_cloud(train_n_a_vocab)
ÑéÖ¤¼¯ÕýÑù±¾ÐÎÈÝ´Ê´ÊÔÆ:


»ñµÃÑéÖ¤¼¯ÉÏÕý¸ºµÄÑù±¾µÄÐÎÈÝ´Ê´ÊÔÆ
# »ñµÃÑéÖ¤¼¯ÉÏÕýÑù±¾ p_valid_data = valid_data[valid_data["label"]==1]["sentence"] # ¶ÔÕýÑù±¾µÄÿ¸ö¾ä×ÓµÄÐÎÈÝ´Ê valid_p_a_vocab = chain(*map(lambda x: get_a_list(x), p_valid_data)) #print(train_p_n_vocab) # »ñµÃÑéÖ¤¼¯ÉϸºÑù±¾ n_valid_data = valid_data[valid_data["label"]==0]["sentence"] # »ñÈ¡¸ºÑù±¾µÄÿ¸ö¾ä×ÓµÄÐÎÈÝ´Ê valid_n_a_vocab = chain(*map(lambda x: get_a_list(x), n_valid_data)) # µ÷ÓûæÖÆ´ÊÔÆº¯Êý get_word_cloud(valid_p_a_vocab) get_word_cloud(valid_n_a_vocab)
¸ù¾Ý¸ßƵÐÎÈÝ´Ê´ÊÔÆÏÔʾ, ÎÒÃÇ¿ÉÒÔ¶Ôµ±Ç°ÓïÁÏÖÊÁ¿½øÐмòµ¥ÆÀ¹À, ͬʱ¶ÔÎ¥·´ÓïÁϱêÇ©º¬ÒåµÄ´Ê»ã½øÐÐÈ˹¤Éó²éºÍÐÞÕý, À´±£Ö¤¾ø´ó¶àÊýÓïÁÏ·ûºÏѵÁ·±ê×¼. ÉÏͼÖеÄÕýÑù±¾´ó¶àÊýÊǰýÒå´Ê, ¶ø¸ºÑù±¾´ó¶àÊýÊDZáÒå´Ê, »ù±¾·ûºÏÒªÇó, µ«ÊǸºÑù±¾´ÊÔÆÖÐÒ²´æÔÚ"±ãÀû"ÕâÑùµÄ°ýÒå´Ê, Òò´Ë¿ÉÒÔÈ˹¤½øÐÐÉó²é¡£
python»ù´¡½Ì³Ì£ºÊ¹ÓÃÎı¾±à¼Æ÷
¸»Îı¾±à¼Æ÷-UEditorÅäÖü°Ê¹ÓÃ
ʲôÊÇÊý¾Ý·ÖÎö£¿Êý¾Ý·ÖÎöÓÐʲô×÷Óã¿
ÈçºÎ×öÊý¾Ý·ÖÎö£¬Êý¾Ý·ÖÎöÁ÷³ÌÊÇʲô?
ÀÖÓãµç¾ºAiÈ˹¤ÖÇÄÜÈí¼þ¹¤³ÌʦÅàѵ
±±¾©Ð£Çø