ÀÖÓãµç¾º

½ÌÓýÐÐÒµA¹ÉIPOµÚÒ»¹É£¨¹ÉƱ´úÂë 003032£©

È«¹ú×Éѯ/ͶËßÈÈÏߣº400-618-4000

Îı¾Êý¾Ý·ÖÎöÓÐʲô×÷Ó㿳£ÓõÄÎı¾Êý¾Ý·ÖÎö·½·¨

¸üÐÂʱ¼ä:2022Äê02ÔÂ09ÈÕ17ʱ41·Ö À´Ô´:ÀÖÓãµç¾º ä¯ÀÀ´ÎÊý:

Îı¾Êý¾Ý·ÖÎöµÄ×÷ÓÃ:

Îı¾Êý¾Ý·ÖÎöÄܹ»ÓÐЧ°ïÖúÎÒÃÇÀí½âÊý¾ÝÓïÁÏ, ¿ìËÙ¼ì²é³öÓïÁÏ¿ÉÄÜ´æÔÚµÄÎÊÌâ, ²¢Ö¸µ¼Ö®ºóÄ£ÐÍѵÁ·¹ý³ÌÖÐһЩ³¬²ÎÊýµÄÑ¡Ôñ.

³£Óõļ¸ÖÖÎı¾Êý¾Ý·ÖÎö·½·¨:

  • ±êÇ©ÊýÁ¿·Ö²¼
  • ¾ä×Ó³¤¶È·Ö²¼
  • ´ÊƵͳ¼ÆÓë¹Ø¼ü´Ê´ÊÔÆ

ÎÒÃǽ«»ùÓÚÕæÊµµÄÖÐÎľƵêÆÀÂÛÓïÁÏÀ´½²½â³£Óõļ¸ÖÖÎı¾Êý¾Ý·ÖÎö·½·¨.

ÖÐÎľƵêÆÀÂÛÓïÁÏ:

ÊôÓÚ¶þ·ÖÀàµÄÖÐÎÄÇé¸Ð·ÖÎöÓïÁÏ, ¸ÃÓïÁÏ´æ·ÅÔÚ"./cn_data"Ŀ¼ÏÂ.ÆäÖÐtrain.tsv´ú±íѵÁ·¼¯, dev.tsv´ú±íÑéÖ¤¼¯, ¶þÕßÊý¾ÝÑùʽÏàͬ.

¸Ãtrain.tsvÊý¾ÝÑùʽ:

sentence    label
Ôç²Í²»ºÃ,·þÎñ²»µ½Î»,Íí²ÍÎÞÎ÷²Í,Ôç²ÍÍí²ÍÏàͬ,·¿¼äÌõ¼þ²»ºÃ,²ÍÌü²»·ÖÎüÑÌÇø.·¿¼ä²»·ÖÓÐÎÞÑÌ·¿.    0
È¥µÄʱºò ,¾Æµê´óÌüºÍ²ÍÌüÔÚ×°ÐÞ,¸Ð¾õ´óÌüÓе㼷.ÓÉÓÚ²ÍÌü×°ÐÞ±¾À´¸ÃÏíÊܵÄÔç·¹,ҲûÓÐÏíÊÜ(ËûÃÇÊÇ8µã¿ªÊ¼Ã¿¸ö·¿¼äËÍ,µ«ÊÇÎÒʱ¼äÀ´²»¼°ÁË)²»¹ýǰ̨·þÎñԱ̬¶ÈºÃ!    1
Óкܳ¤Ê±¼äûÓÐÔÚÎ÷²Ø´óÏÃסÁË£¬ÒÔǰȥ±±¾©ÔÚÕâÀïסµÄ½Ï¶à¡£Õâ´Îס½øÀ´·¢ÏÖ»»ÁËÒº¾§µçÊÓ£¬µ«ÍøÂç²»ÊǺܺã¬ËûÃÇ×Ô¼ºËµÊÇÊշѵÄÔ­ÒòÔì³ÉµÄ¡£ÆäËü»¹ºÃ¡£  1
·Ç³£ºÃµÄµØÀíλÖã¬×¡µÄÊǺÀ»ªº£¾°·¿£¬´ò¿ª´°»§¾Í¿ÉÒÔ¿´¼ûÕ»Çźͺ£¾°¡£¼ÇµÃºÜÔçÒÔǰҲס¹ý£¬ÏÖÔÚÖØÐÂ×°ÐÞÁË¡£×ܵÄÀ´Ëµ±È½ÏÂúÒ⣬ÒÔºó»¹»áס   1
½»Í¨ºÜ·½±ã£¬·¿¼äСÁËÒ»µã£¬µ«ÊǸɾ»Õû½à£¬ºÜÓÐÏã¸ÛµÄÌØÉ«£¬ÐԼ۱Ƚϸߣ¬ÍƼöÒ»ÏÂŶ 1
¾ÆµêµÄ×°ÐޱȽϳ¾É£¬·¿¼äµÄ¸ôÒô£¬Ö÷ÒªÊÇÎÀÉú¼äµÄ¸ôÒô·Ç³£²î£¬Ö»ÄÜËãÊÇÒ»°ãµÄ    0
¾ÆµêÓеã¾É£¬·¿¼ä±È½ÏС£¬µ«¾ÆµêµÄλ×Ó²»´í£¬¾ÍÔÚº£±ß£¬¿ÉÒÔÖ±½ÓÈ¥ÓÎÓ¾¡£8Â¥µÄº£¾°´ò¿ª´°»§¾ÍÊǺ£¡£Èç¹ûÏëסÔÚÈÈÄֵĵشø£¬ÕâÀï²»ÊÇÒ»¸öºÜºÃµÄÑ¡Ôñ£¬²»¹ýÍþº£³ÇÊÐÕæµÄ±È½ÏС£¬´ò³µ»¹ÊÇÏ൱±ãÒ˵Ä¡£ÍíÉϾƵêÃſڳö×â³µ±È½ÏÉÙ¡£   1
λÖúܺã¬×ß·µ½ÎÄÃí¡¢ÇåÁ¹ËÂ5·ÖÖÓ¶¼Óò»ÁË£¬Öܱ߹«½»³µºÜ¶àºÜ·½±ã£¬¾ÍÊdzö×â³µ²»Ì«°®È¥£¨ÀϳÇÇøÂ·Õ­°®¶Â³µ£©£¬ÒòΪÊÇÀϱö¹ÝËùÒÔÉèʩҪ³Â¾ÉЩ£¬    1
¾ÆµêÉ豸һ°ã£¬Ì×·¿ÀïÎÔÊҵIJ»ÄÜÉÏÍø£¬Òªµ½¿ÍÌüÈ¥¡£    0

train.tsvÊý¾ÝÑùʽ˵Ã÷:

train.tsvÖеÄÊý¾ÝÄÚÈݹ²·ÖΪ2ÁÐ, µÚÒ»ÁÐÊý¾Ý´ú±í¾ßÓиÐÇéÉ«²ÊµÄÆÀÂÛÎı¾; µÚ¶þÁÐÊý¾Ý, 0»ò1, ´ú±íÿÌõÎı¾Êý¾ÝÊÇ»ý¼«»òÕßÏû¼«µÄÆÀÂÛ, 0´ú±íÏû¼«, 1´ú±í»ý¼«.

»ñµÃѵÁ·¼¯ºÍÑéÖ¤¼¯µÄ±êÇ©ÊýÁ¿·Ö²¼

# µ¼Èë±Ø±¸¹¤¾ß°ü
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt
# ÉèÖÃÏÔʾ·ç¸ñ
plt.style.use('fivethirtyeight') 

# ·Ö±ð¶ÁȡѵÁ·tsvºÍÑéÖ¤tsv
train_data = pd.read_csv("./cn_data/train.tsv", sep="\t")
valid_data = pd.read_csv("./cn_data/dev.tsv", sep="\t")


# »ñµÃѵÁ·Êý¾Ý±êÇ©ÊýÁ¿·Ö²¼
sns.countplot("label", data=train_data)
plt.title("train_data")
plt.show()


# »ñÈ¡ÑéÖ¤Êý¾Ý±êÇ©ÊýÁ¿·Ö²¼
sns.countplot("label", data=valid_data)
plt.title("valid_data")
plt.show()

ѵÁ·¼¯±êÇ©ÊýÁ¿·Ö²¼:

ѵÁ·¼¯ÈºÊýÁ¿·Ö²¼

ÑéÖ¤¼¯±êÇ©ÊýÁ¿·Ö²¼:

ÑéÖ¤¼¯±êÇ©ÊýÁ¿·Ö²¼

×¢Ò⣺ÔÚÉî¶ÈѧϰģÐÍÆÀ¹ÀÖÐ, ÎÒÃÇÒ»°ãʹÓÃACC×÷ΪÆÀ¹ÀÖ¸±ê, ÈôÏ뽫ACCµÄ»ùÏß¶¨ÒåÔÚ50%×óÓÒ, ÔòÐèÒªÎÒÃǵÄÕý¸ºÑù±¾±ÈÀýά³ÖÔÚ1:1×óÓÒ, ·ñÔò¾ÍÒª½øÐбØÒªµÄÊý¾ÝÔöÇ¿»òÊý¾Ýɾ¼õ. ÉÏͼÖÐѵÁ·ºÍÑéÖ¤¼¯Õý¸ºÑù±¾¶¼ÉÔÓв»¾ùºâ, ¿ÉÒÔ½øÐÐһЩÊý¾ÝÔöÇ¿¡£

»ñȡѵÁ·¼¯ºÍÑéÖ¤¼¯µÄ¾ä×Ó³¤¶È·Ö²¼

# ÔÚѵÁ·Êý¾ÝÖÐÌí¼Óеľä×Ó³¤¶ÈÁÐ, ÿ¸öÔªËØµÄÖµ¶¼ÊǶÔÓ¦µÄ¾ä×ÓÁеij¤¶È
train_data["sentence_length"] = list(map(lambda x: len(x), train_data["sentence"]))

# »æÖƾä×Ó³¤¶ÈÁеÄÊýÁ¿·Ö²¼Í¼
sns.countplot("sentence_length", data=train_data)
# Ö÷Òª¹Ø×¢count³¤¶È·Ö²¼µÄ×Ý×ø±ê, ²»ÐèÒª»æÖƺá×ø±ê, ºá×ø±ê·¶Î§Í¨¹ýdistͼ½øÐв鿴
plt.xticks([])
plt.show()

# »æÖÆdist³¤¶È·Ö²¼Í¼
sns.distplot(train_data["sentence_length"])

# Ö÷Òª¹Ø×¢dist³¤¶È·Ö²¼ºá×ø±ê, ²»ÐèÒª»æÖÆ×Ý×ø±ê
plt.yticks([])
plt.show()


# ÔÚÑéÖ¤Êý¾ÝÖÐÌí¼Óеľä×Ó³¤¶ÈÁÐ, ÿ¸öÔªËØµÄÖµ¶¼ÊǶÔÓ¦µÄ¾ä×ÓÁеij¤¶È
valid_data["sentence_length"] = list(map(lambda x: len(x), valid_data["sentence"]))

# »æÖƾä×Ó³¤¶ÈÁеÄÊýÁ¿·Ö²¼Í¼
sns.countplot("sentence_length", data=valid_data)

# Ö÷Òª¹Ø×¢count³¤¶È·Ö²¼µÄ×Ý×ø±ê, ²»ÐèÒª»æÖƺá×ø±ê, ºá×ø±ê·¶Î§Í¨¹ýdistͼ½øÐв鿴
plt.xticks([])
plt.show()

# »æÖÆdist³¤¶È·Ö²¼Í¼
sns.distplot(valid_data["sentence_length"])

# Ö÷Òª¹Ø×¢dist³¤¶È·Ö²¼ºá×ø±ê, ²»ÐèÒª»æÖÆ×Ý×ø±ê
plt.yticks([])
plt.show()

ѵÁ·¼¯¾ä×Ó³¤¶È·Ö²¼:

ѵÁ·¼¯Èº¾ä×Ó³¤¶È·Ö²¼

ѵÁ·¼¯Èº¾ä×Ó³¤¶È·Ö²¼

ͨ¹ý»æÖƾä×Ó³¤¶È·Ö²¼Í¼, ¿ÉÒÔµÃÖªÎÒÃǵÄÓïÁÏÖд󲿷־ä×Ó³¤¶ÈµÄ·Ö²¼·¶Î§, ÒòΪģÐ͵ÄÊäÈëÒªÇóΪ¹Ì¶¨³ß´çµÄÕÅÁ¿£¬ºÏÀíµÄ³¤¶È·¶Î§¶ÔÖ®ºó½øÐоä×ӽضϲ¹Æë(¹æ·¶³¤¶È)Æðµ½¹Ø¼üµÄÖ¸µ¼×÷ÓÃ. ÉÏͼÖд󲿷־ä×Ó³¤¶ÈµÄ·¶Î§´óÖÂΪ20-250Ö®¼ä¡£

»ñȡѵÁ·¼¯ºÍÑéÖ¤¼¯µÄÕý¸ºÑù±¾³¤¶ÈÉ¢µã·Ö²¼

# »æÖÆÑµÁ·¼¯³¤¶È·Ö²¼µÄÉ¢µãͼ
sns.stripplot(y='sentence_length',x='label',data=train_data)
plt.show()

# »æÖÆÑéÖ¤¼¯³¤¶È·Ö²¼µÄÉ¢µãͼ
sns.stripplot(y='sentence_length',x='label',data=valid_data)
plt.show()

ѵÁ·¼¯ÉÏÕý¸ºÑù±¾µÄ³¤¶ÈÉ¢µã·Ö²¼:

Õý¸ºÑù±¾µÄ³¤¶ÈÉ¢µã·Ö²¼

ÑéÖ¤¼¯ÉÏÕý¸ºÑù±¾µÄ³¤¶ÈÉ¢µã·Ö²¼:

ÑéÖ¤¼¯ÉÏÕý¸ºÑù±¾µÄ³¤¶ÈÉ¢µã·Ö²¼

ͨ¹ý²é¿´Õý¸ºÑù±¾³¤¶ÈÉ¢µãͼ, ¿ÉÒÔÓÐЧ¶¨Î»Òì³£µãµÄ³öÏÖλÖÃ, °ïÖúÎÒÃǸü׼ȷ½øÐÐÈ˹¤ÓïÁÏÉó²é. ÉÏͼÖÐÔÚѵÁ·¼¯ÕýÑù±¾ÖгöÏÖÁËÒì³£µã, ËüµÄ¾ä×Ó³¤¶È½ü3500×óÓÒ, ÐèÒªÎÒÃÇÈ˹¤Éó²é¡£

»ñµÃѵÁ·¼¯ÓëÑéÖ¤¼¯²»Í¬´Ê»ã×ÜÊýͳ¼Æ

# µ¼ÈëjiebaÓÃÓÚ·Ö´Ê
# µ¼Èëchain·½·¨ÓÃÓÚ±âÆ½»¯Áбí
import jieba
from itertools import chain

# ½øÐÐѵÁ·¼¯µÄ¾ä×Ó½øÐзִÊ, ²¢Í³¼Æ³ö²»Í¬´Ê»ãµÄ×ÜÊý
train_vocab = set(chain(*map(lambda x: jieba.lcut(x), train_data["sentence"])))
print("ѵÁ·¼¯¹²°üº¬²»Í¬´Ê»ã×ÜÊýΪ£º", len(train_vocab))

# ½øÐÐÑéÖ¤¼¯µÄ¾ä×Ó½øÐзִÊ, ²¢Í³¼Æ³ö²»Í¬´Ê»ãµÄ×ÜÊý
valid_vocab = set(chain(*map(lambda x: jieba.lcut(x), valid_data["sentence"])))
print("ѵÁ·¼¯¹²°üº¬²»Í¬´Ê»ã×ÜÊýΪ£º", len(valid_vocab))

Êä³öЧ¹û:

ѵÁ·¼¯¹²°üº¬²»Í¬´Ê»ã×ÜÊýΪ£º 12147
ѵÁ·¼¯¹²°üº¬²»Í¬´Ê»ã×ÜÊýΪ£º 6857

»ñµÃѵÁ·¼¯ÉÏÕý¸ºµÄÑù±¾µÄ¸ßƵÐÎÈÝ´Ê´ÊÔÆ

# ʹÓÃjiebaÖеĴÊÐÔ±ê×¢¹¦ÄÜ
import jieba.posseg as pseg

def get_a_list(text):
    """ÓÃÓÚ»ñÈ¡ÐÎÈÝ´ÊÁбí"""
    # ʹÓÃjiebaµÄ´ÊÐÔ±ê×¢·½·¨ÇзÖÎı¾,»ñµÃ¾ßÓдÊÐÔÊôÐÔflagºÍ´Ê»ãÊôÐÔwordµÄ¶ÔÏó, 
    # ´Ó¶øÅжÏflagÊÇ·ñΪÐÎÈÝ´Ê,À´·µ»Ø¶ÔÓ¦µÄ´Ê»ã
    r = []
    for g in pseg.lcut(text):
        if g.flag == "a":
            r.append(g.word)
    return r

# µ¼Èë»æÖÆ´ÊÔÆµÄ¹¤¾ß°ü
from wordcloud import WordCloud

def get_word_cloud(keywords_list):
    # ʵÀý»¯»æÖÆ´ÊÔÆµÄÀà, ÆäÖвÎÊýfont_pathÊÇ×ÖÌå·¾¶, ΪÁËÄܹ»ÏÔʾÖÐÎÄ, 
    # max_wordsÖ¸´ÊÔÆÍ¼Ïñ×î¶àÏÔʾ¶àÉÙ¸ö´Ê, background_colorΪ±³¾°ÑÕÉ« 
    wordcloud = WordCloud(font_path="./SimHei.ttf", max_words=100, background_color="white")
    # ½«´«ÈëµÄÁбíת»¯³É´ÊÔÆÉú³ÉÆ÷ÐèÒªµÄ×Ö·û´®ÐÎʽ
    keywords_string = " ".join(keywords_list)
    # Éú³É´ÊÔÆ
    wordcloud.generate(keywords_string)

    # »æÖÆÍ¼Ïñ²¢ÏÔʾ
    plt.figure()
    plt.imshow(wordcloud, interpolation="bilinear")
    plt.axis("off")
    plt.show()

# »ñµÃѵÁ·¼¯ÉÏÕýÑù±¾
p_train_data = train_data[train_data["label"]==1]["sentence"]

# ¶ÔÕýÑù±¾µÄÿ¸ö¾ä×ÓµÄÐÎÈÝ´Ê
train_p_a_vocab = chain(*map(lambda x: get_a_list(x), p_train_data))
#print(train_p_n_vocab)

# »ñµÃѵÁ·¼¯ÉϸºÑù±¾
n_train_data = train_data[train_data["label"]==0]["sentence"]

# »ñÈ¡¸ºÑù±¾µÄÿ¸ö¾ä×ÓµÄÐÎÈÝ´Ê
train_n_a_vocab = chain(*map(lambda x: get_a_list(x), n_train_data))

# µ÷ÓûæÖÆ´ÊÔÆº¯Êý
get_word_cloud(train_p_a_vocab)
get_word_cloud(train_n_a_vocab)

ÑéÖ¤¼¯ÕýÑù±¾ÐÎÈÝ´Ê´ÊÔÆ:

Ñù±¾ÐÎÈÝ´Ê´ÊÔÆ

ÑéÖ¤¼¯¸ºÑù±¾ÐÎÈÝ´Ê´ÊÔÆ:

»ñµÃÑéÖ¤¼¯ÉÏÕý¸ºµÄÑù±¾µÄÐÎÈÝ´Ê´ÊÔÆ

# »ñµÃÑéÖ¤¼¯ÉÏÕýÑù±¾
p_valid_data = valid_data[valid_data["label"]==1]["sentence"]

# ¶ÔÕýÑù±¾µÄÿ¸ö¾ä×ÓµÄÐÎÈÝ´Ê
valid_p_a_vocab = chain(*map(lambda x: get_a_list(x), p_valid_data))
#print(train_p_n_vocab)

# »ñµÃÑéÖ¤¼¯ÉϸºÑù±¾
n_valid_data = valid_data[valid_data["label"]==0]["sentence"]

# »ñÈ¡¸ºÑù±¾µÄÿ¸ö¾ä×ÓµÄÐÎÈÝ´Ê
valid_n_a_vocab = chain(*map(lambda x: get_a_list(x), n_valid_data))

# µ÷ÓûæÖÆ´ÊÔÆº¯Êý
get_word_cloud(valid_p_a_vocab)
get_word_cloud(valid_n_a_vocab)

¸ù¾Ý¸ßƵÐÎÈÝ´Ê´ÊÔÆÏÔʾ, ÎÒÃÇ¿ÉÒÔ¶Ôµ±Ç°ÓïÁÏÖÊÁ¿½øÐмòµ¥ÆÀ¹À, ͬʱ¶ÔÎ¥·´ÓïÁϱêÇ©º¬ÒåµÄ´Ê»ã½øÐÐÈ˹¤Éó²éºÍÐÞÕý, À´±£Ö¤¾ø´ó¶àÊýÓïÁÏ·ûºÏѵÁ·±ê×¼. ÉÏͼÖеÄÕýÑù±¾´ó¶àÊýÊǰýÒå´Ê, ¶ø¸ºÑù±¾´ó¶àÊýÊDZáÒå´Ê, »ù±¾·ûºÏÒªÇó, µ«ÊǸºÑù±¾´ÊÔÆÖÐÒ²´æÔÚ"±ãÀû"ÕâÑùµÄ°ýÒå´Ê, Òò´Ë¿ÉÒÔÈ˹¤½øÐÐÉó²é¡£





²ÂÄãϲ»¶£º

python»ù´¡½Ì³Ì£ºÊ¹ÓÃÎı¾±à¼­Æ÷

¸»Îı¾±à¼­Æ÷-UEditorÅäÖü°Ê¹ÓÃ

ʲôÊÇÊý¾Ý·ÖÎö£¿Êý¾Ý·ÖÎöÓÐʲô×÷Óã¿

ÈçºÎ×öÊý¾Ý·ÖÎö£¬Êý¾Ý·ÖÎöÁ÷³ÌÊÇʲô?

ÀÖÓãµç¾ºAiÈ˹¤ÖÇÄÜÈí¼þ¹¤³ÌʦÅàѵ

0 ·ÖÏíµ½£º
ºÍÎÒÃÇÔÚÏß½»Ì¸£¡
¡¾ÍøÕ¾µØÍ¼¡¿¡¾sitemap¡¿