python統計中文詞頻
Ⅰ 用python統計詞頻
def statistics(astr):
# astr.replace("\n", "")
slist = list(astr.split("\t"))
alist = []
[alist.append(i) for i in slist if i not in alist]
alist[-1] = alist[-1].replace("\n", "")
return alist
if __name__ == "__main__":
code_doc = {}
with open("test_data.txt", "r", encoding='utf-8') as fs:
for ln in fs.readlines():
l = statistics(ln)
for t in l:
if t not in code_doc:
code_doc.setdefault(t, 1)
else:
code_doc[t] += 1
for keys in code_doc.keys():
print(keys + ' ' + str(code_doc[keys]))
Ⅱ 如何用python統計一個txt文件中某個中文詞語出現的次數
Ⅲ python統計詞頻並進行可視化顯示
你去echart官網去看,有100多種可視化圖形可供你選擇。統計詞頻也有的,你需要做的就是查看example code再把code里data改成你的data即可。當然需要import echart。
Ⅳ 如何用python和jieba分詞,統計詞頻
#!python3
#-*-coding:utf-8-*-
importos,codecs
importjieba
fromcollectionsimportCounter
defget_words(txt):
seg_list=jieba.cut(txt)
c=Counter()
forxinseg_list:
iflen(x)>1andx!=' ':
c[x]+=1
print('常用詞頻度統計結果')
for(k,v)inc.most_common(100):
print('%s%s%s%d'%(''*(5-len(k)),k,'*'*int(v/3),v))
if__name__=='__main__':
withcodecs.open('19d.txt','r','utf8')asf:
txt=f.read()
get_words(txt)
Ⅳ 如何用python對文章中文分詞並統計詞頻
功能 [root@skatedb55 ~]# vi op_log_file.py #!/usr/bin/env python #-*- coding: utf-8 -*- #Author:Skate import os,time def op_log(log): f=file(log_file,'a') date=time.strftime('%Y-%m-%d %H:%M:%S') record = '%s %s\n' %(date,log) ...
Ⅵ 用Python實現小說里的高頻詞統計並顯示
用jieba做分詞,用wordcloud包做詞雲就可以了
#讀取文件內容
file='d:/艾薩克·阿西莫夫/奇妙的航程.TXT'
f=open(file,'r',encoding='gbk')
text=f.read()
f.close()
#使用jieba分詞,因為wordcloud是以空格識別單詞邊界的
importjieba
text=''.join(jieba.cut(text))
#掩碼圖片,單色圖就好
fromscipy.miscimportimread
color_mask=imread('D:/Pictures/7218.png')
#建立詞雲對象,因為是中文,指定一個中文字體,不然可能會亂碼
#WordCloud的參數可以控制很多內容,請自行閱讀包的文檔
importwordcloud
w=wordcloud.WordCloud(font_path='C:/Windows/Fonts/msyh.ttc',
max_words=100,
mask=color_mask)
#載入以空格分詞的字元串
w.generate(text)
#生成圖片
w.to_file('d:/img1.png')
Ⅶ 一個txt文檔,已經用結巴分詞分完詞,怎麼用python工具對這個分完詞的文檔進行計算統計詞頻,求腳本,非
#!/usr/bin/envpython3
#-*-coding:utf-8-*-
importos,random
#假設要讀取文件名為aa,位於當前路徑
filename='aa.txt'
dirname=os.getcwd()
f_n=os.path.join(dirname,filename)
#注釋掉的程序段,用於測試腳本,它生成20行數據,每行有1-20隨機個數字,每個數字隨機1-20
'''
test=''
foriinrange(20):
forjinrange(random.randint(1,20)):
test+=str(random.randint(1,20))+''
test+=' '
withopen(f_n,'w')aswf:
wf.write(test)
'''
withopen(f_n)asf:
s=f.readlines()
#將每一行數據去掉首尾的空格和換行符,然後用空格分割,再組成一維列表
words=[]
forlineins:
words.extend(line.strip().split(''))
#格式化要輸出的每行數據,首尾各佔8位,中間佔18位
defgeshi(a,b,c):
returnalignment(str(a))+alignment(str(b),18)+alignment(str(c))+' '
#中英文混合對齊,參考http://bbs.fishc.com/thread-67465-1-1.html,二樓
#漢字與字母格式化佔位format對齊出錯對不齊漢字對齊數字漢字對齊字母中文對齊英文
#alignment函數用於英漢混合對齊、漢字英文對齊、漢英對齊、中英對齊
defalignment(str1,space=8,align='left'):
length=len(str1.encode('gb2312'))
space=space-lengthifspace>=lengthelse0
ifalignin['left','l','L','Left','LEFT']:
str1=str1+''*space
elifalignin['right','r','R','Right','RIGHT']:
str1=''*space+str1
elifalignin['center','c','C','Center','CENTER','centre']:
str1=''*(space//2)+str1+''*(space-space//2)
returnstr1
w_s=geshi('序號','詞','頻率')
#由(詞,頻率)元組構成列表,先按頻率降序排序,再按詞升序排序,多級排序,一組升,一組降,高級sorted
wordcount=sorted([(w,words.count(w))forwinset(words)],key=lambdal:(-l[1],l[0]))
#要輸出的數據,每一行由:序號(佔8位)詞(佔20位)頻率(佔8位)+' '構成,序號=List.index(element)+1
for(w,c)inwordcount:
w_s+=geshi(wordcount.index((w,c))+1,w,c)
#將統計結果寫入文件ar.txt中
writefile='ar.txt'
w_n=os.path.join(dirname,writefile)
withopen(w_n,'w')aswf:
wf.write(w_s)
Ⅷ python數據挖掘——文本分析
作者 | zhouyue65
來源 | 君泉計量
文本挖掘:從大量文本數據中抽取出有價值的知識,並且利用這些知識重新組織信息的過程。
一、語料庫(Corpus)
語料庫是我們要分析的所有文檔的集合。
二、中文分詞
2.1 概念:
中文分詞(Chinese Word Segmentation):將一個漢字序列切分成一個一個單獨的詞。
eg:我的家鄉是廣東省湛江市-->我/的/家鄉/是/廣東省/湛江市
停用詞(Stop Words):
數據處理時,需要過濾掉某些字或詞
√泛濫的詞,如web、網站等。
√語氣助詞、副詞、介詞、連接詞等,如 的,地,得;
2.2 安裝Jieba分詞包:
最簡單的方法是用CMD直接安裝:輸入pip install jieba,但是我的電腦上好像不行。
後來在這里:https://pypi.org/project/jieba/#files下載了jieba0.39解壓縮後 放在Python36Libsite-packages裡面,然後在用cmd,pip install jieba 就下載成功了,不知道是是什麼原因。
然後我再anaconda 環境下也安裝了jieba,先在Anaconda3Lib這個目錄下將jieba0.39的解壓縮文件放在裡面,然後在Anaconda propt下輸入 pip install jieba,如下圖:
2.3 代碼實戰:
jieba最主要的方法是cut方法:
jieba.cut方法接受兩個輸入參數:
1) 第一個參數為需要分詞的字元串
2)cut_all參數用來控制是否採用全模式
jieba.cut_for_search方法接受一個參數:需要分詞的字元串,該方法適合用於搜索引擎構建倒排索引的分詞,粒度比較細
注意:待分詞的字元串可以是gbk字元串、utf-8字元串或者unicode
jieba.cut以及jieba.cut_for_search返回的結構都是一個可迭代的generator,可以使用for循環來獲得分詞後得到的每一個詞語(unicode),也可以用list(jieba.cut(...))轉化為list代碼示例( 分詞 )
輸出結果為: 我 愛
Python
工信處
女幹事
每月 經過 下屬 科室 都 要 親口
交代
24 口 交換機 等 技術性 器件 的 安裝
工作
分詞功能用於專業的場景:
會出現真武七截陣和天罡北斗陣被分成幾個詞。為了改善這個現象,我們用導入詞庫的方法。
但是,如果需要導入的單詞很多,jieba.add_word()這樣的添加詞庫的方法就不高效了。
我們可以用jieba.load_userdict(『D:PDM2.2金庸武功招式.txt』)方法一次性導入整個詞庫,txt文件中為每行一個特定的詞。
2.3.1 對大量文章進行分詞
先搭建語料庫:
分詞後我們需要對信息處理,就是這個分詞來源於哪個文章。
四、詞頻統計
3.1詞頻(Term Frequency):
某個詞在該文檔中出現的次數。
3.2利用Python進行詞頻統計
3.2.1 移除停用詞的另一種方法,加if判斷
代碼中用到的一些常用方法:
分組統計:
判斷一個數據框中的某一列的值是否包含一個數組中的任意一個值:
取反:(對布爾值)
四、詞雲繪制
詞雲(Word Cloud):是對文本中詞頻較高的分詞,給與視覺上的突出,形成「關鍵詞渲染」,從而國旅掉大量的文本信息,使瀏覽者一眼掃過就可以領略文本的主旨。
4.1 安裝詞雲工具包
這個地址:https://www.lfd.uci.e/~gohlke/pythonlibs/ ,可以搜到基本上所有的Python庫,進去根據自己的系統和Python的版本進行下載即可。
在python下安裝很方便,在anaconda下安裝費了點勁,最終將詞雲的文件放在C:UsersAdministrator 這個目錄下才安裝成功。
五、美化詞雲(詞雲放入某圖片形象中)
六、關鍵詞提取
結果如下:
七、關鍵詞提取實現
詞頻(Term Frequency):指的是某一個給定的詞在該文檔中出現的次數。
計算公式: TF = 該次在文檔中出現的次數
逆文檔頻率(Inverse Document Frequency):IDF就是每個詞的權重,它的大小與一個詞的常見程度成反比
計算公式:IDF = log(文檔總數/(包含該詞的文檔數 - 1))
TF-IDF(Term Frequency-Inverse Document Frequency):權衡某個分詞是否關鍵詞的指標,該值越大,是關鍵詞的可能性就越大。
計算公式:TF - IDF = TF * IDF
7.1文檔向量化
7.2代碼實戰
Ⅸ 求看python 統計中文詞頻的代碼,有一個地方不懂 求大神
首先要說明一個概念:gbk編碼里一個中文字元的『長度』是2。
str='中國'#gbk編碼
要取得'中'這個字元,需要用分片str[0:2],而不是索引str[0]。
以z4為例,下面這些代碼的效果是這樣的。
x='同舟共濟與時俱進艱苦奮斗'
i+=z4.findall(x)#返回['同舟共濟','與時俱進','艱苦奮斗']
i+=z4.findall(x[2:])#返回['舟共濟與','時俱進艱']
i+=z4.findall(x[4:])#返回['共濟與時','俱進艱苦']
i+=z4.findall(x[6:])#返回['濟與時俱','進艱苦奮']
目的是取得所有連續4字中文字元串。
Ⅹ 如何用python統計單詞的頻率
代碼:
passage="""Editor』s Note: Looking through VOA's listener mail, we came across a letter that asked a simple question. "What do Americans think about China?" We all care about the perceptions of others. It helps us better understand who we are. VOA Reporter Michael Lipin begins a series providing some answers to our listener's question. His assignment: present a clearer picture of what Americans think about their chief world rival, and what drives those perceptions.
Two common American attitudes toward China can be identified from the latest U.S. public opinion surveys published by Gallup and Pew Research Center in the past year.
First, most of the Americans surveyed have unfavorable opinions of China as a whole, but do not view the country as a threat toward the United States at the present time.
Second, most survey respondents expect China to pose an economic and military threat to the United States in the future, with more Americans worried about the perceived economic threat than the military one.
Most Americans view China unfavorably
To understand why most Americans appear to have negative feelings about China, analysts interviewed by VOA say a variety of factors should be considered. Primary among them is a lack of familiarity.
"Most Americans do not have a strong interest in foreign affairs, Chinese or otherwise," says Robert Daly, director of the Kissinger Institute on China and the United States at the Washington-based Wilson Center.
Many of those Americans also have never traveled to China, in part because of the distance and expense. "That means that like most human beings, they take short cuts to understanding China," Daly says.
Rather than make the effort to regularly consume a wide range of U.S. media reports about China, analysts say many Americans base their views on widely-publicized major events in China's recent history."""
passage=passage.replace(","," ").replace("."," ").replace(":"," ").replace("』","'").
replace('"'," ").replace("?"," ").replace("!"," ").replace(" "," ")#把標點改成空格
passagelist=passage.split(" ")#拆分成一個個單詞
pc=passagelist.()#復制一份
for i in range(len(pc)):
pi=pc[i]#這一個字元串
if pi.count(" ")==len(pi):#如果全是空格
passagelist.remove(pi)#刪除此項
worddict={}
for j in range(len(passagelist)):
pj=passagelist[j]#這一個單詞
if pj not in worddict:#如果未被統計到
worddict[pj]=1#增加單詞統計,次數設為1
else:#如果統計過了
worddict[pj]+=1#次數增加1
output=""#按照字母表順序,製表符
worddictlist=list(worddict.keys())#提取所有的單詞
worddictlist.sort()#排序(但大小寫會出現問題)
worddict2={}
for k in worddictlist:
worddict2[k]=worddict[k]#排序好的字典
print("單次 次數")
for m in worddict2:#遍歷輸出
tabs=(23-len(m))//8#根據單次長度輸入,如果復制到表格,請把此行改為tabs=2
print("%s%s%d"%(m," "*tabs,worddict[m]))
註:加粗部分是您要統計的短文,請修改。我這里的輸出效果是:
American 1
Americans 9
Center 2
China 10
China's 1
Chinese 1
Daly 2
Editor's 1
First 1
Gallup 1
His 1
Institute 1
It 1
Kissinger 1
Lipin 1
Looking 1
Many 1
Michael 1
Most 2
Note 1
Pew 1
Primary 1
Rather 1
Reporter 1
Research 1
Robert 1
S 2
Second 1
States 3
That 1
To 1
Two 1
U 2
United 3
VOA 2
VOA's 1
Washington-based1
We 1
What 1
Wilson 1
a 10
about 6
across 1
affairs 1
all 1
also 1
among 1
an 1
analysts 2
and 5
answers 1
appear 1
are 1
as 2
asked 1
assignment 1
at 2
attitudes 1
base 1
be 2
because 1
begins 1
beings 1
better 1
but 1
by 2
came 1
can 1
care 1
chief 1
clearer 1
common 1
considered 1
consume 1
country 1
cuts 1
director 1
distance 1
do 3
drives 1
economic 2
effort 1
events 1
expect 1
expense 1
factors 1
familiarity 1
feelings 1
foreign 1
from 1
future 1
have 4
helps 1
history 1
human 1
identified 1
in 5
interest 1
interviewed 1
is 1
lack 1
latest 1
letter 1
like 1
listener 1
listener's 1
mail 1
major 1
make 1
many 1
means 1
media 1
military 2
more 1
most 4
negative 1
never 1
not 2
of 10
on 2
one 1
opinion 1
opinions 1
or 1
others 1
otherwise 1
our 1
part 1
past 1
perceived 1
perceptions 2
picture 1
pose 1
present 2
providing 1
public 1
published 1
question 2
range 1
recent 1
regularly 1
reports 1
respondents 1
rival 1
say 2
says 2
series 1
short 1
should 1
simple 1
some 1
strong 1
survey 1
surveyed 1
surveys 1
take 1
than 2
that 2
the 16
their 2
them 1
they 1
think 2
those 2
threat 3
through 1
time 1
to 7
toward 2
traveled 1
understand 2
understanding 1
unfavorable 1
unfavorably 1
us 1
variety 1
view 2
views 1
we 2
what 2
who 1
whole 1
why 1
wide 1
widely-publicized1
with 1
world 1
worried 1
year 1
(應該是對齊的,到這就亂了)
註:目前難以解決的漏洞
1、大小寫問題,無法分辨哪些必須大寫哪些只是首字母大寫
2、's問題,目前如果含有隻能算為一個單詞里的
3、排序問題,很難做到按照出現次數排序