python统计中文词频
Ⅰ 用python统计词频
def statistics(astr):
# astr.replace("\n", "")
slist = list(astr.split("\t"))
alist = []
[alist.append(i) for i in slist if i not in alist]
alist[-1] = alist[-1].replace("\n", "")
return alist
if __name__ == "__main__":
code_doc = {}
with open("test_data.txt", "r", encoding='utf-8') as fs:
for ln in fs.readlines():
l = statistics(ln)
for t in l:
if t not in code_doc:
code_doc.setdefault(t, 1)
else:
code_doc[t] += 1
for keys in code_doc.keys():
print(keys + ' ' + str(code_doc[keys]))
Ⅱ 如何用python统计一个txt文件中某个中文词语出现的次数
Ⅲ python统计词频并进行可视化显示
你去echart官网去看,有100多种可视化图形可供你选择。统计词频也有的,你需要做的就是查看example code再把code里data改成你的data即可。当然需要import echart。
Ⅳ 如何用python和jieba分词,统计词频
#!python3
#-*-coding:utf-8-*-
importos,codecs
importjieba
fromcollectionsimportCounter
defget_words(txt):
seg_list=jieba.cut(txt)
c=Counter()
forxinseg_list:
iflen(x)>1andx!=' ':
c[x]+=1
print('常用词频度统计结果')
for(k,v)inc.most_common(100):
print('%s%s%s%d'%(''*(5-len(k)),k,'*'*int(v/3),v))
if__name__=='__main__':
withcodecs.open('19d.txt','r','utf8')asf:
txt=f.read()
get_words(txt)
Ⅳ 如何用python对文章中文分词并统计词频
功能 [root@skatedb55 ~]# vi op_log_file.py #!/usr/bin/env python #-*- coding: utf-8 -*- #Author:Skate import os,time def op_log(log): f=file(log_file,'a') date=time.strftime('%Y-%m-%d %H:%M:%S') record = '%s %s\n' %(date,log) ...
Ⅵ 用Python实现小说里的高频词统计并显示
用jieba做分词,用wordcloud包做词云就可以了
#读取文件内容
file='d:/艾萨克·阿西莫夫/奇妙的航程.TXT'
f=open(file,'r',encoding='gbk')
text=f.read()
f.close()
#使用jieba分词,因为wordcloud是以空格识别单词边界的
importjieba
text=''.join(jieba.cut(text))
#掩码图片,单色图就好
fromscipy.miscimportimread
color_mask=imread('D:/Pictures/7218.png')
#建立词云对象,因为是中文,指定一个中文字体,不然可能会乱码
#WordCloud的参数可以控制很多内容,请自行阅读包的文档
importwordcloud
w=wordcloud.WordCloud(font_path='C:/Windows/Fonts/msyh.ttc',
max_words=100,
mask=color_mask)
#载入以空格分词的字符串
w.generate(text)
#生成图片
w.to_file('d:/img1.png')
Ⅶ 一个txt文档,已经用结巴分词分完词,怎么用python工具对这个分完词的文档进行计算统计词频,求脚本,非
#!/usr/bin/envpython3
#-*-coding:utf-8-*-
importos,random
#假设要读取文件名为aa,位于当前路径
filename='aa.txt'
dirname=os.getcwd()
f_n=os.path.join(dirname,filename)
#注释掉的程序段,用于测试脚本,它生成20行数据,每行有1-20随机个数字,每个数字随机1-20
'''
test=''
foriinrange(20):
forjinrange(random.randint(1,20)):
test+=str(random.randint(1,20))+''
test+=' '
withopen(f_n,'w')aswf:
wf.write(test)
'''
withopen(f_n)asf:
s=f.readlines()
#将每一行数据去掉首尾的空格和换行符,然后用空格分割,再组成一维列表
words=[]
forlineins:
words.extend(line.strip().split(''))
#格式化要输出的每行数据,首尾各占8位,中间占18位
defgeshi(a,b,c):
returnalignment(str(a))+alignment(str(b),18)+alignment(str(c))+' '
#中英文混合对齐,参考http://bbs.fishc.com/thread-67465-1-1.html,二楼
#汉字与字母格式化占位format对齐出错对不齐汉字对齐数字汉字对齐字母中文对齐英文
#alignment函数用于英汉混合对齐、汉字英文对齐、汉英对齐、中英对齐
defalignment(str1,space=8,align='left'):
length=len(str1.encode('gb2312'))
space=space-lengthifspace>=lengthelse0
ifalignin['left','l','L','Left','LEFT']:
str1=str1+''*space
elifalignin['right','r','R','Right','RIGHT']:
str1=''*space+str1
elifalignin['center','c','C','Center','CENTER','centre']:
str1=''*(space//2)+str1+''*(space-space//2)
returnstr1
w_s=geshi('序号','词','频率')
#由(词,频率)元组构成列表,先按频率降序排序,再按词升序排序,多级排序,一组升,一组降,高级sorted
wordcount=sorted([(w,words.count(w))forwinset(words)],key=lambdal:(-l[1],l[0]))
#要输出的数据,每一行由:序号(占8位)词(占20位)频率(占8位)+' '构成,序号=List.index(element)+1
for(w,c)inwordcount:
w_s+=geshi(wordcount.index((w,c))+1,w,c)
#将统计结果写入文件ar.txt中
writefile='ar.txt'
w_n=os.path.join(dirname,writefile)
withopen(w_n,'w')aswf:
wf.write(w_s)
Ⅷ python数据挖掘——文本分析
作者 | zhouyue65
来源 | 君泉计量
文本挖掘:从大量文本数据中抽取出有价值的知识,并且利用这些知识重新组织信息的过程。
一、语料库(Corpus)
语料库是我们要分析的所有文档的集合。
二、中文分词
2.1 概念:
中文分词(Chinese Word Segmentation):将一个汉字序列切分成一个一个单独的词。
eg:我的家乡是广东省湛江市-->我/的/家乡/是/广东省/湛江市
停用词(Stop Words):
数据处理时,需要过滤掉某些字或词
√泛滥的词,如web、网站等。
√语气助词、副词、介词、连接词等,如 的,地,得;
2.2 安装Jieba分词包:
最简单的方法是用CMD直接安装:输入pip install jieba,但是我的电脑上好像不行。
后来在这里:https://pypi.org/project/jieba/#files下载了jieba0.39解压缩后 放在Python36Libsite-packages里面,然后在用cmd,pip install jieba 就下载成功了,不知道是是什么原因。
然后我再anaconda 环境下也安装了jieba,先在Anaconda3Lib这个目录下将jieba0.39的解压缩文件放在里面,然后在Anaconda propt下输入 pip install jieba,如下图:
2.3 代码实战:
jieba最主要的方法是cut方法:
jieba.cut方法接受两个输入参数:
1) 第一个参数为需要分词的字符串
2)cut_all参数用来控制是否采用全模式
jieba.cut_for_search方法接受一个参数:需要分词的字符串,该方法适合用于搜索引擎构建倒排索引的分词,粒度比较细
注意:待分词的字符串可以是gbk字符串、utf-8字符串或者unicode
jieba.cut以及jieba.cut_for_search返回的结构都是一个可迭代的generator,可以使用for循环来获得分词后得到的每一个词语(unicode),也可以用list(jieba.cut(...))转化为list代码示例( 分词 )
输出结果为: 我 爱
Python
工信处
女干事
每月 经过 下属 科室 都 要 亲口
交代
24 口 交换机 等 技术性 器件 的 安装
工作
分词功能用于专业的场景:
会出现真武七截阵和天罡北斗阵被分成几个词。为了改善这个现象,我们用导入词库的方法。
但是,如果需要导入的单词很多,jieba.add_word()这样的添加词库的方法就不高效了。
我们可以用jieba.load_userdict(‘D:PDM2.2金庸武功招式.txt’)方法一次性导入整个词库,txt文件中为每行一个特定的词。
2.3.1 对大量文章进行分词
先搭建语料库:
分词后我们需要对信息处理,就是这个分词来源于哪个文章。
四、词频统计
3.1词频(Term Frequency):
某个词在该文档中出现的次数。
3.2利用Python进行词频统计
3.2.1 移除停用词的另一种方法,加if判断
代码中用到的一些常用方法:
分组统计:
判断一个数据框中的某一列的值是否包含一个数组中的任意一个值:
取反:(对布尔值)
四、词云绘制
词云(Word Cloud):是对文本中词频较高的分词,给与视觉上的突出,形成“关键词渲染”,从而国旅掉大量的文本信息,使浏览者一眼扫过就可以领略文本的主旨。
4.1 安装词云工具包
这个地址:https://www.lfd.uci.e/~gohlke/pythonlibs/ ,可以搜到基本上所有的Python库,进去根据自己的系统和Python的版本进行下载即可。
在python下安装很方便,在anaconda下安装费了点劲,最终将词云的文件放在C:UsersAdministrator 这个目录下才安装成功。
五、美化词云(词云放入某图片形象中)
六、关键词提取
结果如下:
七、关键词提取实现
词频(Term Frequency):指的是某一个给定的词在该文档中出现的次数。
计算公式: TF = 该次在文档中出现的次数
逆文档频率(Inverse Document Frequency):IDF就是每个词的权重,它的大小与一个词的常见程度成反比
计算公式:IDF = log(文档总数/(包含该词的文档数 - 1))
TF-IDF(Term Frequency-Inverse Document Frequency):权衡某个分词是否关键词的指标,该值越大,是关键词的可能性就越大。
计算公式:TF - IDF = TF * IDF
7.1文档向量化
7.2代码实战
Ⅸ 求看python 统计中文词频的代码,有一个地方不懂 求大神
首先要说明一个概念:gbk编码里一个中文字符的‘长度’是2。
str='中国'#gbk编码
要取得'中'这个字符,需要用分片str[0:2],而不是索引str[0]。
以z4为例,下面这些代码的效果是这样的。
x='同舟共济与时俱进艰苦奋斗'
i+=z4.findall(x)#返回['同舟共济','与时俱进','艰苦奋斗']
i+=z4.findall(x[2:])#返回['舟共济与','时俱进艰']
i+=z4.findall(x[4:])#返回['共济与时','俱进艰苦']
i+=z4.findall(x[6:])#返回['济与时俱','进艰苦奋']
目的是取得所有连续4字中文字符串。
Ⅹ 如何用python统计单词的频率
代码:
passage="""Editor’s Note: Looking through VOA's listener mail, we came across a letter that asked a simple question. "What do Americans think about China?" We all care about the perceptions of others. It helps us better understand who we are. VOA Reporter Michael Lipin begins a series providing some answers to our listener's question. His assignment: present a clearer picture of what Americans think about their chief world rival, and what drives those perceptions.
Two common American attitudes toward China can be identified from the latest U.S. public opinion surveys published by Gallup and Pew Research Center in the past year.
First, most of the Americans surveyed have unfavorable opinions of China as a whole, but do not view the country as a threat toward the United States at the present time.
Second, most survey respondents expect China to pose an economic and military threat to the United States in the future, with more Americans worried about the perceived economic threat than the military one.
Most Americans view China unfavorably
To understand why most Americans appear to have negative feelings about China, analysts interviewed by VOA say a variety of factors should be considered. Primary among them is a lack of familiarity.
"Most Americans do not have a strong interest in foreign affairs, Chinese or otherwise," says Robert Daly, director of the Kissinger Institute on China and the United States at the Washington-based Wilson Center.
Many of those Americans also have never traveled to China, in part because of the distance and expense. "That means that like most human beings, they take short cuts to understanding China," Daly says.
Rather than make the effort to regularly consume a wide range of U.S. media reports about China, analysts say many Americans base their views on widely-publicized major events in China's recent history."""
passage=passage.replace(","," ").replace("."," ").replace(":"," ").replace("’","'").
replace('"'," ").replace("?"," ").replace("!"," ").replace(" "," ")#把标点改成空格
passagelist=passage.split(" ")#拆分成一个个单词
pc=passagelist.()#复制一份
for i in range(len(pc)):
pi=pc[i]#这一个字符串
if pi.count(" ")==len(pi):#如果全是空格
passagelist.remove(pi)#删除此项
worddict={}
for j in range(len(passagelist)):
pj=passagelist[j]#这一个单词
if pj not in worddict:#如果未被统计到
worddict[pj]=1#增加单词统计,次数设为1
else:#如果统计过了
worddict[pj]+=1#次数增加1
output=""#按照字母表顺序,制表符
worddictlist=list(worddict.keys())#提取所有的单词
worddictlist.sort()#排序(但大小写会出现问题)
worddict2={}
for k in worddictlist:
worddict2[k]=worddict[k]#排序好的字典
print("单次 次数")
for m in worddict2:#遍历输出
tabs=(23-len(m))//8#根据单次长度输入,如果复制到表格,请把此行改为tabs=2
print("%s%s%d"%(m," "*tabs,worddict[m]))
注:加粗部分是您要统计的短文,请修改。我这里的输出效果是:
American 1
Americans 9
Center 2
China 10
China's 1
Chinese 1
Daly 2
Editor's 1
First 1
Gallup 1
His 1
Institute 1
It 1
Kissinger 1
Lipin 1
Looking 1
Many 1
Michael 1
Most 2
Note 1
Pew 1
Primary 1
Rather 1
Reporter 1
Research 1
Robert 1
S 2
Second 1
States 3
That 1
To 1
Two 1
U 2
United 3
VOA 2
VOA's 1
Washington-based1
We 1
What 1
Wilson 1
a 10
about 6
across 1
affairs 1
all 1
also 1
among 1
an 1
analysts 2
and 5
answers 1
appear 1
are 1
as 2
asked 1
assignment 1
at 2
attitudes 1
base 1
be 2
because 1
begins 1
beings 1
better 1
but 1
by 2
came 1
can 1
care 1
chief 1
clearer 1
common 1
considered 1
consume 1
country 1
cuts 1
director 1
distance 1
do 3
drives 1
economic 2
effort 1
events 1
expect 1
expense 1
factors 1
familiarity 1
feelings 1
foreign 1
from 1
future 1
have 4
helps 1
history 1
human 1
identified 1
in 5
interest 1
interviewed 1
is 1
lack 1
latest 1
letter 1
like 1
listener 1
listener's 1
mail 1
major 1
make 1
many 1
means 1
media 1
military 2
more 1
most 4
negative 1
never 1
not 2
of 10
on 2
one 1
opinion 1
opinions 1
or 1
others 1
otherwise 1
our 1
part 1
past 1
perceived 1
perceptions 2
picture 1
pose 1
present 2
providing 1
public 1
published 1
question 2
range 1
recent 1
regularly 1
reports 1
respondents 1
rival 1
say 2
says 2
series 1
short 1
should 1
simple 1
some 1
strong 1
survey 1
surveyed 1
surveys 1
take 1
than 2
that 2
the 16
their 2
them 1
they 1
think 2
those 2
threat 3
through 1
time 1
to 7
toward 2
traveled 1
understand 2
understanding 1
unfavorable 1
unfavorably 1
us 1
variety 1
view 2
views 1
we 2
what 2
who 1
whole 1
why 1
wide 1
widely-publicized1
with 1
world 1
worried 1
year 1
(应该是对齐的,到这就乱了)
注:目前难以解决的漏洞
1、大小写问题,无法分辨哪些必须大写哪些只是首字母大写
2、's问题,目前如果含有只能算为一个单词里的
3、排序问题,很难做到按照出现次数排序