python提取關鍵詞

發布時間: 2022-04-02 09:46:59

Ⅰ python如何實現提取文本中所有連續的詞語

經常需要通過Python代碼來提取文本的關鍵詞，用於文本分析。而實際應用中文本量又是大量的數據，如果使用單進程的話，效率會比較低，因此可以考慮使用多進程。
python的多進程只需要使用multiprocessing的模塊就行，如果使用大量的進程就可以使用multiprocessing的進程池--Pool，然後不同進程處理時使用apply_async函數進行非同步處理即可。

實驗測試語料：message.txt中存放的581行文本，一共7M的數據，每行提取100個關鍵詞。
代碼如下：

[python] view plain
#coding:utf-8
import sys
reload(sys)
sys.setdefaultencoding("utf-8")
from multiprocessing import Pool,Queue,Process
import multiprocessing as mp
import time,random
import os
import codecs
import jieba.analyse
jieba.analyse.set_stop_words("yy_stop_words.txt")

def extract_keyword(input_string):
#print("Do task by process {proc}".format(proc=os.getpid()))
tags = jieba.analyse.extract_tags(input_string, topK=100)
#print("key words:{kw}".format(kw=" ".join(tags)))
return tags

#def parallel_extract_keyword(input_string,out_file):
def parallel_extract_keyword(input_string):
#print("Do task by process {proc}".format(proc=os.getpid()))
tags = jieba.analyse.extract_tags(input_string, topK=100)
#time.sleep(random.random())
#print("key words:{kw}".format(kw=" ".join(tags)))
#o_f = open(out_file,'w')
#o_f.write(" ".join(tags)+"\n")
return tags
if __name__ == "__main__":

data_file = sys.argv[1]
with codecs.open(data_file) as f:
lines = f.readlines()
f.close()

out_put = data_file.split('.')[0] +"_tags.txt"
t0 = time.time()
for line in lines:
parallel_extract_keyword(line)
#parallel_extract_keyword(line,out_put)
#extract_keyword(line)
print("串列處理花費時間{t}".format(t=time.time()-t0))

pool = Pool(processes=int(mp.cpu_count()*0.7))
t1 = time.time()
#for line in lines:
#pool.apply_async(parallel_extract_keyword,(line,out_put))
#保存處理的結果，可以方便輸出到文件
res = pool.map(parallel_extract_keyword,lines)
#print("Print keywords:")
#for tag in res:
#print(" ".join(tag))

pool.close()
pool.join()
print("並行處理花費時間{t}s".format(t=time.time()-t1))

運行：
python data_process_by_multiprocess.py message.txt
message.txt是每行是一個文檔，共581行，7M的數據

運行時間：

不使用sleep來掛起進程,也就是把time.sleep(random.random())注釋掉，運行可以大大節省時間。

Ⅱ 怎樣用python進行關鍵詞提取

關鍵字具體是什麼？
字元串比對就行了
html是beautifulsoup或者正則
json就更簡單了

Ⅲ 如何查找Python中的關鍵字

1、用python這么久就沒遇到過需要查找其關鍵字的時候，就那麼點關鍵字看幾遍後，基本都不會忘啦。而且寫程序時，不管你用的是vim、gedit還是pycharm，遇到關鍵字都會變顏色提醒的呀。

2、交互模式下，試過可行的：

import__builtin__
dir(__builtin__)
help(__builtin__)

Ⅳ 怎麼用python依據多個關鍵詞提取Excel里關鍵詞所在的整行內容

沒有數據，模擬幾條說明常規思路。

a=['PGSC1','PGSC3','PGSC6','PGSC7']

b=[['PGSC1','A',555],['PGSC2','B',988],['PGSC3','C',7666],['PGSC7','P',8767],['PGSC1','A',567]]

data=[]

for x in a:

for y in b:

if x==y[0]:

data.append(y)

print(data)

當然，你也可以用numpy或者pandas來處理會更方便。

Ⅳ python 提取文本關鍵字 link-id輸出

Ⅵ python怎麼提取關鍵詞

你好，那個r'.*?('+ lste +').*?『會吧你這個關鍵字前面和後面的文字都匹配了，所以當你的那個關鍵字多次出現的時候，就會報那個重復出現的錯了。
你直接
hh = re.findall(lste, gg)就可以了呀？
或者是還有什麼需要匹配的東西，我看你後面好像要將結果連接起來，但是你匹配出來的都是關鍵字，直接連接的話，其實就是多個關鍵字的拼接了。

Ⅶ 在python 環境下，使用結巴分詞，自動導入文本，分詞，提取關鍵詞.腳本大俠給個

#-*-coding:UTF-8-*-

importjieba

__author__='lpe234'


seg_list=jieba.cut("我來到北京天安門",cut_all=True)
print','.join(seg_list)

...
Loadingmodelfromcache/var/folders/sv//T/jieba.cache
我,來到,北京,天安,天安門
Loadingmodelcost0.433seconds.
.

Processfinishedwithexitcode0

Ⅷ python 提取有關鍵詞的句子怎麼做

高頻詞提取：
# !/usr/bin/python3
# coding:utf-8

import jieba.analyse

jieba.load_userdict('dict.txt') # dict.txt自定義詞典

content = open('kw.txt', 'rb').read()
tags = jieba.analyse.extract_tags(content, topK=10) # topK 為高頻詞數量
print("\n".join(tags))

Ⅸ python多個變數中提取多個字

原始數據形式：將需要提取關鍵詞的文章放在一個文件夾下面；接著使用python代碼讀取每個文件的文件名，以便後面與關鍵詞一一對應；接著讀取文件數據並進行預處理，將其中的字母、數字以及下劃線去掉；接著提取各個文章的關鍵詞；將結果寫在一個csv文件中；
_python的input函數正常來說，一次只能傳來一個值，且這個值是一個字元串。如果想傳入多個值，我們可以使用字元串的spilt函數，以空格進行字元串的分割，並返回一個列表

Ⅹ python 提取關鍵字，並插入列表

import re

patt = re.compile(r'''
(?P<dt>\d{2}-\d{2})
\s
(?P<tm>\d{1,2}:\d{2}:\d{2}\.\d{3})
\s+
(\d+\s*)+
(?P<errorkey>\w)
\s
(?P<flag>\w+)
\s*
\:\s
(?P<content>.*)
''',re.I|re.U|re.X)

keyset = set()

with open("bug.log") as f:
for m in filter(None, map(patt.match, f)):
flagk = m.group('flag')
print m.group('flag')
keyset.add(flagk)

print ','.join(['"%s"'%k for k in keyset])

閱讀全文

熱點內容

php辦公系統發布：2025-07-19 03:06:35 瀏覽：896

奧德賽買什麼配置出去改裝發布：2025-07-19 02:53:18 瀏覽：38

請與網路管理員聯系請求訪問許可權發布：2025-07-19 02:37:34 瀏覽：187

ipad上b站緩存視頻怎麼下載發布：2025-07-19 02:32:17 瀏覽：842

phpcgi與phpfpm 發布：2025-07-19 02:05:19 瀏覽：525

捷達方向機安全登錄密碼是多少發布：2025-07-19 00:57:37 瀏覽：690

夜魔迅雷下載ftp 發布：2025-07-19 00:39:29 瀏覽：97

增值稅票安全接入伺服器地址發布：2025-07-19 00:20:45 瀏覽：484

solidworkspcb伺服器地址發布：2025-07-18 22:50:35 瀏覽：820

怎麼在堆疊交換機里配置vlan 發布：2025-07-18 22:42:35 瀏覽：628

python提取關鍵詞

與python提取關鍵詞相關的資訊