word2vec源碼python

發布時間: 2022-04-02 18:47:32

A. python中word2vec怎麼間斷訓練

因為word裡面樣式庫的樣式太多了有些少用的它會默認不顯示，在上方工具欄選取「開始」，然後在「樣式」框的右下角有個小圖標，點取後有個很長的樣式列表，在右下方點「選項」在「樣式窗格選項」的第一個下拉窗口裡，選取「所有樣式」，按「確定」然後你就可以在那個很長的樣式窗口裡看到所有的樣式了。當然也沒必要硬是尋找庫裡面的樣式，自己隨便打一段文字，設置好那段文字的樣式後，選取該段文字，按右鍵，選「樣式」裡面的「將所選內容保存為快速樣式」，自己隨便給新樣式改個名字就可以了

B. python word2vec訓練的模型sklearn怎麼用

寫入excel表時有兩種寫入xls和csv，但建議少使用csv，不然在表中調整數據格式時，保存時一直詢問你是否保存新格式，很麻煩。而在讀取數據時，如果指定了哪一張sheet，則在pycharm又會出現格式不對齊

C. word2vec怎麼生成詞向量python

：將one-hot向量轉換成低維詞向量的這一層（雖然大家都不稱之為一層，但在我看來就是一層），因為word2vec的輸入是one-hot。one-hot可看成是1*N（N是詞總數）的矩陣，與這個系數矩陣（N*M, M是word2vec詞向量維數）相乘之後就可以得到1*M的向量，這個向量就是這個詞對應的詞向量了。那麼對於那個N*M的矩陣，每一行就對應了每個單詞的詞向量。接下來就是進入神經網路，然後通過訓練不斷更新這個矩陣。

D. ubantu的word2vec 源碼哪裡可以下載

w2v的google源碼早關了。。。多網路下還是可以找到的

E. 使用word2vec計算詞向量之間的相似度

string為你需要獲取向量的詞，
double[] array = vec.getWordVector(string);
array是這個詞的向量。
首先在創建vec的時候要保證.minWordFrequency(1)，否則有些詞你是得不到向量的，這個方法是設置詞的最小使用頻率。

F. python gensim.models.word2vec 判斷是否有詞

可能是語料有問題。6.5M太少了，word2vec屬於弱監督，詞向量的預測與上下文關聯很大，所以需要找領域集成度很高的語料來訓練。

G. python gensim怎麼用word2vect

詞向量（word2vec）原始的代碼是C寫的，python也有對應的版本，被集成在一個非常牛逼的框架gensim中。

我在自己的開源語義網路項目graph-mind（其實是我自己寫的小玩具）中使用了這些功能，大家可以直接用我在上面做的進一步的封裝傻瓜式地完成一些操作，下面分享調用方法和一些code上的心得。

1.一些類成員變數：

[python]view plain

def__init__(self,modelPath,_size=100,_window=5,_minCount=1,_workers=multiprocessing.cpu_count()):
self.modelPath=modelPath
self._size=_size
self._window=_window
self._minCount=_minCount
self._workers=_workers

modelPath是word2vec訓練模型的磁碟存儲文件（model在內存中總是不踏實），_size是詞向量的維度，_window是詞向量訓練時的上下文掃描窗口大小，後面那個不知道，按默認來，_workers是訓練的進程數（需要更精準的解釋，請指正），默認是當前運行機器的處理器核數。這些參數先記住就可以了。

2.初始化並首次訓練word2vec模型

完成這個功能的核心函數是initTrainWord2VecModel，傳入兩個參數：corpusFilePath和safe_model，分別代表訓練語料的路徑和是否選擇「安全模式」進行初次訓練。關於這個「安全模式」後面會講，先看代碼：

[python]view plain

definitTrainWord2VecModel(self,corpusFilePath,safe_model=False):
'''''
initandtrainaneww2vmodel
(,
aboutsoft_model:
ifsafe_modelistrue,,
andthiscankeeptheusageofos'smemorysafebutslowly.
andifsafe_modelisfalse,
.)
'''
extraSegOpt().reLoadEncoding()
fileType=localFileOptUnit.checkFileState(corpusFilePath)
iffileType==u'error':
warnings.warn('loadfileerror!')
returnNone
else:
model=None
iffileType==u'opened':
print('trainingmodelfromsingleFile!')
model=Word2Vec(LineSentence(corpusFilePath),size=self._size,window=self._window,min_count=self._minCount,workers=self._workers)
eliffileType==u'file':
corpusFile=open(corpusFilePath,u'r')
print('trainingmodelfromsingleFile!')
model=Word2Vec(LineSentence(corpusFile),size=self._size,window=self._window,min_count=self._minCount,workers=self._workers)
eliffileType==u'directory':
corpusFiles=localFileOptUnit.listAllFileInDirectory(corpusFilePath)
print('!')
ifsafe_model==True:
model=Word2Vec(LineSentence(corpusFiles[0]),size=self._size,window=self._window,min_count=self._minCount,workers=self._workers)
forfileincorpusFiles[1:len(corpusFiles)]:
model=self.updateW2VModelUnit(model,file)
else:
sentences=self.loadSetencesFromFiles(corpusFiles)
model=Word2Vec(sentences,size=self._size,window=self._window,min_count=self._minCount,workers=self._workers)
eliffileType==u'other':
#TODOaddsentenceslistdirectly
pass
model.save(self.modelPath)
model.init_sims()
print('procingword2vecmodel...ok!')
returnmodel

首先是一些雜七雜八的，判斷一下輸入文件路徑下訪問結果的類型，根據不同的類型做出不同的文件處理反應，這個大家應該能看懂，以corpusFilePath為一個已經打開的file對象為例，創建word2vec model的代碼為：

[python]view plain

model=Word2Vec(LineSentence(corpusFilePath),size=self._size,window=self._window,min_count=self._minCount,workers=self._workers)

其實就是這么簡單，但是為了代碼健壯一些，就變成了上面那麼長。問題是在面對一個路徑下的許多訓練文檔且數目巨大的時候，一次性載入內存可能不太靠譜了（沒有細研究gensim在Word2Vec構造方法中有沒有考慮這個問題，只是一種習慣性的警惕），於是我設定了一個參數safe_model用於判斷初始訓練是否開啟「安全模式」，所謂安全模式，就是最初只載入一篇語料的內容，後面的初始訓練文檔通過增量式學習的方式，更新到原先的model中。

上面的代碼里，corpusFilePath可以傳入一個已經打開的file對象，或是一個單個文件的地址，或一個文件夾的路徑，通過函數checkFileState已經做了類型的判斷。另外一個函數是updateW2VModelUnit，用於增量式訓練更新w2v的model，下面會具體介紹。loadSetencesFromFiles函數用於載入一個文件夾中全部語料的所有句子，這個在源代碼里有，很簡單，哥就不多說了。

3.增量式訓練更新word2vec模型

增量式訓練w2v模型，上面提到了一個這么做的原因：避免把全部的訓練語料一次性載入到內存中。另一個原因是為了應對語料隨時增加的情況。gensim當然給出了這樣的solution，調用如下：

[python]view plain

defupdateW2VModelUnit(self,model,corpusSingleFilePath):
'''''
(onlycanbeasingleFile)
'''
fileType=localFileOptUnit.checkFileState(corpusSingleFilePath)
iffileType==u'directory':
warnings.warn('cannotdealadirectory!')
returnmodel
iffileType==u'opened':
trainedWordCount=model.train(LineSentence(corpusSingleFilePath))
print('updatemodel,updatewordsnumis:'+trainedWordCount)
eliffileType==u'file':
corpusSingleFile=open(corpusSingleFilePath,u'r')
trainedWordCount=model.train(LineSentence(corpusSingleFile))
print('updatemodel,updatewordsnumis:'+trainedWordCount)
else:
#TODOaddsentenceslistdirectly(sameaslastfunction)
pass
returnmodel

簡單檢查文件type之後，調用model對象的train方法就可以實現對model的更新，這個方法傳入的是新語料的sentences，會返回模型中新增詞彙的數量。函數全部執行完後，return更新後的model，源代碼中在這個函數下面有能夠處理多類文件參數（同2）的增強方法，這里就不多介紹了。

4.各種基礎查詢

當你確定model已經訓練完成，不會再更新的時候，可以對model進行鎖定，並且據說是預載了相似度矩陣能夠提高後面的查詢速度，但是你的model從此以後就read only了。

[python]view plain

deffinishTrainModel(self,modelFilePath=None):
'''''
warning:afterthis,themodelisread-only(can'tbeupdate)
'''
ifmodelFilePath==None:
modelFilePath=self.modelPath
model=self.loadModelfromFile(modelFilePath)
model.init_sims(replace=True)

可以看到，所謂的鎖定模型方法，就是init_sims，並且把裡面的replace參數設定為True。

然後是一些word2vec模型的查詢方法：

[python]view plain

defgetWordVec(self,model,wordStr):
'''''
gettheword'
'''
returnmodel[wordStr]

[python]view plain

defqueryMostSimilarWordVec(self,model,wordStr,topN=20):
'''''
return2-dimList[0]isword[1]isdouble-prob
'''
similarPairList=model.most_similar(wordStr.decode('utf-8'),topn=topN)
returnsimilarPairList

[python]view plain

defculSimBtwWordVecs(self,model,wordStr1,wordStr2):
'''''
returndouble-prob
'''
similarValue=model.similarity(wordStr1.decode('utf-8'),wordStr2.decode('utf-8'))
returnsimilarValue

上述方法都很簡單，基本上一行解決，在源代碼中，各個函數下面依然是配套了相應的model文件處理版的函數。其中，getWordVec是得到查詢詞的word2vec詞向量本身，列印出來是一個純數字的array；queryMostSimilarWordVec是得到與查詢詞關聯度最高的N個詞以及對應的相似度，返回是一個二維list（注釋裡面寫的蠻清楚）；culSimBtwWordVecs是得到兩個給定詞的相似度值，直接返回double值。

5.Word2Vec詞向量的計算

研究過w2v理論的童鞋肯定知道詞向量是可以做加減計算的，基於這個性質，gensim給出了相應的方法，調用如下：

[python]view plain

(self,model,posWordStrList,negWordStrList,topN=20):
'''''
pos-neg
return2-dimList[0]isword[1]isdouble-prob
'''
posWordList=[]
negWordList=[]
forwordStrinposWordStrList:
posWordList.append(wordStr.decode('utf-8'))
forwordStrinnegWordStrList:
negWordList.append(wordStr.decode('utf-8'))
pnSimilarPairList=model.most_similar(positive=posWordList,negative=negWordList,topn=topN)
returnpnSimilarPairList

由於用的是py27，所以之前對傳入的詞列表數據進行編碼過濾，這裡面posWordList可以認為是對結果產生正能量的詞集，negWordList則是對結果產生負能量的詞集，同時送入most_similar方法，在設定return答案的topN，得到的返回結果形式同4中的queryMostSimilarWordVec函數，大家可以這樣數學地理解這個操作：

下面一個操作是我自創的，假設我想用上面詞向量topN「詞-關聯度」的形式展現兩個詞或兩組詞之間的關聯，我是這么做的：

[python]view plain

(self,model,wordStrList1,wordStrList2,topN_rev=20,topN=20):
'''''
-wordListandtag-wordList
first,usethetag-wordListasneg-wordListtogettherev-wordList,
thenusethescr-wordListandtherev-wordListasthenewsrc-tag-wordList
topN_revistopNofrev-
'''
srcWordList=[]
tagWordList=[]
srcWordList.extend(wordStr.decode('utf-8')forwordStrinwordStrList1)
tagWordList.extend(wordStr.decode('utf-8')forwordStrinwordStrList2)
revSimilarPairList=self.queryMSimilarVecswithPosNeg(model,[],tagWordList,topN_rev)
revWordList=[]
revWordList.extend(pair[0].decode('utf-8')forpairinrevSimilarPairList)
stSimilarPairList=self.queryMSimilarVecswithPosNeg(model,srcWordList,revWordList,topN)
returnstSimilarPairList

這個操作的思路就是，首先用兩組詞中的一組作為negWordList，傳入上面的queryMSimilarVecswithPosNeg函數，得到topN一組的中轉詞，在使用這些中轉詞與原先的另一組詞進行queryMSimilarVecswithPosNeg操作，很容易理解，第一步得到的是一組詞作為negWordList的反向結果，再通過這個反向結果與另一組詞得到「負負得正」的效果。這樣就可以通過一組topN的「詞-關聯度」配對List表示兩組詞之間的關系。

H. python 怎麼load word2vec的model

安裝與你的word版本相對應的visio軟體製作簡單方便又美觀。當然如果流程圖不太復雜直接用word製作也不錯。你的這種情況是由於你設置了「固定行距」或者設置了段前段後間距造成的。選中文字，點格式——段落，將行距設置成「單倍行距」，將段前段後間...

I. 用Word2Vec詞向量化的數據訓練Seq2Seq翻譯模型的問題

LED驅動要用恆流電源，不是一般的穩壓電源。

J. python word2vec()訓練中文語句，顯示錯誤如下，求大神指教：

聲明的函數，需要調用。例如
def fun()
……………

fun()#調用才可執行函數里的代碼

閱讀全文

熱點內容

centos安裝php52 發布：2025-07-12 15:14:19 瀏覽：297

usb介面編程發布：2025-07-12 15:14:19 瀏覽：214

演算法學習心得發布：2025-07-12 15:14:08 瀏覽：793

華為手機內核編譯發布：2025-07-12 15:13:13 瀏覽：837

匯編語言編譯器masm 發布：2025-07-12 14:57:37 瀏覽：56

校園網伺服器ip地址發布：2025-07-12 14:55:02 瀏覽：238

如何用密碼鎖定發布：2025-07-12 14:39:10 瀏覽：925

軟體發布源碼發布：2025-07-12 14:29:34 瀏覽：179

sql函數和存儲過程的區別發布：2025-07-12 14:26:37 瀏覽：30

查看存儲功空間發布：2025-07-12 14:17:22 瀏覽：942

word2vec源碼python

與word2vec源碼python相關的資訊