python去除html標簽

發布時間: 2023-07-11 16:43:34

㈠用python正則替換HTML中pre標簽裡面的特殊符號

一共就7個符號，就寫7行替換吧。

用不用正則無所謂，不多。

不用正則也行，網頁parse後，innerText輸出的就是正常文本，innerHtml輸出的才是你說的這種有特殊符號的內容。

㈡ python 解析html 什麼包好

對html的解析是網頁抓取的基礎，分析抓取的結果找到自己想要的內容或標簽以達到抓取的目的。
HTMLParser是python用來解析html的模塊。它可以分析出html裡面的標簽、數據等等，是一種處理html的簡便途徑。 HTMLParser採用的是一種事件驅動的模式，當HTMLParser找到一個特定的標記時，它會去調用一個用戶定義的函數，以此來通知程序處理。它主要的用戶回調函數的命名都是以handler_開頭的，都是HTMLParser的成員函數。當我們使用時，就從HTMLParser派生出新的類，然後重新定義這幾個以handler_開頭的函數即可。這幾個函數包括：
handle_startendtag 處理開始標簽和結束標簽
handle_starttag 處理開始標簽，比如<xx> tag不區分大小寫

handle_endtag 處理結束標簽，比如</xx>
handle_charref 處理特殊字元串，就是以&#開頭的，一般是內碼表示的字元
handle_entityref 處理一些特殊字元，以&開頭的，比如
handle_data 處理數據，就是<xx>data</xx>中間的那些數據
handle_comment 處理注釋
handle_decl 處理<!開頭的，比如<!DOCTYPE html PUBLIC 「-//W3C//DTD HTML 4.01 Transitional//EN」
handle_pi 處理形如<?instruction>的東西
def handle_starttag(self,tag,attr):
#注意：tag不區分大小寫，此時也可以解析 <A 標簽
# SGMLParser 會在創建attrs 時將屬性名轉化為小寫。

if tag=='a':
for href,link in attr:
if href.lower()=="href":
pass

1. 基本解析，找到開始和結束標簽

[python] view plain
<span style="font-size:18px;">#coding:utf-8

from HTMLParser import HTMLParser
'''''
HTMLParser的成員函數:

handle_startendtag 處理開始標簽和結束標簽
handle_starttag 處理開始標簽，比如<xx>
handle_endtag 處理結束標簽，比如</xx>
handle_charref 處理特殊字元串，就是以&#開頭的，一般是內碼表示的字元
handle_entityref 處理一些特殊字元，以&開頭的，比如
handle_data 處理數據，就是<xx>data</xx>中間的那些數據
handle_comment 處理注釋
handle_decl 處理<!開頭的，比如<!DOCTYPE html PUBLIC 「-//W3C//DTD HTML 4.01 Transitional//EN」
handle_pi 處理形如<?instruction>的東西

'''
class myHtmlParser(HTMLParser):
#處理<!開頭的內容
def handle_decl(self,decl):
print 'Encounter some declaration:'+ decl
def handle_starttag(self,tag,attrs):
print 'Encounter the beginning of a %s tag' % tag
def handle_endtag(self,tag):
print 'Encounter the end of a %s tag' % tag
#處理注釋
def handle_comment(self,comment):
print 'Encounter some comments:' + comment

if __name__=='__main__':
a = '<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">\
<html><head><title>test</title><body><a href="http: //www.163.com">鏈接到163</a></body></html>'
m=myHtmlParser()
m.feed(a)
m.close()

輸出結果：

Encounter some declaration:DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"
Encounter the beginning of a html tag
Encounter the beginning of a head tag
Encounter some comments:insert javaScript here!
Encounter the beginning of a title tag
Encounter the end of a title tag
Encounter the beginning of a body tag
Encounter the beginning of a a tag
Encounter the end of a a tag
Encounter the end of a body tag
Encounter the end of a html tag</span>

㈢ Python編寫一個程序以嘗試解析XML/HTML標簽.

要給這段文字添加一個 root 標簽，然後對裡面的 node 進行遍歷。root 標簽的名字可以任意定（但是必須添加一個），我這里使用的 root 命名，對於其它的名字也一樣。如果你是直接讀取的 XML 文件，而不是字元串，可以將文件打開，然後把文件句柄傳入 ElementTree.parse() 函數，最後對其返回值進行遍歷。

fromxml.etreeimportElementTree
parsed=ElementTree.XML('''<root>
<composer>WolfgangAmadeusMozart</composer><author>SamuelBeckett</author><city>London</city>
</root>''')
outstr=[]
fornodeinparsed:
outstr+=['%s:%s'%(node.tag,node.text)]
print(''.join(outstr))

㈣ python的json.loads如何處理帶HTML的字元串

json串中的雙引號需要轉義為【\"】:

str='''[{"level": 1,"value": ["<p>aaa\"b\"ccc</p>"]}]'''

㈤ python 剔除文本中不需要的標簽和連接，提取中文和圖片的連接

import re 利用正則提取，簡單方便。

importre
text=''#待提取文本
result1=re.findall(r'([u4e00-u9fa5]+)',text)#提取中文
result2=re.findall(r'<img.*?src="(.*?)"[^>]*?>',text,re.S)#提取圖片鏈接
print(result1)
print(result2)

㈥ python語言去除文本中的p標簽

用Python語言的sub()函數替換就可以實現你的目標

具體程序如下(假設你每次從文件中讀取一行,放在line變數中)

importre

line='<p>寧波大學</p>'

regex=r'</?p>'

result=re.sub(regex,"",line,re.I)

print(result)

閱讀全文

熱點內容

寫入設備緩存發布：2025-07-01 04:37:35 瀏覽：429

小雞g4怎麼連安卓發布：2025-07-01 04:37:25 瀏覽：353

黃金線主圖源碼發布：2025-07-01 04:35:38 瀏覽：295

阿里輕量伺服器有固定IP嗎發布：2025-07-01 04:30:27 瀏覽：504

電腦什麼配置要合適發布：2025-07-01 04:24:15 瀏覽：164

哪個安卓恢復工具不要錢發布：2025-07-01 04:21:54 瀏覽：740

sql加空格發布：2025-07-01 04:09:38 瀏覽：578

如何關閉ftp防篡改發布：2025-07-01 04:09:04 瀏覽：89

頑固的緩存發布：2025-07-01 03:28:23 瀏覽：114

u盤插安卓手機上怎麼加密發布：2025-07-01 03:09:19 瀏覽：91

python去除html標簽

與python去除html標簽相關的資訊