java爬蟲源碼

發布時間: 2025-04-09 10:23:53

1. 如何用java寫一個知乎爬蟲

下面說明知乎爬蟲的源碼和涉及主要技術點：
（1）程序package組織

（2）模擬登錄（爬蟲主要技術點1）
要爬去需要登錄的網站數據，模擬登錄是必要可少的一步，而且往往是難點。知乎爬蟲的模擬登錄可以做一個很好的案例。要實現一個網站的模擬登錄，需要兩大步驟是：（1）對登錄的請求過程進行分析，找到登錄的關鍵請求和步驟，分析工具可以有IE自帶(快捷鍵F12)、Fiddler、HttpWatcher；（2）編寫代碼模擬登錄的過程。

（3）網頁下載（爬蟲主要技術點2）
模擬登錄後，便可下載目標網頁html了。知乎爬蟲基於HttpClient寫了一個網路連接線程池，並且封裝了常用的get和post兩種網頁下載的方法。

（4）自動獲取網頁編碼（爬蟲主要技術點3）
自動獲取網頁編碼是確保下載網頁html不出現亂碼的前提。知乎爬蟲中提供方法可以解決絕大部分亂碼下載網頁亂碼問題。

（5）網頁解析和提取（爬蟲主要技術點4）
使用Java寫爬蟲，常見的網頁解析和提取方法有兩種：利用開源Jar包Jsoup和正則。一般來說，Jsoup就可以解決問題，極少出現Jsoup不能解析和提取的情況。Jsoup強大功能，使得解析和提取異常簡單。知乎爬蟲採用的就是Jsoup。

（6）正則匹配與提取（爬蟲主要技術點5）
雖然知乎爬蟲採用Jsoup來進行網頁解析，但是仍然封裝了正則匹配與提取數據的方法，因為正則還可以做其他的事情，如在知乎爬蟲中使用正則來進行url地址的過濾和判斷。

（7）數據去重（爬蟲主要技術點6）
對於爬蟲，根據場景不同，可以有不同的去重方案。（1）少量數據，比如幾萬或者十幾萬條的情況，使用Map或Set便可；（2）中量數據，比如幾百萬或者上千萬，使用BloomFilter（著名的布隆過濾器）可以解決；（3）大量數據，上億或者幾十億，Redis可以解決。知乎爬蟲給出了BloomFilter的實現，但是採用的Redis進行去重。

（8）設計模式等Java高級編程實踐
除了以上爬蟲主要的技術點之外，知乎爬蟲的實現還涉及多種設計模式，主要有鏈模式、單例模式、組合模式等，同時還使用了Java反射。除了學習爬蟲技術，這對學習設計模式和Java反射機制也是一個不錯的案例。
4. 一些抓取結果展示

2. java爬蟲讀取某一張指定圖片的url，求解答

package pers.jiaming.download.main;import java.io.*; //io包import java.util.regex.*; //正則包import java.net.*; //網路包/** 下載圖片類* */public final class DownloadPictures implements Runnable{
private URL url = null; //URL private URLConnection urlConn = null; //url連接 private BufferedReader bufIn = null; //緩沖讀取器，讀取網頁信息
private static final String IMG_REG = "<img.*src\\s*=\\s*(.*?)[^>]*?>"; //img標簽正則 private static final String IMG_SRC_REG = "src\\s*=\\s*\"?(.*?)(\"|>|\\s+)"; //img src屬性正則
private String downloadPath = null; //保存路徑
//構造，參數：想要下載圖片的網址、下載到的圖片存放的文件路徑 public DownloadPictures(String urlStr, String downloadPath)
{
createFolder(downloadPath); //創建文件夾
try {
url = new URL(urlStr);
urlConn = url.openConnection();
//設置請求屬性，有部分網站不加這句話會拋出IOException: Server returned HTTP response code: 403 for URL異常 //如：b站 urlConn.setRequestProperty("User-Agent", "Mozilla/4.0 (compatible; MSIE 5.0; Windows NT; DigExt)");
bufIn = new BufferedReader(new InputStreamReader(urlConn.getInputStream()));
}
catch (Exception e) {
e.printStackTrace();
}

this.downloadPath = downloadPath;
}

//檢測路徑是否存在，不存在則創建 private void createFolder(String path)
{
File myPath = new File(path);

if (!myPath.exists()) //不存在則創建文件夾 myPath.mkdirs();
}

//下載函數 public void Download()
{
final int N = 20; //每一次處理的文本行數，這個數越小越容易遺漏圖片鏈接，越大效率越低 (理論上)
String line = "";
String text = "";

while (line != null) //網頁內容被讀完時結束循環 {
for(int i = 0; i < N; i++) //讀取N行網頁信息存入到text當中，因為src內容可能分為多行，所以使用這種方法 try {
line = bufIn.readLine(); //從網頁信息中獲取一行文本
if(line != null) //判斷防止把null也累加到text中 text += line;
}
catch (IOException e) {
e.printStackTrace();
}

//將img標簽正則封裝對象再調用matcher方法獲取一個Matcher對象 final Matcher imgM = Pattern.compile(IMG_REG).matcher(text);

if(!imgM.find()) //如果在當前text中沒有找到img標簽則結束本次循環 continue;

//將img src正則封裝對象再調用matcher方法獲取一個Matcher對象 //用於匹配的文本為找到的整個img標簽 final Matcher imgSrcM = Pattern.compile(IMG_SRC_REG).matcher(imgM.group());

while (imgSrcM.find()) //從img標簽中查找src內容 {
String imageLink = imgSrcM.group(1); //從正則中的第一個組中得到圖片鏈接
print(imageLink); //列印一遍鏈接
//如果得到的src內容沒有寫協議，則添加上// if(!imageLink.matches("https://[\\s\\S]*")) //這里有問題// imageLink = "https://" + imageLink;
print(imageLink); //列印一遍鏈接
try
{
//緩沖輸入流對象，用於讀取圖片鏈接的圖片數據 //在鏈接的圖片不存在時會拋出未找到文件異常 final BufferedInputStream in = new BufferedInputStream(new URL(imageLink).openStream());

//文件輸出流對象用於將從url中讀取到的圖片數據寫入到本地 //保存的路徑為downloadPath，保存的圖片名為時間戳+".png" final FileOutputStream file = new FileOutputStream(new File(downloadPath + System.currentTimeMillis() + ".png"));

int temp; //用於保存in從圖片連接中獲取到的數據 while ((temp = in.read()) != -1)
file.write(temp); //將數據寫入到本地路徑中
//關閉流 file.close();
in.close();

//下載完一張圖片後休息一會 try {
Thread.sleep(800);
}
catch (InterruptedException e) {
e.printStackTrace();
}
}
catch (Exception e)
{
e.printStackTrace();
}
}

//將text中的文本清空 text = "";
}
}

//run @Override
public void run()
{
Download(); //下載函數 }

//列印語句 public void print(Object obj)
{
System.out.println(obj);
}}

3. 如何使用Java語言實現一個網頁爬蟲

我給你代碼
public class DEmo {
public static void match(String s1) {
Pattern p = Pattern.compile("<a(.*)>.*</a>");
Matcher m = p.matcher(s1);
while (m.find()) {
System.out.println(m.group(1));
}
}

public static void main(String args[]) {
URL url;
int responsecode;
HttpURLConnection urlConnection;
BufferedReader reader;
String line;
try {
// 生成一個URL對象，要獲取源代碼的網頁地址為：http://www.sina.com.cn
url = new URL("http://www.jb51.net/article/97787.htm");
// 打開URL
urlConnection = (HttpURLConnection) url.openConnection();
// 獲取伺服器響應代碼
responsecode = urlConnection.getResponseCode();
String temp = "";
if (responsecode == 200) {
// 得到輸入流，即獲得了網頁的內容
reader = new BufferedReader(new InputStreamReader(
urlConnection.getInputStream(), "GBK"));
while ((line = reader.readLine()) != null) {
temp = temp + line;
}
System.out.println(temp);
match(temp);

} else {
System.out.println("獲取不到網頁的源碼，伺服器響應代碼為：" + responsecode);
}
} catch (Exception e) {
System.out.println("獲取不到網頁的源碼,出現異常：" + e);
}

}
}

閱讀全文

熱點內容

scratch少兒編程課程發布：2025-04-16 17:11:44 瀏覽：619

榮耀x10從哪裡設置密碼發布：2025-04-16 17:11:43 瀏覽：347

java從入門到精通視頻發布：2025-04-16 17:11:43 瀏覽：62

php微信介面教程發布：2025-04-16 17:07:30 瀏覽：288

android實現陰影發布：2025-04-16 16:50:08 瀏覽：781

粉筆直播課緩存發布：2025-04-16 16:31:21 瀏覽：334

機頂盒都有什麼配置發布：2025-04-16 16:24:37 瀏覽：197

編寫手游反編譯都需要學習什麼發布：2025-04-16 16:19:36 瀏覽：791

proteus編譯文件位置發布：2025-04-16 16:18:44 瀏覽：350

土壓縮的本質發布：2025-04-16 16:13:21 瀏覽：578

java爬蟲源碼

與java爬蟲源碼相關的資訊