word轉htmljava

發布時間: 2023-08-21 20:54:38

㈠ java 有關word，excel，pdf轉換成html 有幾種方式

java將Word/Excel/PDF文件轉換成HTML整理

項目開發過程中，需求涉及到了各種文檔轉換為HTML或者網頁易顯示格式，現在將實現方式整理如下：
一、使用Jacob轉換Word,Excel為HTML

「JACOB一個Java-COM中間件.通過這個組件你可以在Java應用程序中調用COM組件和Win32 libraries。」

首先下載Jacob包，JDK1.5以上需要使用Jacob1.9版本（JDK1.6尚未測試），與先前的Jacob1.7差別不大

1、將壓縮包解壓後，Jacob.jar添加到Libraries中；

2、將Jacob.dll放至「WINDOWS\SYSTEM32」下面。

需要注意的是：
【使用IDE啟動Web伺服器時，系統讀取不到Jacob.dll，例如用MyEclipse啟動Tomcat，就需要將dll文件到MyEclipse安裝目錄的「jre\bin」下面。
一般系統沒有載入到Jacob.dll文件時，報錯信息為：「java.lang.UnsatisfiedLinkError: no jacob in java.library.path」】

新建類：
1public class JacobUtil
2{
3 public static final int WORD_HTML = 8;
4
5 public static final int WORD_TXT = 7;
6
7 public static final int EXCEL_HTML = 44;
8
9 /** *//**
10 * WORD轉HTML
11 * @param docfile WORD文件全路徑
12 * @param htmlfile 轉換後HTML存放路徑
13 */
14 public static void wordToHtml(String docfile, String htmlfile)
15 {
16 ActiveXComponent app = new ActiveXComponent("Word.Application"); // 啟動word
17 try
18 {
19 app.setProperty("Visible", new Variant(false));
20 Dispatch docs = app.getProperty("Documents").toDispatch();
21 Dispatch doc = Dispatch.invoke(
22 docs,
23 "Open",
24 Dispatch.Method,
25 new Object[] { docfile, new Variant(false),
26 new Variant(true) }, new int[1]).toDispatch();
27 Dispatch.invoke(doc, "SaveAs", Dispatch.Method, new Object[] {
28 htmlfile, new Variant(WORD_HTML) }, new int[1]);
29 Variant f = new Variant(false);
30 Dispatch.call(doc, "Close", f);
31 }
32 catch (Exception e)
33 {
34 e.printStackTrace();
35 }
36 finally
37 {
38 app.invoke("Quit", new Variant[] {});
39 }
40 }
41
42 /** *//**
43 * EXCEL轉HTML
44 * @param xlsfile EXCEL文件全路徑
45 * @param htmlfile 轉換後HTML存放路徑
46 */
47 public static void excelToHtml(String xlsfile, String htmlfile)
48 {
49 ActiveXComponent app = new ActiveXComponent("Excel.Application"); // 啟動word
50 try
51 {
52 app.setProperty("Visible", new Variant(false));
53 Dispatch excels = app.getProperty("Workbooks").toDispatch();
54 Dispatch excel = Dispatch.invoke(
55 excels,
56 "Open",
57 Dispatch.Method,
58 new Object[] { xlsfile, new Variant(false),
59 new Variant(true) }, new int[1]).toDispatch();
60 Dispatch.invoke(excel, "SaveAs", Dispatch.Method, new Object[] {
61 htmlfile, new Variant(EXCEL_HTML) }, new int[1]);
62 Variant f = new Variant(false);
63 Dispatch.call(excel, "Close", f);
64 }
65 catch (Exception e)
66 {
67 e.printStackTrace();
68 }
69 finally
70 {
71 app.invoke("Quit", new Variant[] {});
72 }
73 }
74
75}
76
當時我在找轉換控制項時，發現網易也轉載了一偏關於Jacob使用幫助，但其中出現了比較嚴重的錯誤：String htmlfile = "C:\\AA";
只指定到了文件夾一級，正確寫法是String htmlfile = "C:\\AA\\xxx.html";

到此WORD/EXCEL轉換HTML就已經差不多了，相信大家應該很清楚了:)

二、使用XPDF將PDF轉換為HTML

1、下載xpdf最新版本，地址：http://www.foolabs.com/xpdf/download.html
我下載的是xpdf-3.02pl2-win32.zip

2、下載中文支持包
我下載的是xpdf-chinese-simplified.tar.gz

3、下載pdftohtml支持包
地址：http://sourceforge.net/projects/pdftohtml/
我下載的是：pdftohtml-0.39-win32.tar.gz

4、解壓調試
1) 先將xpdf-3.02pl2-win32.zip解壓，解壓後的內容可根據需要進行刪減，如果只需要轉換為txt格式，其他的exe文件可以刪除，只保留pdftotext.exe，以此類推；
2) 然後將xpdf-chinese-simplified.tar.gz解壓到剛才xpdf-3.02pl2-win32.zip的解壓目錄；
3) 將pdftohtml-0.39-win32.tar.gz解壓，pdftohtml.exe解壓到xpdf-3.02pl2-win32.zip的解壓目錄；
4) 目錄結構：
+---[X:\xpdf]
|-------各種轉換用到的exe文件
|
|-------xpdfrc
|
+------[X:\xpdf\xpdf-chinese-simplified]
|
|
+-------很多轉換時需要用到的字元文件

xpdfrc：此文件是用來聲明轉換字元集對應路徑的文件

5) 修改xpdfrc文件(文件原名為sample-xpdfrc)
修改文件內容為：
Txt代碼

#----- begin Chinese Simplified support package
cidToUnicode Adobe-GB1 xpdf-chinese-simplified\Adobe-GB1.cidToUnicode
unicodeMap ISO-2022-CN xpdf-chinese-simplified\ISO-2022-CN.unicodeMap
unicodeMap EUC-CN xpdf-chinese-simplified\EUC-CN.unicodeMap
unicodeMap GBK xpdf-chinese-simplified\GBK.unicodeMap
cMapDir Adobe-GB1 xpdf-chinese-simplified\CMap
toUnicodeDir xpdf-chinese-simplified\CMap
fontDir C:\WINDOWS\Fonts
displayCIDFontTT Adobe-GB1 C:\WINDOWS\Fonts\simhei.ttf
#----- end Chinese Simplified support package

6) 創建bat文件pdftohtml.bat（放置的路徑不能包含空格）
內容為：
Txt代碼

@echo off
set folderPath=%1
set filePath=%2
cd /d %folderPath%
pdftohtml -enc GBK %filePath%
exit
7) 創建類

JAVA代碼

public class ConvertPdf
{
private static String INPUT_PATH;
private static String PROJECT_PATH;

public static void convertToHtml(String file, String project)
{
INPUT_PATH = file;
PROJECT_PATH = project;
if(checkContentType()==0)
{
toHtml();
}
}

private static int checkContentType()
{
String type = INPUT_PATH.substring(INPUT_PATH.lastIndexOf(".") + 1, INPUT_PATH.length())
.toLowerCase();
if (type.equals("pdf"))
return 0;
else
return 9;
}

private static void toHtml()
{
if(new File(INPUT_PATH).isFile())
{
try
{
String cmd = "cmd /c start X:\\pdftohtml.bat \"" + PROJECT_PATH + "\" \"" + INPUT_PATH + "\"";
Runtime.getRuntime().exec(cmd);
}
catch (IOException e)
{
e.printStackTrace();
}
}
}

}

㈡ Java怎麼操作OpenOffice創建word文檔並向其設置內容

將Word轉Html的原理是這樣的：
1、客戶上傳Word文檔到伺服器
2、伺服器調用OpenOffice程序打開上傳的Word文檔
3、OpenOffice將Word文檔另存為Html格式
4、Over
至此可見，這要求伺服器端安裝OpenOffice軟體，其實也可以是MS Office，不過OpenOffice的優勢是跨平台，你懂的。恩，說明一下，本文的測試基於 MS Win7 Ultimate X64 系統。
下面就是規規矩矩的實現。
1、下載OpenOffice，
2、下載Jodconverter 這是一個開啟OpenOffice進行格式轉化的第三方jar包。
3、泡杯熱茶，等待下載。

4、安裝OpenOffice，安裝結束後，調用cmd，啟動OpenOffice的一項服務：C:\Program Files (x86)\OpenOffice.org 3\program>soffice -headless -accept="socket,port=8100;urp;"

5、打開eclipse
6、喝杯熱茶，等待eclipse打開。
7、新建eclipse項目，導入Jodconverter/lib 下得jar包。

* commons-io
* jodconverter
* juh
* jurt
* ridl
* slf4j-api
* slf4j-jdk14
* unoil
* xstream

8、Coding...

查看代碼

package com.mzule.doc2html.util;

import java.io.BufferedReader;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.ConnectException;
import java.util.Date;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

import com.artofsolving.jodconverter.DocumentConverter;
import com.artofsolving.jodconverter.openoffice.connection.OpenOfficeConnection;
import com.artofsolving.jodconverter.openoffice.connection.SocketOpenOfficeConnection;
import com.artofsolving.jodconverter.openoffice.converter.OpenOfficeDocumentConverter;

/**
* 將Word文檔轉換成html字元串的工具類
*
* @author MZULE
*
*/
public class Doc2Html {

public static void main(String[] args) {
System.out
.println(toHtmlString(new File("C:/test/test.doc"), "C:/test"));
}

/**
* 將word文檔轉換成html文檔
*
* @param docFile
* 需要轉換的word文檔
* @param filepath
* 轉換之後html的存放路徑
* @return 轉換之後的html文件
*/
public static File convert(File docFile, String filepath) {
// 創建保存html的文件
File htmlFile = new File(filepath + "/" + new Date().getTime()
+ ".html");
// 創建Openoffice連接
OpenOfficeConnection con = new SocketOpenOfficeConnection(8100);
try {
// 連接
con.connect();
} catch (ConnectException e) {
System.out.println("獲取OpenOffice連接失敗...");
e.printStackTrace();
}
// 創建轉換器
DocumentConverter converter = new OpenOfficeDocumentConverter(con);
// 轉換文檔問html
converter.convert(docFile, htmlFile);
// 關閉openoffice連接
con.disconnect();
return htmlFile;
}

/**
* 將word轉換成html文件，並且獲取html文件代碼。
*
* @param docFile
* 需要轉換的文檔
* @param filepath
* 文檔中圖片的保存位置
* @return 轉換成功的html代碼
*/
public static String toHtmlString(File docFile, String filepath) {
// 轉換word文檔
File htmlFile = convert(docFile, filepath);
// 獲取html文件流
StringBuffer htmlSb = new StringBuffer();
try {
BufferedReader br = new BufferedReader(new InputStreamReader(
new FileInputStream(htmlFile)));
while (br.ready()) {
htmlSb.append(br.readLine());
}
br.close();
// 刪除臨時文件
htmlFile.delete();
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
// HTML文件字元串
String htmlStr = htmlSb.toString();
// 返回經過清潔的html文本
return clearFormat(htmlStr, filepath);
}

/**
* 清除一些不需要的html標記
*
* @param htmlStr
* 帶有復雜html標記的html語句
* @return 去除了不需要html標記的語句
*/
protected static String clearFormat(String htmlStr, String docImgPath) {
// 獲取body內容的正則
String bodyReg = "<BODY .*</BODY>";
Pattern bodyPattern = Pattern.compile(bodyReg);
Matcher bodyMatcher = bodyPattern.matcher(htmlStr);
if (bodyMatcher.find()) {
// 獲取BODY內容，並轉化BODY標簽為DIV
htmlStr = bodyMatcher.group().replaceFirst("<BODY", "<DIV")
.replaceAll("</BODY>", "</DIV>");
}
// 調整圖片地址
htmlStr = htmlStr.replaceAll("<IMG SRC=\"", "<IMG SRC=\"" + docImgPath
+ "/");
// 把<P></P>轉換成</div></div>保留樣式
// content = content.replaceAll("(<P)([^>]*>.*?)(<\\/P>)",
// "<div$2</div>");
// 把<P></P>轉換成</div></div>並刪除樣式
htmlStr = htmlStr.replaceAll("(<P)([^>]*)(>.*?)(<\\/P>)", "<p$3</p>");
// 刪除不需要的標簽
htmlStr = htmlStr
.replaceAll(
"<[/]?(font|FONT|span|SPAN|xml|XML|del|DEL|ins|INS|meta|META|[ovwxpOVWXP]:\\w+)[^>]*?>",
"");
// 刪除不需要的屬性
htmlStr = htmlStr
.replaceAll(
"<([^>]*)(?:lang|LANG|class|CLASS|style|STYLE|size|SIZE|face|FACE|[ovwxpOVWXP]:\\w+)=(?:'[^']*'|\"\"[^\"\"]*\"\"|[^>]+)([^>]*)>",
"<$1$2>");
return htmlStr;
}

}

㈢怎樣用Java把word文檔轉換為html文檔

可以通過Spire.Doc for Java進行轉換。

首先需要安裝Spire.Doc for Java。可在 Java 程序中添加 Spire.Doc for Java 文件作為依賴項。JAR 文件可以從此鏈接下載。如果您使用 Maven，則可以將以下代碼添加到項目的 pom.xml 文件中，從而輕松地在應用程序中導入 JAR 文件。

<repositories>
<repository>
<id>com.e-iceblue</id>
<name>e-iceblue</name>
<url>https://repo.e-iceblue.cn/repository/maven-public/</url>
</repository></repositories><dependencies>
<dependency>
<groupId>e-iceblue</groupId>
<artifactId>spire.doc</artifactId>
<version>5.2.3</version>
</dependency></dependencies>

Java代碼如下：

mport com.spire.doc.*;public class WordtoHtml {
public static void main(String[] args) {
//實例化Document類的對象
Document doc = new Document();

//載入Word文檔
doc.loadFromFile("inputfile.docx");

//保存為HTML格式
doc.saveToFile("ToHtml.html",FileFormat.Html);
doc.dispose();
}

}

希望對您有幫助。

㈣ Java程序調用 openoffice，將doc文件轉Html文件，但轉換完格式都變成居左邊

1、到官網下載Jacob，
2、將壓縮包解壓後，Jacob.jar添加到Libraries中(先復制到項目目錄中，右鍵單擊jar包選擇BuildPath—>AddtoBuildPath)；
3、將Jacob.dll放至當前項目所用到的「jrein」下面(比如Eclipse正在用的Jre路徑是C:Javajdk1.7.0_17jrein)。
Ps：按照上面的步驟配置的，基本沒有問題，但是有些電腦可能還會報錯，比如：java.lang.UnsatisfiedLinkError:nojacobinjava.library.path，這是系統沒有載入到jacob.dll，網上解決方法是將Jacob.dll放至「WINDOWSSYSTEM32」下面。

Java代碼：
publicclassJacobUtil{
//8代表word保存成html
publicstaticfinalintWORD_HTML=8;
publicstaticvoidmain(String[]args){
Stringdocfile="C:\Users\無名\Desktop\xxx.doc";
Stringhtmlfile="C:\Users\無名\Desktop\xxx.html";
JacobUtil.wordToHtml(docfile,htmlfile);
}
/**
*WORD轉HTML
*@paramdocfileWORD文件全路徑
*@paramhtmlfile轉換後HTML存放路徑
*/
publicstaticvoidwordToHtml(Stringdocfile,Stringhtmlfile)
{
//啟動word應用程序(MicrosoftOfficeWord2003)
ActiveXComponentapp=newActiveXComponent("Word.Application");
System.out.println("*****正在轉換...*****");
try
{
//設置word應用程序不可見
app.setProperty("Visible",newVariant(false));
//documents表示word程序的所有文檔窗口，（word是多文檔應用程序）
Dispatchdocs=app.getProperty("Documents").toDispatch();
//打開要轉換的word文件
Dispatchdoc=Dispatch.invoke(
docs,
"Open",
Dispatch.Method,
newObject[]{docfile,newVariant(false),
newVariant(true)},newint[1]).toDispatch();
//作為html格式保存到臨時文件
Dispatch.invoke(doc,"SaveAs",Dispatch.Method,newObject[]{
htmlfile,newVariant(WORD_HTML)},newint[1]);
//關閉word文件
Dispatch.call(doc,"Close",newVariant(false));
}catch(Exceptione){
e.printStackTrace();
}finally{
//關閉word應用程序
app.invoke("Quit",newVariant[]{});
}
System.out.println("*****轉換完畢********");
}
}

㈤ java使用jacob將word轉換為html，如何設置轉換後html的編碼格式。我想要utf-8的，不要gb2312。

強制轉碼~~
line你要轉的內容
line=new String(line.getBytes("gb2312"),"utf-8");代碼是我憑記憶寫的，應該沒問題
或者你在寫之前。先寫一個HTML頁面編碼的代碼。把頁面的格式設置成utf-8

閱讀全文

熱點內容

java返回this 發布：2025-10-20 08:28:16 瀏覽：845

製作腳本網站發布：2025-10-20 08:17:34 瀏覽：1110

python中的init方法發布：2025-10-20 08:17:33 瀏覽：816

圖案密碼什麼意思發布：2025-10-20 08:16:56 瀏覽：984

怎麼清理微信視頻緩存發布：2025-10-20 08:12:37 瀏覽：872

c語言編譯器怎麼看執行過程發布：2025-10-20 08:00:32 瀏覽：1221

郵箱如何填寫發信伺服器發布：2025-10-20 07:45:27 瀏覽：442

shell腳本入門案例發布：2025-10-20 07:44:45 瀏覽：325

怎麼上傳照片瀏覽上傳發布：2025-10-20 07:44:03 瀏覽：1001

python股票數據獲取發布：2025-10-20 07:39:44 瀏覽：969

word轉htmljava

與word轉htmljava相關的資訊