Convert Word document into HTML without losing original
Clash Royale CLAN TAG#URR8PPP
Convert Word document into HTML without losing original
I am currently developing a program that needs to display a Word document as HTML, but keep track of what is where across the HTML and the original file.
In order to do that, when the Word document is initially loaded, IDs are generated for every element in the document.
foreach (Table t in document.Tables)
t.ID = GUID();
Range range = t.Range;
foreach (Cell c in range.Cells)
c.ID = t.ID + TableIDSeparator + GUID();
foreach (Paragraph p in document.Paragraphs)
p.ID = GUID();
Then I can save the document as HTML this way:
document.SaveAs2(tempFileName, WdSaveFormat.wdFormatFilteredHTML);
But then the document
object becomes the HTML document, and not the original Word document (just as when using Save As from the Word menu the current window displays the freshly saved document and not the original).
document
So I tried to save the document to HTML this way:
Document temp = new Document();
string x = document.Range().XML;
temp.Range().InsertXML(x);
temp.SaveAs2(fn, WdSaveFormat.wdFormatFilteredHTML);
temp.Close(false);
But now the new temp
document is missing all the IDs I've created in the original document, so I cannot find what is where in the HTML file according to the original document.
temp
Am I missing something important in there or is there some way to Save As a word document without losing the reference to the original file?
Cell.ID
Paragraph.ID
Thanks for the suggestion Cindy, but I was already able to solve this — see answer below.
– Vladislav Korotnev
Aug 10 at 15:36
2 Answers
2
Since the documents turn out identical, I used the following approach to copy the IDs to the new document.
Please note the Paragraphs/Tables/etc. arrays begin from element index 1, not 0.
string fn = Path.GetTempPath() + TmpPrefix +GUID() + ".html";
Document temp = new Document();
// Copy whole old document to new document
temp.Range().InsertXML(doc.Range().XML);
// copy IDs assuming the documents are identical and have same amount of elements
for (int i = 1; i <= temp.Tables.Count; i++)
temp.Tables[i].ID = doc.Tables[i].ID;
Range sRange = doc.Tables[i].Range;
Range tRange = temp.Tables[i].Range;
for(int j = 1; j <= tRange.Cells.Count; j++)
tRange.Cells[j].ID = sRange.Cells[j].ID;
for(int i=1; i <= temp.Paragraphs.Count; i++)
temp.Paragraphs[i].ID = doc.Paragraphs[i].ID;
// Save new temp document as HTML
temp.SaveAs2(fn, WdSaveFormat.wdFormatFilteredHTML);
temp.Close();
return fn;
Since I don't need the ID's in the outcoming DOCX file (I only use the IDs to keep track between the DOCX file loaded in the memory and it's HTML representation displayed in my app), this works great for my case.
Though this approach above turned out to be extremely slow on large documents, so I had to do it differently:
public static string RenderHTMLFile(Document doc)
string fn = Path.GetTempPath() + TmpPrefix +GUID() + ".html";
var vba = doc.VBProject;
var module = vba.VBComponents.Add(Microsoft.Vbe.Interop.vbext_ComponentType.vbext_ct_StdModule);
var code = Properties.Resources.HTMLRenderer;
module.CodeModule.AddFromString(code);
var dataMacro = Word.Run("renderHTMLCopy", fn);
return fn;
Where Properties.Resources.HTMLRenderer
is a txt file with the following VB code:
Properties.Resources.HTMLRenderer
Sub renderHTMLCopy(ByVal path As String)
'
' renderHTMLCopy Macro
'
'
Selection.WholeStory
Selection.Copy
Documents.Add
Selection.PasteAndFormat wdPasteDefault
ActiveDocument.SaveAs2 path, WdSaveFormat.wdFormatFilteredHTML
ActiveDocument.Close False
End Sub
The previous version took about 1500ms for a small document, this one renders the same document in roughly 400ms!
By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.
Mmmm...
Cell.ID
andParagraph.ID
are only valid if the document is saved as a web page - so it will tend to be stripped whenever the file is saved or opened as a Word document. Word has a round-trip HTML file format - save as "full" HTML, no filter - I think that would be your best bet. If you need to mark certain things, bookmarks would probably work.– Cindy Meister
Aug 9 at 14:20