views:

54

answers:

1

well i faced i lot of prob converting the html data on page to pdf and to doc making sure images also appear in the converted article but failed

i understand that XML is something like a foundation

so is it?

and how to use it?

i mean any guide of how to generate the xml of the page and then to change its extension to the needed(pdf,doc)?

using vs08,asp.net,c#

+4  A: 

The short answer is no.
If there was such a format, why wouldn't all applications use it in the first place?

A note on different formats

Almost all document applications understands plain text (but image applications, etc. does not). The problem with plain text is that it does not contain any formatting. No pictures, no font size, no margins, nothing except text. Here is also the root cause why there are many different formats, the formatting.

Take HTML for example. HTML is good for flowing texts on web sites with a continuous block of text which is navigated by a scrollbar. No page breaks, can adapt to different column widths depending on screen size, etc. HTML is also very dynamic, pages can expand sections, replace content and react to user input.

On the contrary, take PDF. PDF is page oriented, fixed width and height of the pages. It is also targeted at viewing only. Text wrapping is fixed with explicit line breaks. (Copy the text from a PDF to a Word document and insert some text in the middle of a line, and the line breaking will be a real mess). PDF is emulating a printed page with margins and everything.

Somewhere in the middle is the Word document. Page oriented like PDF, but not as fixed in the shape as a PDF document, to support a nice editing experience. Sections of texts reflow nicely when text is inserted in the middle. It is quite flexible when editing, but the final result is as strict in form as PDF. When printing a Word document the printout will look exactly like it was on the screen.

XML

XML is a very general format, you can think of it as a format for formats. XML in itself does not say anything about the content, you need a separate description of how to interpret the XML for a given application. There exists specifications like DocBook that specifies how to describe a document in XML. But that is not an exact description of how the document will look. It separates content from layout. You need to apply a layout/template to generate a visible output format. From a DocBook XML you can generate PDF, HTML, etc.

There is not given way of converting a given document format to XML, not even a given XML-format like DocBook. XML based formats can be used as a source format to generate different viewable format.

A note on conversion

The problem of converting different formats to each other comes from the different purposes and strengths of each format. One format is simple not suitable or even able to describe the properties of another format correctly. There is no general method of converting between formats, because formats like PDF does not reveal the document structure in a reusable way.

How to publish to different formats

The key to success when publishing to different formats is to separate content from layout. You need to specify what text you have, how the structure is (headers, sections, etc), what images you have and how they relate to your sections of text. The text and structure description may be in XML, in a database or something else.

Then you need a tool to generate each output format from a template using some kind of tool.

Side note on image formats

Image formats on the other hand are much easier to convert between each other (as long as you convert pixel based formats to pixel based formats and vector based formats to vector based formats) since the end result is exactly the same. The difference between different image formats is mainly the compression algorithm used to compress images. The when uncompressing the images the original image with all of its information is restored (except minor compression artifacts).

Albin Sunnanbo