views:

1081

answers:

2

@Solved

The two subquestions I have created have been solved (yay for splitting this one up!), so this one is solved. I'll award the check mark to samjudson, since his answer was the closest. For actual working solutions though, see the below subquestions; both my implemented solutions and the checked answers.

@Deprecated

I am splitting this question into two separate questions, since this is a fairly complicated problem. Answers are still welcome though.

The suquestions are:

  1. XSLT: Convert base64 data into image files
  2. XSLT: Obtaining or matching hashes for base64 encoded data


Hi, just wondering if anyone here has had any success in converting Evernote's export format, which is XML, to HTML including the pictures. I do know that Evernote has an export to HTML function which does this, but I eventually want to do more fancy stuff with it.

I have managed to accomplish getting the text only using the following XSLT:

Sample code removed

See child questions for implemented solutions.

However, a.t.m. this simply ignores any pictures, and this is where I need help.

Stumbling block #1: Evernote stores its pictures as GIFs or PNGs, and when exported, it embeds these GIFs & PNGs directly in the XML using what appears to be base64 (I could be wrong). I need to be able to reconsitute the pictures. If you open the file in a text editor, look for the huge blocks of data in the **//note/resource/data**. For example (indents added manually):

<resource>
<data encoding="base64">
R0lGODlhEAAQAPMAMcDAwP/crv/erbigfVdLOyslHQAAAAECAwECAwECAwECAwECAwECAwECAwEC
AwECAyH/C01TT0ZGSUNFOS4wGAAAAAxtc09QTVNPRkZJQ0U5LjAHgfNAGQAh/wtNU09GRklDRTku
MBUAAAAJcEhZcwAACxMAAAsTAQCanBgAIf8LTVNPRkZJQ0U5LjATAAAAB3RJTUUH1AkWBTYSQXe8
fQAh+QQBAAAAACwAAAAAEAAQAAADSQhgpv7OlDGYstCIMqsZAXYJJEdRQRWRrHk2I9t28CLfX63d
ZEXovJ7htwr6dIQB7/hgJGXMzFApOBYgl6n1il0Mv5xuhBEGJAAAOw==
</data>
<mime>image/gif</mime>
<resource-attributes>
    <file-name>clip_image001.gif</file-name>
</resource-attributes>
</resource>

Stumbling block #2: Evernote stores the file names of each picture under the resource node
**//note/resource/resource-attributes/file-name**
however, in the actual note in which it refers to the picture, it references the picture not by the filename, but by its hash, for example:

<en-media hash="4aaafc3e14314027bb1d89cf7d59a06c" type="image/gif" border="0" width="16" height="16" alt="Alt Text"/>

Can anyone shed some light on how to deal with (base64) encoded binary data inside XML?

Edit

I understand from the comments & answers that plain ol' XSLT won't get the job done handling images. The XSLT processor I am using is Xalan , however, if this is not good enough for the purposes of image processing or base64, then I am please suggest one that does do these!

Also, as requested, here is a sample Evernote export file. The code clips above are merely selected parts of this. I have stripped it down such that it contains just one note and edited most of the text out of it, and added indents for clarity.

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE en-export SYSTEM "http://xml.evernote.com/pub/evernote-export.dtd"&gt;
<en-export export-date="20091029T063411Z" application="Evernote/Windows" version="3.0">

<note>
    <title>A title here</title>
    <content><![CDATA[
     <?xml version="1.0" encoding="UTF-8"?>
     <!DOCTYPE en-note SYSTEM "http://xml.evernote.com/pub/enml.dtd"&gt;
     <en-note bgcolor="#FFFFFF">
      <p>Some text here (followed by the picture)
      <p><en-media hash="4aaafc3e14314027bb1d89cf7d59a06c" type="image/gif" border="0" width="16" height="16" alt="A picture"/></p>
      <p>Some more text here (preceded by the picture)
     </en-note>
    ]]></content>
    <created>20090925T063154Z</created>
    <note-attributes>
     <author/>
    </note-attributes>
    <resource>
     <data encoding="base64">
R0lGODlhEAAQAPMAMcDAwP/crv/erbigfVdLOyslHQAAAAECAwECAwECAwECAwECAwECAwECAwEC
AwECAyH/C01TT0ZGSUNFOS4wGAAAAAxtc09QTVNPRkZJQ0U5LjAHgfNAGQAh/wtNU09GRklDRTku
MBUAAAAJcEhZcwAACxMAAAsTAQCanBgAIf8LTVNPRkZJQ0U5LjATAAAAB3RJTUUH1AkWBTYSQXe8
fQAh+QQBAAAAACwAAAAAEAAQAAADSQhgpv7OlDGYstCIMqsZAXYJJEdRQRWRrHk2I9t28CLfX63d
ZEXovJ7htwr6dIQB7/hgJGXMzFApOBYgl6n1il0Mv5xuhBEGJAAAOw==
     </data>
     <mime>image/gif</mime>
     <resource-attributes>
      <file-name>clip_image001.gif</file-name>
     </resource-attributes>
    </resource>
</note>

</en-export>

And this needs to be transformed into this:

<html>
    <body>
     <p>Some text here (followed by the picture)
     <p><img src="clip_image001.gif" border="0" width="16" height="16" alt="A picture"/></p>
     <p>Some more text here (preceded by the picture)
    </body>
</html>

With the file clip_image001.gif being generated and saved.

+2  A: 

There is a new Data URI specification http://en.wikipedia.org/wiki/Data_URI_scheme which may be of some help provided you are only intending to support modern browsers, and your images are small (for example IE8 only support <32k images).

Other than that the only other thing you can do is use some external scripts to export the image data to file and use them. This would depend greatly on what XSLT processor you are using.

samjudson
Hi Sam, thanks for your suggestion. However, my Evernote documents are largely annotated clips of various websites, and they do contain images > 32k, so I don't think the Data URI Scheme is going to help here (despite looking very similar).Please suggest the XSLT processor and external scripts you would use, as I am felxible with the environment.
bguiz
Well personally I'd process them in C# because I'm a .Net developer, but you could use Java (in which case there are almost unlimited XSLT processors - Saxon being the best IMHO). Each processor has its own way of implementing extension methods however.
samjudson
A: 

It exists a pure XSLT answer to this issue ; look at this page

Erlock
Yeah, I've copme across that site before, unfortunately, it's using base64 to encode strings (not binary data like images). Also, I need a way to actually generate images (write to file).
bguiz