How to find where browser breaks a paragraph of text.

views:

227

answers:

+1 Q:

How to find where browser breaks a paragraph of text.

I need to add line breaks in the positions that the browser naturally adds a newline in a paragraph of text.

For example:

This is some very long text \n that spans a number of lines in the paragraph.

This is a paragraph that the browser chose to break at the position of the \n

I need to find this position and insert a

Does anyone know of any JS libraries or functions that are able to do this?

The only solutuion that I have found so far is to remove tokens from the paragraph and observe the clientHeight property to detect a change in element height. I don't have time to finish this and would like to find something that's already tested.

Edit: The reason I need to do this is that I need to accurately convert HTML to PDF. Acrobat renders text narrower than the browser does. This results in text that breaks in different positions. I need an identical ragged edge and the same number of lines in the converted PDF.

Edit:

@dtsazza: Thanks for your considered answer. It's not impossible to produce a layout editor that almost exactly replciates HTML I've written 99% of one ;)

The app I'm working on allows a user to create a product catalogue by dragging on 'tiles' The tiles are fixed width, absolutely positioned divs that contain images and text. All elemets are styled so font size is fixed. My solution for finding \n in paragraph is ok 80% of the time and when it works with a given paragrah the resulting PDF is so close to the on-screen version that the differences do not matter. Paragraphs are the same height (to the pixel), images are replaced with high res versions and all bitmap artwork is replaced with SVGs generated server side.

The only slight difference between my HTML and PDF is that Acrobat renderes text slightly more narrowly which results in line slightly shorter line length.

Diodeus's solution of adding span's and finding their coords is a very good one and should give me the location of the BRs. Please remember that the user will never see the HTML with the inserted BRs - these are added so that the PDF conversion produces a paragraph that is exactly the same size.

There are lots of people that seem to think this is impossible. I already have a working app that created extremely accurate HTML->PDF conversion of our docs - I just need a better solution of adding BRs because my solution sometimes misses a BR. BTW when it does work my paragraphs are the same height as the HTML equivalents which is the result we are after.

If anyone is interested in the type of doc i'm converting then you can check ou this screen cast:

http://www.localsa.com.au/brochure/brochure.html

Edit: Many thanks to Diodeus - your suggestion was spot on.

Solution: for my situation it made more sense to wrap the words in spans instead of the spaces.

var text = paragraphElement.innerHTML.replace(/ /g, ' ');

text = ""+text+""; //wrap first and last words.

This wraps each word in a span. I can now query the document to get all the words, iterate and compare y position. When y pos changes add a br.

This works flawlessly and gives me the results I need - Thank you!

+1 A:

I would suggest wrapping all spaces in a span tag and finding the coordinates of each tag. When the Y-value changes, you're on a new line.

Diodeus 2009-01-15 14:40:49

But Y?</I'll be here all night>

Mike Robinson 2009-01-15 14:45:56

Please see the edit

Eli_s 2009-01-15 14:49:16

@Diodeus: Great idea will give it a shot.

Eli_s 2009-01-15 14:52:13

I came across the same problem when I was building in-place editor using a bitmap of each character in a fancy non-browser font. I had to figure out the word-wrapping myself. Ugh.

Diodeus 2009-01-15 15:02:36

Thanks for your help Diodeus i'm half way through implementing your idea and it is working like a charm :)

Eli_s 2009-01-15 16:35:51

+3 A:

I don't think there's going to be a very clean solution to this one, if any at all. The browser will flow a paragraph to fit the available space, linebreaking where needed. Consider that if a user resizes the browser window, all the paragraphs will be rerendered and almost certainly will change their break positions. If the user changes the size of the text on the page, the paragraphs will be rerendered with different line break points. If you (or some script on your page) changes the size of another element on the page, this will change the amount of space available to a floating paragraph and again - different line break points.

Besides, changing the actual markup of your page to mimic something that the browser does for you (and does very well) seems like the wrong approach to whatever you're doing. What's the actual problem you're trying to solve here? There's probably a better way to achieve it.

Edit: OK, so you want to render to PDF the same as "the screen version". Do you have a specific definitive screen version nominated - in terms of browser window dimensions, user stylesheets, font preferences and adjusted font size? The critical thing about HTML is that it deliberately does not specify a specific layout. It simply describes what is on the page, what they are and where they are in relation to one another.

I've seen several misguided attempts before to produce some HTML that will exactly replicate a printed creative, designed in something like a DTP application where a definitive absolute layout is essential. Those efforts were doomed to failure because of the nature of HTML, and doing it the other way round (as you're trying to) will be even worse because you don't even have a definitive starting point to work from.

On the assumption that this is all out of your hands and you'll have to do it anyway, my suggestion would be to give up on the idea of mangling the HTML. Look at the PDF conversion software - if it's any good it should give you some options for font kerning and similar settings. Playing around with the details here should get you something that approximates the font rendering in the browser and thus breaks lines at the same places.

Failing that, all I can suggest is taking screenshots of the browser and parsing these with OCR to work out where the lines break (it shouldn't require a very accurate OCR since you know what the raw text is anyway, it essentially just has to count spaces). Or perhaps just embed the screenshot in the PDF if text search/selection isn't a big deal.

Finally doing it by hand is likely the only way to make this work definitively and reliably.

But really, this is still just wrong and any attempts to revise the requirements would be better. Keep going up one step in the chain - why does the PDF have to have the exact same ragged edge as some arbitrary browser rendering? Can you achieve that purpose in another (better) way?

Andrzej Doyle 2009-01-15 14:43:53

Thanks for your reply please see my edit above

Eli_s 2009-01-15 15:50:44

Eli_s 2009-01-15 16:02:30

Another thing I'm worried about with adding BRs is whether you'll be able to update them when the size changes. If the user resizes their browser etc. you'll need to take out those that you previously put in, else they'll have weird unnatural line breaks in addition to the browser's own ones.

Andrzej Doyle 2009-01-15 16:22:10

The line breaks are only added to the version that is saved to the server. On the client end the line breaks are added when the user saves a brochure then removed on save complete.

Eli_s 2009-01-15 16:34:30

Sounds like a bad idea when you account for user set font sizes, MS Windows accessibility mode, and the hundreds of different mobile devices. Let the browser do it's thing - trying to have exact control over the rendering will only cause you hours of frustration.

Mike Robinson 2009-01-15 14:45:25

Please see the edit

Eli_s 2009-01-15 14:53:04

I don't think you'll be able to do this with any kind of accuracy without embedding Gecko/WebKit/Trident or essentially recreating them.

annakata 2009-01-15 14:51:44

The approach i'm using at the mo (removing tokens and measuring height) works 80% of the time, however I dont have the time to polish it.Also Diodeus suggestion is a great one which I think will work well :)

Eli_s 2009-01-15 14:55:18

This is impossible because it goes against the fundamental differences of HTML and PDF.

HTML is rendered based on settings on the reader's side, i.e. preferred font size, screen resolution, browser window size/geometry etc. -- these settings are never known to the author and change from reader to reader, and this is a wanted feature since not everyone has the same technical utilies. PDF is rendered based on the settings the author prescribes and looks the same in every reader; this is especially useful if you want to print something on a given paper size. The target of both methods are totally different media, and things that look good in one do not necessarily look good in the other.

The only thing you can do is adopt analogous styles for your web page and the PDF.

Svante 2009-01-15 15:05:15

It's not impossible. I'm using a great HTML->PDF converter called PrinceXML. It does an amazing job of turning Styled HTML into the equivalent PDF doc. Font rendering is slightly different between the engines, but all I need to do for this proj is get the same ragged edge in both versions.

Eli_s 2009-01-15 15:12:15

OK, let me rephrase: it is impossible unless you break the flexibility of the HTML display.

Svante 2009-01-16 01:56:28

But this means that you "fix" the HTML, not the PDF.

Svante 2009-01-16 01:57:13

Maybe you're not understanding what I am trying to accomplish. A user designs a brochure using html. When the user saves the layout the BRs are added, the layout saved and then the BRs are removed. To the end user this process is transparent. Thanks to Diodeus suggestion I now have a working version

Eli_s 2009-01-16 02:24:15

Maybe an alternative: do all line-breaks yourself, instead of relying on the browser. Place all text in pre tags, and add your own linebreaks. Now at least you don't have to figure out where the browser put them.

Andrej 2009-01-15 15:42:49

great idea! will have to try this one out.

Eli_s 2009-01-15 15:48:28

ansaurus

tags:

views:

answers:

How to find where browser breaks a paragraph of text.

related questions