views:

69

answers:

2

In webspiders/crawlers how can i get the actual initial rendered size of the font a user sees in an HTML document, keeping CSS in mind.

+3  A: 

Rendered text size? A user can change the text size at will using his/her browser settings. Not to mention that different browsers render the same content slightly differently.

Jin Kim
The browser's default size will be used, unless it's set somewhere. And usually the font is always set at least once in the CSS. The crawler should probably look at the DOM hierarchy to figure out which CSS setting is used, unless it's overwritten with inline CSS in the HTML itself. Quite a lot of work, but possible. It's probably easier though to separate headers from normal text to have a better idea of what's what.
Alec
I agree, you'd have to pretty much replicate the DOM on the server by parsing all the html/css/js to get the actual size of rendered text. Sounds like a helluva project.
Jason Watts
Definitely a helluva project. Worse if you want to know about IE pixel heights. Less bad if just gecko (or whatever Firefox uses these days) and webkit will suffice - then my thoughts below seem more tractable for server-side execution.
Tetsujin no Oni
A: 

If you are satisfied with the answer being for the 'default', no user customization view for this purpose (which seems likely), I believe you are looking at a fairly painful scenario:

  • Embed a rendering engine with CSS support in your spider. Prefer the use of an engine which matches most of your users, or alternatively use all three common engines and store the information for all of them. The ease of embedding varies widely on your consuming technology.

  • Load the URI being spidered in the rendering engine(s).

  • Using the engine's API, query it's font metrics for an element containing what you consider representative text (choosing this is an exercise for which I won't even begin to predict a strategy). How you access this will depend entirely on the embedding scenario for your engine.

I expect this is the 'hard way', but I'm not sure there is an 'easy' way.

Tetsujin no Oni