views:

88

answers:

3

Hello there again,

So, I'm working on a project now where I should store webpages inside a database, I'm using crawler4j to crawl and Proxool along with MySQL Java Connector to connect to my database.

When I tested the application I got: com.mysql.jdbc.MysqlDataTruncation: Data truncation: Data too long for column 'HTMLData'.

The HTMLData column wasTEXT.

When I changed the HTMLData column to LONGTEXT the error was gone, but I'm afraid it might get back in the future.

Any idea on how to do that perfectly so I don't worry about that error (or any other similar error) in the future?

Thanks :)

+3  A: 

LONGTEXT can hold 4,294,967,295 bytes, see http://dev.mysql.com/doc/refman/5.1/en/storage-requirements.html

I'd say you don't want to store HTML document bigger then 4GB do you?

(edit, overshot the byte count with 1 byte, 2^32 -1 of course)

Wrikken
But see my answer below -- you may not actually get 4GB into one through the JDBC connector.
Neil Coffey
My point was more that it was more then enough for HTML, even 1 GB should is way overdone for any reasonable HTML document. Hitting the limit for 65K OK, MEDIUMTEXT should be more then enough, 16MB for the standard max_allowed_packet is already pushing it very for for plain HTML.
Wrikken
+1  A: 

This doesn't sound like a good design to me. Why do you have to store HTML in a database? IT feels like it couples every tier from view to persistence through and through.

JSPs are dynamic templates for HTML pages; why not just use JSPs?

This is a design worth re-thinking.

duffymo
Wrikken
As Wrikken said it is for crawling. :)
mpcabd
Wrikken
If you're doing any kind of serious crawling, storing the whole web page will be very costly - The internet archive requires ~2 petabytes (http://www.archive.org/about/faqs.php) to store everything it archives, and that's expensive. You should be processing what you crawl to strip out everything you don't need to minimize your necessary disk space. You can also look into something like Lucene to build indexes of the data you're crawling (http://lucene.apache.org/) which will do a lot of that work for you.
dimo414
+5  A: 

In principle, a LONGTEXT field can hold 4GB data however other smaller restrictions probably apply: e.g. from the MySQL documentation, "The largest possible packet that can be transmitted to or from a MySQL 5.1 server or client is 1GB.". I think this effectively means you'll get up to about about 1GB in a LONGTEXT (and even then, you'll have to reconfigure the maximum packet size from its default I think).

Irrespectively of this limit, HTML generally compresses well, so if your frameworks allow this I would suggest you actually consider a LONGBLOB and run the data through a Deflater before storage (and through an Inflator on retrieval).

Neil Coffey