large-data-volumes

STXXL equivalent in Java

I'm searching a collection framework designed for huge datasets in Java that behaves transparently, like STXXL does for C++. It should transparently swap to disk, but in a much more efficient manner than plain OS-based VM swapping. A StringBuffer/String drop-in replacement would be a big plus. ...

How would you handle making an array or list that would have more entries than the standard implementation would allow you to access

I am trying to create an array or list that could handle in theory, given adequate hardware and such, as many as 100^100 BigInteger entries. The problem with using an array or standard list is that they can only hold Integer.MAX_VALUE number of entries. How would you work around this limitations? A whole new class/interface? A wrapper...

How to avoid OOM (Out of memory) error when retrieving all records from huge table?

Hi all, I am given a task to convert a huge table to custom XML file. I will be using Java for this job. If I simply issue a "SELECT * FROM customer", it may return huge amount of data that eventually causing OOM. I wonder, is there a way i can process the record immediately once it become available, and remove the record from memory a...

Best way to store/retrieve millions of files when their meta-data is in a SQL Database

I have a process that's going to initially generate 3-4 million PDF files, and continue at the rate of 80K/day. They'll be pretty small (50K) each, but what I'm worried about is how to manage the total mass of files I'm generating for easy lookup. Some details: I'll have some other steps to run once a file have been generated, and ther...

Is it necessary to have mysqlcheck run when starting mysql?

I have a large (about 10 GB with a 20 GB innodb buffer pool) database, and have noticed that when I start it, for about the first half hour it's running, the database will periodically lock and unlock all tables, making it quite unpleasant for users who attempt to access our site for the first half hour after a database restart. While I...

How to add more than 500 entries to the datastore with put() in google app engine?

I tried adding batches of data in a list with a couple of calls to db.put(). But it still timeouts occasionally. Anyone have some tips? ...

Bad idea to transfer large payload using web services?

I gather that there basically isn't a limit to the amount of data that can be sent when using REST via a POST or GET. While I haven't used REST or web services it seems that most services involve transferring limited amounts of data. If you want to transfer 1-5MB worth of data (in either direction) are web services considered a bad ide...

Oracle: Find previous record for a ranked list of forecasts

Hi I am faced with a difficult problem: I have a table (oracle 9i) of weather forecasts (many 100's of millions of records in size.) whose makeup looks like this: stationid forecastdate forecastinterval forecastcreated forecastvalue --------------------------------------------------------------------------------- varchar (...

Is there a way to maintain a 200MB immutable data structure in memory and access it from a script?

I have a list of 9 million IPs and, with a set of hash tables, I can make a constant-time function that returns if a particular IP is in that list. Can I do it in PHP? If so, how? ...

how to handle large lists of data

We have a part of an application where, say, 20% of the time it needs to read in a huge amount of data that exceeds memory limits. While we can increase memory limits, we hesitate to do so to since it requires having a high allocation when most times it's not necessary. We are considering using a customized java.util.List implementatio...

mysql delete every other rows except the rows number in a given list

so basically here's what i want to do: i have an account table, i have a list of acct_id: (3, 24, 515, 6326, 17), assuming i have about 100,000 accounts in the table, what's the most effective way to delete all the other rows besides the one with the account_id in my given list? i came up with something like: delete from account where ...

Fastest way to search 1GB+ a string of data for the first occurence of a pattern in Python.

There's a 1 Gigabyte string of arbitrary data which you can assume to be equivalent to something like: 1_gb_string=os.urandom(1*gigabyte) We will be searching this string, 1_gb_string, for an infinite number of fixed width, 1 kilobyte patterns, 1_kb_pattern. Every time we search the pattern will be different. So caching opportunities ...

Doing large updates against indexed view

We have an indexed view that runs across three large tables. Two of these tables (A & B) are constantly getting updated with user transactions and the other table (C) contains data product info that is needs to be updated once a week. This product table contains over 6 million records. We need this view across these three tables for ou...

SQL Server table structure for storing a large number of images

What's the best practice for storing a large amount of image data in SQL Server 2008? I'm expecting to store around 50,000 images using approx 5 gigs of storage space. Currently I'm doing this using a single table with the columns: ID: int/PK/identity Picture: Image Thumbnail: Image UploadDate: DateTime I'm concerned because at arou...

Fast conversion of numeric data into fixed width format file in Python

What is the fastest way of converting records holding only numeric data into fixed with format strings and writing them to a file in Python? For example, suppose record is a huge list consisting of objects with attributes id, x, y, and wt and we frequently need to flush them to an external file. The flushing can be done with the followi...

Select Count(*) over large amount of data

Hello i want to do this for a Report but i have 20,000,000 of records in my table and it causes an TimeOut in my application. SELECT T.transactionStatusID, TS.shortName AS TransactionStatusDefShortName, count(*) AS qtyTransactions FROM Transactions T INNER JOIN TransactionTypesCurrencies TTC ON T.id_Ent = TTC.id_Ent ...

SQL Server 2005 proper index to filter 30,000,000 registers

Hello, I have a problem with a stored procedure of a transactional table, the user have a web form to find transactions by several values. The process is taking too long and I don't know how to set proper index. here is my stored procedure: CREATE PROCEDURE dbo.cg_searchTransactions ( @id_Ent tinyint, @transactionTypeID int = ...

Dealing with large files in Haskell

I have a large file (4+ gigs) of, lets just say, 4 byte floats. I would like to treat it as List, in the sense that I would like to be able to use map, filter, foldl, etc. However, instead of producing a new list with the output, I would like to write the output back into the file, and thus only have to load a small portion of the file i...

Read from one large file and write to many (tens, hundreds, or thousands) files in Java?

I have a large-ish file (4-5 GB compressed) of small messages that I wish to parse into approximately 6,000 files by message type. Messages are small; anywhere from 5 to 50 bytes depending on the type. Each message starts with a fixed-size type field (a 6-byte key). If I read a message of type '000001', I want to write append its payloa...

javascript to find memory available.

Let me make immediately clear: this is not a question about memory leak! I have a page which allows the user to enter some data and a JavaScript to handle this data and produce a result. The JavaScript produces incremental outputs on a DIV, something like this: (function() { var newdiv = document.createElement("div"); newdiv.inner...