large-data-volumes

Efficient File I/O and Conversion of Strings to Floats

I have some gigantic (several gigabyte) ASCII text files that I need to read in line-by-line, convert certain columns to floating point, and do a few simple operations on these numbers. It's pretty straightforward stuff, except that I'm thinking that there has to be a way to speed it up a whole bunch. The program never uses the equival...

Displaying large sorted lists in WPF QUICKLY?

I am developing a program that should be able to display long (up to 500 items) lists of data that need to be resorted when their contents change. Essentially, I have a viewmodel with an observable collection that contains classes with observable data bound to the gui, which is displayed in a ListView. The data must be sorted, but the ...

Processing apache logs quickly

Hi! I'm currently running an awk script to process a large (8.1GB) access-log file, and it's taking forever to finish. In 20 minutes, it wrote 14MB of the (1000 +- 500)MB I expect it to write, and I wonder if I can process it much faster somehow. Here is the awk script: #!/bin/bash awk '{t=$4" "$5; gsub("[\[\]\/]"," ",t); sub(":"," ...

Command line script or software tools to label 3d point cloud dataset

How can i label a 3d point cloud dataset? is there a software which can load a text file containing x,y,z values and then visualize it , so that I can label it ? ...

Handling large records in a J2EE application

There is a table phonenumbers with two columns: id, and number. There are about half a million entries in the table. Database is MySQL. The requirement is to develop a simple J2EE application, connected to that database, that allows a user to download all numbervalues in comma separated style by following a specific URL. If we get all ...

JDBC Batch Insert OutOfMemoryError

Hi I have written a method insert() in which I am trying to use JDBC Batch for inserting half a million records into a MySQL database: public void insert(int nameListId, String[] names) { String sql = "INSERT INTO name_list_subscribers (name_list_id, name, date_added)"+ " VALUES (?, ?, NOW())"; Conn...

Theoretical large volume issue, can't use collection to sort in .NET

Excuse the title of this post, but I can't really think of a more creative title. I am calling a 3rd party web service where the authors are ordering transaction results from most recent. The total transaction count is greater than 100 000. To make matters more interesting the web service sends down complex objects representing each tr...

MySql: Operate on Many Rows Using Long List of Composite PKs

What's a good way to work with many rows in MySql, given that I have a long list of keys in a client application that is connecting with ODBC? Note: my experience is largely SQL Server, so I know a bit, just not MySQL specifically. The task is to delete some rows from 9 tables, but I might have upwards of 5,000 key pairs. I started ou...

java efficient de-duplication

Lets say you have a large text file. Each row contains an email id and some other information (say some product-id). Assume there are millions of rows in the file. You have to load this data in a database. How would you efficiently de-dup data (i.e. eliminate duplicates)? ...

MySQL: How to handle regular high volume of inserts/updates?

Hi, I have table where I track various statistics about site usage. Once a week, I plan to use these statistics and information from other tables to calculate various key indicators for use in multiple reports. This is in order not to have compute the complex data each time a report is accessed. These indicators will be stored in a sepa...

Avoid an "out of memory error" in Java(eclipse), when using large data structure?

OK, so I am writing a program that unfortunately needs to use a huge data structure to complete its work, but it is failing with a "out of memory error" during its initialization. While I understand entirely what that means and why it is a problem, I am having trouble overcoming it, since my program needs to use this large structure and ...

Practical size limitations for RDBMS

I am working on a project that must store very large datasets and associated reference data. I have never come across a project that required tables quite this large. I have proved that at least one development environment cannot cope at the database tier with the processing required by the complex queries against views that the applicat...

[perl] Issue with cloning and large structure processing

My Perl script have weird behaviour which I don't understand. I'm processing large structure stored as array of hashes which is growing while processing. The problem is that structure has about max 8mb when I store it on hdd, but while it is processing it takes about 130mb of ram. Why there is so big difference? The main flow of procce...

Large maintenance php scritp. How to print debug string while the script is executing?

I have a very large php maintenance script (basically it recreates thumbnails for an internal archive), it takes 10 to 20 minutes to complete and I noticed that php only displays "echos" when the whole script has finished parsing. Is there any way to show messages like: Phase 1 - Complete Phase 2 - Complete Phase n - Complete While th...

How to load 1 milion records from database fast?

Now we have a firebird database with 1.000.000 that must be processed after ALL are loaded in RAM memory. To get all of those we must extract data using (select * first 1000 ...) for 8 hours. What is the solution for this? ...

How to pick a chunksize for python multiprocessing with large datasets

I am attempting to to use python to gain some performance on a task that can be highly parallelized using http://docs.python.org/library/multiprocessing. When looking at their library they say to use chunk size for very long iterables. Now, my iterable is not long, one of the dicts that it contains is huge: ~100000 entries, with tuples ...

Fastest way for inserting very large number of records into a Table in SQL

The problem is, we have a huge number of records (more than a million) to be inserted into a single table from a Java application. The records are created by the Java code, it's not a move from another table, so INSERT/SELECT won't help. Currently, my bottleneck is the INSERT statements. I'm using PreparedStatement to speed-up the proce...

Performing Aggregate Functions on Multi-Million Row Tables

I'm having some serious performance issues with a multi-million row table that I feel I should be able to get results from fairly quick. Here's a run down of what I have, how I'm querying it, and how long it's taking: I'm running SQL Server 2008 Standard, so Partitioning isn't currently an option I'm attempting to aggregate all views f...

Using Hibernate's ScrollableResults to slowly read 90 million records

I simply need to read each row in a table in my MySQL database using Hibernate and write a file based on it. But there are 90 million rows and they are pretty big. So it seemed like the following would be appropriate: ScrollableResults results = session.createQuery("SELECT person FROM Person person") .setReadOnly(true).set...

Getting started with massive data

I'm a math guy and occasionally do some statistics/machine learning analysis consulting projects on the side. The data I have access to are usually on the smaller side, at most a couple hundred of megabytes (and almost always far less), but I want to learn more about handling and analyzing data on the gigabyte/terabyte scale. What do I n...