
large amount of data in many text files - how to process?

Hi, I have large amounts of data (a few terabytes) and accumulating... They are contained in many tab-delimited flat text files (each about 30MB). Most of the task involves reading the data and aggregating (summing/averaging + additional transformations) over observations/rows based on a series of predicate statements, and then saving th...

what changes when your input is giga/terabyte sized?

I just took my first baby step today into real scientific computing today when I was shown a data set where the smallest file is 48000 fields by 1600 rows (haplotypes for several people, for chromosome 22). And this is considered tiny. I write Python, so I've spent the last few hours reading about HDF5, and Numpy, and PyTable, but I sti...

How to best transfer large payloads of data using wsHttp with WCF with message security

I have a case where I need to transfer large amounts of serialized object graphs (via NetDataContractSerializer) using WCF using wsHttp. I'm using message security and would like to continue to do so. Using this setup I would like to transfer serialized object graph which can sometimes approach around 300MB or so but when I try to do so ...

Export large amount of data from Oracle 10G to SQL Server 2005

Dear all, I need to export 100 million data rows (avg row length ~ 100 bytes) from Oracle 10G database table into SQL server (over WAN/VLAN with 6MBits/sec capacity) on a regular basis. So far, these are the options that I have tried and a quick summary. Has anyone tried this before? Are there other better options? Which option would be...

What's the best way to transfer a large dataset over an ASMX web service?

I've inherited a C# .NET application which talks to a web service, and the web service talks to an Oracle database. I need to add an export function to the UI, to produce an Excel spreadsheet of some of the data. I have created a web service function to run a database query, load the data into a DataTable and then return it, which work...

Need some help calculating percentile

An rpc server is given which receives millions of requests a day. Each request i takes processing time Ti to get processed. We want to find the 65th percentile processing time (when processing times are sorted according to their values in increasing order) at any moment. We cannot store processing times of all the requests of the past as...

What technology for large scale scraping/parsing?

We're designing a large scale web scraping/parsing project. Basically, the script needs to go through a list of web pages, extract the contents of a particular tag, and store it in a database. What language would you recommend for doing this on a large scale(tens of millions of pages?). . We're using MongoDB for the database, so anythi...

Creating a large sitemap on Google App Engine?

I have a site with around 100,000 unique pages. (1) How do I create a Sitemap for all these links? Should I just list them flat in a large sitemap protocol compatible file? (2) Need to implement this on Google App Engine where there is a 1000 item query limit, and all my individual site URLs are stored as separate entries. How do I so...

Handling 100's of 1,000,000's of rows in T-SQL2005

I have a couple of databases containing simple data which needs to be imported into a new format schema. I've come up with a flexible schema, but it relies on the critical data of the to older DBs to be stored in one table. This table has only a primary key, a foreign key (both int's), a datetime and a decimal field, but adding the coun...

Looking for an easy to use embedded key-value store for C++

I need to write a C++ application that reads and writes large amounts of data (more than the available RAM) but always in a sequential way. In order to keep the data in a future proof and easy to document way I use Protocol Buffer. Protocol buffer however does not handle large amounts of data. My previous solution consisted on creating...

Common Lisp: What is the downside to using this filter function on very large lists?

I want to filter out all elements of list 'a from list 'b and return the filtered 'b. This is my function: (defun filter (a b) "Filters out all items in a from b" (if (= 0 (length a)) b (filter (remove (first a) a) (remove (first a) b)))) I'm new to lisp and don't know how 'remove does its thing, what kind of time will thi...

Using MySQL to search through large data sets?

Now I'm a really advanced PHP developer and heavily knowledged on small-scale MySQL sets, however I'm now building a large infrastructure for a startup I've recently joined and their servers push around 1 million rows of data every day using their massive server power and previous architecture. I need to know what is the best way to sea...

Alternatives to huge drop down lists (24,000+ items)

In my admin section, when I edit items, I have to attach each item to a parent item. I have a list of over 24,000 parent items, which are listed alphabetically in a drop down list (a list of music artists). The edit page that lists all these items in a drop down menu is 2MB, and it lags like crazy for people with old machines, especiall...

Displaying a large collection in a DataGrid

The amount of items in collection: ~100k The amount of field displayed in columns: 4-10 The problem itself - the collection is taken from a database using EntityFramework. It takes about 10-12s on dev computers to load and materialize all the required data. Yet another thing that comes up is that the same collection can be bound to seve...

psycopg2 COPY using cursor.copy_from() freezes with large inputs

Hi, Consider the following code in Python, using psycopg2 cursor object (Some column names were changed or omitted for clarity): filename='data.csv' file_columns=('id', 'node_id', 'segment_id', 'elevated', 'approximation', 'the_geom', 'azimuth') self._cur.copy_from(file=open(filename), table=self.new_...

PHP cURL 'Fatal error: Allowed memory size' for large data sets

I know about the option to set the internal memory ini_set("memory_limit","30M"); But I wanted to know if there is a better approach for querying data? I have a WHILE LOOP that checks to see if I need to query for another 1000 records. using the offset as the starting record number and the limit as the returned records, I search for...

Fetch only N rows at a time (MySQL)

I'm looking for a way to fetch all data from a huge table in smaller chunks. Please advise. ...

How to select/query effectively 70,000 json data?

I'm now developing javascript library which consists 70,000 more villages in Indonesia (accessible at and I build a data browser widget. Everything is fine in Safari and Firefox. But when using Chrome, it always takes long when I happen selecting a district (which automatically loads villages). The code to retr...

How do I allow the user to easily choose how much memory to allocate in a Java Swing app?

We have a Swing app that processes relatively large amounts of data. For instance we currently process csv files with millions of rows of data. For the reasons of performance and simplicity we just keep all of the data in memory. However different users will have different amounts of data they need to process as well as different amou...

jQuery grid recommendations for large data sets?

I was looking around for jQuery grid recommendations and came across this question/answers: In looking through the many jQuery grid solutions out there, it seems they all want to have the entire data set on the client. If I have a large data set (thousands/millions o...