Hi, I have large amounts of data (a few terabytes) and accumulating... They are contained in many tab-delimited flat text files (each about 30MB). Most of the task involves reading the data and aggregating (summing/averaging + additional transformations) over observations/rows based on a series of predicate statements, and then saving th...
I just took my first baby step today into real scientific computing today when I was shown a data set where the smallest file is 48000 fields by 1600 rows (haplotypes for several people, for chromosome 22). And this is considered tiny.
I write Python, so I've spent the last few hours reading about HDF5, and Numpy, and PyTable, but I sti...
I have a case where I need to transfer large amounts of serialized object graphs (via NetDataContractSerializer) using WCF using wsHttp. I'm using message security and would like to continue to do so. Using this setup I would like to transfer serialized object graph which can sometimes approach around 300MB or so but when I try to do so ...
Dear all,
I need to export 100 million data rows (avg row length ~ 100 bytes) from Oracle 10G database table into SQL server (over WAN/VLAN with 6MBits/sec capacity) on a regular basis. So far, these are the options that I have tried and a quick summary. Has anyone tried this before? Are there other better options? Which option would be...
I've inherited a C# .NET application which talks to a web service, and the web service talks to an Oracle database. I need to add an export function to the UI, to produce an Excel spreadsheet of some of the data.
I have created a web service function to run a database query, load the data into a DataTable and then return it, which work...
An rpc server is given which receives millions of requests a day. Each request i takes processing time Ti to get processed. We want to find the 65th percentile processing time (when processing times are sorted according to their values in increasing order) at any moment. We cannot store processing times of all the requests of the past as...
We're designing a large scale web scraping/parsing project. Basically, the script needs to go through a list of web pages, extract the contents of a particular tag, and store it in a database.
What language would you recommend for doing this on a large scale(tens of millions of pages?).
.
We're using MongoDB for the database, so anythi...
I have a site with around 100,000 unique pages.
(1) How do I create a Sitemap for all these links? Should I just list them flat in a large sitemap protocol compatible file?
(2) Need to implement this on Google App Engine where there is a 1000 item query limit, and all my individual site URLs are stored as separate entries. How do I so...
I have a couple of databases containing simple data which needs to be imported into a new format schema. I've come up with a flexible schema, but it relies on the critical data of the to older DBs to be stored in one table. This table has only a primary key, a foreign key (both int's), a datetime and a decimal field, but adding the coun...
I need to write a C++ application that reads and writes large amounts of data (more than the available RAM) but always in a sequential way.
In order to keep the data in a future proof and easy to document way I use Protocol Buffer. Protocol buffer however does not handle large amounts of data.
My previous solution consisted on creating...
I want to filter out all elements of list 'a from list 'b and return the filtered 'b. This is my function:
(defun filter (a b)
"Filters out all items in a from b"
(if (= 0 (length a)) b
(filter (remove (first a) a) (remove (first a) b))))
I'm new to lisp and don't know how 'remove does its thing, what kind of time will thi...
Now I'm a really advanced PHP developer and heavily knowledged on small-scale MySQL sets, however I'm now building a large infrastructure for a startup I've recently joined and their servers push around 1 million rows of data every day using their massive server power and previous architecture.
I need to know what is the best way to sea...
In my admin section, when I edit items, I have to attach each item to a parent item. I have a list of over 24,000 parent items, which are listed alphabetically in a drop down list (a list of music artists).
The edit page that lists all these items in a drop down menu is 2MB, and it lags like crazy for people with old machines, especiall...
The amount of items in collection: ~100k
The amount of field displayed in columns: 4-10
The problem itself - the collection is taken from a database using EntityFramework. It takes about 10-12s on dev computers to load and materialize all the required data. Yet another thing that comes up is that the same collection can be bound to seve...
Hi,
Consider the following code in Python, using psycopg2 cursor object (Some column names were changed or omitted for clarity):
filename='data.csv'
file_columns=('id', 'node_id', 'segment_id', 'elevated',
'approximation', 'the_geom', 'azimuth')
self._cur.copy_from(file=open(filename),
table=self.new_...
I know about the option to set the internal memory
ini_set("memory_limit","30M");
But I wanted to know if there is a better approach for querying data?
I have a WHILE LOOP that checks to see if I need to query for another 1000 records.
using the offset as the starting record number and the limit as the returned records, I search for...
I'm looking for a way to fetch all data from a huge table in smaller chunks.
Please advise.
...
I'm now developing javascript library which consists 70,000 more villages in Indonesia (accessible at http://bisbak.com/regina/) and I build a data browser widget. Everything is fine in Safari and Firefox. But when using Chrome, it always takes long when I happen selecting a district (which automatically loads villages). The code to retr...
We have a Swing app that processes relatively large amounts of data. For instance we currently process csv files with millions of rows of data. For the reasons of performance and simplicity we just keep all of the data in memory. However different users will have different amounts of data they need to process as well as different amou...
I was looking around for jQuery grid recommendations and came across this question/answers:
http://stackoverflow.com/questions/159025/jquery-grid-recommendations
In looking through the many jQuery grid solutions out there, it seems they all want to have the entire data set on the client. If I have a large data set (thousands/millions o...