large-files

Need advice in Efficiency: Scanning 2 very large files worth of information

Hi, I have a relatively strange question. I have a file that is 6 gigabytes long. What I need to do, is scan the entire file, line by line, and determine all rows that match an id number of any other row in the file. Essentially, its like analyzing a web log file where there are many session ids that are organized by the time of each...

How to make Java read very big files using Scanner?

I'm using the following basic function which I copied from the net to read a text file public void read () { File file = new File("/Users/MAK/Desktop/data.txt"); System.out.println("Start"); try { // // Create a new Scanner object which will read the data from the // file passed in. To check i...

How to read large xml file without loading it in memory and using XElement

I want to read a large xml file (100+M). Due to its size, I do not want to load it in memory using XElement. I am using linq-xml queries to parse and read it. What's the best way to do it? Any example on combination of XPath or XmlReader with linq-xml/XElement? Please help. Thanks. ...

hosting acounts for large uploads

Hi, im wondering if anyone knows of any host providers ( uk preferably ) that deals mostly with accepting large file uploads. Most hosts only let you push something like 1.5mb ( thats taking into account the connection and the max execution time ). What i am looking for is a host specificaly for storing files on. I was going to create a...

How to programmatically download a large file in C#

I need to programmatically download a large file before processing it. What's the best way to do that? As the file is large, I want to specific time to wait so that I can forcefully exit. I know of WebClient.DownloadFile(). But there does not seem a way to specific an amount of time to wait so as to forcefully exit. try { WebClie...

How do quickly search through a .csv file in Python

I'm reading a 6 million entry .csv file with Python, and I want to be able to search through this file for a particular entry. Are there any tricks to search the entire file? Should you read the whole thing into a dictionary or should you perform a search every time? I tried loading it into a dictionary but that took ages so I'm current...

[PHP] Using cURL to download large XML files

I'm working with PHP and need to parse a number of fairly large XML files (50-75MB uncompressed). The issue, however, is that these XML files are stored remotely and will need to be downloaded before I can parse them. Having thought about the issue, I think using a system() call in PHP in order to initiate a cURL transfer is probably th...

XML: Process large data

Hello What XML-parser do you recommend for the following purpose: The XML-file (formatted, containing whitespaces) is around 800 MB. It mostly contains three types of tag (let's call them n, w and r). They have an attribute called id which i'd have to search for, as fast as possible. Removing attributes I don't need could save around ...

AS3 Working With Arbitrarily Large Files

I am trying to read a very large file in AS3 and am having problems with the runtime just crashing on me. I'm currently using a FileStream to open the file asynchronously. This does not work(crashes without an Exception) for files bigger than about 300MB. _fileStream = new FileStream(); _fileStream.addEventListener(IOErrorEvent.IO_ERR...

Python: slicing a very large binary file

Say I have a binary file of 12GB and I want to slice 8GB out of the middle of it. I know the position indices I want to cut between. How do I do this? Obviously 12GB won't fit into memory, that's fine, but 8GB won't either... Which I thought was fine, but it appears binary doesn't seem to like it if you do it in chunks! I was appending ...

Tools for viewing logs of unlimited size

It's no secret that application logs can go well beyond the limits of naive log viewers, and the desired viewer functionality (say, filtering the log based on a condition, or highlighting particular message types, or splitting it into sublogs based on a field value, or merging several logs based on a time axis, or bookmarking etc.) is be...

Is there a memory efficient and fast way to load big json files in python?

I have some json files with 500MB. If I use the "trivial" json.load to load its content all at once, it will consume a lot of memory. Is there a way to read partially the file? If it was a text, line delimited file, I would be able to iterate over the lines. I am looking for analogy to it. Any suggestions? Thanks ...

Editing a 9gb .sql file

Hi. I've got a "slightly" large sql script saved as a textfile. It totals in at 8.92gb, so it's a bit of a beast. I've got to do some search and replaces in this file(specifically, change all NOT NULL to NULL, so all fields are nullable) and then execute the darned thing. Does anyone have any suggestions for a text editor that would be ...

Upload 1GB files using chunking in PHP

I have a web application that accepts file uploads of up to 4 MB. The server side script is PHP and web server is NGINX. Many users have requested to increase this limit drastically to allow upload of video etc. However there seems to be no easy solution for this problem with PHP. First, on the client side I am looking for something tha...

Compressing large text data before storing into db?

Hello, I have application which retrieves many large log files from a system LAN. Currently I put all log files on Postgresql, the table has a column type TEXT and I don't plan any search on this text column because I use another external process which nightly retrieves all files and scans for sensitive pattern. So the column value c...

Compositing a large image with ImageMagick

I'm trying to create a very large image (86400 x 43200) using several tiles that make up a portion of this final image with ImageMagick (using the .NET bindings). The problem seems to be when I attempt to create my output image with the given size; ImageMagick just hangs on the Resize() call. When I say 'hangs' I mean the program become...

Random access gzip stream

I'd like to be able to do random access into a gzipped file. I can afford to do some preprocessing on it (say, build some kind of index), provided that the result of the preprocessing is much smaller than the file itself. Any advice? My thoughts were: Hack on an existing gzip implementation and serialize its decompressor state every,...

dictionary interface for large data sets

I have a set of key/values (all text) that is too large to load in memory at once. I would like to interact with this data via a Python dictionary-like interface. Does such a module already exist? Reading key values should be efficient and values compressed on disk to save space. Edit: Ideally cross platform, but only using Linux ...

Is there a fast way to jump to element using XMLReader?

I am using XMLReader to read a large XML file with about 1 million elements on the level I am reading from. However, I've calculated it will take over 10 seconds when I jump to -for instance- element 500.000 using XMLReader::next ([ string $localname ] ) or XMLReader::read ( void ) This is not very usable. Is there a faster way to...

Given a trace of packets, how would you group them into flows?

I've tried it these ways so far: 1) Make a hash with the source IP/port and destination IP/port as keys. Each position in the hash is a list of packets. The hash is then saved in a file, with each flow separated by some special characters/line. Problem: Not enough memory for large traces. 2) Make a hash with the same key as above, but ...