views:

82

answers:

4

Hello

In my current company, i am doing a PoC on how we can write a file downloader utility. We have to use socket programming(TCP/IP) for downloading the files. One of the requirements of the client is that a file(which will be large in size) should be transfered in chunks for example if we have a file of 5Mb size then we can have 5 threads which transfer 1 Mb each. I have written a small application which downloads a file. You can download the eclipe project

from http://www.fileflyer.com/view/QM1JSC0

A brief explanation of my classes

  • FileSender.java : This class provides the bytes of file. It has a method called sendBytesOfFile(long start,long end, long sequenceNo) which gives the number of bytes.

    import java.io.File;

    import java.io.IOException;

    import java.util.zip.CRC32;

    import org.apache.commons.io.FileUtils;

    public class FileSender {

    private static final String FILE_NAME = "C:\\shared\\test.pdf";
    
    
    public ByteArrayWrapper sendBytesOfFile(long start,long end, long sequenceNo){
        try {
            File file = new File(FILE_NAME);
            byte[] fileBytes = FileUtils.readFileToByteArray(file);
            System.out.println("Size of file is " +fileBytes.length);
            System.out.println();
            System.out.println("Start "+start +" end "+end);
            byte[] bytes = getByteArray(fileBytes, start, end);
            ByteArrayWrapper wrapper = new ByteArrayWrapper(bytes, sequenceNo);
            return wrapper;
        } catch (IOException e) {
            throw new RuntimeException(e);
        }
    }
    
    
    private byte[] getByteArray(byte[] bytes, long start, long end){
        long arrayLength = end-start;
        System.out.println("Start : "+start +" end : "+end + " Arraylength : "+arrayLength +" length of source array : "+bytes.length);
        byte[] arr = new byte[(int)arrayLength];
    
    
    
    for(int i = (int)start, j =0; i < end;i++,j++){
        arr[j] = bytes[i];
    }
    return arr;
    
    } public static long fileSize(){ File file = new File(FILE_NAME); return file.length(); }

    }

  • FileReceiver.java - This class receives the file.

Small Explanation what this file does

  1. This class finds the size of the file to be fetched from Sender
  2. Depending upon the size of the file it finds the start and end position till the bytes needs to be read.
  3. It starts n number of threads giving each thread start,end, sequence number and a list which all the threads share.
  4. Each thread reads the number of bytes and creates a ByteArrayWrapper.
  5. ByteArrayWrapper objects are added to the list
  6. Then i have while loop which basically make sure that all threads have done their work
  7. finally it sorts the list based on the sequence number.
  8. then the bytes are joined, and a complete byte array is formed which is converted to a file.

Code of File Receiver

package com.filedownloader;

import java.io.File;

import java.io.IOException;

import java.util.ArrayList;

import java.util.Collections;

import java.util.Comparator;

import java.util.List;

import java.util.zip.CRC32;

import org.apache.commons.io.FileUtils;


public class FileReceiver {

    public static void main(String[] args) {
        FileReceiver receiver = new FileReceiver();
        receiver.receiveFile();

    }
    public void receiveFile(){
        long startTime = System.currentTimeMillis();
        long numberOfThreads = 10;
        long filesize = FileSender.fileSize();

        System.out.println("File size received "+filesize);
        long start = filesize/numberOfThreads;
        List<ByteArrayWrapper> list = new ArrayList<ByteArrayWrapper>();

        for(long threadCount =0; threadCount<numberOfThreads ;threadCount++){
            FileDownloaderTask task = new FileDownloaderTask(threadCount*start,(threadCount+1)*start,threadCount,list);
            new Thread(task).start();
        }

        while(list.size() != numberOfThreads){
            // this is done so that all the threads should complete their work before processing further.
            //System.out.println("Waiting for threads to complete. List size "+list.size());
        }

        if(list.size() == numberOfThreads){
            System.out.println("All bytes received "+list);
            Collections.sort(list, new Comparator<ByteArrayWrapper>() {
                @Override
                public int compare(ByteArrayWrapper o1, ByteArrayWrapper o2) {

                    long sequence1 = o1.getSequence();
                    long sequence2 = o2.getSequence();
                    if(sequence1 < sequence2){
                        return -1;
                    }else if(sequence1 > sequence2){
                        return 1;
                    }
                    else{
                        return 0;
                    }
                }
            });


            byte[] totalBytes = list.get(0).getBytes();
            byte[] firstArr = null;
            byte[] secondArr = null;
            for(int i = 1;i<list.size();i++){
                firstArr = totalBytes;
                secondArr = list.get(i).getBytes();
                totalBytes = concat(firstArr, secondArr);

            }

            System.out.println(totalBytes.length);
            convertToFile(totalBytes,"c:\\tmp\\test.pdf");

            long endTime = System.currentTimeMillis();
            System.out.println("Total time taken with "+numberOfThreads +" threads is "+(endTime-startTime)+" ms" );

        }
    }

        private byte[] concat(byte[] A, byte[] B) {
               byte[] C= new byte[A.length+B.length];
               System.arraycopy(A, 0, C, 0, A.length);
               System.arraycopy(B, 0, C, A.length, B.length);
               return C;
        }

        private void convertToFile(byte[] totalBytes,String name) {
            try {
                FileUtils.writeByteArrayToFile(new File(name), totalBytes);
            } catch (IOException e) {
                throw new RuntimeException(e);
            }
        }
}

Code of ByteArrayWrapper

package com.filedownloader;

import java.io.Serializable;

public class ByteArrayWrapper implements Serializable{

private static final long serialVersionUID = 3499562855188457886L;

private byte[] bytes;
private long sequence;

public ByteArrayWrapper(byte[] bytes, long sequenceNo) {
    this.bytes = bytes;
    this.sequence = sequenceNo;
}

public byte[] getBytes() {
    return bytes;
}

public long getSequence() {
    return sequence;
}

}

Code of FileDownloaderTask

import java.util.List;

public class FileDownloaderTask implements Runnable {

private List<ByteArrayWrapper> list;
private long start;
private long end;
private long sequenceNo;

public FileDownloaderTask(long start,long end,long sequenceNo,List<ByteArrayWrapper> list) {
    this.list = list;
    this.start = start;
    this.end = end;
    this.sequenceNo = sequenceNo;

}

@Override
public void run() {
        ByteArrayWrapper wrapper = new FileSender().sendBytesOfFile(start, end, sequenceNo);
        list.add(wrapper);
}
}

Questions related to this code

  1. Does file downloading becomes fast when multiple threads is used? In this code i am not able to see the benefit.

  2. How should i decide how many threads should i create ?

  3. Are their any opensource libraries which does that

  4. The file which file receiver receives is valid and not corrupted but checksum (i used FileUtils of common-io) does not match. Whats the problem?

  5. This code gives out of memory when used with large file(above 100 Mb) i.e. because byte array which is created. How can i avoid?

I know this is a very bad code but i have to write this in one day -:). Please suggest any other good way to do this?

Thanks

Shekhar

A: 

1 Does file downloading becomes fast when multiple threads is used? In this code i am not able to see the benefit.

No. I would be very surprised if that was the case. The CPU would never have a problem of keeping up with the feeding the network-buffer.

2 How should i decide how many threads should i create ?

In my opinion, 0 extra threads.

4 The file which file receiver receives is valid and not corrupted but checksum (i used FileUtils of common-io) does not match. Whats the problem?

Make sure you don't accidentally rely on strings and specific encodings.

5 This code gives out of memory when used with large file(above 100 Mb) i.e. because byte array which is created. How can i avoid?

The obvious solution would be to read smaller chunks of the file. Have a look at the read method of DataInputStream

http://java.sun.com/j2se/1.4.2/docs/api/java/io/DataInputStream.html#read%28byte[],%20int,%20int%29

And, finally, some general pointers in the matter: Instead of using multiple threads for this kind of thing, I strongly encourage you to have a look at the java.nio package, specifically java.nio.channels and the Selector class.

EDIT: If you're really keen on getting it super-efficient, and have very large files, you could benefit from using UDP, and handle packet order and acknowledgements yourself. TCP does for instance guarantee that the packets received come in the same order as the packets sent. This is not something that you rely heavily on (since you could easily encode the "byte-offset" for each datagram yourself) and thus don't need to "pay" for.

aioobe
actually, i have taken an example from Download Accelerator, it makes 5 connections to get a file and file gets downloaded faster.Suppose, i have 5 gb of file, should i let one thread do that?
Shekhar
I don't know the details of Download Accelerator, but I suspect that it possibly gains speed by "taking" speed from other downloading clients. If it downloads over HTTP it could gain some speed when downloading a large number of small files. It could then perform the handshaking while downloading another file. That is, it could eliminate the startup-latency for each individual file, but if you have a 5 gb file, I really doubt you'll get a speed boost by throwing more threads at the task.
aioobe
@aioobe see my answer, AIUI it's more to do with bypassing TCP fairness by adding more connections. (I guess that's what you were trying to say with the first sentence?). Avoiding connection startup costs for lots of small files can be done by http pipelining, but I don't think download accelerators bother.
wds
+1  A: 

There's a bunch of questions here to answer. I'm not going to go through all of the code, but I can give you some tips.

First off, what some download accelerators do is indeed using the HTTP Range header to download parts of a file in parallel. Why does this work? TCP tries to allocate bandwidth fairly per connection. So if you're downloading a file from a server whose bandwidth is swamped, then you can receive a bigger share of the bandwidth by adding more connections. The same principle applies to servers that restrict outgoing bandwidth, which is usually also applied per connection (sometimes taking the IP into consideration).

Obviously if everybody was doing this, we'd be left with a whole lot of TCP connections and their overhead, and not a lot of bandwidth to do the actual downloading, which is why even these download accelerators will only use 2-4 connections. Moreover, if you are the one writing the server, you really don't need to worry about this, as you will only be slowing yourself down (by adding more overhead).

Going out of memory: don't use a bytearray, use a (buffered) InputStream (or if you have some time, learn how to use java.nio and the byte buffers) and read chunks as you are sending the file. The java tutorials cover all the basics.

wds
@wds; would you agree that UDP could do better than TCP in a scenario like this?
aioobe
@aioobe: on a reliable network, you end up saturating your bandwidth either way. Perhaps you manage to do better than TCP, but not a lot (couple percent?), and at the cost of implementing a bunch of stuff yourself (i.e. you have to monitor link characteristics, know when to resend, provide your own checksums...). effort/benefit I don't think it's a good idea.
wds
+1  A: 

1) Another reason why multiple connections may be faster is related to TCP window size.

throughput <= window size / roundtrip time

See http://en.wikipedia.org/wiki/TCP_tuning#Window_size for details.

You wont see that much difference if you run tests on a local network, because roundtrip time is small enough.

2) The only way to know for sure is to try. And the right number of threads will depend on environnment. If you need to download really big files, it might be worth it to first run a small calibration program that will try to download with different number of threads.

3) I havent looked there for a long time, but Azureus (now called Vuze) has a pretty complete API to download anything from torrent files to FTP ... And they probably have a quite efficient implementation...

Good luck !

Edit (clarification on window size) :

What you are trying to do is maximize throughput (download files faster). There is not much you can do about roundtime trip, it depends on the network. What you can do is increase window size. The window size is automagically adjusted (there is plenty of documentation on this, but I'm too lazy to google it) to best fit the current state of the network. Basically a larger window means better throughput as long as there isnt congestion or packet loss.

In the best case, you will get a window size of 64Kbits, at this point, unless you use some tricks (Jumbo frame / window scaling) which are not cupported by all routers on the internet, you get stuck at a maximum throughput of :

throughput >= 64Kbit / roundtrip time

As you cant get a bigger window, you have to open multiple windows to get around this limitation.

Notes :

  • As aioobe said, UDP isnt subject to the same limitations, this is one of the reason why it is more efficient.
  • A very efficient and scalable protocol to distribute large files is Bittorrent. As long as you dont need authentication / authorization of the downloads, it might work for you. And if you do need authorization, you can always encrypt the files ...
Guillaume
Would you care to elaborate on the implications of your equation?
aioobe
A: 

Don't read huge file chunks into memory. No wonder you're running out. Just seek to the required position in the file and start copying via a sensibly sized buffer:

int count;
byte[] buffer = new byte[8192];
// or whatever takes your fancy, but sizes > the socket send buffer size are pointless
while ((count = in.read(buffer)) > 0)
  out.write(buffer, 0, count);
out.close();
in.close();

Same logic can be used at both ends - when writing the file at the receiver, use a RandomAccessFile and seek to the appropriate offset before starting this loop.

However as other respondents have noted, the client's requirement is really pretty pointless. It doesn't buy anything much except expense and risk. I would just stream the file via a single connection.

What you should do is set a large socket send and receive buffers at both ends, e.g. 60k. The default is 8k on Windows which is uselessly low.

EJP