views:

489

answers:

10

I'm using Socket class for my web client. I can't use HttpWebRequest since it doesn't support socks proxies. So I have to parse headers and handle chunked encoding by myself. The most difficult thing for me is to determine length of content so I have to read it byte-by-byte. First I have to use ReadByte() to find last header ("\r\n\r\n" combination), then check whether body has transfer-encoding or not. If it does I have to read chunk's size etc:

public void ParseHeaders(Stream stream)
{
    while (true)
    {
        var lineBuffer = new List<byte>();
        while (true)
        {
            int b = stream.ReadByte();
            if (b == -1) return;
            if (b == 10) break;
            if (b != 13) lineBuffer.Add((byte)b);
        }
        string line = Encoding.ASCII.GetString(lineBuffer.ToArray());
        if (line.Length == 0) break;
        int pos = line.IndexOf(": ");
        if (pos == -1) throw  new VkException("Incorrect header format");
        string key = line.Substring(0, pos);
        string value = line.Substring(pos + 2);
        Headers[key] = value;
    }
}

But this approach has very poor performance. Can you suggest better solution? Maybe some open source examples or libraries that handle http request through sockets (not very big and complicated though, I'm a noob). The best would be to post link to example that reads message body and correctly handles the cases when: content has chunked-encoding, is gzip- or deflate-encoded, Content-Length header is omitted (message ends when connection is closed). Something like source code of HttpWebRequest class.

Upd: My new function looks like this:

int bytesRead = 0;
byte[] buffer = new byte[0x8000];
do
{
    try
    {
        bytesRead = this.socket.Receive(buffer);
        if (bytesRead <= 0) break;
        else
        {
            this.m_responseData.Write(buffer, 0, bytesRead);
            if (this.m_inHeaders == null) this.GetHeaders();
        }
    }
    catch (Exception exception)
    {
        throw new Exception("Read response failed", exception);
    }
}
while ((this.m_inHeaders == null) || !this.isResponseBodyComplete());

Where GetHeaders() and isResponseBodyComplete() use m_responseData (MemoryStream) with already received data.

A: 

In most (should be all) http requests, there should be a header called content-length that will tell you how many bytes there are in the body of the request. Then it is simply a matter of allocating the appropriate amount of bytes and reading those bytes all at once.

The Real Diel
Some transmission methods in HTTP 1.1 do not send you a valid content length because html is sent in chunks sometimes. It's not a reliable field for html content.
Aren
Anyway, I have to read headers byte-by-byte to get to "Content-Length" header.
Poma
Instead of reading byte by byte there should be a readLine method call which will allow you to read one line at a time. HTTP protocol is..<Initial Line>\r\n<optional header>\r\n<optional header>\r\n<...>\r\n\r\n<content>So you'll need to read line by line until you find the content-length header. And then you can split that line on ": " to get the header name and header value (the length). Once you have the length, keep reading line by line until you reach the empty line. Then read in bytes the length you got from the header.Can you format these comments??? lol
The Real Diel
A: 

You may want to look at the TcpClient class in System.Net, it's a wrapper for a Socket that simplifies the basic operations.

From there you're going to have to read up on the HTTP protocol. Also be prepared to do some zip operations. Http 1.1 supports GZip of it's content and partial blocks. You're going to have to learn quite a bit to parse them out by hand.

Basic Http 1.0 is simple, the protocol is well documented online, our friendly neighborhood Google can help you with that one.

Aren
I can use `GZipStream` and `DeflateStream` for that
Poma
A: 

You should consider using Privoxy. This proxy support redirection to SOCKS. Instead of using sockets, you could still use HttpWebRequest, make your calls to that proxy, which will redirect them to SOCKS. Read that thread where they talk about Privoxy: C# using Tor as Proxy.

Good luck!

Pierre-Luc Champigny
Can't use it. I need to have many different proxies.
Poma
+7  A: 

I suggest that you don't implement this yourself - the HTTP 1.1 protocol is sufficiently complex to make this a project of several man-months.

The question is, is there a HTTP request protocol parser for .NET? This question has been asked on SO, and in the answers you'll see several suggestions, including source code for handling HTTP streams.

http://stackoverflow.com/questions/318506/converting-raw-http-request-into-httpwebrequest-object

EDIT: The rotor code is reasonably complex, and difficult to read/navigate as webpages. But still, the implementaiton effort to add SOCKS supports is much lower than implementing the entire HTTP protocol yourself. You will have something working within a few days at most that you can depend upon, that is based on a tried and tested implementation.

The request and response are read from/written to to a NetworkStream, m_Transport, in the Connection class. This is used in these methods:

internal int Read(byte[] buffer, int offset, int size) 
//and
private static void ReadCallback(IAsyncResult asyncResult)

both in http://www.123aspx.com/Rotor/RotorSrc.aspx?rot=42903

The socket is created in

private void StartConnectionCallback(object state, bool wasSignalled)

So you could modify this method to create a Socket to your socks server, and do the necessary handshake to obtain the external connection. The rest of the code can remain the same.

I gammered this info in about 30 mins looking on the pages on the web. This should go much faster if you load these files into an IDE. It may seem like a burden to have to read through this code - after all, reading code is far harder than writing it, but you are making just small changes to an already established, working system.

To be sure the changes work in all cases, it will be wise to also test when the connection is broken, to ensure that the client reconnects using the same method , and so re-establishes the SOCKS connection and sends the SOCKS request.

mdma
+1 for don't do this.
csharptest.net
I agree with you but source code of HttpWebRequest (Rotor) is way too complicated. I can't even find function that actually receives data from network.
Poma
I understand - it's not the easiest code to read, especially as webpages. I've added some pointers to my answer.
mdma
-1 for don't do this. It's great fun and relatively simple to get something up and running which serves web pages and static content. It's also a great way to learn HTTP. I've written several and every time I learn something new.
John Leidegren
@John Leidegren - The rotor code can be fairly simply extended, as I illustrate in my answer. What is your objection to doing that? If you look at the RFC for the HTTP 1.1 protocol, you'll see that it's involved, complex with many cases to consider. Not my idea of "fun". If I wrote the code for fun, I certainly wouldn't want to depend on it in a production application.
mdma
@mdma - Then I guess we disagree on the idea of fun. I would though argue that HTTP is scalable rather than complex, becuase you can still do a lot without investing a lot of time. A trival HTTP server in under 1 hour, perfectly doable. But I've read enough RFCs to know that the things a typical production server implements, go beoynd that 10% and I'm not looking to start a competition.
John Leidegren
@John Leidegeren - The server can be simple, because it has the most control - you could implement HTTP/1.0. The question here is about implementing a client, the OP says he wants chunking, gzipping and, I would imagine caching and persistent connections as well. That's not something that can be implemented trivially, it's not going to be production ready after a few hours hacking. I'm often getting something working quickly, diving into a new area, hacking and achieving something real in a few hours. But this situation seems to require less seat of the pants and more dependability.
mdma
@mdma - I would agree that a HTTP client is not fundamentally as trivial but it's all the same to me. The only thing I would not implement myself would be the persistent connections stuff. That's daunting, to get right, to say the least.
John Leidegren
@John - now that we've talked, and we've agreed on some points, do you still feel a downvote is justified here? The persistent connections is quite important here since the SOCKS server introduces overhead for each connect/disconnect - the OP mentions that performance is a concern.
mdma
Well, I gave you a down vote becuase I don't like it when people start their answers by stating that you shouldn't try and do this yourself. Becuase in a way, it's the same as saying, this is too hard for you and all too often, it's not. It's also a complete disregard for the learning aspect of trying something yourself. It feels like a cheap and pretty standard way of getting out of actually adressing the question head on. Sometimes it's esier to say, don't, than to explain how. The down vote is not becuase it's not a good answer, I just think it's the wrong kind of answer.
John Leidegren
I understand your point. Here, the OP asks if there is a library or codebase that can be used to achieve this, so I'm answering within the scope of the question.
mdma
Thanks, that helped. I decided to do it myself like John said. I've already writed most stuff I needed except few things (persistent connections etc). The good point in doing such things is that you have great control over what is happening. I still don't understand how HttpWebRequest works internally (service points and other stuff).In mdma's link I found reference to Fiddler application and tried do decompile it using reflector. It's has good and simple code, so even with reflector I found out almost everything I needed. Pity that it's author didn't respond to my email though.
Poma
Good luck with the implementation.
mdma
A: 

While I would tend to agree with mdma about trying as hard as possible to avoid implementing your own HTTP stack, one trick you might consider is reading from the stream moderate-sized chunks. If you do a read and you give it a buffer that's larger than what's available, it should return you the number of bytes it did read. That should reduce the number of system calls and speed up your performance significantly. You'll still have to scan the buffers much like you do now, though.

Yuliy
A: 

Taking a look at another client's code is helpful (if not confusing): http://src.chromium.org/viewvc/chrome/trunk/src/net/http/

I'm currently doing something like this too. I find the best way to increase the efficiency of the client is to use the asynchronous socket functions provided. They're quite low-level and get rid of busy waiting and dealing with threads yourself. All of these have Begin and End in their method names. But first, I would try it using blocking, just so you get the semantics of HTTP out of the way. Then you can work on efficiency. Remember: Premature optimization is evil- so get it working, then optimize all of the stuff!

Also: Some of your efficiency might be tied up in your use of ToArray(). It's known to be a bit expensive computationally. A better solution might be to store your intermediate results in a byte[] buffer and append them to a StringBuilder with the correct encoding.

For gzipped or deflated data, read in all of the data (keep in mind that you might not get all of the data the first time you ask. Keep track of how much data you have read in, and keep on appending to the same buffer). Then you can decode the data using GZipStream(..., CompressionMode.Decompress).

I would say that doing this is not as difficult as some might imply, you just have to be a bit adventurous!

Tim
A: 

I would create a SOCKS proxy that can tunnel HTTP and then have it accept the requests from HttpWebRequest and forward them. I think that would be far easier than recreating everything that HttpWebRequest does. You could start with Privoxy, or just roll your own. The protocol is simple and documented here:

http://en.wikipedia.org/wiki/SOCKS

And on the RFC's that they link to.

You mentioned that you have to have many different proxies -- you could set up a local port for each one.

Lou Franco
A: 

If the problem is a bottleneck in terms of ReadByte being too slow, I suggest you wrap your input stream with a StreamBuffer. If the performance issue you claim to have is expensive becuase of small reads, then that will solve the problem for you.

Also, you don't need this:

string line = Encoding.ASCII.GetString(lineBuffer.ToArray()); 

HTTP by design requires that the header is only made up of ASCII characters. You don't really want to -- or need to -- turn it into actual .NET strings (which are Unicode).

If you wanna find the EOF of the HTTP header, you can do this for good performance.

int k = 0;
while (k != 0x0d0a0d0a) 
{
    var ch = stream.ReadByte();
    k = (k << 8) | ch;
}

When the string \r\n\r\n is encoutered k will equal 0x0d0a0d0a

John Leidegren
While this may help out with this specific problem, you are not giving the poster any indication of the size of the issues he will face if he continues to implement a HTTP client. Persistent connections are not trivial to implement, and not having them will kill performance.
mdma
I believe we went over this in the comments to your answer.
John Leidegren
A: 

All the answers here about extending Socket and/or TCPClient seem to miss something really obvious - that HttpWebRequest is also a class and can therefore be extended.

You don't need to write your own HTTP/socket class. You simply need to extend HttpWebRequest with a custom connection method. After connecting all data is standard HTTP and can be handled as normal by the base class.

public class SocksHttpWebRequest : HttpWebRequest

   public static Create( string url, string proxy_url ) {
   ... setup socks connection ...

   // call base HttpWebRequest class Create() with proxy url
   base.Create(proxy_url);
   }

The SOCKS handshake is not particularly complex so if you have a basic understanding of programming sockets it shouldn't take very long to implement the connection. After that HttpWebRequest can do the HTTP heavy lifting.

SpliFF
It would certainly be nice if it can be solved this simply. How does the base WebHttpRequest.Create get the same socket connection as created to the SOCKS server in the SocketHttpWebRequest.Create?
mdma
Theory is awesome but I don't think you can do that. Could you send a working code example? How do you give a TCP connection to HTTPRequest? AFAIK you can't do that.
dr. evil
A: 

Why don't you read until 2 newlines and then just grab from the string? Performance might be worse but it still should be reasonable:

Dim Headers As String = GetHeadersFromRawRequest(ResponseBinary)
   If Headers.IndexOf("Content-Encoding: gzip") > 0 Then

     Dim GzSream As New GZipStream(New MemoryStream(ResponseBinary, Headers.Length + (vbNewLine & vbNewLine).Length, ReadByteSize - Headers.Length), CompressionMode.Decompress)
ClearTextHtml = New StreamReader(GzSream).ReadToEnd()
End If                         

 Private Function GetHeadersFromRawRequest(ByVal request() As Byte) As String

        Dim Req As String = Text.Encoding.ASCII.GetString(request)
        Dim ContentPos As Integer = Req.IndexOf(vbNewLine & vbNewLine)

        If ContentPos = -1 Then Return String.Empty

        Return Req.Substring(0, ContentPos)
    End Function
dr. evil