views:

177

answers:

4

(I am developing a website to crawl the other website content in ASP.NET . I am able to get the content correctly but how can I identify which language is used based on that content. For Ex. English, Hindi, Chinese, Japanese etc.

I used following code.

HttpWebRequest request = (HttpWebRequest)HttpWebRequest.Create(TextBox1.Text ); request.UserAgent = "A .NET Web Crawler";

    WebResponse response = request.GetResponse();

    Stream stream = response.GetResponseStream();

    StreamReader reader = new StreamReader(stream);
    string htmlText = reader.ReadToEnd();
A: 

If you are talking about "programming language", then you can't. You can find clues, but there is no way to know for sure if a page was produce with asp or php or anything else.

If you are not talking about programming language, but instead english/spanish/french etc, then ignore my answer (but clarify your question).

EJB
Thanks, I want to identify whether its english/Chinese/Japensese like that.
Ajay
A: 

Well, some webpages contain a "lang" or "xml:lang" attribute in the html element. For example:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"&gt;
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<title></title>
<meta http-equiv="content-type" content="text/html;charset=utf-8" />
</head>
<body>

</body>
</html>

In this example the attributes "lang" and "xml:lang" are set as "en" (i.e. English). Additionally, some servers may set a "Content-Language" header and you could check that value of that. (Although, to be honest i haven't actually seen a server which sets this value).

However, the value of these attributes or headers could be anything and some servers and webpages won't even state a language at all. But you'll probably want to search for common language codes as defined by ISO-639 and ISO-3166.

As for the implementation of this in C#, i'll admit it: i don't have much of a clue. But I think the WebResponse class has a property called Headers which you may want to look at.

Oh, and for languages like Hindi, i'm pretty sure that they contain characters unique to that language. In which case you could search your htmlText string for any of these particular characters.

There's also a simple method checking your htmlText string for words common to a particular language. For example, if you wanted to know whether to page was french you could search for the word "bonjour" etc.

A: 

You might find something here: http://www.google.com/uds/samples/language/detect.html

Enno Shioji
A: 

Aside from hoping the person who created the webpage added a language identifier to the html tag or specified it in a meta tag, your best solution (and that means after those two) is to check the unicode character code for some non-English text from the web page.

string text = "あの";
foreach (char c in text)
{
   Console.WriteLine("U+{1:x4}", (int)c);
}

And check what language space it falls into.

This won't be 100% accurate, however, as there is character overlap in a number of languages (Japanese can be determined fairly accurately due to hiragana/katakana use, but a random selection of Chinese characters MIGHT just be a section of Japanese kanji without hiragana or katakana).

The quickest way to do this programatically is likely to narrow down as many language specific character sets as you can and check for those first and then do a more robust search failing any of the other three.

digiwombat