views:

639

answers:

7

Hi:)

I have a hindi script file like this:

3.  भारत का इतिहास काफी समृद्ध एवं विस्तृत है।

I have to write a program which adds a position to each and every word in each sentence. Thus the numbering for every line for a particular word position should start off with 1 in parentheses. The output should be something like this.

3.  भारत(1) का(2) इतिहास(3) काफी(4) समृद्ध(5) एवं(6) विस्तृत(7) है(8) ।(9)

The meaning of the above sentence is:

3.  India has a long and rich history.

If you observe the '।'( which is a full stop in hindi equivalent to a '.' in English ) also has a word position and similarly other special symbols would also have as I am trying to go about English-Hindi Word alignment( a part of Natural Language Processing ( NLP ) ) so the full stop in english '.' should map to '।' in Hindi. Serial nos remain as it is untouched. I thought reading character by character could be a solution. Could you please help me with how to go about in C++ if its easy or if easier could you suggest some other way through some other programming language may like Python/Perl..?

The thing is I am able to get word positions for my English text using C++ as I was able to read character by character using ASCII values in C++ but I don't have a clue to how to go about the same for the hindi text.

The final aim of all this is to see which word position of the English text maps to which postion in Hindi. This way I can achieve bidirectional alignment.

Thank you for your time...:)

+2  A: 

Take a look at http://site.icu-project.org/, a C++ library to process unicode strings.

cc
+4  A: 

If you are working in C++ and decide that UTF-8 is a viable encoding for your application you could look at utfcpp which is a library that provides many equivalents for types found in the stdlib (such as streams and string processing functions) but abstracts away the difficulties of dealing with a variable length encoding like UTF8.

If on the other hand you are free to use any language, I would say that doing something like this in something like Python would be far easier: it's unicode support is very good as are the bundled string processing routines.

#!/usr/bin/env python
# encoding: utf-8

string = u"भारत का इतिहास काफी समृद्ध एवं विस्तृत है।"
parts = []
for part in string.split():
    parts.extend(part.split(u"।"))
print "No of Parts: %d" % len(parts)
print "Parts: %s" % parts

Outputs:

No of Parts: 9
Parts: [u'\u092d\u093e\u0930\u0924', u'\u0915\u093e', u'\u0907\u0924\u093f\u0939\u093e\u0938', u'\u0915\u093e\u092b\u0940', u'\u0938\u092e\u0943\u0926\u094d\u0927', u'\u090f\u0935\u0902', u'\u0935\u093f\u0938\u094d\u0924\u0943\u0924', u'\u0939\u0948', u'']

Also, since you are doing natural language processing, you may want to take a look at the NLTK library for Python which has a wealth of tools for just this kind of job.

jkp
+1  A: 

The easiest way to do the processing is to get your input into a std::wstring (which is logically an array of wchar_t) Now, you still won't have "characters" because that concept is a bit more complex in Hindi. You will however have substrings seperated by L' ', and the L'।' will also be separate. E.g. you can call input.find_first_of(L" ।")

MSalters
+1  A: 

The first thing to do is determine whether or not your input is in UNICODE. Do this by attempting to read your input as UNICODE and see if the results are garbled.

FILE * fp = _wfopen( L"fname",L"r" );
wchar_t buf[1000];
while( fgetws(buf,999, fp ) )   {
    fwprintf(L"%s",buf);
}

If the output is OK, you have a UNICODE file, if garbled it is UTF-8

If you have UTF-8 you will have to convert to Unicode to make processing straightforward.

// convert UTF-8 to UNICODE

    void String2WString( std::wstring& ws, const std::string& s )
    {
        ws.clear();
        int nLenOfWideCharStr = MultiByteToWideChar(CP_ACP, 0, 
            s.c_str(), s.length(), NULL, 0); 
        PWSTR pWideCharStr = (PWSTR)HeapAlloc(GetProcessHeap(), 0, 
            nLenOfWideCharStr * sizeof(wchar_t)+2); 
        if (pWideCharStr == NULL)         
            return; 
        MultiByteToWideChar(CP_ACP, 0, 
            s.c_str(), s.length(), 
            pWideCharStr, nLenOfWideCharStr);
        *(pWideCharStr+nLenOfWideCharStr ) = L'\0';
        ws = pWideCharStr ;
        HeapFree(GetProcessHeap(), 0, pWideCharStr); 

    }

    // read UTF-8
FILE * fp = fopen( "fname","r" );
char buf[1000];
std::string aline;
std::wstring wline;
std::vector< std::wstring> vline;
while( fgets(buf,999, fp ) )    {
    aline = buf;
    String2WString( wline, aline );
    vline.push_back( wline );
}

The above assumes that you are on Windows. On Unix, the same idea applies and the code is quite similar. However, I do not find it quite so straightforward, so I will let a UNIX expert provide the details.

ravenspoint
ugly variable names with hungarian notation?
piotr
A: 

I would seriously suggest that you'd use Python for an applicatin like this. It will lift the burden of decoding the strigns (not to mention allocating memory for them and the like). You will be free to concentrate on your problem, instead of problems of the language.

For example, if the sentence above is contained in an utf-8 file, and you are uisng python2.x. If you use python 3.x it is even more readible, as you don't have to prefix the unicode strings with 'u" ', as in this example (but you will be missing a lot of 3rd party libraries:

separators = [u"।", u",", u"."]
text = open("indiantext.txt").read()
#This converts the encoded text to an internal unicode object, where
# all characters are properly recognized as an entity:
text = text.decode("utf-8")

#this breaks the text on the white spaces, yielding a list of words:
words = text.split()

counter = 1

output = ""
for word in words:
    #if the last char is a separator, and is joined to the word:
    if word[-1] in separators and len(word) > 1:
        #word up to the second to last char:
        output += word[:-1] + u"(%d) " % counter
        counter += 1
        #last char
        output += word[-1] +  u"(%d) " % counter
    else:
        output += word + u"(%d) " % counter
    counter += 1

print output

This is an "unfolded" example, As you get more used to Python there are shorer ways to express this. You can learn the basics of teh language in just a couple of hours, following a tutorial. (for example, the one at http://python.org itself)

jsbueno
+3  A: 

ICU - International Components for Unicode is IBM supported C++ library that is starting to become a standard for handling of characters of all languages. I see more and more projects using it. It does the job really well. Here are the features (copy/pasted from the website):

  • Code Page Conversion: Convert text data to or from Unicode and nearly any other character set or encoding. ICU's conversion tables are based on charset data collected by IBM over the course of many decades, and is the most complete available anywhere.

  • Collation: Compare strings according to the conventions and standards of a particular language, region or country. ICU's collation is based on the Unicode Collation Algorithm plus locale-specific comparison rules from the Common Locale Data Repository, a comprehensive source for this type of data.

  • Formatting: Format numbers, dates, times and currency amounts according the conventions of a chosen locale. This includes translating month and day names into the selected language, choosing appropriate abbreviations, ordering fields correctly, etc. This data also comes from the Common Locale Data Repository.

  • Time Calculations: Multiple types of calendars are provided beyond the traditional Gregorian calendar. A thorough set of timezone calculation APIs are provided.

  • Unicode Support: ICU closely tracks the Unicode standard, providing easy access to all of the many Unicode character properties, Unicode Normalization, Case Folding and other fundamental operations as specified by the Unicode Standard.

  • Regular Expression: ICU's regular expressions fully support Unicode while providing very competitive performance.

  • Bidi: support for handling text containing a mixture of left to right (English) and right to left (Arabic or Hebrew) data.

  • Text Boundaries: Locate the positions of words, sentences, paragraphs within a range of text, or identify locations that would be suitable for line wrapping when displaying the text.

Milan Babuškov
Ahh. But have you ever tried to use it. Not easy.
+6  A: 

Wow, already 6 answers and not a single one actually does what mgj wanted. jkp comes close, but then drops the ball by deleting the daṇḍa.

Perl to the rescue. Less code, fewer bugs.

use utf8; use strict; use warnings;
use Encode qw(decode);
my $index;
join ' ', map { $index++; "$_($index)" } split /\s+|(?=।)/, decode 'UTF-8', <>;
# returns भारत(1) का(2) इतिहास(3) काफी(4) समदध(5) एव(6) विसतत(7) ह(8) ।(9)

edit: changed to read from STDIN as per comment, added best practices pragmas

daxim
@daxim: Use <> instead of the embedded string to read from standard input or files on the command line. You'll need to binmode it and you can do without the 'use utf8'...
ijw
Hi..:) Thank you for your answer. May I know where can I give the input file in your given source code which returns as you have given in a comment the required answer. I am not able to figure it out.
mgj