views:

206

answers:

1

Hi...:) This might look to be a very long question to you I understand, but trust me on this its not long. I am not able to identify why after processing this text is not being able to be read and edited. I tried using the ord() function in python to check if the text contains any Unicode characters( non ascii characters) apart from the ascii ones.. I found quite a number of them. I have a strong feeling that this could be due to the original text itself( The INPUT ).

Input-File: Just copy paste it into a file "acle5v1.txt"

The objective of this code below is to check for upper case characters and to convert it to lower case and also to remove all punctuations so that these words are taken for further processing for word alignment

#include<iostrea>
#include<fstream>
#include<ctype.h>
#include<cstring>

using namespace std;

ifstream fin2("acle5v1.txt");
ofstream fin3("acle5v1_op.txt");
ofstream fin4("chkcharadded.txt");
ofstream fin5("chkcharntadded.txt");
ofstream fin6("chkprintchar.txt");
ofstream fin7("chknonasci.txt");
ofstream fin8("nonprinchar.txt");

int main()
{
char ch,ch1;
fin2.seekg(0);
fin3.seekp(0);
int flag = 0;

            while(!fin2.eof())
    {
        ch1=ch;
        fin2.get(ch);

        if (isprint(ch))// if the character is printable
            flag = 1;

        if(flag)
        {
            fin6<<"Printable character:\t"<<ch<<"\t"<<(int)ch<<endl;
            flag = 0;
        }
        else
        {
            fin8<<"Non printable character caught:\t"<<ch<<"\t"<<int(ch)<<endl;
        }

        if( isalnum(ch) || ch == '@' || ch == ' ' )// checks for alpha numeric characters
        {
            fin4<<"char added: "<<ch<<"\tits ascii value: "<<int(ch)<<endl;
            if(isupper(ch))
            {
                //tolower(ch);
                fin3<<(char)tolower(ch);
            }
            else
            {
                fin3<<ch;
            }
        }
        else if( ( ch=='\t' || ch=='.' || ch==',' || ch=='#' || ch=='?' || ch=='!' || ch=='"' || ch != ';' || ch != ':') && ch1 != ' ' )
        {
            fin3<<' ';
        }
        else if( (ch=='\t' || ch=='.' || ch==',' || ch=='#' || ch=='?' || ch=='!' || ch=='"' || ch != ';' || ch != ':') && ch1 == ' ' )
        {
            //fin3<<" ';
        }
        else if( !(int(ch)>=0 && int(ch)<=127) )
        {
            fin5<<"Char of ascii within range not added: "<<ch<<"\tits ascii value: "<<int(ch)<<endl;
        }
        else
        {
            fin7<<"Non ascii character caught(could be a -ve value also)\t"<<ch<<int(ch)<<endl; 
        }   
    }
    return 0;
}

I have a similar code as the above written in python which gives me an otput which is again not readable and not editable

The code in python looks like this:

#!/usr/bin/python
# -*- coding: UTF-8 -*-

import sys

input_file=sys.argv[1]
output_file=sys.argv[2]

list1=[]

f=open(input_file)
for line in f:
    line=line.strip()   
    #line=line.rstrip('.')   
    line=line.replace('.','')
    line=line.replace(',','')
    line=line.replace('#','')
    line=line.replace('?','')
    line=line.replace('!','')
    line=line.replace('"','')
    line=line.replace('।','')
    line=line.replace('|','')       
    line = line.lower() 
    list1.append(line)
    f.close()

    f1=open(output_file,'w')

    f1.write(' '.join(list1))

    f1.close()

the file takes ip and op at runtime.. as:

python punc_remover.py acle5v1.txt acle5v1_op.txt

The output of this file is in "acle5v1_op.txt"

now after processing this particular output file is needed for further processing. This particular file "aclee5v1_op.txt" is the UNREADABLE Aand UNEDITABLE File that I am not being able to use for further processing. I need this for Word alignment in NLP. I tried readin this output with the following program

#include<iostream>
#include<fstream>

using namespace std;

ifstream fin1("acle5v1_op.txt");
ofstream fout1("chckread_acle5v1_op.txt");
ofstream fout2("chcknotread_acle5v1_op.txt");

int main()
{
    char ch;
    int flag = 0;
    long int r = 0; long int nr = 0;

    while(!(fin1))
    {
        fin1.get(ch);

        if(ch)
        {
            flag = 1;
        }

        if(flag)
        {
            fout1<<ch;
            flag = 0;
            r++;
        }
        else
        {
            fout2<<"Char not been able to be read from source file\n";
            nr++;
        }
    }

    cout<<"Number of characters able to be read: "<<r;
    cout<<endl<<"Number of characters not been able to be read: "<<nr;

    return 0;
}

which prints the character if its readable and if not it doesn't print them but I observed the output of both the file is blank thus I could draw a conclusion that this file "acle5v1_op.txt" is UNREADABLE AND UNEDITABLE. Could you please help me on how to deal with this problem..

To tell you a bit about the statistics wrt the original input file "acle5v1.txt" file it has around 3441 lines in it and around 3 million characters in it.

Keeping in mind the number of characters in the file you editor might/might not be able to manage to open the file.. I was able to open the file in gedit of Fedora 10 which I am currently using .. This is just to notify you that opening with a particular editor was not actually an issue at least in my case...

Can I use scripting languages like Python and Perl to deal with this problem if Yes how? could please be specific on that regard as I am a novice to Perl and Python. Or could you please tell me how do I solve this problem using C++ itself.. Thank you...:) I am really looking forward to some help or guidance on how to go about this problem....

+2  A: 

(now i can reply, after taking some time editing the post. when posting, please use the preview and read the help !)

There is no problem Python cannot tackle... and this problam can definitely be solved using python.

After modifying a bit your python script (indentation is messed up !), i was able to process the content i could copy from your link, the output was fine (but still containing some punctuation like ':' and '()').

you say that after the first processing step, the output is unreadable and uneditable, but what is the content of the output file after processing ? did you try opening it in your editor and see what was in the file ? if this first step is not working, then correct your code at the first step and focus your question on this first step. try using a debugger to see where your code fails.

personally, i suspect an encoding problem: is your input file a pure ascii file ? or could it be that it is encoded in unicode ?

please note that a 3MB file is not much. if this is posing some problems to you, change your editor ! (try jEdit, epsilon, emacs, vi...)

Adrien Plisson
Hi Sir.. Thank you for your answer, thanks a lot for editing it also for me, I shall follow the help from next time. Coming back to your answer.. The output you were able to get after executing the Python script was it readable and editable? as in if I want to read now from this output file "acle5v1_op.txt" and say try printing the same contents into another file say "copy_acle5v1_op.txt".. Are you able to read in this fashion is one part of my question and by editing it I mean to say.. using the cursor are you able to navigate it throughout the file ?( I was not able to do so.. ).
mgj
The output that I am expecting is just that the entire text file should be readable and allow cursor movement for navigation and may be appropriate correction if deemed necessary. The problem I am facing is that I am not able to move the cursor(hence making it UNEDITABLE) and I am not able to read its contents( i.e the output of "acle5v1.txt" which is "acle5v1_op.txt" ) into say another file.. Punctuations is not that much a problem as you must be knowing better than me that I can use the line.replace cmd to deal with chars like '(' , '.' and ')'. I am experiencing no problems with the editor.
mgj
yes i was definitely able to edit the file. but it was composed of one very long single line and i know many editors which fail editing this kind of file. try replacing spaces between words by newlines, it may improve the way your editor handle the file.
Adrien Plisson
I am happy to know that you got the output same as what I got, Could you please tell me are you able to read from this file..? and say copy its contents into another file..Also Could you please tell me are you finding the output line say after the first 'xx' characters( I mean that many no. of characters fit in the first line of the editor..) Is the remaining being followed up in the same screen as the same line no. 1 below it.. or is it going outside your screen view as a continuous line would say go outside this comment box.
mgj
The reason why I am asking you the second question is I have another file with around 4 million characters( as op, whose ip file is also different frm the one posted) which gives me the CONTINUATION of the same line in the line BELOW it and that also is no doubt 1 line but its not going as a continuous line OUTSIDE the monitor screen and thats editable. The link to that ip file is here http://paste.pocoo.org/show/188415/ pls try executing the pyth program on this ip file and compare the op of this file with "acle5v1_op.txt" which u got.. They shud be diff in the way they appear on d screen..
mgj
Also could you please tell me the editor your using for viewing this file..I honestly feel its the way the "acle5v1.txt" is somehow present with some characters in it.. which are not generally visible hence being a reason for the op to come as one continuous line in the form of "acle5v1_op.txt" outside the screen resolution and outside the view as one continous line in an uneditable and unreadable( I don't mean not seen.. just that its contents can't be for e.g read into another file) form. I tried using word wrapping for "acle5v1_op.txt" its not working.. please see what you can do. Thanks:)
mgj