tags:

views:

73

answers:

4

Hello i have txt file which contents is this

\u041f\u0435\u0440\u0432\u044b\u0439_\u0438\u043d\u0442\u0435\u0440\u0430\u043a\u0442\u0438\u0432\u043d\u044b\u0439_\u0438\u043d\u0442\u0435\u0440\u043d\u0435\u0442_\u043a\u0430\u043d\u0430\u043b

how can I read such file to get result like this: "Первый_интерактивный_интернет_канал"

If i type this:

string str = T("\u041f\u0435\u0440\u0432\u044b\u0439\u0438\u043d\u0442\u0435\u0440\u0430\u043a\u0442\u0438\u0432\u043d\u044b\u0439_\u0438\u043d\u0442\u0435\u0440\u043d\u0435\u0442_\u043a\u0430\u043d\u0430\u043b");

then result in 'str' is good but if i rad it from file then it is the same like in file. I gues it is because '\u' becames '\u'. Is there simple way to convert \uxxxx notation to corresponding symbols in c++? Thanks.

+1  A: 

It's not very easy when you're reading in the file. It's easier to do a post-processing step afterwards. You can use Boost::regex to look for the pattern "\u[0-9A-Fa-f]{4}", and replace that by the corresponding single character.

MSalters
This is how I'd do it, and seems to be the only answer so far which is actually related to the question. However, if the file can contain non-BMP characters, the regex has to be modified a bit, e.g. `\\u[0-9A-Fa-f]{4,6}`.
Philipp
Depends on how those characters would be encoded. A \u escape isn't entirely standard. I've seen \U00XXXXXX as well.
MSalters
A: 

When you read it from file, you have to parse the input and check whether you encounter UTF-8 symbol. Read it as text and do some check like in this pseudo-code

int pos = 0;
string result;
while (pos < fileSize)
{
  string utfSymbol;
  utfSymbol.resize(4);
  if (str[pos] == "\\" && str[pos+1] == "u")
  {
    memcpy(&utfSymbol.c_str(), &str.c_str() + pos, 4);
    pos += 4;
    decodeAndAppendUtf8(utfSymbol, result);
  }
  else
  {
    append(str[pos++], result);
  }
}
Haspemulator
Thanks, mayby youcan post decodeAndAppendUtf8 function?
Velutis
It has nothing to do with UTF-8.
Philipp
A: 

Check this code :) Windows SDK has it already for you, MS geeks thought for this too, you can find more details in this post: http://weblogs.asp.net/kennykerr/archive/2008/07/24/visual-c-in-short-converting-between-unicode-and-utf-8.aspx

#include <atlconv.h>
#include <atlstr.h>

#define ASSERT ATLASSERT

int main()
{
    const CStringW unicode1 = L"\u041f and \x03A9"; // 'Alpha' and 'Omega'

    const CStringA utf8 = CW2A(unicode1, CP_UTF8);

    ASSERT(utf8.GetLength() > unicode1.GetLength());

    const CStringW unicode2 = CA2W(utf8, CP_UTF8);

    ASSERT(unicode1 == unicode2);   

    return 0;
}

This code has been tested by me and it works fine.

garzanti
Thanks, but i had tried it and no luck, i dont think it is encoding problem. OK for example i have CString a = "\\uXXXX" now how make variable a to have char of \uXXXX using 'utfcpp', is it possible?
Velutis
Velutis pls check the new edited response.
garzanti
@garzanti: Why do you want to mess around with UTF-8 at all?
Philipp
problem is i dont have this: const CStringW unicode1 = L"\u041f"i have this: const CStringW unicode1 = L"\\u041f" (double esc sequence)
Velutis
@Philipp: What exactly I am messing with UTF-8? His string is an UTF-16, and here it is just a sample showing what it can do. From my point of view it's ok. I think you are just exagerating. Give a solution yourself please first, before critcizing, or at least make fair comment saying what exactly is not ok, instead of sarcastic statements.
garzanti
@Velutis In your initial question it wasn't any double esc sequence :) so... sincerly I have no solution in this case. What you could do is to replace the double backslashes with one, using regular expressions for example.
garzanti
@garzanti: Your code contains a fine example of how to convert from UTF-16 to UTF-8 and back using ATL functions. However, this is just not related to the question, which requires conversions from Unicode escape sequences to UTF-16.
Philipp
+1  A: 

Here is an example for MSalters's suggestion:

#include <iostream>
#include <string>
#include <fstream>
#include <algorithm>
#include <sstream>
#include <iomanip>
#include <locale>

#include <boost/scoped_array.hpp>
#include <boost/regex.hpp>
#include <boost/numeric/conversion/cast.hpp>

std::wstring convert_unicode_escape_sequences(const std::string& source) {
  const boost::regex regex("\\\\u([0-9A-Fa-f]{4})");  // NB: no support for non-BMP characters
  boost::scoped_array<wchar_t> buffer(new wchar_t[source.size()]);
  wchar_t* const output_begin = buffer.get();
  wchar_t* output_iter = output_begin;
  std::string::const_iterator last_match = source.begin();
  for (boost::sregex_iterator input_iter(source.begin(), source.end(), regex), input_end; input_iter != input_end; ++input_iter) {
    const boost::smatch& match = *input_iter;
    output_iter = std::copy(match.prefix().first, match.prefix().second, output_iter);
    std::stringstream stream;
    stream << std::hex << match[1].str() << std::ends;
    unsigned int value;
    stream >> value;
    *output_iter++ = boost::numeric_cast<wchar_t>(value);
    last_match = match[0].second;
  }
  output_iter = std::copy(last_match, source.end(), output_iter);
  return std::wstring(output_begin, output_iter);
}

int wmain() {
  std::locale::global(std::locale(""));
  const std::wstring filename = L"test.txt";
  std::ifstream stream(filename.c_str(), std::ios::in | std::ios::binary);
  stream.seekg(0, std::ios::end);
  const std::ifstream::streampos size = stream.tellg();
  stream.seekg(0);
  boost::scoped_array<char> buffer(new char[size]);
  stream.read(buffer.get(), size);
  const std::string source(buffer.get(), size);
  const std::wstring result = convert_unicode_escape_sequences(source);
  std::wcout << result << std::endl;
}

I'm always surprised how complicated seemingly simple things like this are in C++.

Philipp
I think you've got a spurious w in `const std::wstring filename = "test.txt";`. Also, why the insanely complex file reading? It seems you can trivially write a `while(std::string line = getline(stream)) { }`, with an ordinary text stream. Converting the \uXXXX escapes can then be done line by line.
MSalters
The `wstring filename` is deliberate (I just forgot the `L` prefix) to allow Unicode filenames on Windows. AFAIK `ifstream` contains an appropriate, nonstandard constructor. Regarding the file reading, I think it's the most straightforward way to read a complete file, given the ridiculously limited iostream API. Readine line-by-line is a bit shorter, but still a workaround.
Philipp
Thanks this is what i needed.
Velutis