ansaurus

Question

How to read file which contains \uxxxx in vc++

Answer 1

+1 A:

It's not very easy when you're reading in the file. It's easier to do a post-processing step afterwards. You can use Boost::regex to look for the pattern "\u[0-9A-Fa-f]{4}", and replace that by the corresponding single character.

MSalters 2010-06-30 11:14:51

This is how I'd do it, and seems to be the only answer so far which is actually related to the question. However, if the file can contain non-BMP characters, the regex has to be modified a bit, e.g. `\\u[0-9A-Fa-f]{4,6}`.

Philipp 2010-06-30 13:25:22

Depends on how those characters would be encoded. A \u escape isn't entirely standard. I've seen \U00XXXXXX as well.

MSalters 2010-06-30 14:37:01

Answer 2

A:

When you read it from file, you have to parse the input and check whether you encounter UTF-8 symbol. Read it as text and do some check like in this pseudo-code

int pos = 0;
string result;
while (pos < fileSize)
{
  string utfSymbol;
  utfSymbol.resize(4);
  if (str[pos] == "\\" && str[pos+1] == "u")
  {
    memcpy(&utfSymbol.c_str(), &str.c_str() + pos, 4);
    pos += 4;
    decodeAndAppendUtf8(utfSymbol, result);
  }
  else
  {
    append(str[pos++], result);
  }
}

Haspemulator 2010-06-30 11:47:42

Thanks, mayby youcan post decodeAndAppendUtf8 function?

Velutis 2010-06-30 12:19:13

It has nothing to do with UTF-8.

Philipp 2010-06-30 13:20:01

Answer 3

A:

Check this code :) Windows SDK has it already for you, MS geeks thought for this too, you can find more details in this post: http://weblogs.asp.net/kennykerr/archive/2008/07/24/visual-c-in-short-converting-between-unicode-and-utf-8.aspx

#include <atlconv.h>
#include <atlstr.h>

#define ASSERT ATLASSERT

int main()
{
    const CStringW unicode1 = L"\u041f and \x03A9"; // 'Alpha' and 'Omega'

    const CStringA utf8 = CW2A(unicode1, CP_UTF8);

    ASSERT(utf8.GetLength() > unicode1.GetLength());

    const CStringW unicode2 = CA2W(utf8, CP_UTF8);

    ASSERT(unicode1 == unicode2);   

    return 0;
}

This code has been tested by me and it works fine.

garzanti 2010-06-30 11:53:42

Thanks, but i had tried it and no luck, i dont think it is encoding problem. OK for example i have CString a = "\\uXXXX" now how make variable a to have char of \uXXXX using 'utfcpp', is it possible?

Velutis 2010-06-30 12:22:46

Velutis pls check the new edited response.

garzanti 2010-06-30 13:11:46

@garzanti: Why do you want to mess around with UTF-8 at all?

Philipp 2010-06-30 13:21:47

problem is i dont have this: const CStringW unicode1 = L"\u041f"i have this: const CStringW unicode1 = L"\\u041f" (double esc sequence)

Velutis 2010-06-30 13:25:56

@Philipp: What exactly I am messing with UTF-8? His string is an UTF-16, and here it is just a sample showing what it can do. From my point of view it's ok. I think you are just exagerating. Give a solution yourself please first, before critcizing, or at least make fair comment saying what exactly is not ok, instead of sarcastic statements.

garzanti 2010-06-30 13:37:00

@Velutis In your initial question it wasn't any double esc sequence :) so... sincerly I have no solution in this case. What you could do is to replace the double backslashes with one, using regular expressions for example.

garzanti 2010-06-30 13:39:43

@garzanti: Your code contains a fine example of how to convert from UTF-16 to UTF-8 and back using ATL functions. However, this is just not related to the question, which requires conversions from Unicode escape sequences to UTF-16.

Philipp 2010-06-30 14:31:22

Answer 4

+1 A:

Here is an example for MSalters's suggestion:

#include <iostream>
#include <string>
#include <fstream>
#include <algorithm>
#include <sstream>
#include <iomanip>
#include <locale>

#include <boost/scoped_array.hpp>
#include <boost/regex.hpp>
#include <boost/numeric/conversion/cast.hpp>

std::wstring convert_unicode_escape_sequences(const std::string& source) {
  const boost::regex regex("\\\\u([0-9A-Fa-f]{4})");  // NB: no support for non-BMP characters
  boost::scoped_array<wchar_t> buffer(new wchar_t[source.size()]);
  wchar_t* const output_begin = buffer.get();
  wchar_t* output_iter = output_begin;
  std::string::const_iterator last_match = source.begin();
  for (boost::sregex_iterator input_iter(source.begin(), source.end(), regex), input_end; input_iter != input_end; ++input_iter) {
    const boost::smatch& match = *input_iter;
    output_iter = std::copy(match.prefix().first, match.prefix().second, output_iter);
    std::stringstream stream;
    stream << std::hex << match[1].str() << std::ends;
    unsigned int value;
    stream >> value;
    *output_iter++ = boost::numeric_cast<wchar_t>(value);
    last_match = match[0].second;
  }
  output_iter = std::copy(last_match, source.end(), output_iter);
  return std::wstring(output_begin, output_iter);
}

int wmain() {
  std::locale::global(std::locale(""));
  const std::wstring filename = L"test.txt";
  std::ifstream stream(filename.c_str(), std::ios::in | std::ios::binary);
  stream.seekg(0, std::ios::end);
  const std::ifstream::streampos size = stream.tellg();
  stream.seekg(0);
  boost::scoped_array<char> buffer(new char[size]);
  stream.read(buffer.get(), size);
  const std::string source(buffer.get(), size);
  const std::wstring result = convert_unicode_escape_sequences(source);
  std::wcout << result << std::endl;
}

I'm always surprised how complicated seemingly simple things like this are in C++.

Philipp 2010-06-30 14:28:18

I think you've got a spurious w in `const std::wstring filename = "test.txt";`. Also, why the insanely complex file reading? It seems you can trivially write a `while(std::string line = getline(stream)) { }`, with an ordinary text stream. Converting the \uXXXX escapes can then be done line by line.

MSalters 2010-06-30 14:53:04

The `wstring filename` is deliberate (I just forgot the `L` prefix) to allow Unicode filenames on Windows. AFAIK `ifstream` contains an appropriate, nonstandard constructor. Regarding the file reading, I think it's the most straightforward way to read a complete file, given the ridiculously limited iostream API. Readine line-by-line is a bit shorter, but still a workaround.

Philipp 2010-06-30 15:15:37

Thanks this is what i needed.

Velutis 2010-07-02 06:27:50

ansaurus

tags:

views:

answers:

How to read file which contains \uxxxx in vc++

related questions