views:

117

answers:

4

Hello. Well, I completely get the most basic datatypes of C, like short, int, long, float, to be exact, all numerical types.These types are needed to be known perform right operations with right numbers. For example to use FPU to add two float numbers. So the compiler must know what the type is.

But, when it comes to characters I am little bit off. I know that basic C datatype char is there for ASCII characters coding. But what I don´t know is, why you even need another datatype for characters. Why could not you just use 1 byte integer value to store ASCII character. If you call printf, you apecify the datatype in the call, so you could say to printf that the integer represents ASCII character. I dont know how cout resolves datatype, but I guess you could just specify it somehow.

Another thing is, when you want to use Unicode, you must use datatype wchar. But, what if I would like to use some another, for example ISO, or Windows coding instead of UTF? Becouse wchar codes characters as UTF-16 or UTF-32 (I read its compiler specific). And, what if I would want to use for example some imaginary new 8 byte text coding? What datatype should I use for it? I am actually pretty confused of this, becouse I always expected that if I want to use UTF-32 instead of ASCII, I just tell compiler "get UTF-32 value of the character I typed and save it into 4 char field." I thought that text coding is to be dealt with by the end, print function for example. That I just need to specify the coding for the compiler to use, since Windows doesent use ASCII in win32 apps, I guess C compiler must convert the char I typed to ASCII from whatever the type is that windows sends to the C editor.

And the last thing is, what if I want to use for example 25 Byte integer for some high math operations? C has no specify-yourself datatype. Yes, I know that this would be difficult since all the math operations would need to be changed, becouse CPU can not add 25 Bytes numbers together. But is there a way to do it? Or is there some math library for it? What if I want to compute Pi to 1000000000000000 digits? :)

I know my question is pretty long, but I just wanted to explain my thoughts the best I can in English, since its not my native language it is difficult. And I believe there is simple answer to my question(s), something I missed that explains everything. I read lot about text coding, C tutorials, but nothing about his. Thank you for your time.

A: 

There is (was) no "1-byte integer" type other than char (and signed and unsigned variants thereof). And though Windows NT (i.e. not 9x or ME) does use Unicode internally, your program will only use Unicode if you write it that way -- you have to either use WCHAR and all of the W versions of win32 calls, or use TCHAR and #define UNICODE.

SamB
+2  A: 

Your question is very broad, I'll try to address some specific issues you raised, hopefully it will get you abit more sorted out.

  • The char type can be though of as just another numerical type, just like int, short and long. It is totally ok to write char a=3;. The difference is that with chars the compiler gives you some added value. instead of just numbers you can also assign ASCII characters to a variable like char a='U'; and then the variable will get the ASCII value of that character and you can also initialize arrays of character using literal strings like so: char *s="hello";.
    This doesn't change the fact that after all char is still a numeric type and a string is just an array of numbers. If you'll look at the memory of the string, you'll see the ASCII codes of the string.

  • The choice of char being 1 byte is arbitrary and is largely kept this way in C due to historical reasons. more modern languages like C# and Java define char to be 2 bytes.

  • You don't need "another" type for characters. char is just the numeric type that holds a single singed/unsigned byte the same as short is the numeric type that holds a signed 16 bit word. The fact that this data type is used for characters and strings is just syntactic sugar provided by the compiler. 1 byte integers == char.

  • printf() only works with chars since this is the way C was designed. it it was designed today it would possibly be working with shorts. Indeed in windows you have a version of printf() which works with shorts, it is called wprintf()

  • the type wchar_t, in windows, is just another name for short. somewhere in the windows header files there is a decleration like this: typedef short wchar_t; which makes this happen. You can use them interchangeably. The advantage of using the word wchar_t is that whoever reads your code knows that you now want to use characters rather than numbers. Another reason is that if there's a remote chance that sometime Microsoft will decide that now they want to use UTF32 then all they need to do is redefine the typedef above to be typedef int wchar_t; and that's it (in reality this will be quite abit more complicated to achieve so this change is unlikely in the for seeable future)

  • If you want to use some 8-bit encoding that is not ASCII, for instance the encoding for hebrew which is called "Windows-1255" you just use chars. There are many such encodings but these days using UNICODE is always preferable. Indeed there is actually a version of Unicode itself which fits in 8-bit strings that is UTF-8. If you're dealing with UTF-8 strings then you should work with the char data type. There is nothing that limits it to working with ASCII since it is just a number, it can mean anything.

  • Working with such long numbers is usually done using something called "decimal types". C doesn't have this but C# does. The basic idea of these types is that they handle a number similar to a string. Every digit of the decimal representation is saved using 4 bits so an 8 bit variable can save the numbers in the range 0-99, a 3 byte array can save values in the range of 0-999999 and so on. This way you can save numbers of any range.
    The downside to these is that making calculations on them is alot slower than making calculations on normal binary numbers.
    I am not sure if there are libraries which do this kind of thing in C. Use google to find out.

shoosh
`char` can be either signed or unsigned - it is up to the compiler; and `short` is *always* signed (not unsigned, as you say).
caf
+1  A: 

Actually, there are plenty of languages where the types of variables arent known at compile-time. This does tend to add some run-time overhead though.

To answer your first question, I think you're getting hung up on the name "char". The char type is a one byte integer in C (actually not quite true- it's an integral type large enough to hold any character from the basic character set, but its size is implementation dependent.) Note that you can have both signed chars and unsigned chars, something that doesn't make a lot of sense if you're talking about a data type that only holds characters. But the one byte integer is called "char" in C because that's the most common use for it (again see disclaimer above.)

The rest of your question covers a lot of ground- might have been better to break this up into a few questions. Like the char type wchar_t's size is implementation dependent- the only requirement is that it be large enough to hold any wide character. It's important to understand that Unicode, and character encodings in general are actually independent of the C language. It's also important to understand that character sets are not the same thing as character encodings.

Here's an article (by one of SO's founders, I believe) that gives a brief intro to character sets and encodings: http://www.joelonsoftware.com/articles/Unicode.html. Once you have a better understanding of how they work you'll be in a better position to formulate some questions for yourself. Do note that a lot of character sets (the Windows code page, for instance) only require a single byte of storage.

T Duncan Smith
Actually, the C standard defines "`char`" and "byte" such that a `char` is *always* one byte.
SamB
This is true, but the standard also defines byte to be "large enough to hold any character from the basic character set." The definition used by the C standard is is a bit different from what is usually meant by the word byte. But perhaps I should have been more precise and said that the C char is not necessarily an octet.
T Duncan Smith
+1  A: 

In C, char is an 1 byte integer, and that is also used to store a character. A character is just a 1 byte integer in C.

And, what if I would want to use for example some imaginary new 8 byte text coding?

You would have to build it yourself, based on the types available through your compiler/hardware. One approach could be to define a struct with an array of 8 chars, and build function to maniuplate said struct with all the operations you'd want on that,

becouse I always expected that if I want to use UTF-32 instead of ASCII, I just tell compiler "get UTF-32 value of the character I typed and save it into 4 char field.

You're limited to the types of your C compiler, which is heavily influenced by the hardware(and the C standard + a bit of history). C is a low level language, and does not provide much magic. That said, there are library functions that might allow you to translate between (some) character sets, e.g. the mbtowc() function and similar, which does exactly this, you tell it "here's 16 bytes of ISO8859-1 characters, translate them to UTF-16 into this buffer over there for me please".

And the last thing is, what if I want to use for example 25 Byte integer for some high math operations? C has no specify-yourself datatype.

C lets you define your own data types, structs. You can build an abstraction on top of those. People have built libraries like this, see e.g. here . Other languages might allow you to even more naturally model such types, like C++ which also allow you to overlod operators like +,-,* etc. to work on your own data types.

nos