views:

783

answers:

3

I have been looking at LLVM lately, and I find it to be quite an interesting architecture. However, looking through the tutorial and the reference material, I can't see any examples of how I might implement a string data type. There is a lot of documentation about integers, reals, and other number types, and even arrays, functions and structures, but AFAIK nothing about strings. Would I have to add a new data type to the backend? Is there a way to use built-in data types? Any insight would be appreciated.

p.s. Sorry if some of my terminology is wrong - I didn't quite know how to describe all the different parts of LLVM.

+3  A: 

What is a string? An array of characters.

What is a character? An integer.

So while I'm no LLVM expert by any means, I would guess that if, eg, you wanted to represent some 8-bit character set, you'd use an array of i8 (8-bit integers), or a pointer to i8. And indeed, if we have a simple hello world C program:

#include <stdio.h>

int main() {
        puts("Hello, world!");
        return 0;
}

And we compile it using llvm-gcc and dump the generated LLVM assembly:

$ llvm-gcc -S -emit-llvm hello.c
$ cat hello.s
; ModuleID = 'hello.c'
target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128"
target triple = "x86_64-linux-gnu"
@.str = internal constant [14 x i8] c"Hello, world!\00"         ; <[14 x i8]*> [#uses=1]

define i32 @main() {
entry:
        %retval = alloca i32            ; <i32*> [#uses=2]
        %tmp = alloca i32               ; <i32*> [#uses=2]
        %"alloca point" = bitcast i32 0 to i32          ; <i32> [#uses=0]
        %tmp1 = getelementptr [14 x i8]* @.str, i32 0, i64 0            ; <i8*> [#uses=1]
        %tmp2 = call i32 @puts( i8* %tmp1 ) nounwind            ; <i32> [#uses=0]
        store i32 0, i32* %tmp, align 4
        %tmp3 = load i32* %tmp, align 4         ; <i32> [#uses=1]
        store i32 %tmp3, i32* %retval, align 4
        br label %return

return:         ; preds = %entry
        %retval4 = load i32* %retval            ; <i32> [#uses=1]
        ret i32 %retval4
}

declare i32 @puts(i8*)

Notice the reference to the puts function declared at the end of the file. In C, puts is

int puts(const char *s)

In LLVM, it is

i32 @puts(i8*)

The correspondence should be clear.

As an aside, the generated LLVM is very verbose here because I compiled without optimizations. If you turn those on, the unnecessary instructions disappear:

$ llvm-gcc -O2 -S -emit-llvm hello.c
$ cat hello.s 
; ModuleID = 'hello.c'
target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128"
target triple = "x86_64-linux-gnu"
@.str = internal constant [14 x i8] c"Hello, world!\00"         ; <[14 x i8]*> [#uses=1]

define i32 @main() nounwind  {
entry:
        %tmp2 = tail call i32 @puts( i8* getelementptr ([14 x i8]* @.str, i32 0, i64 0) ) nounwind              ; <i32> [#uses=0]
        ret i32 0
}

declare i32 @puts(i8*)
Jason Creighton
Hmm, okay - so if I want to use strings like many of the interpreted languages nowadays do (not just an array but including length, etc) I would have to declare that as some sort of structure which carries the extra baggage with it - would it have to be a whole new type in the backend?
a_m0d
Yeah, that's basically right, but it doesn't have to be a new type in the backend. You can just use an LLVM struct to store the data you need, and then define some functions that act on your string wrapper. Like Zifre says, it really is a low level virtual machine.
Jason Creighton
Okay, I have found out that you can create nice little arrays in llvm, but I haven't found anywhere that shows how to re-allocate these arrays to a different size (which I will need to do if I want to make a string longer)
a_m0d
I believe that LLVM arrays are fixed-size, like C arrays generally are. To dynamically resize, it depends on what you memory management model is. If you're allocating memory manually, you could just use the system's `malloc`, `realloc` and `free` calls to manage your memory. If you want to use garbage collection without a ton of work, it might be possible to bolt on the Boehm conservative GC. (I don't know, I haven't done it.)
Jason Creighton
+1  A: 

Think about how a string is represented in common languages:

  • C: a pointer to a character. You don't have to do anything special.
  • C++: string is a complex object with a constructor, destructor, and copy constructor. On the inside, it usually holds essentially a C string.
  • Java/C#/...: a string is a complex object holding an array of characters.

LLVM's name is very self explanatory. It really is "low level". You have to implement strings how ever you want them to be. It would be silly for LLVM to force anyone into a specific implementation.

Zifre
+1  A: 

[To follow up on other answers which explain what strings are, here is some implementation help]

Using the C interface, the calls you'll want are something like:

LLVMValueRef llvmGenLocalStringVar(const char* data, int len)
{
  LLVMValueRef glob = LLVMAddGlobal(mod, LLVMArrayType(LLVMInt8Type(), len), "string");

  // set as internal linkage and constant
  LLVMSetLinkage(glob, LLVMInternalLinkage);
  LLVMSetGlobalConstant(glob, TRUE);

  // Initialize with string:
  LLVMSetInitializer(glob, LLVMConstString(data, len, TRUE));

  return glob;
}
Legooolas