tags:

views:

210

answers:

2

Hello,

I am using "ExuberantCtags" also known as "ctags -e", also known as just "etags"

and I am trying to understand the TAGS file format which is generated by the etags command, in particular I want to understand line #2 of the TAGS file.

Wikipedia says that line #2 is described like this:

{src_file},{size_of_tag_definition_data_in_bytes}

In practical terms though TAGS file line:2 for "foo.c" looks like this

foo.c,1683

My quandary is how exactly does it find this number: 1683

I know it is the size of the "tag_definition" so what I want to know is what is the "tag_definition"?

I have tried looking through the ctags source code, but perhaps someone better at C than me will have more success figuring this out.

Thanks!

EDIT #2:

^L^J
hello.c,79^J
float foo (float x) {^?foo^A3,20^J
float bar () {^?bar^A7,59^J
int main() {^?main^A11,91^J

Alright, so if I understand correctly, "79" refers to the number of bytes in the TAGS file from after 79 down to and including "91^J".

Makes perfect sense.

Now the numbers 20, 59, 91 in this example wikipedia says refer to the {byte_offset}

What is the {byte_offset} offset from?

Thanks for all the help Ken!

+3  A: 

It's the number of bytes of tag data following the newline after the number.

Edit: It also doesn't include the ^L character between file tag data. Remember etags comes from a time long ago where reading a 500KB file was an expensive operation. ;)

Here's a complete tags file. I'm showing it two ways, the first with control characters as ^X and no invisible characters. The end-of-line characters implicit in your example are ^J here:

^L^J
hello.cc,45^J
int main(^?5,41^J
int foo(^?9,92^J
int bar(^?13,121^J
^L^J
hello.h,15^J
#define X ^?2,1^J

Here's the same file displayed in hex:

0000000    0c  0a  68  65  6c  6c  6f  2e  63  63  2c  34  35  0a  69  6e
          ff  nl   h   e   l   l   o   .   c   c   ,   4   5  nl   i   n
0000020    74  20  6d  61  69  6e  28  7f  35  2c  34  31  0a  69  6e  74
           t  sp   m   a   i   n   ( del   5   ,   4   1  nl   i   n   t
0000040    20  66  6f  6f  28  7f  39  2c  39  32  0a  69  6e  74  20  62
          sp   f   o   o   ( del   9   ,   9   2  nl   i   n   t  sp   b
0000060    61  72  28  7f  31  33  2c  31  32  31  0a  0c  0a  68  65  6c
           a   r   ( del   1   3   ,   1   2   1  nl  ff  nl   h   e   l
0000100    6c  6f  2e  68  2c  31  35  0a  23  64  65  66  69  6e  65  20
           l   o   .   h   ,   1   5  nl   #   d   e   f   i   n   e  sp
0000120    58  20  7f  32  2c  31  0a                                    
           X  sp del   2   ,   1  nl

There are two sets of tag data in this example: 45 bytes of data for hello.cc and 15 bytes for hello.h.

The hello.cc data starts on the line following "hello.cc,45^J" and runs for 45 bytes--this also happens to be complete lines. The reason why bytes are given is so code reading the file can just allocate room for a 45 byte string and read 45 bytes. The "^L^J" line is after the 45 bytes of tag data. You use this as a marker that there are more files remaining and also to verify that the file is properly formatted.

The hello.h data starts on the line following "hello.h,15^J" and runs for 15 bytes.

Ken Fox
Thanks, but now I have another question, but I'll put it in the main post.
AlexCombas
Thanks a lot for the help, that makes sense now but what is the {bytes_offset}? I've updated the edit to the main post.
AlexCombas
+1  A: 

The {byte_offset} for a tag entry is the number of bytes from the start of the file the function is defined in. The number before the byte offset is the line number. In your example:

hello.c,79^J
float foo (float x) {^?foo^A3,20^J

the foo function begins 20 bytes from the start of hello.c. You can verify that with a text editor that shows your cursor position in the file. You can also use the Unix tail command to display a file a number of bytes in:

tail -c +20 hello.c
Ken Fox
Thanks again Ken!
AlexCombas