The tokenizer itself is usually written using a large DFA table that describes all possible valid tokens (stuff like, a token can start with a letter followed by other letters/numbers followed by a non-letter, or with a number followed by other numbers and either a non-number/point or a point followed by at least 1 number and then a non-number, etc etc). The way i built mine was to identify all the regular expressions my tokenizer will accept, transform them into DFA's and combine them.
Now to "remove comments", when you're parsing a token you can have a comment token (the regex to parse a comment, too long to describe in words), and when you finish parsing this comment you just parse a new token, thus ignoring it. Alternatively you can pass it to the compiler and let it deal with it (or ignore it as it will). Either aproach will preserve meta-data like line numbers and characters-into-the-line.
edit for DFA theory:
Every regular expression can be converted (and is converted) into a DFA for performance reasons. This removes any backtracking in parsing them. This link gives you an idea of how this is done. You first convert the regular expression into an NFA (a DFA with backtracking), then remove all the backtracking nodes by inflating your finite automata.
Another way you can build your DFA is by hand using some common sense. Take for example a finite automata that can parse either an identifier or a number. This of course isn't enough, since you most likely want to add comments too, but it'll give you an idea of the underlying structures.
A-Z space
->(Start)----->(I1)------->((Identifier))
| | ^
| +-+
| A-Z0-9
|
| space
+---->(N1)---+--->((Number)) <----------+
0-9 | ^ | |
| | | . 0-9 space |
+-+ +--->(N2)----->(N3)--------+
0-9 | ^
+-+
0-9
Some notes on the notation used, the DFA starts at the (Start) node and moves through the arrows as input is read from your file. At any one point it can match only ONE path. Any paths missing are defaulted to an "error" node. ((Number))
and ((Identifier))
are your ending, success nodes. Once in those nodes, you return your token.
So from the start, if your token starts with a letter, it HAS to continue with a bunch of letters or numbers and end with a "space" (spaces, new lines, tabs, etc). There is no backtracking, if this fails the tokenizing process fails and you can report an error. You should read a theory book on error recovery to continue parsing, its a really huge topic.
If however your token starts with a number, it has to be followed by either a bunch of numbers or one decimal point. If there's no decimal point, a "space" has to follow the numbers, otherwise a number has to follow followed by a bunch of numbers followed by a space. I didn't include the scientific notation but it's not hard to add.
Now for parsing speed, this gets transformed into a DFA table, with all nodes on both the vertical and horizontal lines. Something like this:
I1 Identifier N1 N2 N3 Number
start letter nothing number nothing nothing nothing
I1 letter+number space nothing nothing nothing nothing
Identifier nothing SUCCESS nothing nothing nothing nothing
N1 nothing nothing number dot nothing space
N2 nothing nothing nothing nothing number nothing
N3 nothing nothing nothing nothing number space
Number nothing nothing nothing nothing nothing SUCCESS
The way you'd run this is you store your starting state and move through the table as you read your input character by character. For example an input of "1.2" would parse as start->N1->N2->N3->Number->SUCCESS. If at any point you hit a "nothing" node, you have an error.
edit 2: the table should actually be node->character->node, not node->node->character, but it worked fine in this case regardless. It's been a while since i last written a compiler by hand.