Parsing TeX code is by no means the only way to get a list of all control sequences. Other possibilities include:
cause TeX to dump its data structures after loading the relevant package, then parse the dump file; when you run the latex
command, what really happens is that the tex
binary loads the latex.fmt
dump file, which was generated by having the same binary parse all the built-in code of LaTeX and dumping its data structures;
modify the source code of TeX to output something every time a control sequence gets defined;
run TeX in a scriptable debugger, insert a breakpoint where the sequence is inserted to the hash table, and have a script output the name of the sequence.
None of these is likely to be a particularly easy solution, but probably easier than writing a TeX-equivalent parser yourself. To get started, look at TeX: The Program, and your TeX system's source code.
If your goal is to provide "intellisense" in an editor, a mere list of command sequences is not going to be much help: when the user types \ref{
, you should offer a list of labels defined in the document (bonus points if typing Chapter~\ref{
results in a list of chapter labels, not all labels); for \settowidth{
, a list of length commands; for \begin{
, a list of environments; etc.
You could see what AUCTeX (an Emacs mode) does; it has a limited, regexp-based parser that handles the common case, and a bunch of package-specific libraries that extend the functionality.