views:

42

answers:

2

I'd like to learn foundations of encodings, characters and text. Understanding these is important for dealing with a large set of text whether that are log files or text source for building algorithms for collective intelligence. My current knowledge is pretty basic: something like "As long as I use UTF-8, I'm okay."

I don't say I need to learn about advanced topics right away. But I need to know:

  • Bit and bytes level knowledge of encodings.
  • Characters and alphabets not used in English.
  • Multi-byte encodings. (I understand some Chinese and Japanese. And parsing them is important.)
  • Regular expressions.
  • Algorithm for text processing.
  • Parsing natural languages.

I also need an understanding of mathematics and corpus linguistics. The current and future web (semantic, intelligent, real-time web) needs processing, parsing and analyzing large text.

I'm looking for some resources (maybe books?) that get me started with some of the bullets. (I find many helpful discussion on regular expressions here on Stack Overflow. So, you don't need to suggest resources on that topic.)

A: 

As is usual for most general "I want to learn about X topic" questions, Wikipedia is a good place to start:

http://en.wikipedia.org/wiki/Character_encoding

http://en.wikipedia.org/wiki/Natural_language_processing

Amber
+1  A: 
  • In addition to wikipedia, Joel Spolskys article on encoding is really good too.
  • This free character map is a nice resource for all unicode characters.
  • This regular expression tutorial can be helpful.
  • Specifically on NLP and Japanese, you could take a look at this Japanese NLP project.
  • On text processing, this Open Source project can be useful.
Lars Andren