This is in response to your original question, and your later answer/question.
I've used the Sphinx search engine before (quite a while ago, so I'm a bit rusty), and found it to be very good, even if the documentation is sometimes a bit lacking.
I'm sure there are other ways to do this, both with your own custom code, or with other search engines—Sphinx just happens to be the one I've used. I'm not suggesting that it will do everything you want, just the way you want, but I am reasonably certain that it will do most of it quite easily, and a lot faster than anything written in PHP/MySQL alone.
I recommend reading Build a custom search engine with PHP before digging into the Sphinx documentation. If you don't think it's suitable after reading that, fair enough.
In answer to your specific questions, I've put together some links from the documentation, together with some relevant quotes:
filtering out common words (as you perhaps noticed, "the" "is" "of" and "intel's" are missing from list)
11.2.8. stopwords
Stopwords are the words that will not
be indexed. Typically you'd put most
frequent words in the stopwords list
because they do not add much value to
search results but consume a lot of
resources to process.
With regards to "cpus" (plurals vs singulars), would it be best to use a particular type (singular or plural), both or exact (ie, "cpus" is different "cpu")?
11.2.9. wordforms
Word forms are applied after
tokenizing the incoming text by
charset_table rules. They essentialy
let you replace one word with another.
Normally, that would be used to bring
different word forms to a single
normal form (eg. to normalize all the
variants such as "walks", "walked",
"walking" to the normal form "walk").
It can also be used to implement
stemming exceptions, because stemming
is not applied to words found in the
forms list.
Continuing with previous item, how can I determine a plural (different flavors: test=>tests fish=>fish and leaf=>leaves)
Sphinx supports the Porter Stemming Algorithm
The Porter stemming algorithm (or
‘Porter stemmer’) is a process for
removing the commoner morphological
and inflexional endings from words in
English. Its main use is as part of a
term normalisation process that is
usually done when setting up
Information Retrieval systems.
Let's say I wanted to use the search term "vendor:intel", where vendor specifies the field name (field_name), do you think there would be a huge impact on the sql server?
3.2. Attributes
A good example for attributes would be
a forum posts table. Assume that only
title and content fields need to be
full-text searchable - but that
sometimes it is also required to limit
search to a certain author or a
sub-forum (ie. search only those rows
that have some specific values of
author_id or forum_id columns in the
SQL table); or to sort matches by
post_date column; or to group matching
posts by month of the post_date and
calculate per-group match counts.
This can be achieved by specifying all
the mentioned columns (excluding title
and content, that are full-text
fields) as attributes, indexing them,
and then using API calls to setup
filtering, sorting, and grouping.
You can also use the 5.3. Extended query syntax to search specific fields (as opposed to filtering results by attributes):
field search operator:
@vendor intel
How does a search engine index a set of fields and bind the found phrases/keywords/etc with the particular field id?
8.6.1. Query
On success, Query() returns a result set that contains some of the found matches (as requested by SetLimits()) and additional general per-query statistics. > The result set is a hash (PHP specific; other languages might utilize other structures instead of hash) with the following keys and values:
"matches":
Hash which maps found document IDs to another small hash containing document weight and attribute values (or an array of the similar small hashes if SetArrayResult() was enabled).
"total":
Total amount of matches retrieved on server (ie. to the server side result set) by this query. You can retrieve up to this amount of matches from server for this query text with current query settings.
"total_found":
Total amount of matching documents in index (that were found and procesed on server).
"words":
Hash which maps query keywords (case-folded, stemmed, and otherwise processed) to a small hash with per-keyword statitics ("docs", "hits").
"error":
Query error message reported by searchd (string, human readable). Empty if there were no errors.
"warning":
Query warning message reported by searchd (string, human readable). Empty if there were no warnings.
Also see Listing 11 and Listing 13 from Build a custom search engine with PHP.