views:

59

answers:

8

We have two tables. The first contains a name (varchar) field. The second contains a field that references the name field from the first table. This foreign key in the second table will be repeated for every row associated with that name. Is it generally discouraged to use a varchar/string field as a join between two tables? When is the best case where a string field can be used as a join field?

A: 

If you're concerned about performance, the best way to know is to create tables that implement your potential design choices, then load them up with massive amounts of data to see what happens.

In theory, very small strings should perform as well as a number in joins. In practice, it would definitely depend upon the database, indexing, and other implementation choices.

John Fisher
+5  A: 

Is it generally discouraged to use a varchar/string field as a join between two tables?

If there's a natural key to be used (extremely rare in real life, but state/province abbreviations are a good example), then VARCHAR fields are fine.

When is the best case where a string field can be used as a join field?

Depends on the database because of the bits allocated to the data type, but generally VARCHAR(4) or less takes around the same amount of space (less the less number of characters) as INT would.

OMG Ponies
If you have a 4-character field, you might want to go with `CHAR(4)` instead of VARCHAR - less overhead.
marc_s
A: 

In a relational database, you shouldn't use a string in one table that references the same string in another table. If the second table is a look-up, create an identity column for the table, and then reference the integer value in the first. When displaying the data, use a join to the second table. Just make sure in the second table you never actually delete records.

The only exception would be if you are creating an archive table where you want to store exactly what was chosen at a given time.

Steven Raines
A: 

It depends on the nature of your data. If the string is some user-entered and updated value then I would probably shy away from joining on it. You may run into consistency difficulties for storing the name in both the parent and the detail table.

Nothing has duplicate names?

I have used a string field as a join when using GUIDs or single char identifiers or when I know the string to be a natural key (though I almost always prefer a surrogate)

Matthew PK
+1  A: 

Generally speaking you shouldn't use anything that is editable by the end users as a FK as an edit would require not one update, but one update per table which references that key.

Everyone else has already mentioned the potenetial performance implications of a query, but the update cost is also worth noting. I strongly suggest the use of a generated key instead.

Donnie
A: 

Natural primary keys like a zip code, phone number, email address or user name are by definition strings. There are unique and relatively short.

If you put an index on such a column there is no problem with using them a join. Impact on performance will usually be minimal.

codymanix
Don't use a string that can change as a key, like a phone number or email address! You WILL be sorry.
thursdaysgeek
Those are not natural primary keys any of them. I would never use any of those as PKs in a table.
HLGEM
On many websites you must register with your email address or create an login name and cannot change them. Even if you could, "ON UPDATE CASCADE" will be your friend.
codymanix
+4  A: 

It's certainly possible to use a varchar as a key field (or simply something to join on). The main problems with it are based on what you normally store in a varchar field; mutable data. Strictly speaking, it's not advisable to have key fields change. A person's name, telephone number, even their SSN can all change. However, the employee with internal ID 3 will always be ID 3, even if there are two John Smiths.

Second, string comparison is dependent on a number of nit-picky details, such as culture, collation, whitespace translation, etc. that can break a join for no immediately-apparent reason. Say you use a tabspace character \t for a certain string you're joining on. Later, you change your software to replace \t with 3 spaces to reduce character escapes in your raw strings. You have now broken any functionality requiring a string with escaped tabs to be matched to an identical-looking, but differently-composed, string.

Lastly, even given two perfectly identical strings, there is a slight performance benefit to comparing two integer numbers than comparing two strings. Integer comparison is effectively constant-time. String comparison is linear at best, based on the length of the string.

KeithS
Too bad I can only give you one upvote.
HLGEM
A: 

Sometimes a join will happen on fields that are not "join fields", because that's just the nature of the query (e.g. most ways of identifying records that are duplicates in a particular column). If the query you want relates to those values, then that's what the join will be on, end of story.

If a field genuinely identifies a row, then it is possible to use it as a key. It's even possible to do so if it could change (it brings issues, but not insurmountable issues) as long as it remains a genuine identifier (it'll never change to a value that exists for another row).

The performance impact varies by common query and by database. By database the type of indexing strategies of some makes them better at using varchar and other textual keys than other databases (in particular, hash-indices are nice).

Common queries can be such that it becomes more performant to use varchar even without hash indices. A classic example is storing pieces of text for a multi-lingual website. Each such piece of text will have a particular languageID relating to the language it is in. However, obtaining other information about that language (it's name etc.) is rarely needed; what's much more often needed is to either filter by the RFC 5646 code, or to find out what that RFC 6546 code is. If we use a numeric id, then we will have to join for both types of query to obtain that code. If we use the code as the ID, then the most common queries concerned with the language won't need to look in the language table at all. Most queries that do care about the details of the language also won't need to do any join; pretty much the only time the key will be used as a foreign key is in maintaining referential integrity on update and insert of text or on deletion of languages. Hence while the join is less efficient when it is used the system as a whole will be more efficient by using fewer joins.

Jon Hanna