views:

118

answers:

2

Background

I am developing a social web app for poets and writers, allowing them to share their poetry, gather feedback, and communicate with other poets. I have very little formal training in database design, but I have been reading books, SO, and online DB design resources in an attempt to ensure performance and scalability without over-engineering.

The database is MySQL, and the application is written in PHP. I'm not sure yet whether we will be using an ORM library or writing SQL queries from scratch in the app. Other than the web application, Solr search server and maybe some messaging client will interact with the database.

Current Needs

The schema I have thrown together below represents the primary components of the first version of the website. Initially, users can register for the site and do any of the following:

  • Create and modify profile details and account settings
  • Post, tag and categorize their writing
  • Read, comment on and "favorite" other users' posts
  • "Follow" other users to get notifications of their activity
  • Search and browse content and get suggested posts/users (though we will be using the Solr search server to index DB data and run these type of queries)

Scalability

Obviously the whole team expects to explode into the scene, overcoming facebook within a few weeks of launching, but I don't want to go crazy with scalability until that time comes (a few weeks, remember). However, we have a second iteration of features that we will be adding shortly after initial launch, so the initial DB design needs to scale effortlessly to accommodate these changes. The most important of these features include:

  • Video, image and audio posts/albums/sharing
  • Private messaging with facebook-style reply threads
  • User-created groups with in-group discussion boards
  • Administration controls for various user roles, (admins, content editors, community managers)

Schema

Here is what I came up with on MySQL Workbench for the initial site. I'm still a little fuzzy on some relational databasey things, so go easy.

alt text

Questions

  1. In general, is there anything I'm doing wrong or can improve upon?
  2. Is there any reason why I shouldn't combine the ExternalAccounts table into the UserProfiles table?
  3. Is there any reason why I shouldn't combine the PostStats table into the Posts table?
  4. Should I expand the design to include the features we are doing in the second version just to ensure that the initial schema can support it?
  5. Is there anything I can do to optimize the DB design for Solr indexing/performance/whatever?
  6. Should I be using more natural primary keys, like Username instead of UserID, or zip/area code instead of a surrogate LocationID in the Locations table?

Thanks for the help!

+1  A: 

In general, is there anything I'm doing wrong or can improve upon?

Overall, I don't see any big flaws in your current setup or schema.

What I'm wonderng is your split into 3 User* tables. I get what you want your intendtion was (having different user-related things seperate) but I don't know if I would go with the exact same thing. If you plan on displaying only data from the User table on the site, this is fine, since the other info is not needed multiple times on the same page but if users need to use their real name and display their real name (like John Doe instead of doe55) than this will slow down things when the data gets bigger since you may require joins. Having the Preferences seperate seems like a personal choice. I have no argument in favor of nor against it.

Your many-to-many tables would not need an addtional PK (e.g PostFavoriteID). A combined primary of both PostID and UserID would be enough since PostFavoriteID is never used anywhere else. This goes for all join tables

Is there any reason why I shouldn't combine the ExternalAccounts table into the UserProfiles table?

As withe the prev. answer, I don't see a advatanage or disadvantage. I may put both in the same table since the NULL (or maybe better -1) values would not bother me.

Is there any reason why I shouldn't combine the PostStats table into the Posts table?

I would put them into the same table using a trigger to handle the increment of the ViewCount table

Should I expand the design to include the features we are doing in the second version just to ensure that the initial schema can support it?

You are using a normalsied schema so any additions can be done at any time.

Is there anything I can do to optimize the DB design for Solr indexing/performance/whatever?

Can't tell you, haven't done it yet but I know that Solr is very powerfull and flexible so I think you should be doing fine.

Should I be using more natural primary keys, like Username instead of UserID, or zip/area code instead of a surrogate LocationID in the Locations table?

There are many threads here on SO discussing this. Personally, I like a surrogate key better (or another unique number key if available) since it makes queries more easier and faster since an int is looked up easier. If you allow a change of username/email/whatever-your-PK-is than there are massive updates required. With the surrogate key, you don't need to bother.

What I would also do is to add things like created_at, last_accessed at (best done via triggers or procedures IMO) to have some stats already available. This can realy give you valuable stats

Further strategies to increate the performance would be things like memcache, counter cache, partitioned tables,... Such things can be discussed when you are really overrun by users because there may be things/technologies/techniques/... that are very specific to your problem.

DrColossos
+1  A: 

I'm not clear what's going on with your User* tables - they're set up as if they're 1:1 but the diagram reflects 1-to-many (the crow's foot symbol).

The ExternalAccounts and UserSettings could be normalised further (in which case they would then be 1-to-many!), which will give you a more maintainable design - you wouldn't need to add further columns to your schema for additional External Account or Notification Types (although this may be less scalable in terms of performance).

For example:

ExternalAccounts
    UserId int,
    AccountType varchar(45),  
    AccountIdentifier varchar(45)

will allow you to store LinkedIn, Google, etc. accounts in the same structure. Similarly, further Notification Types can be readily added using a structure like:

UserSettings
    UserId int,  
    NotificationType varchar(45),  
    NotificationFlag ENUM('on','off')

hth

richaux