views:

121

answers:

1

Hey guys,

I'm trying to use topic modeling with Mallet but have a question.

How do I know when do I need to rebuild the model? For instance I have this amount of documents I crawled from the web, using topic modeling provided by Mallet I might be able to create the models and infer documents with it. But overtime, with new data that I crawled, new subjects may appear. In that case, how do I know whether I should rebuild the model from start till current?

I was thinking of doing so for documents i crawled each month. Can someone please advise?

So, is topic modeling more suitable for text under a fixed amount of topics (the input parameter k, no. of topics). If not, how do I really determine what number to use?

thanks guys

+1  A: 

Hi.

The answers to your questions depend in large part on the kind of data you're working with and the size of the corpus.

Regarding frequency, I'm afraid you'll just have to estimate how often your data changes in a meaningful way and remodel at that rate. You could start with a week and see if the new data lead to a significantly different model. If not, try two weeks and so on.

The number of topics you select is determined by what you're looking for in the model. The higher the number, the more fine-grained the results. If you want a broad overview of what's in your corpus, you could select say 10 topics. For a closer look, you could use 200 or some other suitably high number.

I hope that helps.

jamesi