views:

1901

answers:

11

Background: I've got a bunch of strings that I'm getting from a database, and I want to return them. Traditionally, it would be something like this:

public List<string> GetStuff(string connectionString)
{
    List<string> categoryList = new List<string>();
    using (SqlConnection sqlConnection = new SqlConnection(connectionString))
    {
        string commandText = "GetStuff";
        using (SqlCommand sqlCommand = new SqlCommand(commandText, sqlConnection))
        {
            sqlCommand.CommandType = CommandType.StoredProcedure;

            sqlConnection.Open();
            SqlDataReader sqlDataReader = sqlCommand.ExecuteReader();
            while (sqlDataReader.Read())
            {
                categoryList.Add(sqlDataReader["myImportantColumn"].ToString());
            }
        }
    }
    return categoryList;
}

But then I figure the consumer is going to want to iterate through the items and doesn't care about much else, and I'd like to not box myself in to a List, per se, so if I return an IEnumerable everything is good/flexible. So I was thinking I could use a "yield return" type design to handle this...something like this:

public IEnumerable<string> GetStuff(string connectionString)
{
    using (SqlConnection sqlConnection = new SqlConnection(connectionString))
    {
        string commandText = "GetStuff";
        using (SqlCommand sqlCommand = new SqlCommand(commandText, sqlConnection))
        {
            sqlCommand.CommandType = CommandType.StoredProcedure;

            sqlConnection.Open();
            SqlDataReader sqlDataReader = sqlCommand.ExecuteReader();
            while (sqlDataReader.Read())
            {
                yield return sqlDataReader["myImportantColumn"].ToString();
            }
        }
    }
}

But now that I'm reading a bit more about yield (on sites like this...msdn didn't seem to mention this), it's apparently a lazy evaluator, that keeps the state of the populator around, in anticipation of someone asking for the next value, and then only running it until it returns the next value.

This seems fine in most cases, but with a DB call, this sounds a bit dicey. As a somewhat contrived example, if someone asks for an IEnumerable from that I'm populating from a DB call, gets through half of it, and then gets stuck in a loop...as far as I can see my DB connection is going to stay open forever.

Sounds like asking for trouble in some cases if the iterator doesn't finish...am I missing something?

+7  A: 

You are not missing anything. Your sample shows how NOT to use yield return. Add the items to a list, close the connection, and return the list. Your method signature can still return IEnumerable.

Edit: That said, Jon has a point (so surprised!): there are rare occasions where streaming is actually the best thing to do from a performance perspective. After all, if it's 100,000 (1,000,000? 10,000,000?) rows we're talking about here, you don't want to be loading that all into memory first.

Richard Szalay
Yep...I was only highlighting the IEnumerable aspect of it because that's what made me think of using yield in the first place. And thanks for the answer...glad to see that I'm not barking completely up the wrong tree.
Beska
No worries mate, glad to have helped. If this has answered your question, don't forget to mark it as the answer so it drops off the unanswered questions list.
Richard Szalay
Oh, I almost always mark my questions as answered...but I'd like to hold on to this one for a bit, since Jon has weighed in with a slightly different viewpoint, and I'd like to see how that plays out.
Beska
Good thinking! (padding)
Richard Szalay
Hmm...as in scamming for rep? I hadn't thought of that...I'd rather not have people think I'm just trying to eke out rep. Should I mark this as closed and just hope people keep looking at it? It doesn't really seem like a community wiki thing...
Beska
It won't be on the unanswered questions list anyway, as it's got answers with upvotes.
Jon Skeet
Nono, I wasn't being sarcastic!
Richard Szalay
Beska: I'd leave it open for a while. In particular, if Marc Gravell takes a look I'm sure he'll have an interesting opinion. I'll ping him about it...
Jon Skeet
Ah, I see why you thought that now. The "(padding)" was added to fill the 15 character minimum on comments :)
Richard Szalay
In most of the situations, the dispose code gets called. See my posting
taoufik
+1  A: 

No, you are on the right path... the yield will lock the reader... you can test it doing another database call while calling the IEnumerable

Jhonny D. Cano -Leftware-
You enable MARS in the connection string to allow multiple open SqlDataReaders at a performance hit. But still, this pattern has problems.
spoulson
A: 

Dont use yield here. your sample is fine.

Eh? What was wrong with this answer?
Richard Szalay
It isn't an answer - unknown (yahoo) hasn't understood the question.
reinierpost
+21  A: 

It's a balancing act: do you want to force all the data into memory immediately so you can free up the connection, or do you want to benefit from streaming the data, at the cost of tying up the connection for all that time?

The way I look at it, that decision should potentially be up to the caller, who knows more about what they want to do. If you write the code using an iterator block, the caller can very easily turned that streaming form into a fully-buffered form:

List<string> stuff = new List<string>(GetStuff(connectionString));

If, on the other hand, you do the buffering yourself, there's no way the caller can go back to a streaming model.

So I'd probably use the streaming model and say explicitly in the documentation what it does, and advise the caller to decide appropriately. You might even want to provide a helper method to basically call the streamed version and convert it into a list.

Of course, if you don't trust your callers to make the appropriate decision, and you have good reason to believe that they'll never really want to stream the data (e.g. it's never going to return much anyway) then go for the list approach. Either way, document it - it could very well affect how the return value is used.

Another option for dealing with large amounts of data is to use batches, of course - that's thinking somewhat away from the original question, but it's a different approach to consider in the situation where streaming would normally be attractive.

Jon Skeet
A: 

I've bumped into this wall a few times. SQL database queries are not easily streamable like files. Instead, query only as much as you think you'll need and return it as whatever container you want (IList<>, DataTable, etc.). IEnumerable won't help you here.

spoulson
A: 

What you can do is use a SqlDataAdapter instead and fill a DataTable. Something like this:

public IEnumerable<string> GetStuff(string connectionString)
{
    DataTable table = new DataTable();
    using (SqlConnection sqlConnection = new SqlConnection(connectionString))
    {
        string commandText = "GetStuff";
        using (SqlCommand sqlCommand = new SqlCommand(commandText, sqlConnection))
        {
            sqlCommand.CommandType = CommandType.StoredProcedure;
            SqlDataAdapter dataAdapter = new SqlDataAdapter(sqlCommand);
            dataAdapter.Fill(table);
        }

    }
    foreach(DataRow row in table.Rows)
    {
        yield return row["myImportantColumn"].ToString();
    }
}

This way, you're querying everything in one shot, and closing the connection immediately, yet you're still lazily iterating the result. Furthermore, the caller of this method can't cast the result to a List and do something they shouldn't be doing.

BFree
I don't understand what the point of "lazily iterating the result" is in this example.
mquander
I think the point is that the OP won't be tied down to List<> (which is why he went with the yield approach in the first place), but at the same time this won't keep the database connection open.
Andy
Well, with either approach I don't need to be tied to List<>; I could return IEnumerable<> either way. I was just thinking about moving towards something more generic than List<>, and that was what got me thinking about yield, and the potential consequences of it.
Beska
No it's just that others have suggested returning a List as IEnumerable. While it probably won't happen, the caller of the method CAN cast that to a list whereas if you use the yield, you can't.
BFree
I guess that's true, but that's not a reason not to do it. It's silly to use iterators everywhere for no real reason because you're worried about some strange behavior from the caller (behavior which wouldn't break anything anyway.)
mquander
Fair enough. You raise good points.
BFree
+8  A: 

You're not always unsafe with the IEnumerable. If you leave the framework call GetEnumerator (which is what most of the people will do), then you're safe. Basically, you're as safe as the carefullness of the code using your method:

class Program
{
    static void Main(string[] args)
    {
        // safe
        var firstOnly = GetList().First();

        // safe
        foreach (var item in GetList())
        {
            if(item == "2")
                break;
        }

        // safe
        using (var enumerator = GetList().GetEnumerator())
        {
            for (int i = 0; i < 2; i++)
            {
                enumerator.MoveNext();
            }
        }

        // unsafe
        var enumerator2 = GetList().GetEnumerator();

        for (int i = 0; i < 2; i++)
        {
            enumerator2.MoveNext();
        }
    }

    static IEnumerable<string> GetList()
    {
        using (new Test())
        {
            yield return "1";
            yield return "2";
            yield return "3";
        }
    }

}

class Test : IDisposable
{
    public void Dispose()
    {
        Console.WriteLine("dispose called");
    }
}

Whether you can affort to leave the database connection open or not depends on your architecture as well. If the caller participates in an transaction (and your connection is auto enlisted), then the connection will be kept open by the framework anyway.

Another advantage of yield is (when using a server-side cursor), your code doesn't have to read all data (example: 1,000 items) from the database, if your consumer wants to get out of the loop earlier (example: after the 10th item). This can speed up querying data. Especially in an Oracle environment, where server-side cursors are the common way to retrieve data.

taoufik
+1 for details about disposing, but I don't think that was the concern - I *believe* Beska is worried about some iterations of the caller's loop taking a very long time to process, leaving the database connection open when it doesn't really need to.
Jon Skeet
Thanks, updated with my vision about keeping the connection open.
taoufik
+1  A: 

The only way this would cause problems is if the caller abuses the protocol of IEnumerable<T>. The correct way to use it is to call Dispose on it when it is no longer needed.

The implementation generated by yield return takes the Dispose call as a signal to execute any open finally blocks, which in your example will call Dispose on the objects you've created in the using statements.

There are a number of language features (in particular foreach) which make it very easy to use IEnumerable<T> correctly.

Daniel Earwicker
If you could site some documentation about how Dispose is used by/within enumerators implemented via the yield return keywords it would be helpful.
jpierson
+4  A: 

As an aside - note that the IEnumerable<T> approach is essentially what the LINQ providers (LINQ-to-SQL, LINQ-to-Entities) do for a living. The approach has advantages, as Jon says. However, there are definite problems too - in particular (for me) in terms of (the combination of) separation | abstraction.

What I mean here is that:

  • in a MVC scenario (for example) you want your "get data" step to actually get data, so that you can test it works at the controller, not the view (without having to remember to call .ToList() etc)
  • you can't guarantee that another DAL implementation will be able to stream data (for example, a POX/WSE/SOAP call can't usually stream records); and you don't necessarily want to make the behaviour confusingly different (i.e. connection still open during iteration with one implementation, and closed for another)

This ties in a bit with my thoughts here: Pragmatic LINQ.

But I should stress - there are definitely times when the streaming is highly desirable. It isn't a simple "always vs never" thing...

Marc Gravell
A: 

You could always use a separate thread to buffer the data (perhaps to a queue) while also doing a yeild to return the data. When the user requests data (returned via a yeild), an item is removed from the queue. Data is also being continuously added to the queue via the separate thread. That way, if the user requests the data fast enough, the queue is never very full and you do not have to worry about memory issues. If they don't, then the queue will fill up, which may not be so bad. If there is some sort of limitation you would like to impose on memory, you could enforce a maximum queue size (at which point the other thread would wait for items to be removed before adding more to the queue). Naturally, you will want to make sure you handle resources (i.e., the queue) correctly between the two threads.

As an alternative, you could force the user to pass in a boolean to indicate whether or not the data should be buffered. If true, the data is buffered and the connection is closed as soon as possible. If false, the data is not buffered and the database connection stays open as long as the user needs it to be. Having a boolean parameter forces the user to make the choice, which ensures they know about the issue.

Nick
+1  A: 

Slightly more concise way to force evaluation of iterator:

using System.Linq;

//...

var stuff = GetStuff(connectionString).ToList();
Branko Dimitrijevic