views:

458

answers:

6

Byt lets say I have an integer weight where i.e. elements with weight 10 has 10 times higher probability to be selected than element with weight 1.

var ws = db.WorkTypes
.Where(e => e.HumanId != null && e.SeoPriority != 0)
.OrderBy(e => /*????*/ * e.SeoPriority)
.Select(e => new
{
   DescriptionText = e.DescriptionText,
   HumanId = e.HumanId
})
.Take(take).ToArray();

How do I solved getting random records in Linq when I need the result to be weighted?

I need somthing like http://stackoverflow.com/questions/58457/random-weighted-choice-in-t-sql but in linq and not only getting one record?

If I wouldn't have the weighted requirement, I'd use the NEWID approach, can I adopt this some way?

partial class DataContext
{
    [Function(Name = "NEWID", IsComposable = true)]
    public Guid Random()
    {
        throw new NotImplementedException();
    }
}

...

var ws = db.WorkTypes
.Where(e => e.HumanId != null && e.SeoPriority != 0)
.OrderBy(e => db.Random())
.Select(e => new
{
   DescriptionText = e.DescriptionText,
   HumanId = e.HumanId
})
.Take(take).ToArray();
+1  A: 

Your suggested solution, as it seems from the question, is bound to Linq/Linq2Sql.

If I understand correctly, your main goal is to fetch at most X records from the database, that have a weight of more than 0. If the database holds more than X records, you'd like to choose from them using the record's weight, and have a random result.

If all is correct so far, my solution is to clone each record by its weight: if a record's weight is 5, make sure you have it 5 times. This way the random choice takes into account the weight.

However, cloning the records makes, well, duplications. So you can't just take X records, you should take more and more records until you have X distinct records.

So far I described a general solution, not related to the implementation.

I think it's harder to implement my solution using only Linq2Sql. If the total records count in the DB is not huge, I suggest reading the entire table and do the cloning and random outside the SQL Server.

If the total count is huge, I suggest you take, say, 100,000 records (or less) chosen at random (via Linq2Sql), and apply the implementation as above. I believe it's random enough.

Ron Klein
+1  A: 

Try by using the RAND() sql function - it'll give you a 0 to 1 float.

The downside is that I am not sure if it would cause a full table scan on the sql server side i.e. if the resulting query + execution on sql would be optimized in such a way that once you have the top n records it ignores the rest of the table.

eglasius
+1  A: 

I'm assuming that the weight is an integer. Here's an approach which joins to a dummy table to increase the row-count per the weight; first, lets prove it just at TSQL:

SET NOCOUNT ON
--DROP TABLE [index]
--DROP TABLE seo
CREATE TABLE [index] ([key] int not null) -- names for fun ;-p
CREATE TABLE seo (url varchar(10) not null, [weight] int not null)

INSERT [index] values(1) INSERT [index] values(2)
INSERT [index] values(3) INSERT [index] values(4)
INSERT [index] values(5) INSERT [index] values(6)
INSERT [index] values(7) INSERT [index] values(8)
INSERT [index] values(9) INSERT [index] values(10)

INSERT [seo] VALUES ('abc',1)  INSERT [seo] VALUES ('def',2)
INSERT [seo] VALUES ('ghi',1)  INSERT [seo] VALUES ('jkl',3)
INSERT [seo] VALUES ('mno',1)  INSERT [seo] VALUES ('mno',1)
INSERT [seo] VALUES ('pqr',2)

DECLARE @count int, @url varchar(10)
SET @count = 0
DECLARE @check_rand TABLE (url varchar(10) not null)

-- test it lots of times to check distribution roughly matches weights
WHILE @count < 11000
BEGIN
    SET @count = @count + 1

    SELECT TOP 1 @url = [seo].[url]
    FROM [seo]
    INNER JOIN [index] ON [index].[key] <= [seo].[weight]
    ORDER BY NEWID()

    -- this to check distribution
    INSERT @check_rand VALUES (@url)
END

SELECT ISNULL(url, '(total)') AS [url], COUNT(1) AS [hits]
FROM @check_rand
GROUP BY url WITH ROLLUP
ORDER BY url

This outputs something like:

url        hits
---------- -----------
(total)    11000
abc        1030
def        1970
ghi        1027
jkl        2972
mno        2014
pqr        1987

Showing that we have the correct overall distribution. Now lets bring that into LINQ-to-SQL; I've added the two tables to a data-context (you will need to create something like the [index] table to do this) - my DBML:

  <Table Name="dbo.[index]" Member="indexes">
    <Type Name="index">
      <Column Name="[key]" Member="key" Type="System.Int32" DbType="Int NOT NULL" CanBeNull="false" />
    </Type>
  </Table>
  <Table Name="dbo.seo" Member="seos">
    <Type Name="seo">
      <Column Name="url" Type="System.String" DbType="VarChar(10) NOT NULL" CanBeNull="false" />
      <Column Name="weight" Type="System.Int32" DbType="Int NOT NULL" CanBeNull="false" />
    </Type>
  </Table>

Now we'll consume this; in the partial class for the data-context, add a compiled-query (for performance) in addition to the Random method:

partial class MyDataContextDataContext
{
    [Function(Name = "NEWID", IsComposable = true)]
    public Guid Random()
    {
        throw new NotImplementedException();
    }
    public string GetRandomUrl()
    {
        return randomUrl(this);
    }
    static readonly Func<MyDataContextDataContext, string>
        randomUrl = CompiledQuery.Compile(
        (MyDataContextDataContext ctx) =>
                 (from s in ctx.seos
                 from i in ctx.indexes
                 where i.key <= s.weight
                 orderby ctx.Random()
                 select s.url).First());
}

This LINQ-to-SQL query is very similar to the key part of the TSQL we wrote; lets test it:

using (var ctx = CreateContext()) {
    // show sample query
    ctx.Log = Console.Out;
    Console.WriteLine(ctx.GetRandomUrl());
    ctx.Log = null;

    // check distribution
    var counts = new Dictionary<string, int>();
    for (int i = 0; i < 11000; i++) // obviously a bit slower than inside db
    {
        if (i % 100 == 1) Console.WriteLine(i); // show progress
        string s = ctx.GetRandomUrl();
        int count;
        if (counts.TryGetValue(s, out count)) {
            counts[s] = count + 1;
        } else {
            counts[s] = 1;
        }
    }
    Console.WriteLine("(total)\t{0}", counts.Sum(p => p.Value));
    foreach (var pair in counts.OrderBy(p => p.Key)) {
        Console.WriteLine("{0}\t{1}", pair.Key, pair.Value);
    }
}

This runs the query once to show the TSQL is suitable, then (like before) 11k times to check the distribution; output (not including the progress updates):

SELECT TOP (1) [t0].[url]
FROM [dbo].[seo] AS [t0], [dbo].[index] AS [t1]
WHERE [t1].[key] <= [t0].[weight]
ORDER BY NEWID()
-- Context: SqlProvider(Sql2008) Model: AttributedMetaModel Build: 3.5.30729.4926

which doesn't look too bad at all - it has both tables and the range condition, and the TOP 1, so it is doing something very similar; data:

(total) 11000
abc     939
def     1893
ghi     1003
jkl     3104
mno     2048
pqr     2013

So again, we've got the right distribution, all from LINQ-to-SQL. Sorted?

Marc Gravell
+1  A: 

My first idea was the same as Ron Klein's - create a weighted list and select randomly from that.

Here's a LINQ extension method to create the weighted list from the normal list, given a lambda function that knows the weight property of the object.

Don't worry if you don't get all the generics stuff right away... The usage below should make it clearer:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;

namespace ConsoleApplication1
{
    public class Item
    {
        public int Weight { get; set; }
        public string Name { get; set; }
    }

    public static class Extensions
    {
        public static IEnumerable<T> Weighted<T>(this IEnumerable<T> list, Func<T, int> weight)
        {
            foreach (T t in list)
                for (int i = 0; i < weight(t); i++)
                    yield return t;
        }
    }

    class Program
    {
        static void Main(string[] args)
        {
            List<Item> list = new List<Item>();
            list.Add(new Item { Name = "one", Weight = 5 });
            list.Add(new Item { Name = "two", Weight = 1 });

            Random rand = new Random(0);

            list = list.Weighted<Item>(x => x.Weight).ToList();

            for (int i = 0; i < 20; i++)
            {
                int index = rand.Next(list.Count());
                Console.WriteLine(list.ElementAt(index).Name);
            }

            Console.ReadLine();
        }
    }
}

As you can see from the output, the results are both random and weighted as you require.

ifatree
A LINQ-to-SQL answer that is (essentially) "bring the entire table in-process" is rarely a good idea...
Marc Gravell
i understand what you mean, but the advice is almost too specific to apply to my solution. you might want to mention it as a caveat of why the OP wants to do this through LINQ at all, but as to how to actually do it with LINQ, this is the right way. if your domain objects are big, you'd use Item as a Proxy to hold the weight and the unique ID, then go back to get the "real object" from the ID when you need it. also your advice only applies to tables that can grow to unlimited size; for small collections that you want cached, bringing in the entire table is much better than repeated DB trips.
ifatree
Who said anything about repeated? You can do it in one hit to the db, fetching one row. Oh well...
Marc Gravell
A: 

The reason the GUID (NEWID) function was being used in the SQL example you are looking at is simply that SQL Servers RAND function only calculates once per statement. So is useless in a randomising select.

But as your using linq, A quick and dirty solution is to create a Random object and replace your order by statement.

Random rand = new Random(DateTime.Now.Millisecond);

var ws = db.WorkTypes .Where(e => e.HumanId != null && e.SeoPriority != 0) .OrderByDescending(e => rand.Next(10) * e.SeoPriority) .Select(e => new{ DescriptionText = e.DescriptionText, HumanId = e.HumanId}) .Take(take).ToArray();

The rand.Next(10) assumes your SeoPriority scales from 0 to 10.

It's not 100% acurate, but it's close, adjusting the Next value can tweak it.

MrLink
+1  A: 
var rand = new Random();

var ws = db.WorkTypes
.Where(e => e.HumanId != null && e.SeoPriority != 0)
.OrderByDescending(e => rand.Next() * e.SeoPriority)
.Select(e => new
{
   DescriptionText = e.DescriptionText,
   HumanId = e.HumanId
})
.Take(take).ToArray();
BarrettJ
Not a solution that will work for Linq to Sql, but helped me nonetheless.
Hainesy