Why does the Contains() operator degrade Entity Framework's performance so dramatically?

小开

Editing my original answer - There is a possible workaround, depending on the complexity of your entities. If you know the sql that EF generates to populate your entities, you can execute it directly using DbContext.Database.SqlQuery. In EF 4, I think you could use ObjectContext.ExecuteStoreQuery, but I didn't try it.

For example, using the code from my original answer below to generate the sql statement using a StringBuilder, I was able to do the following

var rows = db.Database.SqlQuery<Main>(sql).ToArray();

and the total time went from approximately 26 seconds to 0.5 seconds.

I will be the first to say it's ugly, and hopefully a better solution presents itself.

update

After a bit more thought, I realized that if you use a join to filter your results, EF doesn't have to build that long list of ids. This could be complex depending on the number of concurrent queries, but I believe you could use user ids or session ids to isolate them.

To test this, I created a Target table with the same schema as Main. I then used a StringBuilder to create INSERT commands to populate the Target table in batches of 1,000 since that's the most SQL Server will accept in a single INSERT. Directly executing the sql statements was much faster than going through EF (approx 0.3 seconds vs. 2.5 seconds), and I believe would be ok since the table schema shouldn't change.

Finally, selecting using a join resulted in a much simpler query and executed in less than 0.5 seconds.

ExecuteStoreCommand("DELETE Target");


var ids = Main.Select(a => a.Id).ToArray();
var sb = new StringBuilder();


for (int i = 0; i < 10; i++)
{
sb.Append("INSERT INTO Target(Id) VALUES (");
for (int j = 1; j <= 1000; j++)
{
if (j > 1)
{
sb.Append(",(");
}
sb.Append(i * 1000 + j);
sb.Append(")");
}
ExecuteStoreCommand(sb.ToString());
sb.Clear();
}


var rows = (from m in Main
join t in Target on m.Id equals t.Id
select m).ToArray();


rows.Length.Dump();

And the sql generated by EF for the join:

SELECT
[Extent1].[Id] AS [Id]
FROM  [dbo].[Main] AS [Extent1]
INNER JOIN [dbo].[Target] AS [Extent2] ON [Extent1].[Id] = [Extent2].[Id]

(original answer)

This is not an answer, but I wanted to share some additional information and it is far too long to fit in a comment. I was able to reproduce your results, and have a few other things to add:

SQL Profiler shows the delay is between execution of the first query (Main.Select) and the second Main.Where query, so I suspected the problem was in generating and sending a query of that size (48,980 bytes).

However, building the same sql statement in T-SQL dynamically takes less than 1 second, and taking the ids from your Main.Select statement, building the same sql statement and executing it using a SqlCommand took 0.112 seconds, and that's including time to write the contents to the console.

At this point, I suspect that EF is doing some analysis/processing for each of the 10,000 ids as it builds the query. Wish I could provide a definitive answer and solution :(.

Here's the code I tried in SSMS and LINQPad (please don't critique too harshly, I'm in a rush trying to leave work):

declare @sql nvarchar(max)


set @sql = 'SELECT
[Extent1].[Id] AS [Id]
FROM [dbo].[Main] AS [Extent1]
WHERE [Extent1].[Id] IN ('


declare @count int = 0
while @count < 10000
begin
if @count > 0 set @sql = @sql + ','
set @count = @count + 1
set @sql = @sql + cast(@count as nvarchar)
end
set @sql = @sql + ')'


exec(@sql)

var ids = Mains.Select(a => a.Id).ToArray();


var sb = new StringBuilder();
sb.Append("SELECT [Extent1].[Id] AS [Id] FROM [dbo].[Main] AS [Extent1] WHERE [Extent1].[Id] IN (");
for(int i = 0; i < ids.Length; i++)
{
if (i > 0)
sb.Append(",");
sb.Append(ids[i].ToString());
}
sb.Append(")");


using (SqlConnection connection = new SqlConnection("server = localhost;database = Test;integrated security = true"))
using (SqlCommand command = connection.CreateCommand())
{
command.CommandText = sb.ToString();
connection.Open();
using(SqlDataReader reader = command.ExecuteReader())
{
while(reader.Read())
{
Console.WriteLine(reader.GetInt32(0));
}
}
}

小开

If you find a performance problem which is blocking for you don't try to spend ages on solving it because you will most probably don't success and you will have to communicate it with MS directly (if you have premium support) and it takes ages.

Use workaround and workaround in case of performance issue and EF means direct SQL. There is nothing bad about it. Global idea that using EF = not using SQL anymore is a lie. You have SQL Server 2008 R2 so:

Create stored procedure accepting table valued parameter to pass your ids
Let your stored procedure return multiple result sets to emulate Include logic in optimal way
If you need some complex query building use dynamic SQL inside stored procedure
Use SqlDataReader to get results and construct your entities
Attach them to context and work with them as if they were loaded from EF

If the performance is critical for you you will not find better solution. This procedure cannot be mapped and executed by EF because current version doesn't support either table valued parameters or multiple result sets.

小开

最佳答案

UPDATE: With the addition of InExpression in EF6, the performance of processing Enumerable.Contains improved dramatically. The approach described in this answer is no longer necessary.

You are right that most of the time is spent processing the translation of the query. EF's provider model doesn't currently include an expression that represents an IN clause, therefore ADO.NET providers can't support IN natively. Instead, the implementation of Enumerable.Contains translates it to a tree of OR expressions, i.e. for something that in C# looks like like this:

new []{1, 2, 3, 4}.Contains(i)

... we will generate a DbExpression tree that could be represented like this:

((1 = @i) OR (2 = @i)) OR ((3 = @i) OR (4 = @i))

(The expression trees have to be balanced because if we had all the ORs over a single long spine there would be more chances that the expression visitor would hit a stack overflow (yes, we actually did hit that in our testing))

We later send a tree like this to the ADO.NET provider, which can have the ability to recognize this pattern and reduce it to the IN clause during SQL generation.

When we added support for Enumerable.Contains in EF4, we thought it was desirable to do it without having to introduce support for IN expressions in the provider model, and honestly, 10,000 is much more than the number of elements we anticipated customers would pass to Enumerable.Contains. That said, I understand that this is an annoyance and that the manipulation of expressions trees makes things too expensive in your particular scenario.

I discussed this with one of our developers and we believe that in the future we could change the implementation by adding first-class support for IN. I will make sure this is added to our backlog, but I cannot promise when it will make it given there are many other improvements we would like to make.

To the workarounds already suggested in the thread I would add the following:

Consider creating a method that balances the number of database roundtrips with the number of elements you pass to Contains. For instance, in my own testing I observed that computing and executing against a local instance of SQL Server the query with 100 elements takes 1/60 of a second. If you can write your query in such a way that executing 100 queries with 100 different sets of ids would give you equivalent result to the query with 10,000 elements, then you can get the results in aproximately 1.67 seconds instead of 18 seconds.

Different chunk sizes should work better depending on the query and the latency of the database connection. For certain queries, i.e. if the sequence passed has duplicates or if Enumerable.Contains is used in a nested condition you may obtain duplicate elements in the results.

Here is a code snippet (sorry if the code used to slice the input into chunks looks a little too complex. There are simpler ways to achieve the same thing, but I was trying to come up with a pattern that preserves streaming for the sequence and I couldn't find anything like it in LINQ, so I probably overdid that part :) ):

Usage:

var list = context.GetMainItems(ids).ToList();

Method for context or repository:

public partial class ContainsTestEntities
{
public IEnumerable<Main> GetMainItems(IEnumerable<int> ids, int chunkSize = 100)
{
foreach (var chunk in ids.Chunk(chunkSize))
{
var q = this.MainItems.Where(a => chunk.Contains(a.Id));
foreach (var item in q)
{
yield return item;
}
}
}
}

Extension methods for slicing enumerable sequences:

public static class EnumerableSlicing
{


private class Status
{
public bool EndOfSequence;
}


private static IEnumerable<T> TakeOnEnumerator<T>(IEnumerator<T> enumerator, int count,
Status status)
{
while (--count > 0 && (enumerator.MoveNext() || !(status.EndOfSequence = true)))
{
yield return enumerator.Current;
}
}


public static IEnumerable<IEnumerable<T>> Chunk<T>(this IEnumerable<T> items, int chunkSize)
{
if (chunkSize < 1)
{
throw new ArgumentException("Chunks should not be smaller than 1 element");
}
var status = new Status { EndOfSequence = false };
using (var enumerator = items.GetEnumerator())
{
while (!status.EndOfSequence)
{
yield return TakeOnEnumerator(enumerator, chunkSize, status);
}
}
}
}

Hope this helps!

小开

We were able to solve the EF Contains problem by adding an intermediate table and joining on that table from LINQ query that needed to use Contains clause. We were able to get amazing results with this approach. We have a large EF model and as "Contains" is not allowed when pre-compiling EF queries we were getting very poor performance for queries that use "Contains" clause.

An overview:

Create a table in SQL Server - for example HelperForContainsOfIntType with HelperID of Guid data-type and ReferenceID of int data-type columns. Create different tables with ReferenceID of differing data-types as needed.
Create an Entity / EntitySet for HelperForContainsOfIntType and other such tables in EF model. Create different Entity / EntitySet for different data-types as needed.
Create a helper method in .NET code which takes the input of an IEnumerable<int> and returns an Guid. This method generates a new Guid and inserts the values from IEnumerable<int> into HelperForContainsOfIntType along with the generated Guid. Next, the method returns this newly generated Guid to the caller. For fast inserting into HelperForContainsOfIntType table, create a stored-procedure which takes input of an list of values and does the insertion. See Table-Valued Parameters in SQL Server 2008 (ADO.NET). Create different helpers for different data-types or create a generic helper method to handle different data-types.

Create a EF compiled query which is similar to something like below:

static Func<MyEntities, Guid, IEnumerable<Customer>> _selectCustomers =
CompiledQuery.Compile(
(MyEntities db, Guid containsHelperID) =>
from cust in db.Customers
join x in db.HelperForContainsOfIntType on cust.CustomerID equals x.ReferenceID where x.HelperID == containsHelperID
select cust
);

Call the helper method with values to be used in the Contains clause and get the Guid to use in the query. For example:

var containsHelperID = dbHelper.InsertIntoHelperForContainsOfIntType(new int[] { 1, 2, 3 });
var result = _selectCustomers(_dbContext, containsHelperID).ToList();

小开

A cacheable alternative to Contains?

This just bit me so I've added my two pence to the Entity Framework Feature Suggestions link.

The issue is definitely when generating the SQL. I have a client on who's data the query generation was 4 seconds but the execution was 0.1 seconds.

I noticed that when using dynamic LINQ and ORs the sql generation was taking just as long but it generated something that could be cached. So when executing it again it was down to 0.2 seconds.

Note that a SQL in was still generated.

Just something else to consider if you can stomach the initial hit, your array count does not change much, and run the query a lot. (Tested in LINQ Pad)

小开

It was fixed on Entity Framework 6 Alpha 2: http://entityframework.codeplex.com/SourceControl/changeset/a7b70f69e551

http://blogs.msdn.com/b/adonet/archive/2012/12/10/ef6-alpha-2-available-on-nuget.aspx

小开

I'm not familiar with Entity Framework but is the perf better if you do the following?

Instead of this:

var ids = Main.Select(a => a.Id).ToArray();
var rows = Main.Where (a => ids.Contains(a.Id)).ToArray();

how about this (assuming the ID is an int):

var ids = new HashSet<int>(Main.Select(a => a.Id));
var rows = Main.Where (a => ids.Contains(a.Id)).ToArray();

小开

The issue is with Entity Framework's SQL generation. It cannot cache the query if one of the parameters is a list.

To get EF to cache your query you can convert your list to a string and do a .Contains on the string.

So for example this code would run much faster since EF could cache the query:

var ids = Main.Select(a => a.Id).ToArray();
var idsString = "|" + String.Join("|", ids) + "|";
var rows = Main
.Where (a => idsString.Contains("|" + a.Id + "|"))
.ToArray();

When this query is generated it will likely be generated using a Like instead of an In so it will speed up your C# but it could potentially slow down your SQL. In my case I didn't notice any performance decrease in my SQL execution, and the C# ran significantly faster.