New Cardinality Estimator Part 3 – Single Table And One Predicate

In this article I will try to find out is there any change in cardinality estimation in a very simple case of querying a single table with a single predicate with the equal operator (i.e. SELECT * FROM T WHERE col = some_value). Since this is a very simple case I don’t expect significant changes. We will create a table in SQL Server 2014 CTP2 and query it with old and new CE and compare the results.

Let’s create a sample Orders table and populate it with 1M rows. Use the following code  to create and populate the test table:

IF OBJECT_ID('dbo.Orders', 'U') IS NOT NULL DROP TABLE dbo.Orders;
CREATE TABLE dbo.Orders(
id INT IDENTITY(1,1) NOT NULL,
custid INT NOT NULL,
orderdate DATETIME NOT NULL,
amount MONEY NOT NULL,
CONSTRAINT PK_Orders PRIMARY KEY CLUSTERED (id ASC)
)
GO
-- Fill the table with 1M rows
DECLARE @date_from DATETIME = '19980101';
DECLARE @date_to DATETIME = '20130630';
DECLARE @number_of_rows INT = 1000000;
INSERT INTO dbo.Orders(custid,orderdate,amount)
SELECT 1 + ABS(CHECKSUM(NEWID())) % 10000 AS custid,
(
SELECT(@date_from +(ABS(CAST(CAST( NEWID() AS BINARY(8)) AS INT))%CAST((@date_to - @date_from)AS INT)))) AS orderdate,
50 + ABS(CHECKSUM(NEWID())) % 1000 AS amount
FROM dbo.GetNums(1, @number_of_rows)
GO
--Create index on the orderdate column
CREATE INDEX IX1 ON dbo.Orders(orderdate);
GO

We want to query this table by filtering the orderdate column – therefore we have also created a non-clustered index on this column.  Now when we populated the table let’s write simple queries returning all orders for a given date. We will write two queries with a literal and one with a local variable. First series of queries instructs the optimizer to use the old CE:

SELECT * FROM dbo.Orders WHERE orderdate = '20130629' OPTION (QUERYTRACEON 9481);
SELECT * FROM dbo.Orders WHERE orderdate = '20130628' OPTION (QUERYTRACEON 9481);
DECLARE @d DATETIME ='20130629';
SELECT * FROM dbo.Orders WHERE orderdate = @d OPTION (QUERYTRACEON 9481);

We have chosen 29th June 2013 because this is the last date in the range and it is statistically significant value. Therefore in the statistics histogram for the orderdate column there is an entry for this date. This means that the estimation for this date would be perfect, since the optimizer can read from statistics exact number of rows in the table for this date.

Its neighbor 28th June has been chosen because an entry for it does not exist in the statistics histogram. Therefore the optimizer has to use an average value for range of rows between the previous statistically significant date and 29th June 2013.

And in third query with local variable the query optimizer cannot use statistic histogram and must refer to info from density vector and statistic header (All Density and Rows).

Next Figure illustrates above described estimations.

Figure2_2

Note: When you try the above code example your results could be slightly different since the table is populated by random generated numbers.

Ergo, the estimated number of rows for Q1 is 203 (value in the column EQ_ROWS for the entry 2013-06-29).

For Q2 we got 175.2778 (value in the column AVG_RANGE_ROWS for the entry 2013-06-29)

And finally the estimated number of rows for Q3 is calculated as product of All density for the column orderdate (0.0001767097) and the column Rows in the statistics header (1000000) = 176.71

This is how estimation works prior to SQL Server 2014. Let’s execute all these queries under the new CE (just remove the statement level option from all three queries).

What is the result? Absolutely the same estimations! So, in this simple case with a single table and one predicate and equals operator the estimation under old and new CE is identical.

Before we make conclusion that there is no difference between old and new CE in this case let’s play a little bit more with the sample table. We have executed all queries (with old and new CE) immediately after we populated the sample table and created an index on the orderdate column. At this point all table relevant statistics were up to date. Let’s see what happens when the statistic is out of date.

Use the following code to add additional 100K rows into the Orders table:

DECLARE @date_from DATETIME = '19980101';
DECLARE @date_to DATETIME = '20130630';
DECLARE @number_of_rows INT = 100000;
INSERT INTO dbo.Orders(custid,orderdate,amount)
SELECT 1 + ABS(CHECKSUM(NEWID())) % 10000 AS custid,
(
SELECT(@date_from +(ABS(CAST(CAST( NEWID() AS BINARY(8)) AS INT))%CAST((@date_to - @date_from)AS INT)))) AS orderdate,
50 + ABS(CHECKSUM(NEWID())) % 1000 AS amount
FROM dbo.GetNums(1, @number_of_rows)
GO

We have simply added 100K new rows, which is 10% of total rows. As you can see we used the same date interval for orderdate which means that newly added orders are not made after all orders in the main table – they are just added to it (for instance an import of all orders from one subsidiary). What would happen when we add 100K really new orders (with orderdate after 30.06. 2013) is also very interesting but will be covered in one of future articles.

Since we did not add 20% of total rows or more the statistics object for the column orderdate is not automatically updated. You can check this by issuing the following statement:

DBCC SHOW_STATISTICS('dbo.Orders', 'IX1');

We can see that the statistics histogram still shows 203 rows for 29th June. All newly added rows don’t exist for the statistics object.

Now we will execute all queries once again with old and new CE and compare estimations. We will start with a value from statistics histogram as literal.

SELECT * FROM dbo.Orders WHERE orderdate = '20130628' OPTION (QUERYTRACEON 9481);
GO
SELECT * FROM dbo.Orders WHERE orderdate = '20130628';

Figure2_3

We can see identical estimations again. Both optimizers come up with the same estimation, which is slightly different than the estimation before newly added rows. Although the statistics histogram still shows the same value for 29th June (203) the estimated number of rows is higher. The optimizer takes in account total number of rows and assumes that newly added rows have the same distribution as the rest of table. The following formula returns the estimated number of rows:

EQ_ROWS * Total Rows /Rows

where

EQ_ROWS is value in the column EQ_ROWS for the entry 2013-06-29

Total Rows = total number of rows in the table

Rows = total number of rows in the table at the time of last statistics update

In this example 203.0 * 1100000/1000000 = 223.3, exactly what we saw as Estimated Number of Rows in the execution plan.

Both CEs are aware of outdated statistics and scale the estimation according to total number of rows in the table. This assumption makes sense if newly added rows don’t change data distribution significantly.

OK. Let’s execute the second query:

SELECT * FROM dbo.Orders WHERE orderdate = '20130628' OPTION (QUERYTRACEON 9481);
GO
SELECT * FROM dbo.Orders WHERE orderdate = '20130628';

And observe the execution plans and Estimated Number of Rows:

Figure2_4

Oops, we found the first difference between old and new estimations! Old CE uses the same estimation as it was before newly added columns, i.e. refers to the statistics object only. New CE comes up with a little bit higher number. Maybe you guess, it is exactly 10% higher! It respects total number of rows and not number of rows at the time of the last statistics update. New CE maybe does not estimate always better, but at least it is consistent with the estimation for a date which is contained in the statistics histogram. Therefore I would say that this change is an improvement, a consistent implementation is better than inconsistent one. Which estimation better reflects the real life workloads depends on data distribution.

Now when we found one difference let’s try to find another one. Our third query uses a local variable instead of literal. Let’s execute it and see if we found discrepancy again:

DECLARE @d DATETIME ='20130629'; SELECT * FROM dbo.Orders WHERE orderdate = @d OPTION (QUERYTRACEON 9481);
GO
DECLARE @d DATETIME ='20130629'; SELECT * FROM dbo.Orders WHERE orderdate = @d;

Figure2_5

We can see the same situation again as we had for a literal which is not an entry in the statistics histogram. New CE has “corrected” the “All density” value from the statistics object and scaled it according to total number of rows.

Conclusion

In case of estimation for a simple filter predicate for a single table there is no difference between old and new CE if the statistics on filter column is up to date. Both estimations end up with the identical number for literal dates and local variables.

However, when the statistics objects is not up to date in two cases new CE estimates different. When a literal date is used in a statement and the value cannot ne find in the statistics histogram (i.e. when AVG_RANGE_ROWS is used for estimation) and when the table is filtered by a value from a local variable. In both cases new CE considers total number of rows in the table at the time of execution plan generation and not at the time when the statistics object was updated as it was in old CE.

The behavior of new CE is consistent and generally more accurate if data added or changed after the last statistics update does not significantly change overall data distribution. If newly added data have different distribution than we should update statistics manually.

Related Articles

2 CommentsLeave a comment

  • […] One our predicate (orderdate) has been affected with outdated statistic, the other predicate wasn’t affected. However, from the estimated plans you can see that the estimated number of rows is not the same for old and new CE. The old CE estimates 99.7 while the new CE comes with a little bit more 109.7. A “little bit” is 10% in this case. A detailed explanation about this change in the new CE you can find in my previous post. […]

  • Why in below formula you took 203 while values were spread assumed equally.
    _++++++±++++7+++++++++++++++++++++++
    EQ_ROWS * Total Rows /Rows

    where

    EQ_ROWS is value in the column EQ_ROWS for the entry 2013-06-29