New Cardinality Estimator Part 8 – Out of Histogram

Part8_2One month after the RTM release of SQL Server 2014 you can find several articles about new cardinality estimator.

In this article I will play with changes related to estimations in case of outdated statistics. We’ll see some easy reproducable (and pretty real-life) cases where the old and new estimator come up with values with the difference of five orders of magnitude! This can lead to different execution plans, especially for large tables and in this article we’ll see if this is a good thing or not.

As usuall we’ll create a sample table called Orders and populate it with 10 M rows. Use the following code  to create and populate the test table:

IF OBJECT_ID(‘dbo.Orders’, ‘U’) IS NOT NULL DROP TABLE dbo.Orders;
CREATE TABLE dbo.Orders(
id INT IDENTITY(1,1) NOT NULL,
custid INT NOT NULL,
orderdate DATETIME NOT NULL,
amount MONEY NOT NULL,
CONSTRAINT PK_Orders PRIMARY KEY CLUSTERED (id ASC)
)
GO
— Fill the table with 1M rows
DECLARE @date_from DATETIME = ‘19991101’;

DECLARE @date_to DATETIME = ‘20131231’;

DECLARE @number_of_rows INT = 10000000;
INSERT INTO dbo.Orders(custid,orderdate,amount)
SELECT 1 + ABS(CHECKSUM(NEWID())) % 10000 AS custid,
(SELECT(@date_from +(ABS(CAST(CAST( NewID() AS BINARY(8)) AS INT))%CAST((@date_to – @date_from)AS INT)))) AS orderdate,
50 + ABS(CHECKSUM(NEWID())) % 1000 AS amount
FROM dbo.GetNums(1, @number_of_rows)
GO
–Create index on the orderdate column
CREATE INDEX IX1 ON dbo.Orders(orderdate);
GO

We want to query this table by filtering the orderdate column – therefore we have also created a non-clustered index on this column.  At this point we have an index and the statistics on the orderdate column is created in the background and it is up-to-date. When statistics is up to date there is no significant discrepancy between estimations under the old and new CE. Interessting things happen when statistics is out of date, especially for large tables.

To demonstrate this let’s add additional rows into the Orders table, but not enough to trigger the statistics update.

Use the following code to add 1M rows into the Orders table:

DECLARE @date_from DATETIME = ‘20140101’;
DECLARE @date_to DATETIME = ‘20140531’;
DECLARE @number_of_rows INT = 1000000;
INSERT INTO dbo.Orders(custid,orderdate,amount)
SELECT 1 + ABS(CHECKSUM(NEWID())) % 10000 AS custid,
(SELECT(@date_from +(ABS(CAST(CAST( NewID() AS BINARY(8)) AS INT))%CAST((@date_to – @date_from)AS INT)))) AS orderdate,
50 + ABS(CHECKSUM(NEWID())) % 1000 AS amount
FROM dbo.GetNums(1, @number_of_rows)
GO

We have simply added 1M new rows, which is 10% of total rows. As you can see the newly added orders have orderdate in 2014 which menas that the statistics for the orderdate column is not aware of all these new rows.

OK, let’s play with some queries with dates from 2014 and see how both cardinality estimators proceed with these challenges.

Let’s start with a small one predicate query:

SELECT * FROM dbo.Orders WHERE orderdate = ‘20140430’ OPTION (QUERYTRACEON 9481);
GO
SELECT * FROM dbo.Orders WHERE orderdate = ‘20140430’;

newcepart8_1

We can see significant difference in estimations. The old optimizer does not know about new rows at all and estimates 1 row. The new optimizer comes up with about 2000x more rows, which is significant difference. Where does this number come from? This is an average number of orders per day and can be seen in the statistics header in the column All density. In my sample data this attribut has value of 0.0001932741 and the total number of rows in the statistic header is (still) 10M. These two values make the estimated number of rows. As we can see the new CE uses the information about the table cardinality (which is known to the old CE too) and whenever it see that the statistics is not up to date and that the parameter participating in the predicate cannot be found in the statistics histogram, it assumes that the distribution for data out of histogram boundaries is uniform and generates estimations based on All density info. This is similar behaviour as when we would wrap the literal in a local variable. However, it is not the same estimation: for a local variable the estimation is 2126, for an out of histogram literal 1932. In the first case the All density is multiplied with the table cardinality, in the second with the info about the total number of rows from the statistics.

So, in case of outdated statistics the new CE behaves significantly different than the old one.

This difference is not so important here – the execution plan is same, there is no operators which require memory where an underestimation can cause sort or hash warnings, but it can be and we’ll see this in the next example.

Part8_2