3

I have a very large table containing price history.

CREATE TABLE [dbo].[SupplierPurchasePrice](
    [SupplierPurchasePriceId] [int] IDENTITY(1,1) PRIMARY KEY,
    [ExternalSupplierPurchasePriceId] [varchar](20) NULL,
    [ProductId] [int] NOT NULL,
    [SupplierId] [int] NOT NULL,
    [Price] [money] NOT NULL,
    [PreviousPrice] [money] NULL,
    [SupplierPurchasePriceDate] [date] NOT NULL,
    [Created] [datetime] NULL,
    [Modified] [datetime] NULL,
)

Per Product(Id) and Supplier(Id) I have hundreds of price records. Now there is a need to remove the bulk of the data but still keep some historic data. For each Product(Id) and Supplier(Id) i want to keep, let's say, 14 records. But not the first or last 14. I want to keep the first and last record. And then keep 12 records evenly in between the first and last. That way I keep some history in tact.

I cannot figure out a way to do this directly with a stored procedure instead of through my c# ORM (which is way too slow).

2
  • 2
    can you add sql fiddle? Commented Jan 25, 2014 at 13:21
  • Although I like the question, I'm not sure that this is a particular robust approach to archiving the data. I think a better approach would be to snapshot the prices at fixed points in time historically, so you have reliable comparisons for both the suppliers and the products. Commented Jan 25, 2014 at 14:04

3 Answers 3

3

Here is a direct counting approach to solving the problem:

select spp.*
from (select spp.*,
             sum(12.5 / (cnt - 1)) over (partition by SupplierId, ProductId
                                         order by SupplierPurchasePriceId
                                        ) as cum
      from (select spp.*,
                   row_number() over (partition by SupplierId, ProductId
                                      order by SupplierPurchasePriceId
                                     ) as seqnum,
                  count(*) over (partition by SupplierId, ProductId) as cnt,
            from SupplierPurchasePrice spp
           ) spp
     ) spp
where seqnum = 1 or seqnum = cnt or cnt <= 14 or
      (floor(cumgap) <> floor(cumgap - 12.5/(cnt - 1)));

The challenge is deciding where the 12 records in between go. This calculates an average "gap" in the records, as 12.5/(cnt - 1). This is a constant that is then accumulated over the records. It will go from basically 0 to 12.5 in the largest record. The idea is to grab any record where this passes an integer value. So, if the cumulative goes form 2.1 to 2.3, then the record is not chosen. If it goes from 2.9 to 3.1 then the record is chosen.

The number 12.5 is not magic. Any number between 12 and 13 should do. Except for the issue with choosing the oldest and most recent values. I chose 12.5 to be extra sure that these don't count for the 12.

You can see the same logic working here at SQL Fiddle. The flag column shows which would be chosen and the totflag validates that exactly 14 are chosen.

1

I would try something like

select
  to_keep.SupplierPurchasePriceId
from
   (select
      foo.SupplierPurchasePriceId,
      row_number() over (partition by ProductId, SupplierId, tile_num order by Created) as takeme
    from
      (select
         SupplierPurchasePriceId,
         ProductId,
         SupplierId,
         Created,
         ntile(13) over(partition by ProductId, SupplierId order by Created) as tile_num
       from
         SupplierPurchasePrice
      ) foo
   ) to_keep
where
  to_keep.takeme = 1

union

select distinct
  last_value(SupplierPurchasePriceId) over (partition by ProductId, SupplierId order by Created range between UNBOUNDED PRECEDING and UNBOUNDED FOLLOWING) as SupplierPurchasePriceId
from
  SupplierPurchasePrice

This should give primary keys of rows to be kept. Performance may vary. Not tested.

1

(Caveat - I read your question incorrectly. This retains the first, last and every 14th record in between, not necessarily a total of 14 records.)

After backing up your database, and trying a dry run this first to check that this will delete the required data (Assumed here is that the chronological order of records is set by SupplierPurchasePriceDate - change the ORDER BY as appropriate if not the case)

WITH CTE AS
(
    SELECT 
        [SupplierId], 
        [ProductId],
        [SupplierPurchasePriceId], 
        [SupplierPurchasePriceDate],
        ROW_NUMBER()  OVER (Partition BY [SupplierId], [ProductId] 
            ORDER BY [SupplierPurchasePriceDate]) -1 AS Rnk
    FROM [dbo].[SupplierPurchasePrice]
)
SELECT cteRank.*
    FROM 
        CTE cteRank
        JOIN 
        (SELECT ProductId, SupplierId, MAX(Rnk) as MaxRnk
            FROM CTE cteMax
            GROUP BY ProductId, SupplierID) X
        ON cteRank.SupplierId = X.SupplierId AND cteRank.ProductId = X.ProductId
    WHERE cteRank.Rnk % 12 != 0 AND cteRank.Rnk !=  X.MaxRnk;

If this works as expected, and since it seems you already have a surrogate key, then the delete step is simply deleting the matched surrogate keys:

DELETE FROM [dbo].[SupplierPurchasePrice]
WHERE [SupplierPurchasePriceId] IN (...)

It works by ranking the price data by date and then deleting records which aren't modulo 12 (Change as per your requirement). The -1 is because ROW_NUMBER() is 1 based. The first record is retained because as the first Modulo match. The MaxRnk step is to also preserve the latest price for each Supplier, Product pair.

Note that this will retain the first, every 12th thereafter, and the last. So there will likely be an uneven gap between the latest (Max) and second last record (the last Modulo = 0). But surely, close enough?

SqlFiddle here

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Not the answer you're looking for? Browse other questions tagged or ask your own question.