0

I am trying to build a complex sql query for hours but still didn't find any way to do it as expected.

Here is my table and my dataset :

create table Skills
(
ID varchar(10),
StartDate date,
EndDate date,
Skill varchar(10)
);

Insert into Skills values
('1','2021-01-01','2021-12-31','A'),
('1','2022-01-01','2022-12-31','B'),
('2','2021-01-01','2021-12-31','A'),
('2','2021-11-30','2022-12-31','B'),
('3','2021-01-01','2021-12-31','A'),
('3','2021-11-30','2022-12-31','B'),
('3','2022-11-30','2023-12-31','C'),
('4','2021-01-01','2021-12-31','A'),
('4','2022-01-01','2022-12-31','B'),
('4','2022-11-30','2023-12-31','C');

I would like to aggregate rows by ID only when dates range (StartDate, EndDate) overlap. Here is the expected result :

1, 2021-01-01, 2021-12-31, A
1, 2022-01-01, 2022-12-31, B
2, 2021-01-01, 2022-12-31, B
3, 2021-01-01, 2023-12-31, C
4, 2021-01-01, 2021-12-31, A
4, 2022-01-01, 2023-12-31, C

When rows with overlapping dates range are aggregated, we need to keep the oldest StartDate, the newest EndDate and the Skill associated to the newest EndDate.

I tried so many queries with partition by, lag, cte, etc.

Could you help me find the right solution please ?

Thanks, Regards

5
  • What is MySQL version precisely?
    – Akina
    Commented Jul 11, 2023 at 13:23
  • MySQL version is 8.0
    – Gosfly
    Commented Jul 11, 2023 at 13:45
  • and the Skill associated to the newest EndDate What if 2 rows have the same EndDate but different Skill? What if StartDate is the same too?
    – Akina
    Commented Jul 11, 2023 at 13:53
  • For a same ID you can't have the same StartDate, nor the same EndDate (I mean in my case)
    – Gosfly
    Commented Jul 11, 2023 at 14:13
  • Does this is provided with according unique indices? if there are such indices then duplicates (include complete ones) MAY EXIST (for example as an issue of some programmatical error or fail), and nothing prevents this. You must take this into account..
    – Akina
    Commented Jul 12, 2023 at 4:25

1 Answer 1

1

This is a gaps and islands problem, to solve it you can use lag() to determine where the "islands" start, Then use a cumulative sum() to determine gaps :

Assuming the endDate is unique per id :

select d.*, s.skill
from (
  select d.id, min(d.start_date) as start_date, max(d.end_date) as end_date
  from (
      select d.*,
      sum(case when DATEDIFF(prev_end_date, start_date) > 0 then 0 else 1 end)  over (partition by id order by start_date) as grp
      from (
            select d.*,
            lag(end_date) over (partition by id order by start_date) as prev_end_date
            from Skills d
      ) d
  ) d
  group by d.id, grp
) d
inner join Skills s on s.id = d.id and s.end_date = d.end_date

Demo here

4
  • 1
    Man you so strong, thank you very much, this is doing it perfectly !
    – Gosfly
    Commented Jul 11, 2023 at 13:29
  • Just find out a little issue with max(Skill) as it will pick the max in alphabetical order instead of the value of Skill from the row with the last EndDate. From my example, if you change ('4','2022-11-30','2023-12-31','C'); by ('4','2022-11-30','2023-12-31','A'); The result will be 'B' instead of 'A'.
    – Gosfly
    Commented Jul 11, 2023 at 13:43
  • That is correct, let me check
    – SelVazi
    Commented Jul 11, 2023 at 13:50
  • 1
    That's correct, that's what I would have done too, thanks for your answer and your help
    – Gosfly
    Commented Jul 11, 2023 at 14:43

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Not the answer you're looking for? Browse other questions tagged or ask your own question.