TL;DR: see full code on bottom
You have to break down your task in simpler subtasks. Basically, you want to apply one or several calculations on your dataframe on successive rows, this grouped by company. This means you will have to use groupby
and apply
.
Let's start with generating an example dataframe. Here I used lowercase letters as words for the "sentences" column.
import numpy as np
import string
df = pd.DataFrame({'date': np.tile(range(2020, 2010, -1), 3),
'sentences': [np.random.choice(list(string.ascii_lowercase), size=np.random.randint(10)) for i in range(30)],
'company': np.repeat(list('ABC'), 10)})
df
output:
date sentences company
0 2020 [z] A
1 2019 [s, f, g, a, d, a, h, o, c] A
2 2018 [b] A
…
26 2014 [q] C
27 2013 [i, w] C
28 2012 [o, p, i, d, f, w, k, d] C
29 2011 [l, f, h, p] C
Concatenate the "sentences" column of the next row (previous year):
pd.concat([df, df.shift(-1).add_suffix('_pre')], axis=1)
output:
date sentences company date_pre sentences_pre company_pre
0 2020 [z] A 2019.0 [s, f, g, a, d, a, h, o, c] A
1 2019 [s, f, g, a, d, a, h, o, c] A 2018.0 [b] A
2 2018 [b] A 2017.0 [x, n, r, a, s, d] A
3 2017 [x, n, r, a, s, d] A 2016.0 [u, n, g, u, k, s, v, s, o] A
4 2016 [u, n, g, u, k, s, v, s, o] A 2015.0 [v, g, d, i, b, z, y, k] A
5 2015 [v, g, d, i, b, z, y, k] A 2014.0 [q, o, p] A
6 2014 [q, o, p] A 2013.0 [j, s, s] A
7 2013 [j, s, s] A 2012.0 [g, u, l, g, n] A
8 2012 [g, u, l, g, n] A 2011.0 [v, p, y, a, s] A
9 2011 [v, p, y, a, s] A 2020.0 [a, h, c, w] B
…
Define a function to compute a number of distance metrics (here the two defined in the question). TypeError is caught to handle the case where there is no row to compare with (one occurrence per group).
def compare_lists(s):
l1 = s['sentences_pre']
l2 = s['sentences']
try:
return pd.Series({'years': '%d–%d' % (s['date'], s['date_pre']),
'yoy_diff_len': len(l2)-len(l1),
'yoy_nb_common': len(set(l1).intersection(set(l2))),
'company': s['company'],
})
except TypeError:
return
This works on a sub-dataframe that was filtered to match only one company:
df2 = df.query('company == "A"')
pd.concat([df2, df2.shift(-1).add_suffix('_pre')], axis=1).dropna().apply(compare_lists, axis=1
output:
years yoy_diff_len yoy_nb_common company
0 2020–2019 -4 0 A
1 2019–2018 6 1 A
2 2018–2017 1 0 A
3 2017–2016 1 0 A
4 2016–2015 -7 0 A
5 2015–2014 4 0 A
6 2014–2013 1 0 A
7 2013–2012 -1 0 A
8 2012–2011 -5 1 A
Now you can make a function to construct each dataframe per group and apply the computation:
def group_compare(df):
df2 = pd.concat([df, df.shift(-1).add_suffix('_pre')], axis=1)
return df2.apply(compare_lists, axis=1)
and use this function to apply on each group:
df.groupby('company').apply(group_compare)
Full code:
import numpy as np
import string
df = pd.DataFrame({'date': np.tile(range(2020, 2010, -1), 3),
'sentences': [np.random.choice(list(string.ascii_lowercase), size=np.random.randint(10)) for i in range(30)],
'company': np.repeat(list('ABC'), 10)})
def compare_lists(s):
l1 = s['sentences_pre']
l2 = s['sentences']
try:
return pd.Series({'years': '%d–%d' % (s['date'], s['date_pre']),
'yoy_diff_len': len(l2)-len(l1),
'yoy_nb_common': len(set(l1).intersection(set(l2))),
'company': s['company'],
})
except TypeError:
return
def group_compare(df):
df2 = pd.concat([df, df.shift(-1).add_suffix('_pre')], axis=1).dropna()
return df2.apply(compare_lists, axis=1)
## uncomment below to remove "company" index
df.groupby('company').apply(group_compare) #.reset_index(level=0, drop=True)
output:
years yoy_diff_len yoy_nb_common company
company
A 0 2020–2019 -8 0 A
1 2019–2018 8 0 A
2 2018–2017 -5 0 A
3 2017–2016 -3 2 A
4 2016–2015 1 3 A
5 2015–2014 5 0 A
6 2014–2013 0 0 A
7 2013–2012 -2 0 A
8 2012–2011 0 0 A
B 10 2020–2019 3 0 B
11 2019–2018 -6 1 B
12 2018–2017 3 0 B
13 2017–2016 -5 1 B
14 2016–2015 2 2 B
15 2015–2014 4 1 B
16 2014–2013 3 0 B
17 2013–2012 -8 0 B
18 2012–2011 1 1 B
C 20 2020–2019 8 1 C
21 2019–2018 -7 0 C
22 2018–2017 0 1 C
23 2017–2016 7 0 C
24 2016–2015 -3 0 C
25 2015–2014 3 0 C
26 2014–2013 -1 0 C
27 2013–2012 -6 2 C
28 2012–2011 4 2 C
[list of strings]
is the same? How the comparison should be done (0->0, 1->1
or0->1, 0->2, 0->N, 1->0, 1->1, 1->N
)? Give an example at least for 2020-A, 2019-A, 2018-A, please.