1

I have a large data frame. Sample below

| year | sentences         | company |
|------|-------------------|---------|
| 2020 | [list of strings] | A       |
| 2019 | [list of strings] | A       |
| 2018 | [list of strings] | A       |
| ...  | ....              | ...     |
| 2020 | [list of strings] | Z       |
| 2019 | [list of strings] | Z       |
| 2018 | [list of strings] | Z       |

I want to compare the sentences column by company by year so as to get a year on year change.
Example: for company A, I would like to apply an operator such as sentence similarity or some distance metric for the [list of strings]2020 and [list of strings]2019, then [list of strings]2019 and [list of strings]2018.

Similarly for company B, C, ... Z.

How can this be achieved?

EDIT

length of [list of strings] is variable. So some simple quantifying operators could be

  1. Difference in number of elements --> length([list of strings]2020) - length([list of strings]2019)
  2. Count of common elements --> length(set([list of strings]2020, [list of strings]2019))

The comparisons should be:

| years     | Y-o-Y change (Some function) | company |
|-----------|------------------------------|---------|
| 2020-2019 | 15                           | A       |
| 2019-2018 | 3                            | A       |
| 2018-2017 | 55                           | A       |
| ...       | ....                         | ...     |
| 2020-2019 | 33                           | Z       |
| 2019-2018 | 32                           | Z       |
| 2018-2017 | 27                           | Z       |
3
  • It is not entirely clear to me from your question what you are trying to do. If you would like to apply a function to a column you can easily do that with df.apply. From there I can show you how to calculate year on year change of a numerical feature?
    – user10343540
    Commented Jul 4, 2021 at 0:57
  • Do the length of list of strings [list of strings] is the same? How the comparison should be done (0->0, 1->1 or 0->1, 0->2, 0->N, 1->0, 1->1, 1->N)? Give an example at least for 2020-A, 2019-A, 2018-A, please.
    – Corralien
    Commented Jul 4, 2021 at 1:18
  • I have added an edit to clarify.
    – sheth7
    Commented Jul 4, 2021 at 5:15

1 Answer 1

0

TL;DR: see full code on bottom

You have to break down your task in simpler subtasks. Basically, you want to apply one or several calculations on your dataframe on successive rows, this grouped by company. This means you will have to use groupby and apply.

Let's start with generating an example dataframe. Here I used lowercase letters as words for the "sentences" column.

import numpy as np
import string

df = pd.DataFrame({'date':      np.tile(range(2020, 2010, -1), 3),
                   'sentences': [np.random.choice(list(string.ascii_lowercase), size=np.random.randint(10)) for i in range(30)],
                   'company':   np.repeat(list('ABC'), 10)})
df

output:

    date                    sentences company
0   2020                          [z]       A
1   2019  [s, f, g, a, d, a, h, o, c]       A
2   2018                          [b]       A
…
26  2014                          [q]       C
27  2013                       [i, w]       C
28  2012     [o, p, i, d, f, w, k, d]       C
29  2011                 [l, f, h, p]       C

Concatenate the "sentences" column of the next row (previous year):

pd.concat([df, df.shift(-1).add_suffix('_pre')], axis=1)

output:

    date                    sentences company  date_pre                sentences_pre company_pre
0   2020                          [z]       A    2019.0  [s, f, g, a, d, a, h, o, c]           A
1   2019  [s, f, g, a, d, a, h, o, c]       A    2018.0                          [b]           A
2   2018                          [b]       A    2017.0           [x, n, r, a, s, d]           A
3   2017           [x, n, r, a, s, d]       A    2016.0  [u, n, g, u, k, s, v, s, o]           A
4   2016  [u, n, g, u, k, s, v, s, o]       A    2015.0     [v, g, d, i, b, z, y, k]           A
5   2015     [v, g, d, i, b, z, y, k]       A    2014.0                    [q, o, p]           A
6   2014                    [q, o, p]       A    2013.0                    [j, s, s]           A
7   2013                    [j, s, s]       A    2012.0              [g, u, l, g, n]           A
8   2012              [g, u, l, g, n]       A    2011.0              [v, p, y, a, s]           A
9   2011              [v, p, y, a, s]       A    2020.0                 [a, h, c, w]           B
…

Define a function to compute a number of distance metrics (here the two defined in the question). TypeError is caught to handle the case where there is no row to compare with (one occurrence per group).

def compare_lists(s):
    l1 = s['sentences_pre']
    l2 = s['sentences']
    try:
        return pd.Series({'years': '%d–%d' % (s['date'], s['date_pre']),
                          'yoy_diff_len': len(l2)-len(l1),
                          'yoy_nb_common': len(set(l1).intersection(set(l2))),
                          'company': s['company'],
                         })
    except TypeError:
        return

This works on a sub-dataframe that was filtered to match only one company:

df2 = df.query('company == "A"')
pd.concat([df2, df2.shift(-1).add_suffix('_pre')], axis=1).dropna().apply(compare_lists, axis=1

output:

       years  yoy_diff_len  yoy_nb_common company
0  2020–2019            -4              0       A
1  2019–2018             6              1       A
2  2018–2017             1              0       A
3  2017–2016             1              0       A
4  2016–2015            -7              0       A
5  2015–2014             4              0       A
6  2014–2013             1              0       A
7  2013–2012            -1              0       A
8  2012–2011            -5              1       A

Now you can make a function to construct each dataframe per group and apply the computation:

def group_compare(df):
    df2 = pd.concat([df, df.shift(-1).add_suffix('_pre')], axis=1)
    return df2.apply(compare_lists, axis=1)

and use this function to apply on each group:

df.groupby('company').apply(group_compare)

Full code:

import numpy as np
import string

df = pd.DataFrame({'date':      np.tile(range(2020, 2010, -1), 3),
                   'sentences': [np.random.choice(list(string.ascii_lowercase), size=np.random.randint(10)) for i in range(30)],
                   'company':   np.repeat(list('ABC'), 10)})

def compare_lists(s):
    l1 = s['sentences_pre']
    l2 = s['sentences']
    try:
        return pd.Series({'years': '%d–%d' % (s['date'], s['date_pre']),
                          'yoy_diff_len': len(l2)-len(l1),
                          'yoy_nb_common': len(set(l1).intersection(set(l2))),
                          'company': s['company'],
                         })
    except TypeError:
        return
    
def group_compare(df):
    df2 = pd.concat([df, df.shift(-1).add_suffix('_pre')], axis=1).dropna()
    return df2.apply(compare_lists, axis=1)
                                           ## uncomment below to remove "company" index
df.groupby('company').apply(group_compare) #.reset_index(level=0, drop=True)

output:

            years     yoy_diff_len     yoy_nb_common     company
company                     
A     0     2020–2019            -8            0            A
      1     2019–2018             8            0            A
      2     2018–2017            -5            0            A
      3     2017–2016            -3            2            A
      4     2016–2015             1            3            A
      5     2015–2014             5            0            A
      6     2014–2013             0            0            A
      7     2013–2012            -2            0            A
      8     2012–2011             0            0            A
B    10     2020–2019             3            0            B
     11     2019–2018            -6            1            B
     12     2018–2017             3            0            B
     13     2017–2016            -5            1            B
     14     2016–2015             2            2            B
     15     2015–2014             4            1            B
     16     2014–2013             3            0            B
     17     2013–2012            -8            0            B
     18     2012–2011             1            1            B
C    20     2020–2019             8            1            C
     21     2019–2018            -7            0            C
     22     2018–2017             0            1            C
     23     2017–2016             7            0            C
     24     2016–2015            -3            0            C
     25     2015–2014             3            0            C
     26     2014–2013            -1            0            C
     27     2013–2012            -6            2            C
     28     2012–2011             4            2            C

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Not the answer you're looking for? Browse other questions tagged or ask your own question.