Missing value imputation in Python

Question

I have two huge vectors item_clusters and beta. The element item_clusters [ i ] is the cluster id to which the item i belongs. The element beta [ i ] is a score given to the item i. Scores are {-1, 0, 1, 2, 3}.

Whenever the score of a particular item is 0, I have to impute that with the average non-zero score of other items belonging to the same cluster. What is the fastest possible way to to this?

This is what I have tried so far. I converted the item_clusters to a matrix clusters_to_items such that the element clusters_to_items [ i ][ j ] = 1 if the cluster i contains item j, else 0. After that I am running the following code.

# beta (1x1.3M) csr matrix
# num_clusters = 1000
# item_clusters (1x1.3M) numpy.array
# clust_to_items (1000x1.3M) csr_matrix

alpha_z = []
for clust in range(0, num_clusters):
    alpha = clust_to_items[clust, :]
    alpha_beta = beta.multiply(alpha)
    sum_row = alpha_beta.sum(1)[0, 0]
    num_nonzero = alpha_beta.nonzero()[1].__len__() + 0.001
    to_impute = sum_row / num_nonzero
    Z = np.repeat(to_impute, beta.shape[1])
    alpha_z = alpha.multiply(Z)
    idx = beta.nonzero()
    alpha_z[idx] = beta.data
interact_score = alpha_z.tolist()[0]

# The interact_score is the required modified beta 
# This is used to do some work that is very fast

The problem is that this code has to run 150K times and it is very slow. It will take 12 days to run according to my estimate.

Edit: I believe, I need some very different idea in which I can directly use item_clusters, and do not need to iterate through each cluster separately.

Ok, I'm gonna be the unpopular kid here: Python's a great language, but if performance is as much as a concern as it is here, and especially if you're dealing with huge amounts of raw data, and this is your hot spot, where most of the time is spent, implement it in C. — Marcus Müller, Commented May 21, 2016 at 18:35
can you figure out which actual line of code takes the most time? — Marcus Müller, Commented May 21, 2016 at 18:37

DSM · Accepted Answer · 2016-05-21 19:24:22Z

I don't know if this means I'm the popular kid here or not, but I think you can vectorize your operations in the following way:

def fast_impute(num_clusters, item_clusters, beta):

    # get counts
    cluster_counts = np.zeros(num_clusters)
    np.add.at(cluster_counts, item_clusters, 1)

    # get complete totals
    totals = np.zeros(num_clusters)
    np.add.at(totals, item_clusters, beta)

    # get number of zeros
    zero_counts = np.zeros(num_clusters)
    z = beta == 0
    np.add.at(zero_counts, item_clusters, z)

    # non-zero means
    cluster_means = totals / (cluster_counts - zero_counts)

    # perform imputations
    imputed_beta = np.where(beta != 0, beta, cluster_means[item_clusters])

    return imputed_beta

which gives me

>>> N = 10**6
>>> num_clusters = 1000
>>> item_clusters = np.random.randint(0, num_clusters, N)
>>> beta = np.random.choice([-1, 0, 1, 2, 3], size=len(item_clusters))
>>> %time imputed = fast_impute(num_clusters, item_clusters, beta)
CPU times: user 652 ms, sys: 28 ms, total: 680 ms
Wall time: 679 ms

and

>>> imputed[:5]
array([ 1.27582017, -1.        , -1.        ,  1.        ,  3.        ])
>>> item_clusters[:5]
array([506, 968, 873, 179, 269])
>>> np.mean([b for b, i in zip(beta, item_clusters) if i == 506 and b != 0])
1.2758201701093561

Note that I did the above manually. It would be a lot easier if you were using higher-level tools, say like those provided by pandas:

>>> df = pd.DataFrame({"beta": beta, "cluster": item_clusters})
>>> df.head()
   beta  cluster
0     0      506
1    -1      968
2    -1      873
3     1      179
4     3      269
>>> df["beta"] = df["beta"].replace(0, np.nan)
>>> df["beta"] = df["beta"].fillna(df["beta"].groupby(df["cluster"]).transform("mean"))
>>> df.head()
      beta  cluster
0  1.27582      506
1 -1.00000      968
2 -1.00000      873
3  1.00000      179
4  3.00000      269

Marcus Müller · Accepted Answer · 2016-05-21 18:40:04Z

0

My suspicion is that

alpha_beta = beta.multiply(alpha)

is a terrible idea, because you only need the first elements of the row sums, so you're doing a couple million multiply-adds in vain, if I'm not mistaken:

sum_row = alpha_beta.sum(1)[0, 0]

So, write down the discrete formula for beta * alpha, then pick the row you need and derive the formula for its sum.

answered May 21, 2016 at 18:40

Marcus Müller

36.1k4 gold badges57 silver badges100 bronze badges

The idea is that both alpha and beta are vectors. I need to take the sum of elements of beta only if they belong to the cluster clust (i.e. alpha values are 1). That is why I am multiplying alpha and beta element-wise. alpha_beta sum is a matrix of one element. To get the actual float value, I have to explicitly mention [0][0].
– Sonu Mishra
Commented May 21, 2016 at 18:45
@SonuKMishra hint: sum(element-wise product) can be calculated in one step, namely the dot product
– Marcus Müller
Commented May 21, 2016 at 18:48
But I do not need only sum. I need average for which I need the number of such elements as well.
– Sonu Mishra
Commented May 21, 2016 at 18:51
Also, I am using one more multiplication, so getting rid of one will at max give me a 2x speed-up. I am in need of much more than that. Please look at the edit note that I have put in the question. I believe, if I can somehow get rid of the for loop, it will be awesome!
– Sonu Mishra
Commented May 21, 2016 at 19:00

Add a comment |

Collectives™ on Stack Overflow

Missing value imputation in Python

2 Answers 2

Your Answer

Not the answer you're looking for? Browse other questions tagged
python-3.x
numpy
scipy
missing-data
or ask your own question.

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged python-3.xnumpyscipymissing-data or ask your own question.

Related

Not the answer you're looking for? Browse other questions tagged
python-3.x
numpy
scipy
missing-data
or ask your own question.