I have two numpy arrays x and y of same length, and I am trying to make a square matrix A such that the (i,j) entry of the matrix will contain a 1 if a certain relationship holds between x[i], x[j], y[i] and y[j], and a 0 otherwise.
The current method I have for this is starting with A as a zero matrix (of dimensions len(x) by len(x)) and then using a double for loop to range over all i, and all j>i (this matrix is symmetric in the diagonal). However this takes a really really long time to run - the length of x is around 15,000. The vast majority of entries in A will be 0, so this seems to be quite an inefficient way of doing this. I thought that potentially masked arrays could be used but I haven’t figured out how to use them in this situation. Any help would be greatly appreciated!
Here is the code I currently have: df
is a data frame in which one column contains surnames and another a date.
import pandas as pd
from datetime import datetime
date_format = '%Y-%m-%d %H:%M:%S'
df1 = pd.DataFrame([['Smith', '2024-12-16 12:00:00'], ['Smith', '2024-12-16 13:00:00'], ['Doe', '2024-12-16 12:01:00'], ['Doe', '2024-12-16 12:04:00']])# -*- coding: utf-8 -*-
df1.columns = ['Surname', 'Date']
df1['Date'] = df1.apply(lambda r : datetime.strptime(r['Date'], date_format),1)
x1 = df1['Surname'].to_numpy()
y1 = df1['Date'].to_numpy()
A1 = scipy.sparse.lil_matrix((len(df1.index), len(df1.index))).todense()
for i in range(len(x1)):
for j in range(i, len(x1)):
A1[i,j] = int((x1[i] == x1[j]) & (abs(np.timedelta64(y1[i]-y1[j], 'm')) < np.timedelta64(5, 'm')))
x
andy
are numpy arrays but then tag the question with sparse-matrix. Which one is it?numpy.array
or scipy sparse matrix?df
. It's also supposed to be minimal. Your dataframe has no meaning for the question and just complicates the picture. Please read the help section on how to ask a good question before asking.