Populating a “count matrix” with combinations of pandas DataFrame rows

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP



Populating a “count matrix” with combinations of pandas DataFrame rows



Let's say I have the following pandas DataFrame in Python3.x


import pandas as pd

dict1 = 'name':['dog', 'dog', 'cat', 'cat', 'cat', 'bird', 'bird', 'bird', 'bird'], 'number':[42, 42, 42, 42, 42, 42, 42, 42, 42], 'count':[1, 2, 4, 5, 7, 1, 2, 5, 8]
df = pd.DataFrame(dict1)

print(df)
## name number count
## 0 dog 42 1
## 1 dog 42 2
## 2 cat 42 4
## 3 cat 42 5
## 4 cat 42 7
## 5 bird 42 1
## 6 bird 42 2
## 7 bird 42 5
## 8 bird 42 8



Column counts contains integers from 1 to 8. My goal is to populate an 8 by 8 zero matrix with the count of each combination "pair" given the unique category in column name.


counts


name



So, the combination pairs for dog, cat, and bird are:


dog


cat


bird


dog: (1, 2)
cat: (4, 5), (4, 7), (5, 7)
bird: (1, 2), (1, 5), (1, 8), (2, 5), (2, 8), (5, 8)



For each pair, I add +1 to the corresponding entry in the zero matrix.


+1



This matrix will be symmetric, i.e. (n, m) = (m, n). The matrix given df would be:


(n, m) = (m, n)


df


1 2 3 4 5 6 7 8
1: 0 2 0 0 1 0 0 1
2: 2 0 0 0 1 0 0 1
3: 0 0 0 0 0 0 0 0
4: 0 0 0 0 1 0 1 0
5: 1 1 0 1 0 0 1 1
6: 0 0 0 0 0 0 0 0
7: 0 0 0 1 1 0 0 0
8: 1 1 0 0 1 0 0 0



Note that (1,2)=(2,1) has a count 2, from the dog combination and the bird combination.


(1,2)=(2,1)


dog


bird



(1) In order to do this, I think it would be best to create a list of "combinations tuples" given the pandas DataFrame.



That is, something like


list_combos = [(1, 2), (2, 1), (4, 5), (4, 7), (5, 7), (5, 4), (7, 4), (7, 5),
(1, 2), (1, 5), (1, 8), (2, 5), (2, 8), (5, 8), (2, 1), (5, 1),
(8, 1), (5, 2), (8, 2), (8, 5)]



Given that the matrix is symmetric, perhaps it would be better to use:


list_combos2 = [(1, 2), (4, 5), (4, 7), (5, 7), (1, 2), (1, 5), (1, 8), (2, 5), (2, 8), (5, 8)]



How could one calculate the permutations of entires in a pandas DataFrame, given the categorical value in 'names'?



(2) What would be the most algorithmically efficient (i.e. RAM) to populate this matrix, given the list of tuples?



I should be able to feed a list of tuples into a numpy array, but how does one fill in the zeros?





Looks a lot like stackoverflow.com/questions/51811519/…?
– coldspeed
Aug 12 at 18:21





It doesn't matter to me. Was just wondering if they were related. Anyway I've already answered you.
– coldspeed
Aug 12 at 18:35





Nowhere did you explain why you need an 8x8 matrix instead of list-of-tuples (or possibly faster, an mx2 array). What do you actually use the 8x8 matrix for later on?
– smci
Aug 13 at 7:48




2 Answers
2



You can use groupby, iterate over combinations, and build your matrix like so:


import numpy as np
from itertools import combinations

mat = np.zeros((df['count'].max(), ) * 2)
idx =
for _, g in df.groupby('name'):
idx.extend(combinations(g['count'] - 1, r=2))

np.add.at(mat, list(zip(*idx)), 1)
mat += mat.T

array([[0., 2., 0., 0., 1., 0., 0., 1.],
[2., 0., 0., 0., 1., 0., 0., 1.],
[0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 1., 0., 1., 0.],
[1., 1., 0., 1., 0., 0., 1., 1.],
[0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 1., 1., 0., 0., 0.],
[1., 1., 0., 0., 1., 0., 0., 0.]])



There may be a faster solution, but this is the cleanest one I can think of.





Ah, itertools.combinations is what I was looking for! I was trying something with itertools.product. Thanks!
– EB2127
Aug 12 at 18:36



itertools.combinations


itertools.product





beat me to it :) I would also suggest storing the full matrix is worthless as it is symmetric and has no diagonal. In contrast if memory / performance is important store only the top triangle?
– Alexander McFarlane
Aug 12 at 18:38






@AlexanderMcFarlane Yup, especially for large matrices, since like you mentioned, this is symmetrical. Good point!
– coldspeed
Aug 12 at 18:39






@AlexanderMcFarlane I agree storing the entire matrix is worthless. How would you revise this? idx is really all of the information
– EB2127
Aug 12 at 18:40


idx





you can keep a a flat 1D list (idx). Then use a bit of logic to re-index where necessary.
– Alexander McFarlane
Aug 12 at 18:41



idx



Using Numpy's bincount


bincount


from itertools import combinations, chain
from collections import defaultdict

d = defaultdict(list)
for tup in df.itertuples():
d[tup.name].append(tup.count)

i, j = zip(*chain(*(combinations(v, 2) for v in d.values())))
i, j = np.array(i + j) - 1, np.array(j + i) - 1

np.bincount(i * 8 + j, minlength=64).reshape(8, 8)

array([[0, 2, 0, 0, 1, 0, 0, 1],
[2, 0, 0, 0, 1, 0, 0, 1],
[0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 1, 0, 1, 0],
[1, 1, 0, 1, 0, 0, 1, 1],
[0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 1, 1, 0, 0, 0],
[1, 1, 0, 0, 1, 0, 0, 0]])






By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

Popular posts from this blog

Firebase Auth - with Email and Password - Check user already registered

Dynamically update html content plain JS

Creating a leaderboard in HTML/JS