Populating a “count matrix” with combinations of pandas DataFrame rows
Clash Royale CLAN TAG#URR8PPP
Populating a “count matrix” with combinations of pandas DataFrame rows
Let's say I have the following pandas DataFrame in Python3.x
import pandas as pd
dict1 = 'name':['dog', 'dog', 'cat', 'cat', 'cat', 'bird', 'bird', 'bird', 'bird'], 'number':[42, 42, 42, 42, 42, 42, 42, 42, 42], 'count':[1, 2, 4, 5, 7, 1, 2, 5, 8]
df = pd.DataFrame(dict1)
print(df)
## name number count
## 0 dog 42 1
## 1 dog 42 2
## 2 cat 42 4
## 3 cat 42 5
## 4 cat 42 7
## 5 bird 42 1
## 6 bird 42 2
## 7 bird 42 5
## 8 bird 42 8
Column counts
contains integers from 1 to 8. My goal is to populate an 8 by 8 zero matrix with the count of each combination "pair" given the unique category in column name
.
counts
name
So, the combination pairs for dog
, cat
, and bird
are:
dog
cat
bird
dog: (1, 2)
cat: (4, 5), (4, 7), (5, 7)
bird: (1, 2), (1, 5), (1, 8), (2, 5), (2, 8), (5, 8)
For each pair, I add +1
to the corresponding entry in the zero matrix.
+1
This matrix will be symmetric, i.e. (n, m) = (m, n)
. The matrix given df
would be:
(n, m) = (m, n)
df
1 2 3 4 5 6 7 8
1: 0 2 0 0 1 0 0 1
2: 2 0 0 0 1 0 0 1
3: 0 0 0 0 0 0 0 0
4: 0 0 0 0 1 0 1 0
5: 1 1 0 1 0 0 1 1
6: 0 0 0 0 0 0 0 0
7: 0 0 0 1 1 0 0 0
8: 1 1 0 0 1 0 0 0
Note that (1,2)=(2,1)
has a count 2, from the dog
combination and the bird
combination.
(1,2)=(2,1)
dog
bird
(1) In order to do this, I think it would be best to create a list of "combinations tuples" given the pandas DataFrame.
That is, something like
list_combos = [(1, 2), (2, 1), (4, 5), (4, 7), (5, 7), (5, 4), (7, 4), (7, 5),
(1, 2), (1, 5), (1, 8), (2, 5), (2, 8), (5, 8), (2, 1), (5, 1),
(8, 1), (5, 2), (8, 2), (8, 5)]
Given that the matrix is symmetric, perhaps it would be better to use:
list_combos2 = [(1, 2), (4, 5), (4, 7), (5, 7), (1, 2), (1, 5), (1, 8), (2, 5), (2, 8), (5, 8)]
How could one calculate the permutations of entires in a pandas DataFrame, given the categorical value in 'names'?
(2) What would be the most algorithmically efficient (i.e. RAM) to populate this matrix, given the list of tuples?
I should be able to feed a list of tuples into a numpy array, but how does one fill in the zeros?
It doesn't matter to me. Was just wondering if they were related. Anyway I've already answered you.
– coldspeed
Aug 12 at 18:35
Nowhere did you explain why you need an 8x8 matrix instead of list-of-tuples (or possibly faster, an mx2 array). What do you actually use the 8x8 matrix for later on?
– smci
Aug 13 at 7:48
2 Answers
2
You can use groupby, iterate over combinations, and build your matrix like so:
import numpy as np
from itertools import combinations
mat = np.zeros((df['count'].max(), ) * 2)
idx =
for _, g in df.groupby('name'):
idx.extend(combinations(g['count'] - 1, r=2))
np.add.at(mat, list(zip(*idx)), 1)
mat += mat.T
array([[0., 2., 0., 0., 1., 0., 0., 1.],
[2., 0., 0., 0., 1., 0., 0., 1.],
[0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 1., 0., 1., 0.],
[1., 1., 0., 1., 0., 0., 1., 1.],
[0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 1., 1., 0., 0., 0.],
[1., 1., 0., 0., 1., 0., 0., 0.]])
There may be a faster solution, but this is the cleanest one I can think of.
Ah,
itertools.combinations
is what I was looking for! I was trying something with itertools.product
. Thanks!– EB2127
Aug 12 at 18:36
itertools.combinations
itertools.product
beat me to it :) I would also suggest storing the full matrix is worthless as it is symmetric and has no diagonal. In contrast if memory / performance is important store only the top triangle?
– Alexander McFarlane
Aug 12 at 18:38
@AlexanderMcFarlane Yup, especially for large matrices, since like you mentioned, this is symmetrical. Good point!
– coldspeed
Aug 12 at 18:39
@AlexanderMcFarlane I agree storing the entire matrix is worthless. How would you revise this?
idx
is really all of the information– EB2127
Aug 12 at 18:40
idx
you can keep a a flat 1D list (
idx
). Then use a bit of logic to re-index where necessary.– Alexander McFarlane
Aug 12 at 18:41
idx
Using Numpy's bincount
bincount
from itertools import combinations, chain
from collections import defaultdict
d = defaultdict(list)
for tup in df.itertuples():
d[tup.name].append(tup.count)
i, j = zip(*chain(*(combinations(v, 2) for v in d.values())))
i, j = np.array(i + j) - 1, np.array(j + i) - 1
np.bincount(i * 8 + j, minlength=64).reshape(8, 8)
array([[0, 2, 0, 0, 1, 0, 0, 1],
[2, 0, 0, 0, 1, 0, 0, 1],
[0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 1, 0, 1, 0],
[1, 1, 0, 1, 0, 0, 1, 1],
[0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 1, 1, 0, 0, 0],
[1, 1, 0, 0, 1, 0, 0, 0]])
By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.
Looks a lot like stackoverflow.com/questions/51811519/…?
– coldspeed
Aug 12 at 18:21