Populating a “count matrix” with combinations of pandas DataFrame rows

Populating a “count matrix” with combinations of pandas DataFrame rows

Let's say I have the following pandas DataFrame in Python3.x

import pandas as pd dict1 = 'name':['dog', 'dog', 'cat', 'cat', 'cat', 'bird', 'bird', 'bird', 'bird'], 'number':[42, 42, 42, 42, 42, 42, 42, 42, 42], 'count':[1, 2, 4, 5, 7, 1, 2, 5, 8] df = pd.DataFrame(dict1) print(df) ## name number count ## 0 dog 42 1 ## 1 dog 42 2 ## 2 cat 42 4 ## 3 cat 42 5 ## 4 cat 42 7 ## 5 bird 42 1 ## 6 bird 42 2 ## 7 bird 42 5 ## 8 bird 42 8

Column counts contains integers from 1 to 8. My goal is to populate an 8 by 8 zero matrix with the count of each combination "pair" given the unique category in column name.

counts

name

So, the combination pairs for dog, cat, and bird are:

dog

cat

bird

dog: (1, 2) cat: (4, 5), (4, 7), (5, 7) bird: (1, 2), (1, 5), (1, 8), (2, 5), (2, 8), (5, 8)

For each pair, I add +1 to the corresponding entry in the zero matrix.

+1

This matrix will be symmetric, i.e. (n, m) = (m, n). The matrix given df would be:

(n, m) = (m, n)

df

1 2 3 4 5 6 7 8 1: 0 2 0 0 1 0 0 1 2: 2 0 0 0 1 0 0 1 3: 0 0 0 0 0 0 0 0 4: 0 0 0 0 1 0 1 0 5: 1 1 0 1 0 0 1 1 6: 0 0 0 0 0 0 0 0 7: 0 0 0 1 1 0 0 0 8: 1 1 0 0 1 0 0 0

Note that (1,2)=(2,1) has a count 2, from the dog combination and the bird combination.

(1,2)=(2,1)

dog

bird

(1) In order to do this, I think it would be best to create a list of "combinations tuples" given the pandas DataFrame.

That is, something like

list_combos = [(1, 2), (2, 1), (4, 5), (4, 7), (5, 7), (5, 4), (7, 4), (7, 5), (1, 2), (1, 5), (1, 8), (2, 5), (2, 8), (5, 8), (2, 1), (5, 1), (8, 1), (5, 2), (8, 2), (8, 5)]

Given that the matrix is symmetric, perhaps it would be better to use:

list_combos2 = [(1, 2), (4, 5), (4, 7), (5, 7), (1, 2), (1, 5), (1, 8), (2, 5), (2, 8), (5, 8)]

How could one calculate the permutations of entires in a pandas DataFrame, given the categorical value in 'names'?

(2) What would be the most algorithmically efficient (i.e. RAM) to populate this matrix, given the list of tuples?

I should be able to feed a list of tuples into a numpy array, but how does one fill in the zeros?

Looks a lot like stackoverflow.com/questions/51811519/…?
– coldspeed
Aug 12 at 18:21

It doesn't matter to me. Was just wondering if they were related. Anyway I've already answered you.
– coldspeed
Aug 12 at 18:35

Nowhere did you explain why you need an 8x8 matrix instead of list-of-tuples (or possibly faster, an mx2 array). What do you actually use the 8x8 matrix for later on?
– smci
Aug 13 at 7:48

2 Answers
2

You can use groupby, iterate over combinations, and build your matrix like so:

import numpy as np from itertools import combinations mat = np.zeros((df['count'].max(), ) * 2) idx = for _, g in df.groupby('name'): idx.extend(combinations(g['count'] - 1, r=2)) np.add.at(mat, list(zip(*idx)), 1) mat += mat.T array([[0., 2., 0., 0., 1., 0., 0., 1.], [2., 0., 0., 0., 1., 0., 0., 1.], [0., 0., 0., 0., 0., 0., 0., 0.], [0., 0., 0., 0., 1., 0., 1., 0.], [1., 1., 0., 1., 0., 0., 1., 1.], [0., 0., 0., 0., 0., 0., 0., 0.], [0., 0., 0., 1., 1., 0., 0., 0.], [1., 1., 0., 0., 1., 0., 0., 0.]])

There may be a faster solution, but this is the cleanest one I can think of.

Ah, itertools.combinations is what I was looking for! I was trying something with itertools.product. Thanks!
– EB2127
Aug 12 at 18:36

itertools.combinations

itertools.product

beat me to it :) I would also suggest storing the full matrix is worthless as it is symmetric and has no diagonal. In contrast if memory / performance is important store only the top triangle?
– Alexander McFarlane
Aug 12 at 18:38

@AlexanderMcFarlane Yup, especially for large matrices, since like you mentioned, this is symmetrical. Good point!
– coldspeed
Aug 12 at 18:39

@AlexanderMcFarlane I agree storing the entire matrix is worthless. How would you revise this? idx is really all of the information
– EB2127
Aug 12 at 18:40

idx

you can keep a a flat 1D list (idx). Then use a bit of logic to re-index where necessary.
– Alexander McFarlane
Aug 12 at 18:41

idx

Using Numpy's bincount

bincount

from itertools import combinations, chain from collections import defaultdict d = defaultdict(list) for tup in df.itertuples(): d[tup.name].append(tup.count) i, j = zip(*chain(*(combinations(v, 2) for v in d.values()))) i, j = np.array(i + j) - 1, np.array(j + i) - 1 np.bincount(i * 8 + j, minlength=64).reshape(8, 8) array([[0, 2, 0, 0, 1, 0, 0, 1], [2, 0, 0, 0, 1, 0, 0, 1], [0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 1, 0, 1, 0], [1, 1, 0, 1, 0, 0, 1, 1], [0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 1, 1, 0, 0, 0], [1, 1, 0, 0, 1, 0, 0, 0]])

By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

LmnKinZW,qPYgjPH94SA1Mg,C,vsq,6oKqyhKp,WD

搜尋此網誌

Sfyjdyy