Programing

numpy 벡터에서 가장 빈번한 숫자 찾기

lottogame 2020. 8. 18. 08:07
반응형

numpy 벡터에서 가장 빈번한 숫자 찾기


파이썬에 다음 목록이 있다고 가정합니다.

a = [1,2,3,1,2,1,1,1,3,2,2,1]

이 목록에서 가장 빈번한 번호를 깔끔하게 찾는 방법은 무엇입니까?


목록에 음수가 아닌 정수가 모두 포함 된 경우 numpy.bincounts를 살펴 봐야합니다.

http://docs.scipy.org/doc/numpy/reference/generated/numpy.bincount.html

그리고 아마도 np.argmax를 사용하십시오.

a = np.array([1,2,3,1,2,1,1,1,3,2,2,1])
counts = np.bincount(a)
print np.argmax(counts)

더 복잡한 목록 (음수 또는 정수가 아닌 값 포함)의 np.histogram경우 유사한 방식으로 사용할 수 있습니다 . 또는 numpy를 사용하지 않고 파이썬으로 작업하고 싶다면 collections.Counter이런 종류의 데이터를 처리하는 좋은 방법입니다.

from collections import Counter
a = [1,2,3,1,2,1,1,1,3,2,2,1]
b = Counter(a)
print b.most_common(1)

당신은 사용할 수 있습니다

(values,counts) = np.unique(a,return_counts=True)
ind=np.argmax(counts)
print values[ind]  # prints the most frequent element

일부 요소가 다른 요소만큼 자주 발생하는 경우이 코드는 첫 번째 요소 만 반환합니다.


SciPy 를 사용하려는 경우 :

>>> from scipy.stats import mode
>>> mode([1,2,3,1,2,1,1,1,3,2,2,1])
(array([ 1.]), array([ 6.]))
>>> most_frequent = mode([1,2,3,1,2,1,1,1,3,2,2,1])[0][0]
>>> most_frequent
1.0

여기에서 찾을 수있는 일부 솔루션의 성능 (iPython 사용)

>>> # small array
>>> a = [12,3,65,33,12,3,123,888000]
>>> 
>>> import collections
>>> collections.Counter(a).most_common()[0][0]
3
>>> %timeit collections.Counter(a).most_common()[0][0]
100000 loops, best of 3: 11.3 µs per loop
>>> 
>>> import numpy
>>> numpy.bincount(a).argmax()
3
>>> %timeit numpy.bincount(a).argmax()
100 loops, best of 3: 2.84 ms per loop
>>> 
>>> import scipy.stats
>>> scipy.stats.mode(a)[0][0]
3.0
>>> %timeit scipy.stats.mode(a)[0][0]
10000 loops, best of 3: 172 µs per loop
>>> 
>>> from collections import defaultdict
>>> def jjc(l):
...     d = defaultdict(int)
...     for i in a:
...         d[i] += 1
...     return sorted(d.iteritems(), key=lambda x: x[1], reverse=True)[0]
... 
>>> jjc(a)[0]
3
>>> %timeit jjc(a)[0]
100000 loops, best of 3: 5.58 µs per loop
>>> 
>>> max(map(lambda val: (a.count(val), val), set(a)))[1]
12
>>> %timeit max(map(lambda val: (a.count(val), val), set(a)))[1]
100000 loops, best of 3: 4.11 µs per loop
>>> 

Best is 'max' with 'set'


Also if you want to get most frequent value(positive or negative) without loading any modules you can use the following code:

lVals = [1,2,3,1,2,1,1,1,3,2,2,1]
print max(map(lambda val: (lVals.count(val), val), set(lVals)))

While most of the answers above are useful, in case you: 1) need it to support non-positive-integer values (e.g. floats or negative integers ;-)), and 2) aren't on Python 2.7 (which collections.Counter requires), and 3) prefer not to add the dependency of scipy (or even numpy) to your code, then a purely python 2.6 solution that is O(nlogn) (i.e., efficient) is just this:

from collections import defaultdict

a = [1,2,3,1,2,1,1,1,3,2,2,1]

d = defaultdict(int)
for i in a:
  d[i] += 1
most_frequent = sorted(d.iteritems(), key=lambda x: x[1], reverse=True)[0]

I like the solution by JoshAdel.

But there is just one catch.

The np.bincount() solution only works on numbers.

If you have strings, collections.Counter solution will work for you.


Expanding on this method, applied to finding the mode of the data where you may need the index of the actual array to see how far away the value is from the center of the distribution.

(_, idx, counts) = np.unique(a, return_index=True, return_counts=True)
index = idx[np.argmax(counts)]
mode = a[index]

Remember to discard the mode when len(np.argmax(counts)) > 1


In Python 3 the following should work:

max(set(a), key=lambda x: a.count(x))

Here is a general solution that may be applied along an axis, regardless of values, using purely numpy. I've also found that this is much faster than scipy.stats.mode if there are a lot of unique values.

import numpy

def mode(ndarray, axis=0):
    # Check inputs
    ndarray = numpy.asarray(ndarray)
    ndim = ndarray.ndim
    if ndarray.size == 1:
        return (ndarray[0], 1)
    elif ndarray.size == 0:
        raise Exception('Cannot compute mode on empty array')
    try:
        axis = range(ndarray.ndim)[axis]
    except:
        raise Exception('Axis "{}" incompatible with the {}-dimension array'.format(axis, ndim))

    # If array is 1-D and numpy version is > 1.9 numpy.unique will suffice
    if all([ndim == 1,
            int(numpy.__version__.split('.')[0]) >= 1,
            int(numpy.__version__.split('.')[1]) >= 9]):
        modals, counts = numpy.unique(ndarray, return_counts=True)
        index = numpy.argmax(counts)
        return modals[index], counts[index]

    # Sort array
    sort = numpy.sort(ndarray, axis=axis)
    # Create array to transpose along the axis and get padding shape
    transpose = numpy.roll(numpy.arange(ndim)[::-1], axis)
    shape = list(sort.shape)
    shape[axis] = 1
    # Create a boolean array along strides of unique values
    strides = numpy.concatenate([numpy.zeros(shape=shape, dtype='bool'),
                                 numpy.diff(sort, axis=axis) == 0,
                                 numpy.zeros(shape=shape, dtype='bool')],
                                axis=axis).transpose(transpose).ravel()
    # Count the stride lengths
    counts = numpy.cumsum(strides)
    counts[~strides] = numpy.concatenate([[0], numpy.diff(counts[~strides])])
    counts[strides] = 0
    # Get shape of padded counts and slice to return to the original shape
    shape = numpy.array(sort.shape)
    shape[axis] += 1
    shape = shape[transpose]
    slices = [slice(None)] * ndim
    slices[axis] = slice(1, None)
    # Reshape and compute final counts
    counts = counts.reshape(shape).transpose(transpose)[slices] + 1

    # Find maximum counts and return modals/counts
    slices = [slice(None, i) for i in sort.shape]
    del slices[axis]
    index = numpy.ogrid[slices]
    index.insert(axis, numpy.argmax(counts, axis=axis))
    return sort[index], counts[index]

I'm recently doing a project and using collections.Counter.(Which tortured me).

The Counter in collections have a very very bad performance in my opinion. It's just a class wrapping dict().

What's worse, If you use cProfile to profile its method, you should see a lot of '__missing__' and '__instancecheck__' stuff wasting the whole time.

Be careful using its most_common(), because everytime it would invoke a sort which makes it extremely slow. and if you use most_common(x), it will invoke a heap sort, which is also slow.

Btw, numpy's bincount also have a problem: if you use np.bincount([1,2,4000000]), you will get an array with 4000000 elements.

참고URL : https://stackoverflow.com/questions/6252280/find-the-most-frequent-number-in-a-numpy-vector

반응형