python-序列计数统计

今天在开发是遇见一个很有意思的问题，如何在一个列表中统计出现元素的次数。

此类场景经常出现在数据统计，对语料处理也经常用到

问题：如何统计序列中元素的次数

words = ['Beautiful', 'is', 'better', 'than', 'ugly', 'Explicit', 'is', 'better', 'than', 'implicit', 'Simple', 'is', 'better', 'than', 'complex', 'Complex', 'is', 'better', 'than', 'complicated', 'Flat', 'is', 'better', 'than', 'nested', 'Sparse', 'is', 'better', 'than', 'dense']

通常处理这种问题的思路一般是新建一个字典作为计数操作

# 计数子弹
count_dict = {}
words = ['Beautiful', 'is', 'better', 'than', 'ugly', 'Explicit', 'is', 'better', 'than', 'implicit', 'Simple', 'is',
         'better', 'than', 'complex', 'Complex', 'is', 'better', 'than', 'complicated', 'Flat', 'is', 'better', 'than',
         'nested', 'Sparse', 'is', 'better', 'than', 'dense']
# 循环单词序列，如果不存在则创建初始值，如果存在则增加计数
for word in words:
    if word in count_dict.keys():
        count_dict[word] += 1
    else:
        count_dict[word] = 0

这样的处理可以解决我们提出的问题，有更好的更符合python的处理方法么？

Collections模块中的Counter类正是为了处理这种问题设计的

# 引入counter
>>> from collections import Counter
# 一行代码即可搞定
>>> word_counts = Counter(words)
>>> word_counts
Counter({'is': 6, 'better': 6, 'than': 6, 'Beautiful': 1, 'ugly': 1, 'Explicit': 1, 'implicit': 1, 'Simple': 1, 'complex': 1, 'Complex': 1, 'complicated': 1, 'Flat': 1, 'nested': 1, 'Sparse': 1, 'dense': 1})

获取结果之后我们可以很轻松的分析结果，比如我们想要获取排名最高的三个单词
1
2
>>> word_counts.most_common(3) [('is', 6), ('better', 6), ('than', 6)]
获取任意单词的计数值 (底层中Counter也是个字典映射)
1
2
>>> word_counts['ugly'] 1

手动增加计数也十分方便

>>> word_counts['ugly'] += 1
>>> word_counts['ugly']
2
>>> word_counts['ugly'] = 8
>>> word_counts['ugly']
8

增加元素也可以使用update语法

# 新增单词
>>> more_words = ['Readability', 'counts', 'Special', 'cases', "aren't", 'special', 'enough', 'to', 'break', 'the', 'rules', 'Although', 'practicality', 'beats', 'purity', 'Errors', 'should', 'never', 'pass', 'silently', 'Unless', 'explicitly', 'silenced']
# 直接使用update
>>> word_counts.update(more_words)
    
    
>>> word_counts
Counter({'ugly': 8, 'is': 6, 'better': 6, 'than': 6, 'Beautiful': 1, 'Explicit': 1, 'implicit': 1, 'Simple': 1, 'complex': 1, 'Complex': 1, 'complicated': 1, 'Flat': 1, 'nested': 1, 'Sparse': 1, 'dense': 1, 'Readability': 1, 'counts': 1, 'Special': 1, 'cases': 1, "aren't": 1, 'special': 1, 'enough': 1, 'to': 1, 'break': 1, 'the': 1, 'rules': 1, 'Although': 1, 'practicality': 1, 'beats': 1, 'purity': 1, 'Errors': 1, 'should': 1, 'never': 1, 'pass': 1, 'silently': 1, 'Unless': 1, 'explicitly': 1, 'silenced': 1})
    

Counter其他操作

>>> a=Counter(words)
>>> b=Counter(more_words)
>>> a + b
Counter({'is': 6, 'better': 6, 'than': 6, 'Beautiful': 1, 'ugly': 1, 'Explicit': 1, 'implicit': 1, 'Simple': 1, 'complex': 1, 'Complex': 1, 'complicated': 1, 'Flat': 1, 'nested': 1, 'Sparse': 1, 'dense': 1, 'Readability': 1, 'counts': 1, 'Special': 1, 'cases': 1, "aren't": 1, 'special': 1, 'enough': 1, 'to': 1, 'break': 1, 'the': 1, 'rules': 1, 'Although': 1, 'practicality': 1, 'beats': 1, 'purity': 1, 'Errors': 1, 'should': 1, 'never': 1, 'pass': 1, 'silently': 1, 'Unless': 1, 'explicitly': 1, 'silenced': 1})
  
>>> a - b
Counter({'is': 6, 'better': 6, 'than': 6, 'Beautiful': 1, 'ugly': 1, 'Explicit': 1, 'implicit': 1, 'Simple': 1, 'complex': 1, 'Complex': 1, 'complicated': 1, 'Flat': 1, 'nested': 1, 'Sparse': 1, 'dense': 1})

总结

1. 当面对任何数据制表或者计数是使用counter会比手写字典计数算法更有效率

“世界就像是个巨大的马戏团，它让你兴奋，却让我惶恐，因为我知道散场后永远是有限温存，无限心酸。”——Charlie Chaplin