scikit-learn 为机器学习准备文本数据

blairchen

data-science

Publish：Dec 8, 2019

views

scikit-learn

有关特征的提取，scikit-learn给出了很多方法，具体分成了图片特征提取和文本特征提取。

文本特征提取的接口是sklearn.feature_extraction.text，那么接下来学习里面封装的函数。

CountVectorizer

from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(min_df=1)

corpus = [
            'This is the first document.',
            'This is the second second document.',
            'And the third one.',
            'Is this the first document?',
         ]
X = vectorizer.fit_transform(corpus)
feature_name = vectorizer.get_feature_names()

print feature_name
print X.toarray()

程序的结果为

[u'and', u'document', u'first', u'is', u'one', u'second', u'the', u'third', u'this']
[[0 1 1 1 0 0 1 0 1]
 [0 1 0 1 0 2 1 0 1]
 [1 0 0 0 1 0 1 1 0]
 [0 1 1 1 0 0 1 0 1]]