What is Scikit learn
Introduction
Scikit-learn (formerly scikits.learn and also known as sklearn) is a free software machine learning library for the Python programming language.[3] It features various classification, regression and clustering algorithms including support vector machines, random forests, gradient boosting, k-means and DBSCAN, and is designed to interoperate with the Python numerical and scientific libraries NumPy and SciPy.
Bag of Words
The most intuitive way to do so is to use a bags of words representation:
Assign a fixed integer id to each word occurring in any document of the training set (for instance by building a dictionary from words to integer indices).
For each document
#i
, count the number of occurrences of each wordw
and store it inX[i, j]
as the value of feature#j
wherej
is the index of wordw
in the dictionary.
CountVectorizer
CountVectorizer converts a collection of text documents to a matrix of token counts
This implementation produces a sparse representation of the counts using scipy.sparse.csr_matrix.