Python 文档 - Scikit Learn

198 阅读1分钟

What is Scikit learn

Introduction

Scikit-learn (formerly scikits.learn and also known as sklearn) is a free software machine learning library for the Python programming language.[3] It features various classification, regression and clustering algorithms including support vector machines, random forests, gradient boosting, k-means and DBSCAN, and is designed to interoperate with the Python numerical and scientific libraries NumPy and SciPy.

Bag of Words

The most intuitive way to do so is to use a bags of words representation:

  1. Assign a fixed integer id to each word occurring in any document of the training set (for instance by building a dictionary from words to integer indices).

  2. For each document #i, count the number of occurrences of each word w and store it in X[i, j] as the value of feature #j where j is the index of word w in the dictionary.

CountVectorizer

CountVectorizer converts a collection of text documents to a matrix of token counts

This implementation produces a sparse representation of the counts using scipy.sparse.csr_matrix.

Code

Import libraries

Preparation on POJO

Loading data

Preparing data

Bag of words vectorization

Classification

Linear SVM

Decision Tree

Naive Bayes

Logistic Regression

Evaluation

Update the code 

Prepare data

Update Vectorizer

Tunning the model with Grid Search

IO for Model 

Save Model

Load Model

Additionally: Confusion Matrix