使用 Turi Create 在 iOS 上实现自然语言处理

1,488 阅读14分钟
原文链接: www.raywenderlich.com

Natural Language Processing, or NLP, is the discipline of taking unstructured text and discerning some characteristics about it. To help you do this, Apple’s operating systems provide functions to understand text using the following techniques:

  • Language identification
  • Lemmatization (identification of the root form of a word)
  • Named entity recognition (proper names of people, places and organizations)
  • Parts of speech identification
  • Tokenization

In this tutorial, you’ll use tokenization and a custom machine learning model, or ML model, to identify the author of a given poem, or at least the poet the poem most closely emulates.

Note: This tutorial assumes you’re already familiar with the basics of iOS development and Swift. If you’re new to either of these topics, check out our iOS Development and Swift Language tutorials.

Getting Started

You may have already run into NLP in apps where sections of text are automatically turned into links or tags, or when text is automatically analyzed for emotional charge (called “sentiment” in the biz). NLP is widely used by Apple for data detectors, keyboard auto-suggest and Siri suggestions. Apps with search capabilities often use NLP to find related information and to efficiently transform input text into a canonical and, therefore indexable, form.

The app you’ll be building for this tutorial will take a poem and compare its text against a Core ML model that’s trained with the words from poems of famous, i.e. public-domain, authors. To build the Core ML model, you’ll be using Turi Create, an open-source Python project from Apple that creates and trains ML models.

Turi Create won’t make you a machine learning expert because it hides almost all the internal workings and mathematics involved. On the plus side, it means you don’t have to be a machine learning expert to use machine learning algorithms in your app!

App Overview

The first thing you need to do is download the starter app. You can find the Download Materials link at the top or bottom of this tutorial.

Inside the downloaded materials, you’ll find two project folders, a JSON file, and a Core ML model. Don’t worry about the JSON and ML model files; you’ll use those a bit later. Open the KeatsOrYeats-starter folder and fire up the KeatsOrYeats.xcodeproj inside.

Once running, copy and paste your favorite Yeats poem. Here is an example, “On Being Asked for a War Poem,” by William Butler Yeats:

I think it better that in times like these
A poet's mouth be silent, for in truth
We have no gift to set a statesman right;
He has had enough of meddling who can please
A young girl in the indolence of her youth,
Or an old man upon a winter’s night.

Press Return to run the analysis. At the top, you’ll see the app’s results, indicating “P. Laureate” wrote the poem! This prediction has a 50% confidence and a 10% confidence the poem matches the works of “P. Inglorious”.

This is obviously not correct, but that’s because the results are hard-coded and there’s no actual analysis.

Download Every Known Poem

Sometimes it’s useful to start developing an app using a simple, brute-force approach. The first-order solution to author identification is to get a copy of every known poem, or at least known poems by a set list of poets. That way, the app can do a simple string compare and see if the poem matches any of the authors. As Robert Burns once said, “Easy peasy.”

Nice try, but there are two major problems. First, poems (especially older ones) don’t always have canonical formatting (like line breaks, spacing and punctuation) so it’s hard to do a blind string compare. Second, your full-featured app should identify which author the entered poem most resembles, even if that isn’t a poem known to the app or not a work by that author.

There’s got to be a better way… And there is! Machine learning lets you create models of text you can then use to classify never-before-seen text into a known category.

Intro to Machine Learning: Text-Style

There are many different algorithms covered under the umbrella of machine learning. This CGPGrey video gives an excellent layman’s introduction.

The main takeaway is the resulting model is a mathematical black box that takes input text, transforms that input, and the result will be a decision or, in this a case, a probability that the text matches a given author. Inside that box is a series of weighted values that compute that probability. These weights are “discovered” (refined) over a series of epochs where the weights are adjusted to reduce the overall error.

The simplest model is a linear regression, which fits a line to a series of points. You may be familiar with the old equation y = mx + b. In this case, you have a series of known x & y’s, and the training of the model is to figure out the “m” (the weights) and “b”.

In a standard training scenario, there will be a guess for m & b, an error computed and, then over successive epochs, those get nudged closer and closer to find a value that minimizes the error. When presented with a never-before-seen “x”, the model can predict what the “y” value will be. Here is an in-depth article on how it works with Turi Create.

Of course, real-world models are far more complicated and take into account many different input variables.

Bag of Words

Machine learning inspects and analyzes an input’s features. Features in this context are the important or salient values about the input or, mathematically speaking, the independent variables in the computation. From the download materials, go ahead and open corpus.json, which will be the input file for training the model. Inside, you’ll see an array of JSON objects. Take a look at the first item:

{
    "title": "When You Are Old",
    "author": "William Butler Yeats",
    "text": "When you are old and grey and full of sleep,\nAnd nodding by the fire, take down this book,\nAnd slowly read, and dream of the soft look\nYour eyes had once, and of their shadows deep;\nHow many loved your moments of glad grace,\nAnd loved your beauty with love false or true,\nBut one man loved the pilgrim Soul in you,\nAnd loved the sorrows of your changing face;\nAnd bending down beside the glowing bars,\nMurmur, a little sadly, how Love fled\nAnd paced upon the mountains overhead\nAnd hid his face amid a crowd of stars."
}

In this case, a single “input” has three columns: title, author and text. The text column will be the only feature for the model, and title is not taken into account. The author is the class the model is tasked with computing, which is sometimes called the label or dependent variable.

If the whole text is used as the input, then the model basically becomes the naïve straight-up comparison discussed above. Instead, specific aspects of the text have to be fed into the model. The default way of handling text is as a bag of words, or BOW. Imagine breaking up all the text into its individual words and throwing them into a bag so they lose their context, ordering and sentence structure. This way, the only dimension that’s retained is the frequency of the collection of words.

In other words, the BOW is a map of words to word counts.

For this tutorial, each poem gets transformed into a BOW, with the assumption that one author will use similar words across different poems, and that other authors will tend toward different word choices.

Each word then becomes a dimension for optimizing the model. In this paltry example of 518 poems, there are 24,939 different words used.

The Logistic Classifier

Turi Create will make a logistic classifier for this type of analysis, which actually works a little differently than a linear regression.

To oversimplify a bit: instead of interpolating a single value, a logistic classifier will compute a probability (from 0 to 1) for each class by multiplying how much each word contributes to that class by the number of times that word appears, ultimately adding all of that up across all of the words.

Take the first line of that first Yeats poem: “When you are old and grey and full of sleep” And the first line of the first Keats poem: “Happy is England! I could be content”

If these two lines were the total input, each of these words contribute wholly to their author. This is because there are no overlapping words. If the Keats line was, instead, “Happy are England”, then the word “are” would contribute 50/50 for each author.

Word    Keats Yeats
-------------------
And       0     1
Are       0     1
Be        1     0
Could     1     0
Content   1     0
England   1     0
Grey      0     1
Happy     1     0
I         1     0
Is        1     0
Full      0     1
Of        0     1
Old       0     1
Sleep     0     1
When      0     1
You       0     1

Now if you take the poem you saw earlier, “On Being Asked for a War Poem”, as the input, only one word — I — appears in the training list, so the model would predict that Keats wrote the poem at 100% and that Yeats wrote the poem at 0%.

Hopefully this illustrates why a large data set is required to accurately train models!

Using Turi Create

Core ML is iOS’s machine learning engine, supporting multiple types of models based on different machine learning SDKs like scikit and keras. Apple’s open-source library, Turi Create, reduces the overhead in learning how to use these libraries, and handles choosing the best type of model for a given task. This is done either by having a pre-chosen model type for the activity or by running several models against each other to see which performs best.

Turi Create is app-specific, rather than model-specific. This means you specify the type of problem you want to solve, rather than choosing the type of model you want to use. This way, it can choose the right model for the job.

Like most machine learning tools, the ones that are compatible with Core ML are written in Python. To get started, very little understanding of Python is necessary. Having said that, knowing Python is useful if you want to expand how you train models or customize the input data, or if you run into trouble.

Setting Up Python

The following instructions assume you already have Python installed, which is likely if you have a Mac running the latest Xcode.

Run the following command in Terminal to check if you have Python installed already:

python -V

If Python is installed, you’ll see its version number. If it isn’t, you’ll need to follow these instructions to download and install Python wiki.python.org/moin/Beginn….

You’ll also need pip installed on your machine, which comes with the Python installation. Run the following command to make sure it’s installed:

which pip

If the result isn’t for a folder ending in /bin/pip, you’ll need to install it from pip.pypa.io/en/stable/i….

Finally, it’s suggested to use virutalenv to install Turi Create. This isn’t generally part of the default Mac setup, but it can be installed from the Terminal by using:

pip install virtualenv

If you get any permission errors, preface the command with the sudo command.

sudo pip install virtualenv

If you get any SSL errors, you’ll need to add the --trusted-host command line option.

pip install --trusted-host pypi.python.org virtualenv

Virtualenv is a tool for creating virtual Python environments. This means you can install a series of tools and libraries in isolation in a named environment. With virtual environments, you can build and run an app with a known set of dependencies, and then go and create a separate environment for a new app that has a different set of tools, possibly with versions that would otherwise conflict with the first environment.

From an iOS perspective, think of it as being able to have an environment with Xcode 8.2, Cocoapods 1.0 and Fastlane 2.4 to build one app, and then be able to launch another environment with Xcode 9.1, Cocoapods 1.2 and Fastlane 2.7 to build another app, without those two conflicting. This is just one more reminder of the sophistication of open-source developer tools with large communities.

Installing Turi Create

With Python in hand, for the first step, you’ll create a new virtual environment in which to install Turi Create.

Open a Terminal window, and cd into the directory where you downloaded this tutorial’s materials. For reference, corpus.json should be in the current folder before continuing.

From there, enter the following command:

virtualenv venv

This creates a new virtual environment named venv in your project directory.

When you have completed that, activate the environment:

source venv/bin/activate

When there is an active environment, you’ll see a (venv) prepended to the terminal prompt. If you need to get out of the virtual environment, run the deactivate command.

Finally, make sure the environment is still activated and install Turi Create:

pip install -U turicreate

If you have any issues with installation, you can run a more explicit install command:

python2.7 -m pip install turicreate 

This installs the latest version of the Turi Create library, along with all its dependencies. Now it’s time to actually start using Python!

Using Turi Create to train a model

First, in a new Terminal window with the virtual environment active and launch Python in the same directory as your corpus.json file:

python

You can also use a more interactive environment like iPython, which provides better history and tab-completion features, but that’s outside the scope of this tutorial.

Next, run the following command:

import turicreate as tc

This will import the Turi Create module and make it accessible from the symbol tc.

Next, load the JSON data:

data = tc.SFrame.read_json('corpus.json', orient='records')

This will load the data from the JSON file into a SFrame, which is the data container for Turi Create. Its data is organized in columns like a spreadsheet and has powerful functions for manipulation. This is important for massaging data to get the best input for training a model. It’s also optimized for loading from disk storage, which is important for large data sets that can easily overwhelm RAM.

Type in data to see what you pulled out. The generated output shows the size and data types contained within, as well as the first few rows of data.

<bound method SFrame.explore of Columns:
    author  str
    text    str
    title   str
Rows: 518
Data:
+----------------------+-------------------------------+
|        author        |              text             |
+----------------------+-------------------------------+
| William Butler Yeats | When you are old and grey ... |
| William Butler Yeats | Had I the heavens' embroid... |
| William Butler Yeats | Were you but lying cold an... |
| William Butler Yeats | Wine comes in at the mouth... |
| William Butler Yeats | That crazed girl improvisi... |
| William Butler Yeats | Turning and turning in the... |
| William Butler Yeats | I made my song a coat\nCov... |
| William Butler Yeats | I will arise and go now, a... |
| William Butler Yeats | I think it better that in ... |
|      John Keats      | Happy is England! I could ... |
+----------------------+-------------------------------+
+-------------------------------+
|             title             |
+-------------------------------+
|        When You Are Old       |
| He Wishes For The Cloths O... |
| He Wishes His Beloved Were... |
|        A Drinking Song        |
|         A Crazed Girl         |
|       The Second Coming       |
|             A coat            |
|   The Lake Isle Of Innisfree  |
| On being asked for a War Poem |
| Happy Is England! I Could ... |
+-------------------------------+
[518 rows x 3 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.>
Note: If you get an error loading the data, make sure you launched Python in the same directory as the JSON file, or specify a full path to it.

Now that you have the data, for the next step, you’ll create a model by running:

model = tc.sentence_classifier.create(data, 'author', features=['text'])

This creates a sentence classifier given the loaded data, specifying the author to be the class labels, and the text column to be the input variable. To build a more accurate classifier, you can compute and then provide additional features such as meter, line length and rhyme scheme.

This command creates the model and trains it on data. It will reserve about 5% of the rows as a validation set. This means that 95% of the data is for training, and then the remaining data will be used to test the accuracy of the trained model.

Due to the poor quality of the training data (that is, there are a large number of words for a only a handful of examples per author), if the training fails or gets terminated before the maximum 10 iterations are complete, just re-run the command. The training is not deterministic, so trying again might lead to a different result, depending on the starting values for the coefficients.

Finally, run this command to export the model in the Core ML format:

model.export_coreml('Poets.mlmodel')

Voilà! With four lines of Python, you’ve built and trained an ML model ready to use from an iOS app.

Using Core ML

Now that you have a Core ML model, for the next step, you’ll use it in the app.

Import the Model

Core ML lets you use a pre-trained model in your app to make predictions or perform classifications on user input. To use the model, drag the generated Poets.mlmodel into the project navigator. If you skipped the model-generation section of this tutorial, or had trouble creating the model, you can use the one included at the root of the project zip (Download Materials link at top or bottom of the tutorial).

Xcode automatically parses the model file and shows you the important information in the editor panel.

The first section, Machine Learning Model, tells you about the model’s metadata, which Turi Create automatically created for you when generating the model.

The most important line here is the Type. This tells you what kind of model it is. In this case it’s a Pipeline Classifier. A classifier means that it takes the input and tries to assign a label to it. In this case, that is an “author best match”. The pipeline part means that the model is a series of mathematical transforms used on the input data to calculate the class probabilities.

The next section, Model Class shows the generated Swift class to be used inside the app. This class is the code wrapper to the model, and it’s covered in the next step of the tutorial.

The third section, Model Evaluation Parameters describes the inputs and outputs of the model.

Here, there is one input, text, which is a dictionary of string keys (individual words) to double values (the number of times that word appears in the input poem).

There are also two outputs. The first, author, is the most likely match for the poem’s author. The other output, authorProbability, is the percent confidence of a match for each known author.

You’ll see that, for some inputs, even though there is only one “best match”, that match itself might have a very small probability, or there might be two or three matches that are all reasonably close.

Now, click on the arrow next to Poets in the Model Class section. This will open Poets.swift, an automatically generated Swift file. This contains a series of classes that form a convenience wrapper for accessing the model. In particular, it has a simple initializer, a prediction(text:) function that does the actual evaluation by the model, and two classes that wrap the input and output so that you can use standard Swift values in the calling code, instead of worrying about the Core ML data types.

NSLinguisticTagger

Before you can use the model, you need the input text, which is from a free-form text box, which you’ll need to convert to something that’s compatible with PoetsInput. Even though Turi Create handles creating the BOW (Bag of Words) from the SFrame training input, Core ML does not yet have that capability built in. That means you need to transform the text into a dictionary of word counts manually.

You could write a function that takes the input text, splits it at the spaces, trims punctuation and then counts the remainder. Or, even better, use a context-aware text processing API: NSLinguisticTagger.

NSLinguisticTagger is the Cocoa SDK for processing natural language. As of iOS 11, its functionality is backed by its own Core ML model, which is much more complicated than the one shown here.

It’s hard making sure a character-parsing algorithm is smart enough to work around all the edge cases in a language — apostrophe and hyphen punctuation, for example. Even though this app just covers poets from America and the United Kingdom writing in English, there’s no reason the model couldn’t also have poems written in other languages. Introducing parsing for multiple languages, especially non-Roman character languages, can get very difficult very quickly. Fortunately, you can leverage NSLinguisticTagger to simplify this.

In PoemViewController.swift add the following helper function to the private extension:

func wordCounts(text: String) -> [String: Double] {
  // 1
  var bagOfWords: [String: Double] = [:]
  // 2
  let tagger = NSLinguisticTagger(tagSchemes: [.tokenType], options: 0)
  // 3
  let range = NSRange(text.startIndex..., in: text)
  // 4
  let options: NSLinguisticTagger.Options = [.omitPunctuation, .omitWhitespace]


  // 5
  tagger.string = text
  // 6
  tagger.enumerateTags(in: range, unit: .word, scheme: .tokenType, options: options) { _, tokenRange, _ in
    let word = (text as NSString).substring(with: tokenRange)
    bagOfWords[word, default: 0] += 1
  }

  return bagOfWords
}

The output of the function is a count of each word as it appears in the input string, but let’s break down each step:

  1. Initializes your bag of words dictionary.
  2. Creates a NSLinguisticTagger set up to tag all the tokens (words, punctuation, whitespace) in a string.
  3. The tagger operates over a NSRange, so you create a range for the whole string.
  4. Set the options to skip punctuation and whitespace when tagging the string.
  5. Set the tagger string to the text parameter.
  6. Applies the block to all the found tags in the string for each word. This parameter combination identifies all the words in the string, then increments a dictionary value for the word, which works as the dictionary key.

Using the model

With the word counts in hand, they can now be fed into the model. Replace the contents of analyze(text:) with the following:

func analyze(text: String) {
  // 1
  let counts = wordCounts(text: text)
  // 2
  let model = Poets()
  
  // 3
  do {
    // 4
    let prediction = try model.prediction(text: counts)
    updateWithPrediction(poet: prediction.author,
                         probabilities: prediction.authorProbability)
  } catch {
    // 5
    print(error)
  }
}

This function:

  1. Initializes a variable to hold the output of wordCounts(text:).
  2. Creates an instance of the Core ML model.
  3. Wraps the prediction logic in a do/catch block because it can throw an error.
  4. Passes the parsed text to the prediction(text:) function that runs the model.
  5. Logs an error if one exists.

Build and run, then enter a poem and let the model do its magic!

.

The result is great, but you can chalk that one up to good training! Another poem may not have the desired results. For example, this Joyce Kilmer classic does not.

In this case, the model leans heavily towards Emily Dickinson since there are far more of her poems in the training set than any other author. This is the downside to machine learning — the results are only as good as the data used to train the models.

Where To Go From Here?

You can get the KeatsOrYeats-final project from the Download Materials link at the top or bottom of this tutorial.

If you are feeling adventurous and want to take things further, you could easily build on this tutorial by designing your own text classifier. If you have a large data set with known labels, such as reviews and ratings, genres or filters, it would make a good fit. You can also build more accurate models by feeding them more data or providing multiple columns in the features input to classifier.create(). Good candidates would be a poem’s title or style.

Another way to get more accurate predictions is to clean up the input data. Unfortunately, there aren’t a lot of options available to the sentence_classifier, but you can use the logic classifier directly. That way, you can provide a massaged input that eliminates common words or uses an n-gram (pair of words rather than a single word) for a more accurate analysis. Turi Create also has a number of helper functions for this purpose available.

You can also learn more about Core ML and machine learning with these other tutorials: Beginning Machine Learning with scikit-learn and Beginning Machine Learning with Keras & Core ML.

Hopefully, you’re interest in all things NLP and machine learning has been piqued! If you’re looking to connect with other like-minded developers, or just want to share something cool, feel free to join the discussion in the forum below!

Download Materials