How to implement article words extract and analysis? (01)

Now let us think the process:

flowchart

begin[begin] --> a[user reuqest a book's page content] --> b[server analysis then get page tokens] --> c[server get user's unknown words] --> d[compare tokens and unknow words] --> f[return this page user's unknown words] --> END[end]

So we must know the words which user unknwon. Then need a tool to analysis user's unknown and return let frontend to give those words different color.

1.Generate our own words database.

I choose NLTK

First follow document to download nltk data.

import nltk
nltk.download()

We ignore this error and change Server Index to https://www.nltk.org/nltk_data/.We can also directly download nltk data on this website.Then move what we download file into one of these folder.

1161192 words. That is huge!

For test popurse we only save humor category words into our database.

Ofcourse we need a Word model and never forget to run migrations.

class Word(models.Model):
    word = models.CharField(max_length=50)

def insert_into_db(category: str):
    words = brown.words(categories=category)
    word_objects = [Word(word=word) for word in words]
    Word.objects.bulk_create(word_objects)

insert_into_db('humor')

But this db have a lot of repeat record and some punctuation mark so we need remove them.

Let have a test:

And different category will have repeat words so we add a unique=True on model word field.

Redo insert db:

def insert_category_words_into_db(category=None):
    words = set(brown.words(categories=category))
    filtered_set_words = set()
    for word in words:
        # remove punctuation and space
        word = word.strip(string.punctuation + ' ')
        if len(word) > 0:
            filtered_set_words.add(word)

    word_objects = [Word(word=word) for word in filtered_set_words]
    # ignore conflicts
    Word.objects.bulk_create(word_objects, ignore_conflicts=True)

Now we have 5w words in db.

2. Create a UserWord databse table.

If the server has a Word table which stores a total of 100,000 all english words (let's ignore phrases for now) and each record has only id and word(varchar 50) field. 100000 * (50+4) bytes = 5.14MB

We also need a UserWord table to store the words that users don't know and are learning. This table will have three fields: id 4bytes, user_id: 4bytes, word_id: 4bytes. And a new user doesn't know any words, so in fact he will cost: (4*3) * 100000 = 1.14MB

If the server has 10000 users like this, it will cost 11.4GB of disk space.

But save user unknown word doesn't have to much value.

Another design is to save only user understand words. it will save a lot of disk space.😀

Here is our db design:

Then add some endpoint to crud.

3.Analysis a article's words then compare with UserWord table

import nltk
    sentence = """
    NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries, and an active discussion forum.
Thanks to a hands-on guide introducing programming fundamentals alongside topics in computational linguistics, plus comprehensive API documentation, NLTK is suitable for linguists, engineers, students, educators, researchers, and industry users alike. NLTK is available for Windows, Mac OS X, and Linux. Best of all, NLTK is a free, open source, community-driven project.
NLTK has been called “a wonderful tool for teaching, and working in, computational linguistics using Python,” and “an amazing library to play with natural language.”
Natural Language Processing with Python provides a practical introduction to programming for language processing. Written by the creators of NLTK, it guides the reader through the fundamentals of writing Python programs, working with corpora, categorizing text, analyzing linguistic structure, and more. The online version of the book has been been updated for Python 3 and NLTK 3. (The original Python 2 version is still available at https://www.nltk.org/book_1ed.)
    """
    tokens = nltk.word_tokenize(sentence)

Raise a error tell us need to install a tokenlizer.After install don't forget to unzip it.It has a lot of pickle to support a lot of language:

After that. It shows:

Now let us userid 1 learn some words:

Then let us see how many words in the sentence user known and unknown:

@api_view(['GET'])
def query_current_user_words_in_article(request):
    sentence = """
        NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries, and an active discussion forum.
    Thanks to a hands-on guide introducing programming fundamentals alongside topics in computational linguistics, plus comprehensive API documentation, NLTK is suitable for linguists, engineers, students, educators, researchers, and industry users alike. NLTK is available for Windows, Mac OS X, and Linux. Best of all, NLTK is a free, open source, community-driven project.
    NLTK has been called “a wonderful tool for teaching, and working in, computational linguistics using Python,” and “an amazing library to play with natural language.”
    Natural Language Processing with Python provides a practical introduction to programming for language processing. Written by the creators of NLTK, it guides the reader through the fundamentals of writing Python programs, working with corpora, categorizing text, analyzing linguistic structure, and more. The online version of the book has been been updated for Python 3 and NLTK 3. (The original Python 2 version is still available at https://www.nltk.org/book_1ed.)
        """
    tokens = filter_words(nltk.word_tokenize(sentence))

    # get user's unknown words
    current_user_words_query_set = UserWord.objects.filter(user=request.user.id)
    user_unknown_words = set()
    user_known_words = set()
    for token in tokens:
        try:
            word = Word.objects.get(word=token)
            try:
                # QuerySet get only accept primary key or fields in a unique constraint
                current_user_words_query_set.get(word=word.id)
                # in here means user already known this token
                user_known_words.add(word.word)
            except UserWord.DoesNotExist:
                user_unknown_words.add(word.word)
                continue
        except Word.DoesNotExist:
            continue
    data = {
        'known': list(user_known_words),
        'unknown': list(user_unknown_words)
    }
    return Response(data=data)

It response:

Not bad. But 242 queries are too much.Don't worry.We write a test case then optimization it.

Test Case

def test_detail_without_authenticated(self):
    # he knows nothing
    response = self.client.get(f'/test/')
    self.assertEqual(response.status_code, status.HTTP_200_OK)
    self.assertEqual(len(response.data['known']), 0)
    self.assertEqual(len(response.data['unknown']), 109)

    # let him know some word
    for word in ('at', 'the', 'has'):
        UserWord.objects.create(word=Word.objects.get(word=word), user=self.defaultUser, state=1)

    token, created = Token.objects.get_or_create(user=self.defaultUser)
    client = APIClient()
    client.credentials(HTTP_AUTHORIZATION='Token ' + token.key)

    response = client.get(f'/test/')
    self.assertEqual(response.status_code, status.HTTP_200_OK)
    self.assertEqual(len(response.data['known']), 3)
    self.assertEqual(len(response.data['unknown']), 106)

We will finish this topic at next post. Thanks for reading!