Pset5: Speller

这个 pset 我真的是，光看前面的 need to know 就晕倒不知道该如何是好。

重点：如何创建hash table，以及hash function和hash value的意义，链表linked lists的应用。复习：文件读取与free memory的方法。

更新Update：最后新增了Python的写法，思路瞬间清晰max！

需要复习的知识点：

Hash Table

Hash table: array of linked lists.

Hash Function : assigns a number to every input. 在这道体重，hash function 的具体作用：
将一个 word 作为 input；
output：输出 word 处于哪个**"bucket"**中；

Linked Lists

定义：linked lists 中，每一个 node 有一个 value 和一个指向下个 node 的 pointer；

Load Function思路步骤：

将 dictionary 这个文件 load 到一个 hash table 中。具体的操作步骤：

首先，打开 dictionary 这个 file。
- 使用fopen
- 检查 return value 是否为 NULL
读取文件内容：read strings from file one at a time
- fscanf(file, "%s", word)
- loop：直到 fscanf 返回 EOF ，代表已读取到文件末；
给每一个 word 创建一个 node（包含 value 和指向下一个 node 的 pointer）
- malloc
- 检查 return value 是否为 NULL
- 将 word copy 到 node 中，使用strcpy
Hash word to obtain a hash value
- 使用hash这个 function（参见上面的知识点复习）
- Function takes a string and returns an index
将 node 置入 hash table 中（insert node into hash table at that location）

- 就是将一个 node 加到 linked lists 中；
- repeat 这个过程；

最后呈现：

bool load(const char *dictionary)
{
    FILE *fopen(dictionary, "r");
    if(file == NULL)
    {
        return false;
    }
    char word [LENGTH + 1];
    //之前定义了每个 node 中 word 的长度是[LENGTH+1];
    
    while (fscanf(file, "%s", word) != EOF)
    {
        node *n = malloc(sizeof(node));
        if (n==NULL)
        {
            return false;
        }
        strcpy(n->word, word);
        // strcpy 功能，将 n 这个 node 中的 word 设置成读取的 word；
        hash_value = hash(word);
        //利用 hash 功能映射得到 hash value
        n->next = table[hash_value];
        //在链表头部插入新的 node
        table[hash_value] = n;
        //将原有的列表更新；
        word_count ++;
    }
    fclose(file);
    return true;
}

Hash Fuction 实现思路：

hash function 需要实现：

Input：word（可能包含 apostrophes 即’符号）
Output：numerical index
Deterministic：确定的，same input, same output 之前的 distribution code 中确定了一个常数 N（设置成了 N=1），可以设置更大的 N 值，来让 hash table 中有更多的 bucket。但是 N 的取值范围要注意。

Hash 函数有很多种，但道理都是相通的，即将字符串映射到一定区间的数字上。但很可惜，这些 hash 算法的表达我完全看不懂😅 我直接 copy 了使用djb2 的方法。

CSDN 上一个总结的常见 hash 算法：blog.csdn.net/yanshu2012/…

最后的呈现

// Hashes word to a number
unsigned int hash(const char *word)
{
    unsigned long hash = 5381;
    int c;
    while( c= toupper(*word++))
    {
        hash = ( (hash<<5 + hash) + c; / * hash * 33 + c * /
    }
    return hash % N;
}

Size Function 实现思路

碎碎念：绝了，我听 walkthrough 听到这一步才明白，load，hash，size……这些我正在做的工作是在构建 dictionary

在 load function 中已经有了 word_count；
在这一步只需要 return 这个值就可以；实现过程：

unsigned int size(void)
{
    if (word_count>0)
    {
        return word_count;
    }
    return 0;
}

Check Function

check function 需要实现：take a word，然后 check 它在不在刚刚构建出的 dictionary 中。

Input：a char *word
Output: bool (true or false)
case insensitive: 即检查拼写错误时，应该忽略大小写问题；

具体的操作过程：

Hash word，得到一个 hash value；
去到 hash table 中，从 linked list 中寻找对应 hash value；
使用strcasecmp来对两个 string 进行 insensitive 的对比；

重点 : traversing linked list

设置一个 node，命名为 cursor；
将 cursor set 为 linked list 中第一个对象；
不断移动，直到cursor 到了 NULL（即末尾）；

最终呈现：

bool check(const char *word)
{
    hash_value = hash(word);
    node *cursor = table[hash_value];
    
    while (cursor != NULL)
    {
        if (strcasecmp(word, cursor->word) == 0)
        {
            return true;
        }
        cursor = cursor->next;
    }
    return false;

Unload Function

这一步就是 free memory。 free memory 的过程中需要引入一个临时的 node 命名为 tmp。

设置 node 命名为 cursor，指向 linked lists 中的第一个 node；
设置 tmp，指向同一 node；
cursor 移动到下一个 node 时，释放 tmp；
不断循环创建 tmp →移动 cursor →释放 tmp 的过程；具体实现：

bool unload(void)
{
    for(int i = 0; i<N; i++)
    {
        node *cursor = table[i];
        while (cursor != NULL)
        {
            node *tmp = cursor;
            cursor = cursor->next;
            free(tmp);
        }
        return true;
    }
    
    return false;
}

最终效果：

我的运行结果：

WORDS MISSPELLED:     955
WORDS IN DICTIONARY:  143091
WORDS IN TEXT:        17756
TIME IN load:         0.02
TIME IN check:        1.52
TIME IN size:         0.00
TIME IN unload:       0.00
TIME IN TOTAL:        1.54

老师的运行结果：

WORDS MISSPELLED:     955
WORDS IN DICTIONARY:  143091
WORDS IN TEXT:        17756
TIME IN load:         0.02
TIME IN check:        0.01
TIME IN size:         0.00
TIME IN unload:       0.02
TIME IN TOTAL:        0.05

应该是 hash 算法的问题？

放大招，用Python写

Python 写一遍，我稀里糊涂的脑子总算又清醒回来了。


words = set()

## 创建一个 set，将之命名为 words，set 我们可以增加 value


def check (word):

    if word.lower() in words:

        return True

    else

        return False

## define 一个功能，`.lower()`可以直接得到 lowercase 的 version

## 配合下面的 load dictionary 中创建出的 words，就可以直接 check

相应的 word 在不在 words 中


def load(dictionary):

    file = open (dictionary, "r")

    for line in file:

        word = line.rstrip()

        words.add(word)

    file.close()

    return True

## `restrip`这个功能是去除字符串后的空格等

## line 这个功能的具体使用方法我还没有查

## 但整体思路是：打开 dictionary 这个 file，然后 line by line 把内容读取到 words 中。


 def size():

    return len(words)



def unload()

    return True

【CS50】Pset5: Speller 拼写检查器