ECE 2524 - Word Frequency Count

ECE 2524

Introduction to Unix for Engineers

Word Frequency Count

Last modified

The Task

Using the library provided, write a program that solves the classic word count problem:

Given a text file and an integer k, print the k most common words in the file (and the number of their occurrences) in decreasing frequency. – Jon Bentley, programming pearls

For the first part of this project, make the following simplifying assumptions:

  1. k can be set to 10, but we will generalize the program later for arbitrary k, so store this as a variable initialized to ‘10’.
  2. Input will be read from standard input, you will not have to open any files. A FILE* stream named stdin corresponding to standard input is defined in <stdio.h>.
  3. all input text will be lowercase

The API

I have created a library named analytics that provides a number of functions relating to processing words from text. The full doxygen generated API is available. For this project the following functions will be useful:

Note that you don’t need to know the implementation details of ‘word_list_t’ or ‘word_map_t’ to use them (but if you want to, use the source!). Just that one is a list of words and the other is a list of word,number pairs. The functions defined in analytics.h either accept these types as parameters, or return them.

Callback functions

The function word_map_nforeach takes a parameter that is a function pointer. This works the same way as a data pointer, but instead of containing a memory address that points to a data value a funciton pointer’s value points to a function which can then be called from within the calling function. This is used to allow you to work with the items contained in the word_map_t type without knowing the implementation details of that data structure (other than the type).

A word map item contains a word (const char *) and a count (int), if you look at the parameter fnct accepted by word_map_nforeach it has a type of

void(*fnct)(const char*, int)

which is a pointer to a function that returns void and takes 2 parameters, the first of type const char* and an the second of typpe int. As an example:

void some_callback(const char* word, int count) {
    // do something with 'word' and 'count'
}

int main() {
   int nlimit = 10;
   word_map_t wm;
   
   /* other variable declarations */

   /* calls to library functions (at least one must initialize wm) */

   /* if you pass an uninitialized 'wm' to this function your program
      will crash due to a Segmentation fault */
   word_map_nforeach(wm, some_callback, nlimit);

   return 0;
}

since we want to print the sorted word count, the code in your call back should print out the value of count and the string word. This can be done in a single line with printf.

Initialize a git repository

I will assume all the source files for this project (there will probably be just one!) will live in ~/ece2524/wordfreq (for “word frequency”). Adjust the commands accordingly if you use a different path.

$ git init ~/ece2524/wordfreq
$ cd ~/ece2524/wordfreq

Write a main

The main function for this part will be very short, all of the complexity is hidden away in the library I provided you.

#include <stdio.h>

#include <analytics.h>

int main() {
    //variable declarations

    //calls to library functions

    return 0;
}

Compile an object file

First, compile the object file(s) for your own code. Assuming you have a source file named main.c, run

$ clang -c -o main.o main.c

The -c flag tells the clang to just compile the source file main.c into an object file main.o without completing the final linking step.

Linking to a shared library

To create a working program you need to link the object file so that the program knows where to find the code for the library calls.

$ clang -o wordfreq main.o -lanalytics

Note: The analytics library is installed on the ece2524 VM, so the above command will work if you are compiling from your shell account. If you would like to compile and link locally you will have to install the library on your own machine. The source is available on github.

Test Before Push

Always test your program before submitting, it is much easier to debug output inconsistancies when you can run the code and make adjustments quickly. Errors you get locally will generally be easier to understand than those that come back from the automatic testing. Create a couple input files, starting with the two that are used in the tests (see below) and run your program with each of them. Does the output match what you expect?

$ cat >numbers << EOF
:four two four one
:two four three three
:three four two
:EOF
$ ./wordfreq <numbers
4 four
3 three
3 two
1 one
$

The tests in the Testing section use regular expressions to match lines with arbitrary whitespace, the fancy syntax just tells the test to match regardless of extra whitespace around the numbers.

Add and commit your changes

The only file you should add to your repository is main.c. Do not add any of the generated files.

$ git add main.c
$ git commit

You can use a commit message of “initial commit” for the first commit, but after that use a message that is descriptive of what you changed since the last commit.

Add a remote

Add a remote named origin

$ git remote add origin git@ece2524.ece.vt.edu:cvl_username/wordfreq.git

Push your changes

$ git push -u origin master# explain

On subsequent pushes you can just run

$ git push# explain

Remember, git only pushes what is in your repository, which is only what you explicitly add and commit. If you make any changes to your source file you will have to add and commit those changes to the repository if you want them to get pushed to the server.

Submission

The source files should exist in their own git repository, if you change to the directory containing your source files and run ls -a you should see a directory named .git. If not, run git init to initialize a git repository in the current directory. You should only run git init once for each new project.

Push your git repository to the remote at git@ece2524.ece.vt.edu:USER/wordfreq.git where USER is your git user name.

If you have initialized a new repo but have not added a remote yet:

$ git remote add origin git@ece2524.ece.vt.edu:USER/wordfreq.git

where is your git user name.

If you have already added a remote named origin, but the URL is incorrect, replace add with set-url in the above command. You can always check that remotes you have added by running git remote -v.

Remember, if this is the first time pushing to a new remote you need to specify a destination branch (usually `master`). Using the `-u` option will save this default destination for future pushes.

$ git push -u origin master

Testing

Feature repo path: features/wordfreq

The following features will be tested using cucumber:

@compile
Feature: Compile

  Background:
    Given I am working from a clean git clone to "wordfreq"
    And I cd to "wordfreq"
    
  Scenario: Clean Repo
    Then a file named "wordfreq" should not exist

  Scenario: Compile
    When I successfully run `clang -c -o main.o main.c`
    Then a file named "main.o" should exist
    When I successfully run `clang -o wordfreq -lanalytics main.o`
    Then a file named "wordfreq" should exist
@part1 @no-clobber
Feature: Word Frequency Utility
  
  Background:
    Given I cd to "wordfreq"
    And a file named "fox.txt" with:
    """
    the quick brown fox jumped over the lazy cow.
    but the cow jumped over the moon!
    what does the fox say?
    
    """
    And a file named "numbers" with:
    """
    four two four one
    two four three three
    three four
    
    """
    
  Scenario: 
    When I run the shell command "./wordfreq < fox.txt"
    Then its stdout should contain exactly 10 lines
    And its stdout lines should match:
    | ^\s*5\s+the$    |
    | ^\s*2\s+cow$    |
    | ^\s*2\s+fox$    |
    | ^\s*2\s+jumped$ |
    | ^\s*2\s+over$   |
    | ^\s*1\s+brown$  |
    | ^\s*1\s+but$    |
    | ^\s*1\s+does$   |
    | ^\s*1\s+lazy$   |
    | ^\s*1\s+moon$   |

  Scenario: 
    When I run the shell command "./wordfreq < numbers"
    Then its stdout should contain exactly 4 lines
    And its stdout lines should match:
    | ^\s*4\s+four$  |
    | ^\s*3\s+three$ |
    | ^\s*2\s+two$   |
    | ^\s*1\s+one$   |

You can run the tests manually with

$ cucumber /usr/share/features/wordfreq
when logged in to your shell account. This command assumes your current working directory is your project directory.