Word Frequency Count
Last modified
The Task
Using the library provided, write a program that solves the classic word count problem:
Given a text file and an integer k, print the k most common words in the file (and the number of their occurrences) in decreasing frequency. – Jon Bentley, programming pearls
For the first part of this project, make the following simplifying assumptions:
- k can be set to 10, but we will generalize the program later for arbitrary k, so store this as a variable initialized to ‘10’.
- Input will be read from standard input, you will not have to open
any files. A
FILE*
stream namedstdin
corresponding to standard input is defined in<stdio.h>
. - all input text will be lowercase
The API
I have created a library named analytics
that provides a number of
functions relating to processing words from text. The full doxygen
generated API is available. For this project the following functions
will be useful:
- Each of the three functions defined in analytics.h
- The word_map_nforeach function defined in word_map.h
Note that you don’t need to know the implementation details of ‘word_list_t’ or ‘word_map_t’ to use them (but if you want to, use the source!). Just that one is a list of words and the other is a list of word,number pairs. The functions defined in analytics.h either accept these types as parameters, or return them.
Callback functions
The function
word_map_nforeach
takes a parameter that is a function pointer. This works the same way
as a data pointer, but instead of containing a memory address that
points to a data value a funciton pointer’s value points to a function
which can then be called from within the calling function. This is
used to allow you to work with the items contained in the word_map_t
type without knowing the implementation details of that data structure
(other than the type).
A word map item contains a word (const char *
) and a count (int
),
if you look at the parameter fnct
accepted by word_map_nforeach
it
has a type of
void(*fnct)(const char*, int)
which is a pointer to a function that returns void
and takes 2
parameters, the first of type const char*
and an the second of typpe
int
. As an example:
void some_callback(const char* word, int count) {
// do something with 'word' and 'count'
}
int main() {
int nlimit = 10;
word_map_t wm;
/* other variable declarations */
/* calls to library functions (at least one must initialize wm) */
/* if you pass an uninitialized 'wm' to this function your program
will crash due to a Segmentation fault */
word_map_nforeach(wm, some_callback, nlimit);
return 0;
}
since we want to print the sorted word count, the code in your call
back should print out the value of count
and the string word
.
This can be done in a single line with printf
.
Initialize a git repository
I will assume all the source files for this project (there will
probably be just one!) will live in ~/ece2524/wordfreq
(for “word
frequency”). Adjust the commands accordingly if you use a different
path.
$ git init ~/ece2524/wordfreq
$ cd ~/ece2524/wordfreq
Write a main
The main function for this part will be very short, all of the complexity is hidden away in the library I provided you.
#include <stdio.h>
#include <analytics.h>
int main() {
//variable declarations
//calls to library functions
return 0;
}
Compile an object file
First, compile the object file(s) for your own code. Assuming you have
a source file named main.c
, run
$ clang -c -o main.o main.c
The -c
flag tells the clang to just compile the source file main.c
into an object file main.o
without completing the final linking
step.
Linking to a shared library
To create a working program you need to link the object file so that the program knows where to find the code for the library calls.
$ clang -o wordfreq main.o -lanalytics
Note: The analytics
library is installed on the ece2524 VM, so the
above command will work if you are compiling from your shell account.
If you would like to compile and link locally you will have to install
the library on your own machine. The
source is available on github.
Test Before Push
Always test your program before submitting, it is much easier to debug output inconsistancies when you can run the code and make adjustments quickly. Errors you get locally will generally be easier to understand than those that come back from the automatic testing. Create a couple input files, starting with the two that are used in the tests (see below) and run your program with each of them. Does the output match what you expect?
$ cat >numbers << EOF
:four two four one
:two four three three
:three four two
:EOF
$ ./wordfreq <numbers
4 four
3 three
3 two
1 one
$
The tests in the Testing section use regular expressions to match lines with arbitrary whitespace, the fancy syntax just tells the test to match regardless of extra whitespace around the numbers.
Add and commit your changes
The only file you should add to your repository is main.c
. Do not add any of the generated files.
$ git add main.c
$ git commit
You can use a commit message of “initial commit
” for the first
commit, but after that use a message that is descriptive of what you
changed since the last commit.
Add a remote
Add a remote named origin
$ git remote add origin git@ece2524.ece.vt.edu:cvl_username/wordfreq.git
Push your changes
$ git push -u origin master# explain
On subsequent pushes you can just run
$ git push# explain
Remember, git
only pushes what is in your repository, which is only
what you explicitly add
and commit
. If you make any changes to
your source file you will have to add
and commit
those changes to
the repository if you want them to get pushed to the server.
Submission
The source files should exist in their own git repository, if you change to the directory containing your source files and run ls -a
you should see a directory named .git
. If not, run git init
to initialize a git repository in the current directory. You should only run git init
once for each new project.
Push your git repository to the remote at git@ece2524.ece.vt.edu:USER/wordfreq.git
where USER
is your git user name.
If you have initialized a new repo but have not added a remote yet:
$ git remote add origin git@ece2524.ece.vt.edu:USER/wordfreq.git
where is your git user name.
If you have already added a remote named origin
, but the URL is incorrect, replace add
with set-url
in the above command. You can always check that remotes you have added by running git remote -v
.
Remember, if this is the first time pushing to a new remote you need to specify a destination branch (usually `master`). Using the `-u` option will save this default destination for future pushes.
$ git push -u origin master
Testing
Feature repo path: features/wordfreq
The following features will be tested using cucumber:
@compile
Feature: Compile
Background:
Given I am working from a clean git clone to "wordfreq"
And I cd to "wordfreq"
Scenario: Clean Repo
Then a file named "wordfreq" should not exist
Scenario: Compile
When I successfully run `clang -c -o main.o main.c`
Then a file named "main.o" should exist
When I successfully run `clang -o wordfreq -lanalytics main.o`
Then a file named "wordfreq" should exist
@part1 @no-clobber
Feature: Word Frequency Utility
Background:
Given I cd to "wordfreq"
And a file named "fox.txt" with:
"""
the quick brown fox jumped over the lazy cow.
but the cow jumped over the moon!
what does the fox say?
"""
And a file named "numbers" with:
"""
four two four one
two four three three
three four
"""
Scenario:
When I run the shell command "./wordfreq < fox.txt"
Then its stdout should contain exactly 10 lines
And its stdout lines should match:
| ^\s*5\s+the$ |
| ^\s*2\s+cow$ |
| ^\s*2\s+fox$ |
| ^\s*2\s+jumped$ |
| ^\s*2\s+over$ |
| ^\s*1\s+brown$ |
| ^\s*1\s+but$ |
| ^\s*1\s+does$ |
| ^\s*1\s+lazy$ |
| ^\s*1\s+moon$ |
Scenario:
When I run the shell command "./wordfreq < numbers"
Then its stdout should contain exactly 4 lines
And its stdout lines should match:
| ^\s*4\s+four$ |
| ^\s*3\s+three$ |
| ^\s*2\s+two$ |
| ^\s*1\s+one$ |
You can run the tests manually with
$ cucumber /usr/share/features/wordfreq
when logged in to your shell account. This command assumes your current working directory is your project directory.