1. Run randomsample_by_sessions.py to obtain a random sample of user requests in a MR table
 Result: (mr) pzuev/random_sample_seglearn

2. Download the table, run extract_urls_from_sample.py to extract URLs
 Result: sample_urls.txt

3. Run url_sampler to produce a manageable selection of popular hosts with desired per-host coverage.
 Result: sample_urls_to_download.txt (stdout)
 Result (unused): hoststats.txt (stderr)

4. kwworm -Q read_sample.query > sample_docs.protobin. Steps 3 and 4 are invoked by the fetch_sample_docs.sh script.
 Result: sample_docs_large.protobin (60 000 docs, about 6-7 GB with the default sample size)

5. Run the learning algorithm ('learn').
 Result: learned_freqs.txt

6. Unpack all the docs to a cache directory (for assessor web UI): unpack.sh
 Result: .cache/
 Result: url_groups.txt

7. Start up the sentence weighting backend: 'serve learned_freqs.txt'. It will read HTMLs from the .cache directory
 Result: listening on port 45011 for /html_hash?url=original_url&encoding=encoding_name

8. Start the python frontent:
 Result: listening on port 45040
