# pyreplay

A log replay tool for PostgreSQL.

## Overview

`pyreplay` takes PostgreSQL text logs and 'replays' them back to a server, issuing the same commands that
are given in the logs. It only works with CSV formatted logs. See "producing the logs" below for an explanation of how to generate logs in the right format.

There are two commands: `preprocess.py` takes the raw logs as input and creates a preprocessed log file
that `replay.py` then replays to a specified server.

## producing the logs

In order to produce logs suitable for use with this package, you need the following settings in your postgresql.conf:

    logging_collector = on # warning: requires a server restart to change!
    log_min_duration_statement = 0 # warning: can generate a large log volume
    log_connections = on
    log_disconnctions = on

Plus, csvlog must appear somewhere in log_destination, e.g.

    log_destination = 'csvlog'

or

    log_destination = 'stderr, csvlog'

## preprocess.py

`preprocess.py` takes CSV logs as input, and produces a single (potentially very large) preprocessed log
file, along with a 'control' file that contains some digested information about the logs.

The command-line arguments and options to `preprocess.py` are:

* Arguments -- The log files to be preprocessed. They must be given in chronological order. Either .csv or .csv.gz
files are acceptable; `preprocess.py` will automatically unzip .csv.gz files. If there are no arguments, stdin
is assumed; stdin input must not be gzipped.

* `--control-file` -- The output file for the control file. If omitted, `pyreplay_control.json` in the current
directory is assumed.

* `--processed-logs` -- The output file for the processed logs. This is written as a single gzipped file; it
can be quite large. If omitted, `processed_logs.csv.gz` in the current directory is assumed.

* `--acceleration` -- If given, `preprocess.py` will "accelerate" replay by this multiplier. Normally, pyreplay
attempts to replay the log entries with the same timing as the original logs. If this is, for example, 2.5, pyreplay
will attempt to replay everything 2.5x faster.

## replay.py

`replay.py` takes the files created by `preprocess.py` and replays them against a designated server.

The options to `replay.py` are:

* `--control-file` -- The control file created in by `preprocess.py`. The full path to the processed logs file is
stored in the control file, so it does not have to be specified separately. If omitted, `pyreplay_control.json` in the current
directory is assumed.

* `--connect-string` -- The connection string to the target database against which to replay the logs. If ommited,
'dbname=postgres user=postgres' is assumed, which is probably wrong.

* `--preroll` -- An integer value in seconds. When `replay.py` starts, it delays by this number of seconds before
the first entry is presented to the server. This time is used to create the child processes used to manage the
connections. The default is 10 seconds, which should be enough under most conditions.
