# Quirks

Achievements is an old project that has built up a lot of functionality over the years. This has resulted in a number of design choices that may be unexpected or hacky, which are documented here. There is some documentation inline for these as well, which may expand on the actual implementation.

## Single Instance Quests Application
#### What
The quests application calculates quest progress in batch jobs at a fixed interval, as defined in the [cron file](https://git.xarth.tv/cb/achievements/blob/master/internal/quests/cron.yaml). However, in order to ensure only a single redshift query is run at a time, quests runs a single instance with an [atomic bool](https://git.xarth.tv/cb/achievements/blob/master/internal/quests/quest_progress.go#L44) that signals whether a job is currently running or not. 
#### Why
Redshift query time for quests varies between batch jobs. Redshift cannot usually handle multiple queries running into each other without issues, including grinding the entire cluster to a halt. Because we cannot predict how long a query will take, we instead use a flag in the code to determine when a new query can be started.
#### Side Effects
The quests beanstalk app cannot ever use more than one instance, otherwise the flag will not work and multiple queries will be run.

The cron schedule doesn't guarantee how quickly jobs will be run in this case. For example, if one job takes 20 minutes, the next 4 scheduled cron jobs will not do anything because they will detect that a job is currently running by checking the flag.

## Twitchcon Achievement URL
#### What
For completing the twitchcon ticket achievement, we check a url provided by marketing that contains a list of everyone who has purchased a ticket so far. This is called [in the sourcer](https://git.xarth.tv/cb/achievements/blob/master/internal/sourcer/twitchcon.go), and implemented in the twitchcon [client](https://git.xarth.tv/cb/achievements/blob/master/internal/clients/twitchcon/client.go). The catch here is that the url to get ticket purchasers changes with each twitchcon, which means we have to coordinate with marketing to get the correct url every time a new twitchcon goes on sale. The url can be changed in the [config](https://git.xarth.tv/cb/achievements/blob/master/config/production.yaml#L66).
#### Why
The twitchcon achievement was never fully automated, and was only intended to run one time. The initial name of the achievement (which still exists) was "single_twitchcon_2017". But, the copy of the achievement on the frontend never indicated this, so users expected it to be unlocked for every twitchcon. It was difficult to fully integrate with the purchasing flow and marketing suggested this as a workaround. This at least allows the achievement to be completed automatically on a scheduled job, rather than from manual entries.
#### Side Effects
The url must be changed every new twitchcon.

## Affiliate Invite Errors
#### What
In order for users to get paid by Twitch, they must complete a payout onboarding step, which saves information for tax and accounting purposes. Users can get stuck in a weird state where they have some incomplete onboarding, and then fail to be invited to another onboarding flow because of that. This occurs because users can only have one active onboarding flow at once. When automated affiliate inviates fail, we [store the error in a table](https://git.xarth.tv/cb/achievements/blob/master/internal/affiliateinviter/affiliate_inviter.go#L146) and wait for an [onboarding complete SNS message](https://git.xarth.tv/cb/achievements/blob/master/internal/affiliateworker/process.go#L39) to come in. When we find a message matching the failed invite, we send a re-invite and remove the error from the table.
#### Why
When affiliate invites fail during quest batch jobs, we can't retry during the job because they fail due to some existing state that requires user action. To avoid sending manual re-invites, we must store them and process them asynchronously. Payments will send an SNS message whenever a user completes an existing onboarding workflow, which we must use to check if that channel has an existing failed invite.
#### Side Effects
This entire system really only works for invites that fail with an "existing onboarding workflow" error. If the invite system is down, the failures are still logged in the table, but we cannot use an SNS topic to re-send them, so manual intervention is required.

## Multiple Redshift Clusters
#### What
There are two redshift clusters in use by achievements. One contains lifetime data for ccu and minute_broadcast. The other is a standard tahoe replica using spade.
#### Why
It's not possible to query tahoe_recent without date ranges, because thats the sort key of every table and is required to make any queries perform well. When Achievements was first implemented, tahoe_recent did not exist and different [schemas](https://git.xarth.tv/cb/achievements/tree/master/_migrations/redshift) were used for redshift tables. We can query over different ranges than tahoe_recent allows, and we could not migrate those queries over once we needed to use a tahoe replica.
#### Side Effects
Lifetime data does not scale infinitely. At some point, we either need a smarter way to get lifetime data or a different way to query it. At some point, the redshift cluster will run out of space.

## RDS Parameter Groups
#### What
The achievements RDS cluster uses a [custom paramater group](https://git.xarth.tv/cb/achievements/blob/master/terraform/modules/achievements/postgres.tf#L32) for configuring replication settings. These have been pretty carefully tuned to match current batch loads in order to keep the clusters healthy.
#### Why
Large batch jobs seem to mess up postgres replication, because it cannot replicate data fast enough as its being written in. Many scheduled jobs in achievements write a lot of data very fast, and broken replication can take down the cluster. We attempt to store larger wals for longer in this cluster to try and avoid that.
#### Side Effects
This is a best effort change, and is not guaranteed to continue to work. It does not scale to larger jobs, so new achievements should be added with caution and tested on staging. Write IOPS spikes can cause issues with replication, even with parameter group changes.
