Statistical methods used in this archive

Methodology

Document digitization pipeline

How scanned election protocols become structured data. The number of votes for each party were available already on the CEC site and they were scraped directly. They did not pass through this pipeline.

SCAN scraped image DESKEW · ALIGN straighten & orient CUT ROWS split into rows CUT BOXES isolate cells PRINTED CNN typed numerals HANDWRIT CNN handwritten digits DIGITS → EXCEL structured output

Sobyanin–Sukhovolsky regression

This method examines the linear relationship between turnout rates and party vote shares across polling stations. It was developed by Alexander Sobyanin and Vladislav Sukhovolsky to identify statistical anomalies inconsistent with organic electoral behavior. The core intuition is that in a genuine election, a party's vote share should not grow faster than voter turnout itself.

Yij = Aj × Ti + Bj

Where:

Yij — vote share for party j at precinct i (votes received ÷ registered voters)

Ti — turnout rate at precinct i (votes cast ÷ registered voters)

Aj — regression slope: how party j's vote share changes as turnout increases

Bj — intercept: estimated vote share when turnout approaches zero

Slope coefficients significantly above 1.0 indicate that a party's vote share grows faster than turnout itself — a pattern inconsistent with organic voter behavior. Negative coefficients indicate a party losing support as turnout rises. The combination of a high R² value (above 0.7) with out-of-range coefficients constitutes a key statistical indicator of manipulation.

Kiesling–Spilkin distribution analysis

This method examines the statistical distribution of turnout and vote shares across polling stations. In fair elections, turnout rates across precincts should approximate a normal distribution. Ballot stuffing and related manipulation techniques systematically distort these distributions, producing right-skewed patterns or bimodal distributions, where a cluster of manipulated precincts appears at artificially high turnout levels.

The analysis uses histogram visualization to identify clustering around round-number turnout values and to assess bimodality or right-skewedness. Statistical tests include normality tests, symmetry assessments, and calculations of skewness and kurtosis coefficients.

To mitigate artefactual peaks arising from integer division (since turnout is the ratio of two integers), uniform random noise in the range [−0.5, +0.5] is added to the vote count for each party at each precinct. Results are then binned into 1% turnout groups. This procedure is repeated 10 times and the average is taken, eliminating histogram peaks that arise from rounding rather than real patterns in the data.

Data sources

All underlying data is sourced from summary protocols of precinct election commissions published by the Central Election Commission of Georgia. The dataset includes: number of registered voters, voters included in special lists, total votes cast, invalid ballots, and votes received by each participating party. Geographic indicators distinguish between urban and rural settlements and between specific regions of the country.

Protocols were digitized from scanned documents using a custom convolutional neural network and subsequently verified manually, including reconciliation of handwritten corrections and explanatory notes made by election commission members on original protocols.