## Statistical analysis of the Fisher-Yates shuffle

The Fisher-Yates shuffle is a procedure for producing a random permutation of a sequence. This procedure is also known as the Knuth shuffle. Here I provide a statistical analysis of an implementation of the Fisher-Yates shuffle. A central idea is that any permutation of a sequence should equally likely be an output of the Fisher-Yates shuffle. That is, in a large enough number of shuffles of a fixed sequence, the observed probability of each permutation produced by a Fisher-Yates shuffle implementation should cluster around or converge to the theoretical probability for that permutation. As the number of shuffles increases, the observed probability for each possible permutation should converge to the theoretical probability. Otherwise there is something wrong with the implementation. I used my implementation of the Fisher-Yates shuffle to produce random permutations of various simple sequences of digits. The resulting output of the shuffles were used to perform frequency analyses of the behaviour of the implementation. Following are details on the particular sequences and the number of iterations for each sequence. Iteration here counts the number of times that I shuffled the given sequence. An experiment on a sequence is then the totality of all shuffles performed on it.

- Sequence: 123. Iterations: 1,000,000
- Sequence: 1234. Iterations: 1,000,000
- Sequence: 12345. Iterations: 1,000,000
- Sequence: 123456. Iterations: 10,000,000
- Sequence: 1234567. Iterations: 100,000,000
- Sequence: 12345678. Iterations: 100,000,000
- Sequence: 123456789. Iterations: 100,000,000

Each sequence was initialized as follows. Let be a vector where holds the digit . So holds the digit , holds , and so on. Two versions of the experiment was performed on each sequence. In the first version of the experiment, called version A, at the start of each iteration, I initialized to be as described above. Then I randomly permuted the vector. In version B of the experiment, I first initialized to be as above. Then I proceeded to repeatedly randomly permute . Thus if is the permutation obtained from iteration , then during iteration I applied the Fisher-Yates shuffle on to obtain . These two different versions of each experiment on a sequence were performed to see whether if they would produce qualitatively identical results. The experimental results suggest so: the two different versions of each experiment produced qualitatively similar results.

Source code of the experiments are provided here. Note that in order to compile the C files, you need to check out igraph trunk from Launchpad, apply the patch on this ticket, and then compile and install the resulting patched igraph version on your system. The C files containing the code for the experiments output the result of each shuffle to a file. For small sequences with say 3 to 4 digits, the resulting output files are a few MB in size. But for longer sequences, such as with 5 or more digits, the output files can be from tens of MB to hundreds of MB in size. The experimental data are easily generated from the above C files, so I do not provide the data. The data for each experiment were analyzed using the Python script fisherstat.py. If you intend to replicate the experiments, you need to adjust this script for each data file of each experiment. Given a data file for each experiment, the Python file is loaded from within the Sage command line interface; everything from then on is automated, from reading the experimental data to computing the frequency distribution. All experiments were run on the Sage cluster, in particular the sage.math compute node, whose purchase is supported by US National Science Foundation Grant No. DMS-0821725. Data analysis was performed using the Sage mathematics software system version 4.6.2.

Also note that the Bitbucket.org project link also points to PDF files. These files plot the normalized frequency distributions of the experimental data. The horizontal axis of each plot is for the permutation IDs. Each permutation of a fixed sequence is assigned a unique ID starting from 0. For example, for the sequence “123” here are all the possible permutations together with their corresponding IDs:

123 -> 0 132 -> 1 213 -> 2 231 -> 3 312 -> 4 321 -> 5

The vertical axis contains the corresponding normalized frequency of each permutation. Each frequency count was normalized by the number of iterations for the corresponding experiment. See the script fisherstat.py for further details. The normalized frequency for a permutation can be thought of as the empirical probability of that permutation showing up as a result of a Fisher-Yates shuffle.

And now comes the fun bit: plots of the experimental data. As I said above, both versions of each experiment produced qualitatively similar results. For this reason, below I only show some plots for version A of each experiment. To see all plots including plots for version B, refer to the Bitbucket.org project page. For kicks, each PDF file containing a plot was typeset using LaTeX and pgfplots.

As is evident from the above plots, for each sequence considered the empirical probabilities resulting from the experiments cluster around the theoretical probabilities. For a sequence of 3 or 4 digits, the empirical probabilities converge to the theoretical probability after a million or so experimental iterations. For example, the sequence “123” has six possible permutations so each permutation has a theoretical probability of of occurring as a result of the Fisher-Yates shuffle. In the above plot for the sequence “123”, it can be seen that the empirical probabilities converge to the theoretical probability after one million iterations. But as the number of digits in a sequence increases, the number of experimental iterations needs to increase as well in order to observe a convergence of the empirical probabilities to the theoretical probability for that sequence. For example, after iterations for the sequence “123456789”, observe that the empirical probabilities still cluster around the theoretical probability of . As the iteration number increases, the range of empirical probabilities should converge to the theoretical probability.

## Converting from binary to integer

The following is an updated and edited version of my posts to this sage-support thread.

**Problem**

You have a bitstring as output by

sage.crypto.stream.blum_blum_shub

and you want to convert that bitstring to an integer. Or in general, you want to convert a bit vector to its integer representation.

**Solution**

Here are two ways, assuming that you want the bits in little-endian order, i.e. you read the bits from right to left in increasing order of powers of 2.

sage: version() 'Sage Version 4.5.3, Release Date: 2010-09-04' sage: from sage.crypto.stream import blum_blum_shub sage: b = blum_blum_shub(length=6, lbound=10**4, ubound=10**5); b 100110 sage: type(b) <class 'sage.monoids.string_monoid_element.StringMonoidElement'> sage: # read in little-endian order sage: # conversion using Python's built-in int() sage: int(str(b), base=2) 38 sage: # conversion using Sage's built-in Integer() sage: Integer(str(b), base=2) 38

Now assume you read the bitstring as output by blum_blum_shub in big-endian order, i.e. from left to right in increasing order of powers of 2. You simply convert the bitstring to a string, reverse that string, and apply any of the above two methods.

sage: # reversing a string sage: str(b) '100110' sage: str(b)[::-1] '011001' sage: # read in big-endian order sage: int(str(b)[::-1], base=2) 25 sage: Integer(str(b)[::-1], base=2) 25

Or you can do as follows:

sage: b = "100110" sage: sum(Integer(i) * (2^Integer(e)) for e, i in enumerate(b)) 25 sage: sum(Integer(i) * (2^Integer(e)) for e, i in enumerate(b[::-1])) 38

Another way is to use Horner’s method. Here’s a Sage function that computes the integer representation of a bit vector read using big-endian order. A usage example is also shown.

sage: def horner(A, x0): ... # Evaluate the polynomial P(x) at x = x_0. ... # ... # INPUT ... # ... # - A -- list of coefficients of P where A[i] is the coefficient of ... # x_i. ... # - x0 -- the value x_0 at which to evaluate P(x). ... # ... # OUTPUT ... # ... # An evaluation of P(x) using Horner's method. ... i = len(A) - 1 ... b = A[i] ... i -= 1 ... while i >= 0: ... b = b*x0 + A[i] ... i -= 1 ... return b sage: A = [1, 0, 0, 1, 1, 0] sage: horner(A, 2) 25

As an exercise, modify the function horner to output the integer representation of a bit vector that is read using little-endian order.

## Optimized parity testing

To test the parity of an integer is to determine whether it is even or odd. Letting n = 132469, we can test the parity of n by computing its value modulo 2. This can be done either by using the Sage built-in mod function, the Python modulo operator %, or by using the Python bitwise operator &. The operator & is bitwise conjunction, i.e. it corresponds to multiplication over the Galois field of two elements. The integer n = 132469 is odd, hence the result of parity testing via mod, %, and & should each return 1.

sage: n = 132469 sage: mod(n, 2).lift() 1 sage: mod(n, 2) 1 sage: n % 2 1 sage: n & 1 1

However, the test using the bitwise operator & is the fastest of all:

sage: %timeit mod(n, 2).lift() 625 loops, best of 3: 42.2 micro second per loop sage: %timeit mod(n, 2) 625 loops, best of 3: 42.2 micro second per loop sage: %timeit n % 2 625 loops, best of 3: 1.22 micro second per loop sage: %timeit n & 1 625 loops, best of 3: 1.02 micro second per loop

The program below demonstrates how to do parity testing using C.

#include <stdio.h> /* Parity testing using bitwise AND. */ static void parity_and(int n) { if (n & 1) printf("%i is odd\n", n); else printf("%i is even\n", n); } /* Parity testing using the modulus operator. */ static void parity_mod(int n) { if (n % 2) printf("%i is odd\n", n); else printf("%i is even\n", n); } int main(void) { int n = 132469; int m = 132470; printf("Parity testing using modulus operator\n"); parity_mod(n); parity_mod(m); printf("Parity testing using bitwise AND\n"); parity_and(n); parity_and(m); return 0; }

The output of the program is:

Parity testing using modulus operator 132469 is odd 132470 is even Parity testing using bitwise AND 132469 is odd 132470 is even

## List within a list

The problem below was asked on the AskSage forum. Below I restate the problem and three possible solutions.

**Problem**

I have a short list of primes

sage: L = [2, 5, 23]

and a longer list of primes

sage: P = primes_first_n(20); P [2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59, 61, 67, 71]

How do I check that all members of L are contained within P?

**Solution**

You can use a brute-force search by defining your own custom function. This option doesn’t assume that elements in your lists are unique. Your lists can contain duplicate elements if you want.

sage: def is_sublist(shortlist, longlist): ....: for e in shortlist: ....: if not (e in longlist): ....: return False ....: return True ....: sage: L = [2, 5, 23] sage: P = primes_first_n(20); P [2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59, 61, 67, 71] sage: is_sublist(L, P) True sage: L + [23] [2, 5, 23, 23] sage: is_sublist(L + [23], P) True sage: L.append(next_prime(P[-1])); L [2, 5, 23, 73] sage: is_sublist(L, P) False sage: is_sublist(L + [23], P) False

Alternatively, you can use the built-in functions itertools.imap and all. The function itertools.imap is efficient when your lists are large, e.g. having hundreds or even hundreds of thousands of elements. This second option doesn’t care if your lists have duplicate elements.

sage: import itertools sage: L = [2, 5, 23] sage: P = primes_first_n(20); P [2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59, 61, 67, 71] sage: L + [23] [2, 5, 23, 23] sage: all(itertools.imap(lambda x: x in P, L)) True sage: all(itertools.imap(lambda x: x in P, L + [23])) True sage: L.append(next_prime(P[-1])); L [2, 5, 23, 73] sage: all(itertools.imap(lambda x: x in P, L)) False sage: all(itertools.imap(lambda x: x in P, L + [23])) False

Or, as Mitesh Patel said, you could use set. This third approach assumes that the elements in each list are unique, i.e. each list doesn’t contain duplicate elements.

sage: L = [2, 5, 23] sage: P = set(primes_first_n(20)) sage: set(L) set([2, 5, 23]) sage: set(L).issubset(P) True sage: set(L + [23]) set([2, 5, 23]) sage: set(L + [23]).issubset(P) True sage: L.append(111); L [2, 5, 23, 111] sage: set(L) set([2, 111, 5, 23]) sage: set(L + [111]) set([2, 111, 5, 23]) sage: set(L + [111]).issubset(P) False sage: set(L).issubset(P) False

## Thematic tutorials

The release of Sage 4.5.2 exposes a new category of documentation called “thematic tutorials”. Like any tutorial, a thematic tutorial discusses how to use Sage but with a focus on a particular topic. The thematic tutorial page currently lists two tutorials:

More are currently in preparation. See, for example, tickets 8469, 8442, 8467, 3624, and 8466. Any help is appreciated in reviewing or working on these tickets. Any suggestions for thematic tutorial topics? Or are there any existing tutorials that should go under the rubric of thematic tutorials?

## Pushing towards 90% doctest coverage for Sage 5.0

This is an edited version of my post to sage-devel. One of the main goals of the upcoming Sage 5.0 release is to get doctest coverage of the Sage library up to at least 90%. As of Sage 4.4.4.alpha0, the overall weighted coverage is 82.7%. To get a sense of which modules in the Sage library need work on their coverage scores, you could use the coverage script as follows:

$ ./sage -coverage /path/to/module.py[x]

Or you could do the following to get the coverage scores of all modules, including a coverage summary:

$ ./sage -coverageall

You might be interested in knowing which modules have a certain coverage percentage, in which case you could save the output of -coverageall to a text file and then grep that file for certain coverage scores. At this repository is a script to generate various types of coverage analysis reports. You can also find the script here. The script currently supports the following reports

- The coverage summary of all modules.
- Modules with 100% coverage.
- Modules with zero coverage.
- Modules with between 1% and 9% coverage.
- Modules with between 10% and 19% coverage.
- Modules with between 20% and 29% coverage.
- Modules with between 30% and 39% coverage.
- Modules with between 40% and 49% coverage.
- Modules with between 50% and 59% coverage.
- Modules with between 60% and 69% coverage.
- Modules with between 70% and 79% coverage.
- Modules with between 80% and 89% coverage.
- Modules with between 90% and 99% coverage.

Each report has links to detailed reports for individual modules. To run the script, copy it to the SAGE_ROOT of a Sage source or binary installation and do

[mvngu@sage sage-4.4.4.alpha0]$ ./coverage-status.py Coverage report of all modules... Summary of doctest coverage... Modules with 0% coverage... Modules with 100% coverage... Coverage reports within certain ranges... Detailed coverage report for all modules... Format the detailed coverage reports... Format the summary reports... Generate index.html...

And you’re done. Here is a report generated by the script. The idea is to provide an overview of which modules need work. I’d be interested to know what other types of doctest coverage reports people would like to see. Comments, suggestions, critiques, etc. are welcome.