### Archive

Archive for the ‘open source software’ Category

Problem: Implement a primality test in Haskell. Below is the code. The first implementation of the function divides uses a homemade function for finding the remainder when an integer is divided by another integer.

```-- The remainder when a is divided by b.
remainder :: Integer -> Integer -> Integer
remainder a b | a < b = a
| a == b = 0
| otherwise = remainder (a - b) b

-- Whether d divides n.
divides :: Integer -> Integer -> Bool
divides d n = remainder n d == 0
```

The second implementation of divides uses the built-in function rem to find the remainder upon division by an integer.

```-- Whether d divides n.  A more efficient version that uses the built-in
-- function rem.
divides :: Integer -> Integer -> Bool
divides d n = rem n d == 0
```

The full primality test follows:

```-- Whether d divides n.  A more efficient version that uses the built-in
-- function rem.
divides :: Integer -> Integer -> Bool
divides d n = rem n d == 0

-- The least divisor of n that is at least k.
ldf :: Integer -> Integer -> Integer
ldf k n | divides k n = k
| k^2 > n = n
| otherwise = ldf (k + 1) n

-- The least divisor of n.
ld :: Integer -> Integer
ld n = ldf 2 n

-- Primality test.
prime :: Integer -> Bool
prime n | n < 1 = error "must be a positive integer"
| n == 1 = False
| otherwise = ld n == n
```

## Basic implementation of Dijkstra’s algorithm

Dijkstra’s algorithm as presented in Algorithm 2.5, page 75 of the book Algorithmic Graph Theory is meant to be a general template. Lots of details have been left out, one in particular is how to implement line 6 of the algorithm. This one line of Dijkstra’s algorithm has been the subject of numerous research papers on how to efficiently implement a search technique for Dijkstra’s algorithm. A simple search technique is linear search, where you search some array or list from start to finish. A more efficient search technique for line 6 is a binary heap. To implement infinity as stated on line 1 of Algorithm 2.5, you would simply let a very large number represent infinity. This number should ideally be larger than any weight in the graph you are searching.

Below is a basic implementation of Dijkstra’s algorithm following the general template of Algorithm 2.5. This implementation is meant to be for searching in simple, unweighted, undirected, connected graphs. Because the graph to search is assumed to be unweighted, I simply let each edge have unit weight and represent infinity as the integer $10^9$. The implementation below should provide a basis on which to implement Dijkstra’s algorithm for, say, weighted graphs and other types of graphs. To use the implementation below, save it to a Python file, load the file into a Sage session using load(), and call the function dijkstra().

```def dijkstra(G, s):
"""
Shortest paths in a graph using Dijkstra's algorithm.

INPUT:

- G -- a simple, unweighted, undirected, connected graph. Thus each edge
has unit weight.
- s -- a vertex in G from which to start the search.

OUTPUT:

A list D of distances such that D[v] is the distance of a shortest path
from s to v.  A dictionary P of vertex parents such that P[v] is the
parent of v.

EXAMPLES:

sage: G = graphs.PetersenGraph()
sage: dijkstra(G, 0)
([0, 1, 2, 2, 1, 1, 2, 2, 2, 2], {1: 0, 2: 1, 3: 4, 4: 0, 5: 0, 6: 1, 7: 5, 8: 5, 9: 4})

sage: G = Graph({0:{1:1, 3:1}, 1:{2:1, 3:1, 4:1}, 2:{4:1}})
sage: dijkstra(G, 0)
([0, 1, 2, 1, 2], {1: 0, 2: 1, 3: 0, 4: 1})
"""
n = G.order()  # how many vertices
m = G.size()   # how many edges
D = [1000000000 for _ in range(n)]  # largest weights; represent +infinity
D[s] = 0       # distance from vertex to itself is zero
P = {}         # a dictionary for fast look-up
Q = set(G.vertices())
while len(Q) > 0:
v = mindist(D, Q)
Q.remove(v)
if D[u] > D[v] + 1:     # each edge has unit weight, so add 1
D[u] = D[v] + 1
P.setdefault(u, v)  # the parent of u is v
return D, P

def mindist(D, Q):
"""
Choose a vertex in Q such that it has minimal distance.

INPUT:

- D -- a list of vertices with corresponding distances.  Each distance
D[v] corresponding to a vertex v means that v is that much further away
from a source vertex.
- Q -- all vertices to consider.

OUTPUT:

A vertex with minimum distance.
"""
v = None          # start the search here
low = 1000000000  # the running minimum distance; represent +infinity
for u in Q:
if D[u] < low:
v = u
low = D[v]
return v
```
Categories: graph theory, mathematics, Sage

Just some snippet of Haskell code today to return the last but one element of a list. The first implementation uses the built-in functions `last` and `take`.

```-- Return the last but one element of a list.  This implementation uses
-- last and take.
penlast :: [a] -> a
penlast xs = if null xs || length xs <= 2
else last (take ((length xs) - 1) xs)
```

The second implementation below uses the built-in functions `head` and `drop`.

```-- Return the penultimate element of a list.  This implementation uses
penhead xs = if null xs || length xs <= 2
else head (drop ((length xs) - 2) xs)
```

## 2011 Spies Prize: Robert Bradshaw

This year, Robert Bradshaw is the winner of the Annual Spies Sage Development Prize. Congratulations, Robert! Here is the prize citation:

Robert Bradshaw has been an extremely active and productive Sage developer for over five years. Additionally, he has been a leader, both in maintaining the community and in important design decisions.

He is probably best known for his work on Cython, which is critical for the performance of many key parts of Sage, and his work designing and implementing the coercion model, which makes many powerful mathematical constructions possible. However, his interests and significant contributions are wide-ranging, including: exact linear algebra, arithmetic of elliptic curves, L-functions, 3-D plotting and parallel building. A recent project is the patchbot tool, which automates testing contributions posted on trac. Moreover, he is an important contributor to trouble-shooting and design discussions in the sage-devel forum and is also the third most numerous poster of all time in the sage-support forum.

For his many important technical contributions, and his long-time and continuing involvement in the Sage community, Robert Bradshaw is awarded the 2011 Spies Sage Development Prize. This award carries a prize of \$500 from the Sage Foundation (thanks to Jaap Spies).

## Documenta Mathematica now mirrored at sagemath.org

I’m delighted to announce that the journal Documenta Mathematica is now mirrored on the Sage website at http://www.sagemath.org/documenta/. Documenta Mathematica is an open access mathematics journal. It is open to all fields of mathematics. Articles are refereed in the traditional anonymous peer review model, which is what any respectable journal does.

Categories: mathematics, Sage

## Statistical analysis of the Fisher-Yates shuffle

The Fisher-Yates shuffle is a procedure for producing a random permutation of a sequence. This procedure is also known as the Knuth shuffle. Here I provide a statistical analysis of an implementation of the Fisher-Yates shuffle. A central idea is that any permutation of a sequence should equally likely be an output of the Fisher-Yates shuffle. That is, in a large enough number of shuffles of a fixed sequence, the observed probability of each permutation produced by a Fisher-Yates shuffle implementation should cluster around or converge to the theoretical probability for that permutation. As the number of shuffles increases, the observed probability for each possible permutation should converge to the theoretical probability. Otherwise there is something wrong with the implementation. I used my implementation of the Fisher-Yates shuffle to produce random permutations of various simple sequences of digits. The resulting output of the shuffles were used to perform frequency analyses of the behaviour of the implementation. Following are details on the particular sequences and the number of iterations for each sequence. Iteration here counts the number of times that I shuffled the given sequence. An experiment on a sequence is then the totality of all shuffles performed on it.

• Sequence: 123. Iterations: 1,000,000
• Sequence: 1234. Iterations: 1,000,000
• Sequence: 12345. Iterations: 1,000,000
• Sequence: 123456. Iterations: 10,000,000
• Sequence: 1234567. Iterations: 100,000,000
• Sequence: 12345678. Iterations: 100,000,000
• Sequence: 123456789. Iterations: 100,000,000

Each sequence was initialized as follows. Let $V$ be a vector where $V[i]$ holds the digit $i + 1$. So $V[0]$ holds the digit $1$, $V[1]$ holds $2$, and so on. Two versions of the experiment was performed on each sequence. In the first version of the experiment, called version A, at the start of each iteration, I initialized $V$ to be as described above. Then I randomly permuted the vector. In version B of the experiment, I first initialized $V$ to be as above. Then I proceeded to repeatedly randomly permute $V$. Thus if $V_i$ is the permutation obtained from iteration $i$, then during iteration $i + 1$ I applied the Fisher-Yates shuffle on $V_i$ to obtain $V_{i+1}$. These two different versions of each experiment on a sequence were performed to see whether if they would produce qualitatively identical results. The experimental results suggest so: the two different versions of each experiment produced qualitatively similar results.

Source code of the experiments are provided here. Note that in order to compile the C files, you need to check out igraph trunk from Launchpad, apply the patch on this ticket, and then compile and install the resulting patched igraph version on your system. The C files containing the code for the experiments output the result of each shuffle to a file. For small sequences with say 3 to 4 digits, the resulting output files are a few MB in size. But for longer sequences, such as with 5 or more digits, the output files can be from tens of MB to hundreds of MB in size. The experimental data are easily generated from the above C files, so I do not provide the data. The data for each experiment were analyzed using the Python script fisherstat.py. If you intend to replicate the experiments, you need to adjust this script for each data file of each experiment. Given a data file for each experiment, the Python file is loaded from within the Sage command line interface; everything from then on is automated, from reading the experimental data to computing the frequency distribution. All experiments were run on the Sage cluster, in particular the sage.math compute node, whose purchase is supported by US National Science Foundation Grant No. DMS-0821725. Data analysis was performed using the Sage mathematics software system version 4.6.2.

Also note that the Bitbucket.org project link also points to PDF files. These files plot the normalized frequency distributions of the experimental data. The horizontal axis of each plot is for the permutation IDs. Each permutation of a fixed sequence is assigned a unique ID starting from 0. For example, for the sequence “123” here are all the possible permutations together with their corresponding IDs:

```123 -> 0
132 -> 1
213 -> 2
231 -> 3
312 -> 4
321 -> 5
```

The vertical axis contains the corresponding normalized frequency of each permutation. Each frequency count was normalized by the number of iterations for the corresponding experiment. See the script fisherstat.py for further details. The normalized frequency for a permutation can be thought of as the empirical probability of that permutation showing up as a result of a Fisher-Yates shuffle.

And now comes the fun bit: plots of the experimental data. As I said above, both versions of each experiment produced qualitatively similar results. For this reason, below I only show some plots for version A of each experiment. To see all plots including plots for version B, refer to the Bitbucket.org project page. For kicks, each PDF file containing a plot was typeset using LaTeX and pgfplots.

As is evident from the above plots, for each sequence considered the empirical probabilities resulting from the experiments cluster around the theoretical probabilities. For a sequence of 3 or 4 digits, the empirical probabilities converge to the theoretical probability after a million or so experimental iterations. For example, the sequence “123” has six possible permutations so each permutation has a theoretical probability of $1/6$ of occurring as a result of the Fisher-Yates shuffle. In the above plot for the sequence “123”, it can be seen that the empirical probabilities converge to the theoretical probability after one million iterations. But as the number of digits in a sequence increases, the number of experimental iterations needs to increase as well in order to observe a convergence of the empirical probabilities to the theoretical probability for that sequence. For example, after $10^8$ iterations for the sequence “123456789”, observe that the empirical probabilities still cluster around the theoretical probability of $1 / 362880$. As the iteration number increases, the range of empirical probabilities should converge to the theoretical probability.

## Installing ESS (Emacs Speaks Statistics)

Problem

You have a non-standard directory under which you install software. You have installed Emacs under this directory. Now you want to install ESS under this directory as well so that Emacs could interact with R.

Solution

```/scratch/username/usr
```

```/scratch/username/usr/share/emacs/site-lisp
```

```;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
;; ESS: Emacs Speaks Statistics, mainly for R
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;

```M-x R