Secret computer code threatens science

18 April 2012 Leave a comment

An article in Scientific American about the vital role of software in science. Here’s the abstract:

Missing source code can allow bad science to slip through the cracks and means extra headaches for scientists who want to closely follow up on new studies or check for errors

Advertisements
Categories: open science

Primality test in Haskell

14 March 2012 Leave a comment

Problem: Implement a primality test in Haskell. Below is the code. The first implementation of the function divides uses a homemade function for finding the remainder when an integer is divided by another integer.

-- The remainder when a is divided by b.                                        
remainder :: Integer -> Integer -> Integer
remainder a b | a < b = a
              | a == b = 0
              | otherwise = remainder (a - b) b

-- Whether d divides n.                                                         
divides :: Integer -> Integer -> Bool                                        
divides d n = remainder n d == 0

The second implementation of divides uses the built-in function rem to find the remainder upon division by an integer.

-- Whether d divides n.  A more efficient version that uses the built-in        
-- function rem.                                                                
divides :: Integer -> Integer -> Bool
divides d n = rem n d == 0

The full primality test follows:

-- Whether d divides n.  A more efficient version that uses the built-in        
-- function rem.                                                                
divides :: Integer -> Integer -> Bool
divides d n = rem n d == 0

-- The least divisor of n that is at least k.                                   
ldf :: Integer -> Integer -> Integer
ldf k n | divides k n = k
        | k^2 > n = n
        | otherwise = ldf (k + 1) n

-- The least divisor of n.                                                      
ld :: Integer -> Integer
ld n = ldf 2 n

-- Primality test.                                                              
prime :: Integer -> Bool
prime n | n < 1 = error "must be a positive integer"
        | n == 1 = False
        | otherwise = ld n == n

Basic implementation of Dijkstra’s algorithm

25 February 2012 1 comment

Dijkstra’s algorithm as presented in Algorithm 2.5, page 75 of the book Algorithmic Graph Theory is meant to be a general template. Lots of details have been left out, one in particular is how to implement line 6 of the algorithm. This one line of Dijkstra’s algorithm has been the subject of numerous research papers on how to efficiently implement a search technique for Dijkstra’s algorithm. A simple search technique is linear search, where you search some array or list from start to finish. A more efficient search technique for line 6 is a binary heap. To implement infinity as stated on line 1 of Algorithm 2.5, you would simply let a very large number represent infinity. This number should ideally be larger than any weight in the graph you are searching.

Below is a basic implementation of Dijkstra’s algorithm following the general template of Algorithm 2.5. This implementation is meant to be for searching in simple, unweighted, undirected, connected graphs. Because the graph to search is assumed to be unweighted, I simply let each edge have unit weight and represent infinity as the integer 10^9. The implementation below should provide a basis on which to implement Dijkstra’s algorithm for, say, weighted graphs and other types of graphs. To use the implementation below, save it to a Python file, load the file into a Sage session using load(), and call the function dijkstra().

def dijkstra(G, s):
    """
    Shortest paths in a graph using Dijkstra's algorithm.

    INPUT:

    - G -- a simple, unweighted, undirected, connected graph. Thus each edge
      has unit weight.
    - s -- a vertex in G from which to start the search.

    OUTPUT:

    A list D of distances such that D[v] is the distance of a shortest path
    from s to v.  A dictionary P of vertex parents such that P[v] is the
    parent of v.

    EXAMPLES:

    sage: G = graphs.PetersenGraph()
    sage: dijkstra(G, 0)
    ([0, 1, 2, 2, 1, 1, 2, 2, 2, 2], {1: 0, 2: 1, 3: 4, 4: 0, 5: 0, 6: 1, 7: 5, 8: 5, 9: 4})

    sage: G = Graph({0:{1:1, 3:1}, 1:{2:1, 3:1, 4:1}, 2:{4:1}})
    sage: dijkstra(G, 0)
    ([0, 1, 2, 1, 2], {1: 0, 2: 1, 3: 0, 4: 1})
    """
    n = G.order()  # how many vertices
    m = G.size()   # how many edges
    D = [1000000000 for _ in range(n)]  # largest weights; represent +infinity
    D[s] = 0       # distance from vertex to itself is zero
    P = {}         # a dictionary for fast look-up
    Q = set(G.vertices())
    while len(Q) > 0:
        v = mindist(D, Q)
        Q.remove(v)
        Adj = set(G.neighbors(v))
        for u in Adj.intersection(Q):
            if D[u] > D[v] + 1:     # each edge has unit weight, so add 1
                D[u] = D[v] + 1
                P.setdefault(u, v)  # the parent of u is v
    return D, P

def mindist(D, Q):
    """
    Choose a vertex in Q such that it has minimal distance.

    INPUT:

    - D -- a list of vertices with corresponding distances.  Each distance
      D[v] corresponding to a vertex v means that v is that much further away
      from a source vertex.
    - Q -- all vertices to consider.

    OUTPUT:

    A vertex with minimum distance.
    """
    v = None          # start the search here
    low = 1000000000  # the running minimum distance; represent +infinity
    for u in Q:
        if D[u] < low:
            v = u
            low = D[v]
    return v
Categories: graph theory, mathematics, Sage

The case for open computer programs

22 February 2012 1 comment

An article [1] in the science journal Nature, making a case for releasing software used to produce results of scientific papers. For people who don’t have access to Nature, the abstract of the article should sum up its main thesis:

Scientific communication relies on evidence that cannot be entirely included in publications, but the rise of computational science has added a new layer of inaccessibility. Although it is now accepted that data should be made available on request, the current regulations regarding the availability of software are inconsistent. We argue that, with some exceptions, anything less than the release of source programs is intolerable for results that depend on computation. The vagaries of hardware, software and natural language will always ensure that exact reproducibility remains uncertain, but withholding code increases the chances that efforts to reproduce results will fail.

[1] D. C. Ince, L. Hatton, and J. Graham-Cumming. The case for open computer programs. Nature, 482:485–488, 2012. DOI: 10.1038/nature10836

Categories: open science

Haskell snippet

5 November 2011 Leave a comment

Just some snippet of Haskell code today to return the last but one element of a list. The first implementation uses the built-in functions last and take.

-- Return the last but one element of a list.  This implementation uses         
-- last and take.                                                               
penlast :: [a] -> a
penlast xs = if null xs || length xs <= 2
             then head xs
             else last (take ((length xs) - 1) xs)

The second implementation below uses the built-in functions head and drop.

-- Return the penultimate element of a list.  This implementation uses          
-- head and drop.                                                               
penhead :: [a] -> a
penhead xs = if null xs || length xs <= 2
             then head xs
             else head (drop ((length xs) - 2) xs)
Categories: Haskell, programming Tags: ,

Originality in research

22 July 2011 3 comments

We never think entirely alone: we think in company, in a vast collaboration; we work with the workers of the past and of the present. [In] the whole intellectual world … each one finds in those about him [or her] the initiation, help, verification, information, encouragement, that he [or she] needs.

— A. D. Sertillanges [11, p.145]

In a post called It’s academic, I used the phrase “original contribution to scholarship” to describe one of many important outcomes of PhD research. The phrase is usually synonymous with publications in peer-reviewed professional literature. Yes, having our research published in a research journal is an important outcome. It is one of many tangible outcomes of our candidature, besides a hard-bound thesis thick enough to be used as a paper weight and chock full of jargon to the extent of frightening away everyone except the brave souls of our professional society. But what about intangible outcomes? In this post, I start exploring some common intangible outcomes of a PhD education, drawing on the phrase “original contribution to scholarship” as a springboard for discussion. Think of this post as the first installment in the series of SOC posts: scholarship, originality, contribution, but not necessarily to be discussed in that order.

Prior art

First, I deal with the issue of originality in this installment. Originality can be described as something that has not been conceived of or done before by someone. With a vast body of research literature out there, how are we to know what we’re doing is original research? How do we know we’re not duplicating someone else’s published work? Short answer: it’s almost impossible to know for certain. There’s an apocryphal anecdote that John von Neumann came up with two solutions to two problems, but on both occasions afterward learnt that someone had already solved the problems. Another example is Gauss’ work. Gauss didn’t publish much of his mathematics research. In many cases, later works were found to have been rediscoveries of Gauss’ unpublished work. Or consider the calculus war [1], a priority dispute over whether Leibniz or Newton was the original inventor of calculus. One of our best guards against duplicating someone else’s work is to be aware of the general research trends in our chosen area and have a clear statement of our research questions. These two topics will be discussed in turn.

To know what’s happening in our area, we should start off by reading some research papers that we think are directly relevant to our project. A good starting point should be expository papers or Wikipedia articles, or publications providing high-level introductions to certain topics. The objective is to get a lay-person’s overview of our research area before delving into research papers. Survey papers generally provide technical overviews of particular research areas, with detailed analyses and discussions of salient theories, trends, topics, and techniques. A survey paper’s purpose is to be a guide to the vast body of research publications on a particular topic, often citing hundreds of relevant publications spread over dozens of journals and books. More often than not, a survey paper should be read as if it is a practitioner’s guide to recent research. Don’t expect every concept or technique to be clearly explained, or explained at all. A good strategy for reading survey papers is to have ready access to relevant textbooks and reference materials. In fact, we should find out whether there are any textbooks or reference books relevant to our research topic and read them as well. In short, reading survey papers, expository publications, and textbooks should equip us with a bird’s eye-view of previous work in our area.

Once we have a general (or vague) idea of what has been published in our research area, we should start writing a literature review. This is similar to writing a survey of previous published work in our area, with detailed discussions of relevant theories, trends, topics, and techniques. It is by means of writing a literature review that we acquaint ourselves with relevant technical details, and successes and limitations of published techniques. Think of the literature review as a survey paper based on a sample of publications most relevant to our research questions. As we write our literature review, we should also maintain a list of research questions together with high-level strategies to investigate them. There’s never enough time to write a thorough and comprehensive literature review. The best we could do is to incrementally add to our literature review as we progress along our research project. When the time comes to write our research proposal, we would have available our literature review and a list of research ideas from which to draw inspiration and material.

Originality in conduct

Thus far, I have discussed how to start doing original research by first being aware of prior research. I now turn to originality in terms of how we conduct or carry out our project. Cryer [2, pp.193–196] identified a number of broad approaches in which a project is seen to be original. The author also identified originality in terms of a project’s outcomes, but I won’t discuss that point here. A project can be original in terms of: (1) tools, techniques, procedures, and methods; (2) exploring an unknown; (3) exploring something unanticipated; or (4) using data in a novel way. These will be discussed in turn.

First, our project could be original in terms of developing new tools, techniques, procedures or methods, or applying these to an existing problem or to a context where they have not been tried before. Regardless of the success or failure of such investigations, we would more often than not discover cases when a new approach yields new or insightful results, and situations where the approach sheds no light at all. Often, it is not enough to develop a computer program implementing a new algorithm or technique; we need to show how that program could be used to shed new understanding on a problem.

Second, we could venture into an unexplored area and in the process discover a new field of research. Examples of such investigations into the unknown include Hardin’s “tragedy of the commons” [9], Diffie and Hellman’s development of public-key cryptography [3], Haar’s introduction of wavelets [8], and Euler’s introduction of graph theory [5]. It’s almost rare to develop a new field from scratch, and even if we do so it might take years or decades or even longer before anyone realizes the new field’s full potentials. We should also be aware that scientific research is a competitive enterprise with the attendant occasional priority disputes of who did what before whom.

Third, while investigating a problem we might serendipitously come across an unexpected result, phenomenon, or research direction. Examples of fortuitous discoveries include Louis Pasteur’s discovery of the cholera vaccine, Oskar Minkowski’s discovery that diabetes results from a pancreas disorder, and Alexander Fleming’s discovery of the enzyme lysozyme. In some cases such as the above, it is worth investigating the unexpected, but in other cases we should be cautious about investigating something that might lead down a dead end. Exploring the unanticipated can also be seen in research that lends a new perspective on an existing field, such as applying the notion of the tragedy of the commons to information and communications technology [7]. Much fruitful research falling under the rubric of exploring the unanticipated also consists in applying existing tools, techniques, or methods to problems that have already enjoyed much attention. An example is the application of graph theory to analyzing social, technological, biological, and other types of networks [4].

Fourth, an existing dataset could be explored in a novel way. Among other things, this includes interpreting the dataset differently from how it has been interpreted in the existing research literature, or applying new techniques to process the dataset. The task of collating a new dataset itself is an original research project often involving a team of researchers. At every step of the project, we need to adhere to established data collection protocols. Once the dataset is finally assembled, there are protocols specifying how to cleanse the dataset, how to code the data, etc. Examples of original research one of whose primary goals was the collation of a new dataset include the human genome project, the Facebook dataset project [10], the C. elegans nervous system project [12], and the modENCODE Project [6].

References

[1] J. Bardi. The Calculus War: Newton, Leibniz and the Greatest Mathematical Clash of All Time. High Stakes Publishing, 2007.

[2] P. Cryer. The Research Student’s Guide to Success. Open University Press, 3rd edition, 2006.

[3] W. Diffie and M. Hellman. New directions in cryptography. IEEE Transactions on Information Theory, 22(6):644–654, 1976.

[4] D. Easley and J. Kleinberg. Networks, Crowds, and Markets: Reasoning about a Highly Connected World. Cambridge University Press, 2010.

[5] L. Euler. Solutio problematis ad geometriam situs pertinentis. Comment Acad. Sci. Imperialis Petropolit, 8:128–140, 1736.

[6] M. B. Gerstein, Z. J. Lu, E. L. V. Nostrand, C. Cheng, B. I. Arshinoff, T. Liu, K. Y. Yip, R. Robilotto, A. Rechtsteiner, K. Ikegami, P. Alves, A. Chateigner, M. Perry, M. Morris, R. K. Auerbach, X. Feng, J. Leng, A. Vielle, W. Niu, K. Rhrissorrakrai, A. Agarwal, R. P. Alexander, G. Barber, C. M. Brdlik, J. Brennan, J. J. Brouillet, A. Carr, M.-S. Cheung, H. Clawson, S. Contrino, L. O. Dannenberg, A. F. Dernburg, A. Desai, L. Dick, A. C. Dose, J. Du, T. Egelhofer, S. Ercan, G. Euskirchen, B. Ewing, E. A. Feingold, R. Gassmann, P. J. Good, P. Green, F. Gullier, M. Gutwein, M. S. Guyer, L. Habegger, T. Han, J. G. Henikoff, S. R. Henz, A. Hinrichs, H. Holster, T. Hyman, A. L. Iniguez, J. Janette, M. Jensen, M. Kato, W. J. Kent, E. Kephart, V. Khivansara, E. Khurana, J. K. Kim, P. Kolasinska-Zwierz, E. C. Lai, I. Latorre, A. Leahey, S. Lewis, P. Lloyd, L. Lochovsky, R. F. Lowdon, Y. Lubling, R. Lyne, M. MacCoss, S. D. Mackowiak, M. Mangone, S. McKay, D. Mecenas, G. Merrihew, D. M. Miller III, A. Muroyama, J. I. Murray, S.-L. Ooi, H. Pham, T. Phippen, E. A. Preston, N. Rajewsky, G. Ratsch, H. Rosenbaum, J. Rozowsky, K. Rutherford, P. Ruzanov, M. Sarov, R. Sasidharan, A. Sboner, P. Scheid, E. Segal, H. Shin, C. Shou, F. J. Slack, C. Slightam, R. Smith, W. C. Spencer, E. O. Stinson, S. Taing, T. Takasaki, D. Vafeados, K. Voronina, G. Wang, N. L. Washington, C. M. Whittle, B. Wu, K.-K. Yan, G. Zeller, Z. Zha, M. Zhong, X. Zhou, modENCODE Consortium, J. Ahringer, S. Strome, K. C. Gunsalus, G. Micklem, X. S. Liu, V. Reinke, S. K. Kim, L. W. Hillier, S. Henikoff, F. Piano, M. Snyder, L. Stein, J. D. Lieb, and R. H. Waterston. Integrative analysis of the Caenorhabditis elegans genome by the modENCODE Project. Science, 330(6012):1775–1787, 2010.

[7] G. M. Greco and L. Floridi. The tragedy of the digital commons. Ethics and Information Technology, 6(2):73–81, 2004.

[8] A. Haar. Zur theorie der orthogonalen funktionensysteme. Mathematische Annalen, 71(1):38–53, 1911.

[9] G. Hardin. The tragedy of the commons. Science, 162(3859):1243–1248, 1968.

[10] K. Lewis, J. Kaufman, M. Gonzalez, A. Wimmer, and N. Christakis. Tastes, ties, and time: A new social network dataset using Facebook.com. Social Networks, 30(4):330–342, 2008.

[11] A. D. Sertillanges. The Intellectual Life: Its Spirits, Conditions and Methods. Mercier Press, 1978. Translated by Mary Ryan.

[12] J. G. White, E. Southgate, J. N. Thompson, and S. Brenner. The structure of the nervous system of the nematode Caenorhabditis elegans. Philosophical Transactions of The Royal Society B, 314(1165):1–340, 1986.

2011 Spies Prize: Robert Bradshaw

18 July 2011 Leave a comment

This year, Robert Bradshaw is the winner of the Annual Spies Sage Development Prize. Congratulations, Robert! Here is the prize citation:

Robert Bradshaw has been an extremely active and productive Sage developer for over five years. Additionally, he has been a leader, both in maintaining the community and in important design decisions.

He is probably best known for his work on Cython, which is critical for the performance of many key parts of Sage, and his work designing and implementing the coercion model, which makes many powerful mathematical constructions possible. However, his interests and significant contributions are wide-ranging, including: exact linear algebra, arithmetic of elliptic curves, L-functions, 3-D plotting and parallel building. A recent project is the patchbot tool, which automates testing contributions posted on trac. Moreover, he is an important contributor to trouble-shooting and design discussions in the sage-devel forum and is also the third most numerous poster of all time in the sage-support forum.

For his many important technical contributions, and his long-time and continuing involvement in the Sage community, Robert Bradshaw is awarded the 2011 Spies Sage Development Prize. This award carries a prize of $500 from the Sage Foundation (thanks to Jaap Spies).