I need to classify some data with (I hope) nearest-neighbour algorithm. I’ve googled this problem and found a lot of libraries (including PyML, mlPy and Orange), but I’m unsure of where to start here.

How should I go about implementing k-NN using Python?

Particularly given the technique (k-Nearest Neighbors) that you mentioned in your Q, i would strongly recommend scikits.learn. [Note: after this Answer was posted, the lead developer of this Project informed me of a new homepage for this Project.]

A few features that i believe distinguish this library from the others (at least the other Python ML libraries that i have used, which is most of them):

  • an extensive diagnostics & testing library (including plotting
    modules, via Matplotlib)–includes feature-selection algorithms,
    confusion matrix, ROC, precision-recall, etc.;

  • a nice selection of ‘batteries-included’ data sets (including
    handwriting digits, facial images, etc.) particularly suited for ML techniques;

  • extensive documentation (a nice surprise given that this Project is
    only about two years old) including tutorials and step-by-step
    example code (which use the supplied data sets);

Without exception (at least that i can think of at this moment) the python ML libraries are superb. (See the PyMVPA homepage for a list of the dozen or so most popular python ML libraries.)

In the past 12 months for instance, i have used ffnet (for MLP), neurolab (also for MLP), PyBrain (Q-Learning), neurolab (MLP), and PyMVPA (SVM) (all available from the Python Package Index)–these vary significantly from each other w/r/t maturity, scope, and supplied infrastructure, but i found them all to be of very high quality.

Still, the best of these might be scikits.learn; for instance, i am not aware of any python ML library–other than scikits.learn–that includes any of the three features i mentioned above (though a few have solid example code and/or tutorials, none that i know of integrate these with a library of research-grade data sets and diagnostic algorithms).

Second, given you the technique you intend to use (k-nearest neighbor) scikits.learn is a particularly good choice. Scikits.learn includes kNN algorithms for both regression (returns a score) and classification (returns a class label), as well as detailed sample code for each.

Using the scikits.learn k-nearest neighbor module (literally) couldn’t be any easier:

>>> # import NumPy and the relevant scikits.learn module
>>> import numpy as NP
>>> from sklearn import neighbors as kNN

>>> # load one of the sklearn-suppplied data sets
>>> from sklearn import datasets
>>> iris = datasets.load_iris()
>>> # the call to load_iris() loaded both the data and the class labels, so
>>> # bind each to its own variable
>>> data = iris.data
>>> class_labels = iris.target

>>> # construct a classifier-builder by instantiating the kNN module's primary class
>>> kNN1 = kNN.NeighborsClassifier()

>>> # now construct ('train') the classifier by passing the data and class labels
>>> # to the classifier-builder
>>> kNN1.fit(data, class_labels)
      NeighborsClassifier(n_neighbors=5, leaf_size=20, algorithm='auto')

What’s more, unlike nearly all other ML techniques, the crux of k-nearest neighbors is not coding a working classifier builder, rather the difficult step in building a production-grade k-nearest neighbor classifier/regressor is the persistence layer–i.e., storage and fast retrieval of the data points from which the nearest neighbors are selected. For the kNN data storage layer, scikits.learn includes an algorithm for a ball tree (which i know almost nothing about other than is apparently superior to the kd-tree (the traditional data structure for k-NN) because its performance doesn’t degrade in higher dimensional features space.

Additionally, k-nearest neighbors requires an appropriate similarity metric (Euclidean distance is the usual choice, though not always the best one). Scikits.learn includes a stand-along module comprised of various distance metrics as well as testing algorithms for selection of the appropriate one.

Finally, there are a few libraries that i have not mentioned either because they are out of scope (PyML, Bayesian); they are not primarily ‘libraries’ for developers but rather applications for end users (e.g., Orange), or they have unusual or difficult-to-install dependencies (e.g., mlpy, which requires the gsl, which in turn must be built from source) at least for my OS, which is Mac OS X.

(Note: i am not a developer/committer for scikits.learn.)