Is there a way to predict how long it will take to run a classifier from sci-kit learn based on the parameters and dataset? I know, pretty meta, right?

Some classifiers/parameter combinations are quite fast, and some take so long that I eventually just kill the process. I’d like a way to estimate in advance how long it will take.

Alternatively, I’d accept some pointers on how to set common parameters to reduce the run time.

There are very specific classes of classifier or regressors that directly report remaining time or progress of your algorithm (number of iterations etc.). Most of this can be turned on by passing verbose=2 (any high number > 1) option to the constructor of individual models. Note: this behavior is according to sklearn-0.14. Earlier versions have a bit different verbose output (still useful though).

The best example of this is ensemble.RandomForestClassifier or ensemble.GradientBoostingClassifier` that print the number of trees built so far and remaining time.

clf = ensemble.GradientBoostingClassifier(verbose=3), y)
   Iter       Train Loss   Remaining Time
     1           0.0769            0.10s


clf = ensemble.RandomForestClassifier(verbose=3), y)
  building tree 1 of 100

This progress information is fairly useful to estimate the total time.

Then there are other models like SVMs that print the number of optimization iterations completed, but do not directly report the remaining time.

clf = svm.SVC(verbose=2), y)
    optimization finished, #iter = 1
    obj = -1.802585, rho = 0.000000
    nSV = 2, nBSV = 2

Models like linear models don’t provide such diagnostic information as far as I know.

Check this thread to know more about what the verbosity levels mean: scikit-learn fit remaining time

If you are using IPython, you can consider to use the built-in magic commands such as %time and %timeit

%time – Time execution of a Python statement or expression. The CPU and wall clock times are printed, and the value of the expression (if any) is returned. Note that under Win32, system time is always reported as 0, since it can not be measured.

%timeit – Time execution of a Python statement or expression using the timeit module.


In [4]: %timeit NMF(n_components=16, tol=1e-2).fit(X)
1 loops, best of 3: 1.7 s per loop


We’re actually working on a package that gives runtime estimates of scikit-learn fits.

You would basically run it right before running the, y) to get the runtime estimation.

Here’s a simple use case:

from scitime import Estimator 
estimator = Estimator() 
rf = RandomForestRegressor()
X,y = np.random.rand(100000,10),np.random.rand(100000,1)
# Run the estimation
estimation, lower_bound, upper_bound = estimator.time(rf, X, y)

Feel free to take a look!