What is the difference between pipeline and make_pipeline in scikit?

I got this from the sklearn webpage:

  • Pipeline: Pipeline of transforms with a final estimator

  • Make_pipeline: Construct a Pipeline from the given estimators. This is a shorthand for the Pipeline constructor.

But I still do not understand when I have to use each one. Can anyone give me an example?

35071 次浏览

The only difference is that make_pipeline generates names for steps automatically.

Step names are needed e.g. if you want to use a pipeline with model selection utilities (e.g. GridSearchCV). With grid search you need to specify parameters for various steps of a pipeline:

pipe = Pipeline([('vec', CountVectorizer()), ('clf', LogisticRegression()])
param_grid = [{'clf__C': [1, 10, 100, 1000]}
gs = GridSearchCV(pipe, param_grid)
gs.fit(X, y)

compare it with make_pipeline:

pipe = make_pipeline(CountVectorizer(), LogisticRegression())
param_grid = [{'logisticregression__C': [1, 10, 100, 1000]}
gs = GridSearchCV(pipe, param_grid)
gs.fit(X, y)

So, with Pipeline:

  • names are explicit, you don't have to figure them out if you need them;
  • name doesn't change if you change estimator/transformer used in a step, e.g. if you replace LogisticRegression() with LinearSVC() you can still use clf__C.

make_pipeline:

  • shorter and arguably more readable notation;
  • names are auto-generated using a straightforward rule (lowercase name of an estimator).

When to use them is up to you :) I prefer make_pipeline for quick experiments and Pipeline for more stable code; a rule of thumb: IPython Notebook -> make_pipeline; Python module in a larger project -> Pipeline. But it is certainly not a big deal to use make_pipeline in a module or Pipeline in a short script or a notebook.