Mixing categorial and continuous data in Naive Bayes classifier using scikit-learn

I'm using scikit-learn in Python to develop a classification algorithm to predict the gender of certain customers. Amongst others, I want to use the Naive Bayes classifier but my problem is that I have a mix of categorical data (ex: "Registered online", "Accepts email notifications" etc) and continuous data (ex: "Age", "Length of membership" etc). I haven't used scikit much before but I suppose that that Gaussian Naive Bayes is suitable for continuous data and that Bernoulli Naive Bayes can be used for categorical data. However, since I want to have both categorical and continuous data in my model, I don't really know how to handle this. Any ideas would be much appreciated!

31479 次浏览

You have at least two options:

  • Transform all your data into a categorical representation by computing percentiles for each continuous variables and then binning the continuous variables using the percentiles as bin boundaries. For instance for the height of a person create the following bins: "very small", "small", "regular", "big", "very big" ensuring that each bin contains approximately 20% of the population of your training set. We don't have any utility to perform this automatically in scikit-learn but it should not be too complicated to do it yourself. Then fit a unique multinomial NB on those categorical representation of your data.

  • Independently fit a gaussian NB model on the continuous part of the data and a multinomial NB model on the categorical part. Then transform all the dataset by taking the class assignment probabilities (with predict_proba method) as new features: np.hstack((multinomial_probas, gaussian_probas)) and then refit a new model (e.g. a new gaussian NB) on the new features.

The simple answer: multiply result!! it's the same.

Naive Bayes based on applying Bayes’ theorem with the “naive” assumption of independence between every pair of features - meaning you calculate the Bayes probability dependent on a specific feature without holding the others - which means that the algorithm multiply each probability from one feature with the probability from the second feature (and we totally ignore the denominator - since it is just a normalizer).

so the right answer is:

  1. calculate the probability from the categorical variables.
  2. calculate the probability from the continuous variables.
  3. multiply 1. and 2.

Hope I'm not too late. I recently wrote a library called Mixed Naive Bayes, written in NumPy. It can assume a mix of Gaussian and categorical (multinoulli) distributions on the training data features.

https://github.com/remykarem/mixed-naive-bayes

The library is written such that the APIs are similar to scikit-learn's.

In the example below, let's assume that the first 2 features are from a categorical distribution and the last 2 are Gaussian. In the fit() method, just specify categorical_features=[0,1], indicating that Columns 0 and 1 are to follow categorical distribution.

from mixed_naive_bayes import MixedNB
X = [[0, 0, 180.9, 75.0],
[1, 1, 165.2, 61.5],
[2, 1, 166.3, 60.3],
[1, 1, 173.0, 68.2],
[0, 2, 178.4, 71.0]]
y = [0, 0, 1, 1, 0]
clf = MixedNB(categorical_features=[0,1])
clf.fit(X,y)
clf.predict(X)

Pip installable via pip install mixed-naive-bayes. More information on the usage in the README.md file. Pull requests are greatly appreciated :)

@Yaron's approach needs an extra step (4. below):

  1. Calculate the probability from the categorical variables.
  2. Calculate the probability from the continuous variables.
  3. Multiply 1. and 2. AND
  4. Divide 3. by the sum of the product of 1. and 2. EDIT: What I actually mean is that the denominator should be (probability of the event given the hypotnesis is yes) + (probability of evidence given the hypotnesis is no) (asuming a binary problem, without loss of generality). Thus, the probabilities of the hypotheses (yes or no) given the evidence would sum to 1.

Step 4. is the normalization step. Take a look at @remykarem's mixed-naive-bayes as an example (lines 268-278):

        if self.gaussian_features.size != 0 and self.categorical_features.size != 0:
finals = t * p * self.priors
elif self.gaussian_features.size != 0:
finals = t * self.priors
elif self.categorical_features.size != 0:
finals = p * self.priors


normalised = finals.T/(np.sum(finals, axis=1) + 1e-6)
normalised = np.moveaxis(normalised, [0, 1], [1, 0])


return normalised

The probabilities of the Gaussian and Categorical models (t and p respectively) are multiplied together in line 269 (line 2 in extract above) and then normalized as in 4. in line 275 (fourth line from the bottom in extract above).

For hybrid features, you can check this implementation.

The author has presented mathematical justification in his Quora answer, you might want to check.

You will need the following steps:

  1. Calculate the probability from the categorical variables (using predict_proba method from BernoulliNB)
  2. Calculate the probability from the continuous variables (using predict_proba method from GaussianNB)
  3. Multiply 1. and 2. AND
  4. Divide by the prior (either from BernoulliNB or from GaussianNB since they are the same) AND THEN
  5. Divide 4. by the sum (over the classes) of 4. This is the normalisation step.

It should be easy enough to see how you can add your own prior instead of using those learned from the data.