Scalable or online out-of-core multi-label classifiers
I have been blowing my brains out over the past 2-3 weeks on this problem.
I have a multi-label (not multi-class) problem where each sample can
belong to several of the labels.
I have around 4.5 million text documents as training data and around 1
million as test data. The labels are around 35K.
I am using scikit-learn. For feature extraction I was previously using
TfidfVectorizer which didn't scale at all, now I am using HashVectorizer
which is better but not that scalable given the number of documents that I
have.
vect = HashingVectorizer(strip_accents='ascii', analyzer='word',
stop_words='english', n_features=(2 ** 10))
SKlearn provides a OneVsRestClassifier into which I can feed any
estimator. For multi-label I found LinearSVC & SGDClassifier only to be
working correctly. Acc to my benchmarks SGD outperforms LinearSVC both in
memory & time. So, I have something like this
clf = OneVsRestClassifier(SGDClassifier(loss='log', penalty='l2',
n_jobs=-1), n_jobs=-1)
But this suffers from some serious issues:
OneVsRest does not have a partial_fit method which makes it impossible for
out-of-core learning. Are there any alternatives for that?
HashingVectorizer/Tfidf both work on a single core and don't have any
n_jobs parameter. It's taking too much time to hash the documents. Any
alternatives/suggestions? Also is the value of n_features correct?
I tested on 1 million documents. The Hashing takes 15 minutes and when it
comes to clf.fit(X, y), I receive a MemoryError because OvR internally
uses LabelBinarizer and it tries to allocate a matrix of dimensions (y x
classes) which is fairly impossible to allocate. What should I do?
Any other libraries out there which have reliable & scalable multi-label
algorithms? I know of genism & mahout but both of them don't have anything
for multi-label situations?
No comments:
Post a Comment