Photo by Romain Vignes on Unsplash

This post is part 2 of solving CareerVillage’s kaggle challenge; however, it also serves as a general purpose tutorial for the following three things:

  • Finding topics and keywords in texts using LDA
  • Using Spacy’s Semantic Similarity library to find similarities between texts
  • Using scikit-learn’s DBSCAN clustering algorithm for topic and keyword clustering

Problem Description

This section serves as a short reminder on what we are trying to do. CareerVillage, in its essence, is like Stackoverflow or Quora but for career questions. …


Guide on how to prepare data for model training for Kaggle’s Histopathologic Cancer Detection.

Photo by National Cancer Institute on Unsplash

Kaggle serves as a wonderful host to Data Science and Machine Learning challenges. One of them is the Histopathologic Cancer Detection Challenge. In this challenge, we are provided with a dataset of images on which we are supposed to create an algorithm (it says algorithm and not explicitly a machine learning model, so if you are a genius with an alternate way to detect metastatic cancer in images; go for it!) to detect metastatic cancer.

This article serves as a guide on how to prepare Kaggle’s dataset and the guide covers the following 4 things:

  • How to download the dataset…

Abdul Qadir

Data for good. Senior @ Minerva Schools at KGI.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store