Kaggle and SciKit-Learn

A few weeks ago a data scientist friend suggested I look at Kaggle.com and tackle some of their machine learning challenges.

Like all first time users, I began with the Titanic datset, and trying to understand which of the known background about the passengers led to a higher chance on survival. You are provided with around 1500 training cases, where you are told whether passengers survived or not, and have to make a prediction for the reamining 600 test cases. Kaggle have their own tutorials on how to get started and I began by reinterpret these as Jupyter notebooks and posted them on my GitHub.

I begin by expanding the Kaggle tutorial on how to use NumPy and CSV packages in “1. Titanic - NumPy and CSV packages.ipynb” - this makes a model based on “train.csv” and applies it to “test.csv”. I output two predictions - a “1a. ModelBasedonGenderAlone.csv”, and then a “1b. ModelBasedonGenderClass.csv”

I then use a DataFrame in Pandas to streamline this process using Pandas’s powerful built-in tools in “2. Titanic - Pandas .ipynb”. I repeat the ideas of my NumPy model to produce “2. genderclassmodel-pandas.csv”.

I then analyse the data using ML in “3. Titanic - Random Forests.ipynb” and work in terms of Random Forests using the scikit-learn package to produce “3. FamSizeAgeClassForest.csv”.

I look forward to going deeper into machine lerning using these examples over the coming weeks.