Readme file for: Comparing K-Means, Ridge Regression, Decision Trees and Random Forests for Datasets with Outliers and Missing Features

Quargs Greene, Daniel Gergeus, Shivam Goyal, Dhiraj Simhadri {qgreene, dgergeus, sgoyal15, dhirajs}@bu.edu

Links to project code and original datasets:

Functions implemented:

Python library and package functions (see corresponding documentation for more information):

Code for cleaning and preprocessing (see corresponding documentation for any Python library code):

preprocess.py: reads CSV files, shapes them, recreates train-test divisions, synthetically generates pseudorandom Gaussian data with induced outliers,0s, and NaN values, creates cleaned control datasets trimmed of outliers, saves removed portions and outliers from datasets, displays datasets in command line and first-order statistics
divide_data: an alternative to sklearn.model_selection.train_test_split for splitting training and testing data
pandas.DataFrame.fillna
pandas.DataFrame.to_numpy
pandas.DataFrame.iloc
numpy.random.seed
pandas.DataFrame.fit
sklearn.model_selection.train_test_split
sklearn.preprocessing.MinMaxScaler
numpy.random.normal
pandas.DataFrame.transform
numpy.random.randint
numpy.percentile
pandas.DataFrame.mask
pandas.read_csv
pandas.DataFrame.drop
np.random.choice
sklearn.preprocessing.fit_transform

For dependencies, see requirements.txt and import statements at the top of each file and their corresponding documentation.

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
.gitattributes		.gitattributes
.gitignore		.gitignore
Books.csv		Books.csv
DecisionTrees.ipynb		DecisionTrees.ipynb
K_Means.ipynb		K_Means.ipynb
PCA.ipynb		PCA.ipynb
README.md		README.md
RandomForest.ipynb		RandomForest.ipynb
Ratings.csv		Ratings.csv
Users.csv		Users.csv
fashion-mnist_test.csv		fashion-mnist_test.csv
fashion-mnist_train.csv		fashion-mnist_train.csv
preprocess.py		preprocess.py
requirements.txt		requirements.txt
ridge_synthetic.py		ridge_synthetic.py
ridge_titanic.py		ridge_titanic.py
titanic.csv		titanic.csv

Provide feedback