Predictive modelling (classification method) can be seen as a branch of machine learning, particularly its ability to classify unknown objects into specific categories. At its core, predictive modelling would rely on the availability of a reference or training dataset that would inform future test inputs. Prior to running any algorithms, the dataset of interest would need to be prepared, the data would be: a) normalized across its variables (to ensure of consistency) and b) split into a training dataset (serves as the reference) and a test dataset. Using the training dataset we are able to “teach” the system to predict categorization patterns and predict the classification of future input data. The k-Nearest Neighbours (kNN) algorithm can be used to predict the classification of a test input based on the distance of surrounding input objects (example: kNN example ).
Using the Iris data set, I developed the training and test datasets, using the kNN algorithm. The training data serves as the reference point, training the system to recognize patterns and use this learning pattern to predict outcomes in the test dataset. Using the kNN algorithm, we are able to predict plant species within the test dataset (our output variable) based on the nearest neighbours within the test dataset and the training data set learning. See below for algorithm design and assumptions:
Subsequent steps involve testing the accuracy of the algorithm and whether the test predictions were correct relative to the training set.