Data Science Fundamentals: Week 4

This week was an introduction to Python and its many quantitative functions, having had the chance to explore both Python and R programming languages, I have to say that I found Python a program that focuses on code readability, complex functions and would be highly useful for developers. R, tends to lean towards statisticians, graphical analysis and a focus on data visualization.  Between the two, I think each is developed with a specific working group in mind, however, I do believe that Python would be best suited for those on the development side.

I tested out the k-nearest neighbor algorithm using Python and learned quite a bit about two widely used distance metrics; Euclidean and Manhattan. To start off, the k-near neighbor (kNN) is an algorithm that can store specific cases/objects and from that produce/classify new objects based on similarity measures (i.e., patterns, distances). Let’s take an example, we have a space containing a variety of shapes (triangles, squares, circles), we have an unknown shape (?) and are interested in figuring out the class of the shape.


Now, we know the unknown shape can either be a triangle, circle or square, in this case, we look at k of the kNN algorithm. k represents the nearest neighbors to the unknown object of which we will use (determining k is an important factor in this problem). Let’s say we set k=4, in this case, we look at the 4 nearest neighboring shapes to the unknown shape.


From this, we can assume that the nearest neighbors to the unknown shape are 3 triangles and a circle, as a result the unknown shape would be a triangle. We can use this concept of using nearest neighbors or points to assist in determining distances and unknown objects.

Using this algorithm, I attempted to identify the distance between points using two methods:

  1. Euclidean- the straight line distance between two points
  2. Manhattan- the distance is the absolute sum of the identified points


The Euclidean distance relies on the principle of Pythagorean’s theorem (c^2 = a^2 + b^2), as a result, the math function is called (import sqrt). Following which the points/distances to be calculated are identified. Through the above code, I was able to define the Euclidean distance function (x,y), returning the final result through the defined equation: the sum of the square root of “a” and “b”, these values are then “zipped” or returned as “x” and “y”.

Calculating manually, I arrive at the same result:

Blog9 (2)

The same code format was followed for the Manhattan distance, the main difference was the equation used to calculate the final distance (ex., using absolute sums):

Blog9 (4).jpg

The equation is the absolute sum of the identified “x” and “y” points, calculating manually also yields the same result:

Blog9 (5).jpg

The advantages of these equations allow us to define an unknown object, be it distance or type. We can use these equations to determine the distance between points, in addition to being used to determine the unknown location of an object (on a more complex level). There are different ways to frame the code when calculating distances, however, I found this particular code easy to understand and applicable to different points.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s