Saturday, May 16, 2020

Linear Regression From Scratch (1/5)


 
B
efore we hop into the derivation of simple linear regression, it’s important for us to have a very strong intuition on what we are actually going to do and especially why we are going to do it? With that being said, let’s dive in! 

To see when we are going to need to use Simple Linear Regression, why don’t we start with a story of some friends!

Let’s say there lived some friends named SpongeBob, Patrick, Squidward and Gary in the “Bikini Bottom!”. One day Squidward went to SpongeBob and had this conversation. Let’s check it out.

Squidward : “Hey SpongeBob I’ve heard you’re so smart!”

SpongeBob : “Yes sir! There is no doubt in that.”

Squidward : “Is that so?”

SpongeBob : “Umm..Yes…”

Squidward : “So here’s the thing.. I want to sell my house as I’m going to shift to my new lavish house in the downtown. But I can’t figure out at which price I should sell my house! If I keep the price too high then no one is going to buy it and if I set the price low then I might face tremendous financial loss! So you have to help me find the best price for my house. But keep in mind you have only one day!”

(SpongeBob is stressed as always but he’s very optimistic about finding the solution. To discuss the problem he went to his shrewd friend Patrick’s house.)

(Patrick is in his living room watching TV with a big bowl of popcorn in his hands.)

(SpongeBob described the whole situation to Patrick.)

Patrick : “That’s a piece of cake my friend! Follow me!”

(They decided to go to Squidward’s neighborhood, where his two neighbors recently sold their houses. After some time they were able to find out the square footage and selling price of their houses.)

Here’s the digged data!

HOUSE_A :

Area : 1500 ft

Price : 150000 $

HOUSE_B :

Area : 2500 ft

Price : 300000 $

(From the collected data Patrick was able to draw the following graph.)


Now after pensive thinking Patrick was able to predict the selling price for Squidward’s house. Here’s an explanation provided by him.

When we have two given points in a coordinate plane we can always find the equation of line passing through the two points. Here’s the formula for that :


Where,

m : slope of the line

b : y-intercept of the line

Y : y-coordinate of the point from which the line passes through

X : x-coordinate of the point from which the line passes through

*******

Basics of coordinate geometry :

Some rules for choosing the points in the coordinate plane:

(1) We always look from left to right in the coordinate plane to name the points.

(2) After looking from left-to-right, the first point we get must be named (x1,y1) and the second point will be (x2,y2).

(3) Horizontal lines have slope of 0.

(4) Vertical lines have “Infinite” slope.

(5) If the Y-coordinate of the second point is greater than the Y-coordinate of the first point then the line has positive(+) slope, else the line has negative slope.

(6) Points at the same vertical distance from X-axis have same Y-coordinate.

(7) Points at the same vertical distance from Y-axis have same X-coordinate.



*******

So, from the above mentioned rules we can say that in our graph :

(x1,y1) = ( 1500,150000)

(x2,y2) = (2500,300000)

Now we can easily find the slope :




Since we know the equation of Straight line : Y = mX + b , we can easily find the value of b (intercept) :

Now we have our finalized equation of slope :

(Now Patrick and SpongeBob went to Squidward to find out the square footage of his house.)

SpongeBob : “Hey Squidward, can you tell me the total area in square footage of your house?”

Squidward : “Whyyy?”

SpongeBob : “Do you want to sell your house or not?”

Squidward : “Okay..Okay..It’s 1800 square feets.”

(SpongeBob and Patrick left the store.)

Now, in our equation :

Y : Price of house

X : Area in square feets

So, as per our prediction the selling price of the house should be : 195000$

(Now SpongeBob and Patrick decided to their another ingenious friend Gary the snail for confirmation of the number that they predicted.)

SpongeBob : “Hey Gary, can you confirm that the prices we predicted for Squidward’s house are correct?”

Gary : (Meditating for a minute) “I think you are underpricing the house.”

SpongeBob : “Butttt….!”

Patrick : “Can you help us?”

(Gary holds him a paper that has data about 100 houses in the city with it’s price and area.)

(SpongeBob and Patrick went home and plotted the data on the coordinate plane.)

Now, here we can see that using two points to plot a line won’t work in this situation, we must find a line that “best fits” the 100 data points. When thinking about the best fit line — think one line that is closest to all the points. Instead of trial and error, we can determine this best fit by minimizing a thing called sum of squared errors.

*******

Sum of Squared Error :

To begin, we need to find an equation of line that minimizes the distance between all the data points we have plotted.

When asked about the explanation for Sum of Squared Error, Patrick explained the following :

One way to measure distance between the scattered points and the line is to find the distance between their Y values.

Let’s say we use our line from earlier : Y = 150X — 75000 and want to see how accurate our previous function is for a 1800 square foot house that actually sold for $220,000. Well if we input a 1800 square feet in our equation,it says we should have sold the house for $195,000, but in reality it sold for $220000. A difference of $25000. Point one for Gary!

In brief,

Actual selling price : $220,000

Predicted selling price : $195,000

Error in prediction : 220,000–195,000 = $25,000

This difference, or error, in price is exactly what we need to do for the rest of the 99 data points. Once we do this for each point, we then add the errors together to measure our accuracy. More formally stated…

and to account for negative numbers, we square the errors:

Now what we have to do is to minimize this error to predict the output price more accurately . Well we can say that once they are equipped with that power we’ll be able to predict the house price of almost every houses in our neighborhood.

*******

Guidelines for regression line :

(1) Use regression lines when there is significant correlation to predict values.

(2) Stay within the range of the data. Do not extrapolate!! For example, if the data is from 10 to 60 don’t try to predict value for 500.

(3) Don’t make prediction for a population based on another population’s regression line.



*******

Use-case of linear regression :

(1) Height and weight

(2) Alcohol consumed and blood alcohol content.

(3) Vital lung capacity and pack-years of smoking.

(4) Driving speed and gas mileage.

*******

Moving forward,

In the next part we’ll see about bruteforce attack to get the value of slope and intercept for our “best-fit” line.


*******

Special thanks to : Patrick , SpongeBob ,Gary , Squidward! :)

You can download the code and some handwritten notes on the derivation from here : https://drive.google.com/open?id=1_stSoY4JaKjiSZqDdVyW8VupATdcVr67

If you have any additional questions, feel free to contact me : shuklapratik22@gmail.com

No comments:

Post a Comment