HOUSE_A :
Area : 1500 ft
Price : 150000 $
HOUSE_B :
Area : 2500 ft
Price : 300000 $
(From the collected data Patrick was able to draw the following graph.)
Where,
m : slope of the line
b : y-intercept of the line
Y : y-coordinate of the point from which the line passes through
X : x-coordinate of the point from which the line passes through
*******
Basics of coordinate geometry :
Some rules for choosing the points in the coordinate plane:
(1) We always look from left to right in the coordinate plane to name the points.
(2) After looking from left-to-right, the first point we get must be named (x1,y1) and the second point will be (x2,y2).
(3) Horizontal lines have slope of 0.
(4) Vertical lines have “Infinite” slope.
(5) If the Y-coordinate of the second point is greater than the Y-coordinate of the first point then the line has positive(+) slope, else the line has negative slope.
(6) Points at the same vertical distance from X-axis have same Y-coordinate.
(7) Points at the same vertical distance from Y-axis have same X-coordinate.
(x1,y1) = ( 1500,150000)
(x2,y2) = (2500,300000)
Now we can easily find the slope :
Now we have our finalized equation of slope :
(Now Patrick and SpongeBob went to Squidward to find out the square footage of his house.)
SpongeBob : “Hey Squidward, can you tell me the total area in square footage of your house?”
Squidward : “Whyyy?”
SpongeBob : “Do you want to sell your house or not?”
Squidward : “Okay..Okay..It’s 1800 square feets.”
(SpongeBob and Patrick left the store.)
Now, in our equation :
Y : Price of house
X : Area in square feets
So, as per our prediction the selling price of the house should be : 195000$
(Now SpongeBob and Patrick decided to their another ingenious friend Gary the snail for confirmation of the number that they predicted.)
SpongeBob : “Hey Gary, can you confirm that the prices we predicted for Squidward’s house are correct?”
Gary : (Meditating for a minute) “I think you are underpricing the house.”
SpongeBob : “Butttt….!”
Patrick : “Can you help us?”
(Gary holds him a paper that has data about 100 houses in the city with it’s price and area.)
(SpongeBob and Patrick went home and plotted the data on the coordinate plane.)
Now, here we can see that using two points to plot a line won’t work in this situation, we must find a line that “best fits” the 100 data points. When thinking about the best fit line — think one line that is closest to all the points. Instead of trial and error, we can determine this best fit by minimizing a thing called sum of squared errors.
*******
Sum of Squared Error :
To begin, we need to find an equation of line that minimizes the distance between all the data points we have plotted.
When asked about the explanation for Sum of Squared Error, Patrick explained the following :
One way to measure distance between the scattered points and the line is to find the distance between their Y values.
Let’s say we use our line from earlier : Y = 150X — 75000 and want to see how accurate our previous function is for a 1800 square foot house that actually sold for $220,000. Well if we input a 1800 square feet in our equation,it says we should have sold the house for $195,000, but in reality it sold for $220000. A difference of $25000. Point one for Gary!
In brief,
Actual selling price : $220,000
Predicted selling price : $195,000
Error in prediction : 220,000–195,000 = $25,000
This difference, or error, in price is exactly what we need to do for the rest of the 99 data points. Once we do this for each point, we then add the errors together to measure our accuracy. More formally stated…
and to account for negative numbers, we square the errors:
Now what we have to do is to minimize this error to predict the output price more accurately . Well we can say that once they are equipped with that power we’ll be able to predict the house price of almost every houses in our neighborhood.
*******
Guidelines for regression line :
(1) Use regression lines when there is significant correlation to predict values.
(2) Stay within the range of the data. Do not extrapolate!! For example, if the data is from 10 to 60 don’t try to predict value for 500.
(3) Don’t make prediction for a population based on another population’s regression line.
Use-case of linear regression :
(1) Height and weight
(2) Alcohol consumed and blood alcohol content.
(3) Vital lung capacity and pack-years of smoking.
(4) Driving speed and gas mileage.
*******
Moving forward,
In the next part we’ll see about bruteforce attack to get the value of slope and intercept for our “best-fit” line.
*******
Special thanks to : Patrick , SpongeBob ,Gary , Squidward! :)
You can download the code and some handwritten notes on the derivation from here : https://drive.google.com/open?id=1_stSoY4JaKjiSZqDdVyW8VupATdcVr67
If you have any additional questions, feel free to contact me : shuklapratik22@gmail.com
No comments:
Post a Comment