House Price Prediction Project - Beginner

16-Days Challenge – Personal Project

Project

Information

Purpose/Goal

To collect data, create a House Prices Dataset by cleaning the data and preprocessing it, and test my regression skills by designing an algorithm to accurately predict house prices in Gaborone, Botswana.

Findings

To be determined…

Table of Contents

Introduction

I started a 16-Day House Price Prediction Personal Project Challenge, it’s a challenge I’m part taking in in order to encourage myself to start and finish the project, and also I’m posting it so that viewers like you can hold me accountable if I don’t live up to the challenge. In this challenge, I will collect data, create a House Prices Dataset, and test my regression skills by designing an algorithm to accurately predict house prices in Gaborone, Botswana. Accurately predicting house prices can be a daunting task. The buyers are just not concerned about the size(square feet) of the house and there are various other factors that play a key role to decide the price of a house/property. It can be extremely difficult to figure out the right set of attributes that are contributing to understanding the buyer’s behavior. So the dataset that I will create will consist of data across various property aggregators across Gaborone, Botswana. In this challenge, my role as a data scientist is to predict prices as accurately as possible. Also, after this challenge, I will create a lot of room for improvement, learn more about feature engineering, and master advanced regression techniques such as Random Forest, Deep Neural Nets, and various other ensembling techniques.

Project

Implementation

Day #1: Research

Research on the data required for the project, possible data source(s) & Data Collection Method(s)

Main Research Questions:

  • What are various factors that play a key role to decide the price of a house/property?
    • Location: The value of a house is assigned based on the location of the house, looking at three primary indicators, according to Joe Gomez:
      • The quality of local schools
      • Employment opportunities
      • Proximity to shopping, entertainment, and recreational centers.

These factors can influence why some neighborhoods command steep prices and others that are a few miles away don’t. In addition, a location’s proximity to highways, utility lines, and public transit can all impact a home’s overall value. When it comes to calculating a home’s value, location can be more important than even the size and condition of the house.

    • Age & Condition: Typically, homes that are newer appraise at a higher value. The fact that critical parts of the house, like plumbing, electrical, the roof, and appliances are newer and therefore less likely to break down, can generate savings for a buyer. For example, if a roof has a 20-year warranty, that’s money an owner will save over the next two decades, compared to an older home that may need a roof replaced in just a few years.
    • The Local Market: Even if the house is in excellent condition, in the best location, with premium upgrades, the number of other properties for sale in its area and the number of buyers in the market can impact the house’s value.
    • Home Size & Usable Space: When estimating your home’s market value, size is an important element to consider, since a bigger home can positively impact its valuation. The value of a home is roughly estimated at price per square foot — the sales price divided by the square footage of the home.In addition to square footage, a home’s usable space matters when determining its value. Garages, attics, and unfinished basements are generally not counted in usable square footage. So if you have a 2,000-square-foot home with a 600-square-foot garage, that’s only 1,400 square feet of liveable space.Liveable space is what is most important to buyers and appraisers. Bedrooms and bathrooms are most highly valued, so the more beds and baths your home offers, the more your home is generally worth. However, these trends are very locally specific.
  • What attributes or features can I find or attain from these factors, and what would they look like?
  • What will the full dataset possibly look like, which will consist of data, from the various data sources, across different property aggregators across Gaborone, Botswana?
To continue click the LinkedIn provided below:

Days No. 2 to 9: Data Collection, Data Preprocessing, Feature Selection & Data Description

Data Collection Phase: Execute the chosen Data Collection Method

Scraping Features(Attributes) of Houses being Sold in Gaborone

Method:

WEB SCRAPING

I had documented a few research questions, to basically guide me as I collected the data. Those questions looked to find factors that played a key role when deciding the price of a house, which features(attributes) would be ideal for the dataset and what it looks like, and lastly, the different data sources that I should look into and hopefully extract or collect all the attributes I need. Challenges & Roadblocks Just like every other data science project out there, they all never go exactly as planned and they always have their unique challenges & roadblocks. With that said here are a few challenges that I had as well as the roadblocks: The first challenge was that the sites (data sources) had limited information and not all attributes were available, for example, the Age & Condition of the property was not included in the listings, Not all property listings in each site had the floor area attribute, and same applies for the Erf Size attribute basically, a listing would have one without the other, and same applied to other attributes that were not collected. Another challenge that I had was the limited number of data sources to collect data from specifically the real estate websites that list properties sold in Gaborone Botswana. As for the roadblocks, the first one that I encountered was the small number of different houses actually being sold in Gaborone Botswana which really limits the size of the House Price Dataset that I want to create. Another roadblock that I had was the lack of third-party data sources for the House Price Datasets in Gaborone, Botswana which I could have used to increase the size of the dataset. Data Collection Process At this point, I will walk you through all the steps I took to scrape the data of different houses up for sale on websites like Property24 Botswana. I will use three Python libraries to do this: Pandas, BeautifulSoup, and Urllib.

Code snippet of the Packages used

import pandas as pd
from urllib.request import urlopen
from bs4 import BeautifulSoup
I will provide all the codes I used. You can use them as starter codes for any web scraping project of your own in the future.
  • Property 24 BotswanaThe Property 24 Botswana website offers more than just the attributes I managed to attain, I was only limited by the level of my skillset, only using three python packages namely; urllib3, Beautifulsoup, and pandas. I have a bit of experience with Selenium Webdriver which can be used to attain more attributes than what I managed to attain, but yet again my skill level with selenium is limited.NB: The data in the CSV file is untouched and it is very dirty(unclean).
  • SeeffThe Seeff Botswana website offers more than just the attributes I managed, to attain them I used a few advanced element locators, only using three python packages listed above. After investigating the site, I come to the conclusion that the site is static, not dynamic, and that I won’t be able to attain enough data from just the search results of the site. The strategy I decided to use was to scrape URL links of each property listing, compile them into a list, then scrape the data I need from each URL in the list.NB: The data in the CSV file is untouched and it is very dirty(unclean).
  • ReMax PropertiesAs of this month and year (June 2022) Apex Properties only has about 15 houses for sale in Gaborone, whereas ReMax has about 80 houses up for sale in Gaborone, which is the reason why I scraped ReMax instead of Apex Properties.NB: I have nothing against APEX PROPERTIES this decision was purely based on the number of properties up for sale at the time. The ReMax website offers more attributes (features) than the few attributes I managed, to attain all the attributes, I used a few advanced element locators, as well as three python packages used previously. After investigating the site, I come to the conclusion that the site is static, not dynamic, and that I won’t be able to attain enough data from just the search results of the site. The strategy I decided to use was to scrape URL links of each property listing, compile them into a list, then scrape the data I need from each URL in the list.
  Click here for the python scripts: https://github.com/SandileDesmondMfazi/GaboroneHousePricePredicitions  

Data Preprocessing Phase: Prep & cleaning the collected data

Integrating the different Dataset(s), dealing with the missing values, duplicates, correcting data types, etc.

 

Day No.9 to 12: Model Training, Evaluation& Model Selection

Not yet reached…

Day No.13 to 16: Hyperparameter Tuning

Not yet reached…

Project

Results

Not yet reached…