Example: bankruptcy

CS 229 Project Report: Predicting Used Car Prices

CS 229 Project Report Kshitij Kumbar, Pranav Gadre and Varun Nayak CS 229 Project Report: Predicting Used Car Prices Kshitij Kumbar Pranav Gadre Varun Nayak Abstract Determiningwhetherthelistedpriceofausedc arisachallengingtask,duetothemanyfactors thatdrive ausedvehicle canaccuratelypredictthepriceofausedcarba sedonitsfeatures,inordertomakeinformedpu rchases. Weimplementandevaluatevariouslearningmet hodsonadatasetconsistingofthesalepriceso f modelandK-Meansclusteringwithlinearregre ssionyieldthebestresults,butarecomputehe avy.

metric is the gradient of the loss function. This model was chosen to account for non-linear relationships between the features and predicted price, by splitting the data into 100 regions. 4. XGBoost Extreme Gradient Boosting or XGBoost [4] is one of the most popular machine learning models in current times. XGBoost is quite similar at the core ...

Tags:

  Boosting, Derating, Gradient boosting

Information

Domain:

Source:

Link to this page:

Please notify us if you found a problem with this document:

Other abuse

Transcription of CS 229 Project Report: Predicting Used Car Prices

1 CS 229 Project Report Kshitij Kumbar, Pranav Gadre and Varun Nayak CS 229 Project Report: Predicting Used Car Prices Kshitij Kumbar Pranav Gadre Varun Nayak Abstract Determiningwhetherthelistedpriceofausedc arisachallengingtask,duetothemanyfactors thatdrive ausedvehicle canaccuratelypredictthepriceofausedcarba sedonitsfeatures,inordertomakeinformedpu rchases. Weimplementandevaluatevariouslearningmet hodsonadatasetconsistingofthesalepriceso f modelandK-Meansclusteringwithlinearregre ssionyieldthebestresults,butarecomputehe avy.

2 Conventionallinearregressionalsoyieldeds atisfactoryresults,withtheadvantageofasi gnificantlylower training time in comparison to the aforementioned methods. Motivation Decidingwhetherausedcarisworththepostedp ricewhenyouseelistingsonlinecanbedifficu lt. Severalfactors,includingmileage,make,mod el,year, theperspectiveofaseller,itisalsoadilemma topriceausedcarappropriately[2-3].Basedo nexisting data, the aim is to use machine learning algorithms to develop models for Predicting used car Prices .

3 Dataset Forthisproject,weareusingthedatasetonuse dcarsalesfromallovertheUnitedStates,avai lableon Kaggle [1]. The features available in this dataset are Mileage, VIN, Make, Model, Year, State and City. Pre-processing Inordertogetabetterunderstandingofthedat a, datasethadmanyoutliers, ,modelsthatare thelatestyearandhavelowmileagesellforapr emium,however,thereweremanydatapointstha tdid car ,weprunedourdatasettothree standard deviations around the mean in order to remove outliers.

4 WeconvertedtheMake, thedataset,wereplacedthestringrepresenti ngthecitywithabooleanwhichwassetifthepop ulationof the city was above a certain threshold a major city. CS 229 Project Report Kshitij Kumbar, Pranav Gadre and Varun Nayak Fig 1: (Top) Histogram for raw data. (Bottom) Histogram after pruning. CertainfeaturessuchasVINnumbersdroppeddu ringtrainingasthesewereuniquetoeachvehic le,thus ,whileimplementingcertainMachineLearning frameworks,werealisedthatcertaincategori calfeatureshadcommonvalueswhichcausespro blems whileusingframeworkssuchasXGBoostwhichre quireuniquefeaturenames(acrossallfeature s).

5 For eg:havingthestring genesis names were hence filtered out and renamed to work with these frameworks. Analysing Linearity in Dataset Toanalyzethedegreetowhichourfeaturesarel inearlyrelatedtoprice,weplottedthePricea gainst two features. Fig 2: (Left) Price vs Year scatter plot. (Right) Price vs Mileage scatter plot. CS 229 Project Report Kshitij Kumbar, Pranav Gadre and Varun Nayak Methodology Weutilizedseveralclassicandstate-of-the- artmethods,includingensemblelearningtech niques,witha 90%-10% ,weused500 ,RandomForestandGradientBoostwereour ,theopen-sourceScikit-Learnpackage[7]was used.

6 Regression LinearRegressionwaschosenasthefirstmodel duetoitssimplicityandcomparativelysmall ,withoutanyfeaturemapping,wereuseddirect lyasthefeaturevectors. No regularization was used since the results clearly showed low variance. Forest tree,specificallyasthenamesuggests,multi pledecisiontreestogeneratetheensemblemod el parallelandarerelativelyuncorrelated,thu sproducinggoodresultsaseachtreeisnotpron eto Aggregationorbaggingprovidingtherandomne ssrequiredtoproducerobustanduncorrelated compare a bagging technique with the following gradient boosting methods.

7 Boost GradientBoostingisanotherdecisiontreebas edmethodthatisgenerallydescribedas amethod oftransformingweaklearnersintostronglear ners .Thismeansthatlikeatypicalboostingmethod , observationsareassigneddifferentweightsa ndbasedoncertainmetrics,theweightsofdiff icultto relationships between the features and predicted price, by splitting the data into 100 regions. ExtremeGradientBoostingorXGBoost[4]isone ofthemostpopularmachinelearningmodelsin featuresmanyadditivefeaturesthatsignific antlyimproveitsperformancesuchasbuiltins upportfor regularization,parallelprocessingaswella sgivingadditionalhyperparameterstotunesu chastree pruning, algorithm was run on all cores in parallel.

8 LightGBM[5]isanothergradientboostingbase dframeworkwhichisgainingpopularitydueith igher CS 229 Project Report Kshitij Kumbar, Pranav Gadre and Varun Nayak XGBoost,thisLightGBMhasaleaf-wisetreegro wthinsteadofalevel-wiseapproachresulting in [6],thuseliminatingthe needtoonehotvectorizethemandinturn, ,ModelandStateand and was run on all cores in parallel. + Linear Regression Inordertocapitalizeonthelinearregression resultsandtheapparentcategoricallinearit yinthe ,anensemblemethodwhichusedKMeansclusteri ngofthefeaturesand ,athree-clustermodelwas ,thedatasetwasclassifiedintothesethreecl ustersandpassedthroughalinear regressor trained on each of the three training sets.

9 Neural Network (MLP Regressor) Tointroducemodecomplexitiesinthemodel,th eMLPregressor[7],whichusesadeepneuralnet perceptronregressormodel, was set at and batch size at 200. ReLu was used as the activation function. Results isastatistical R2 R2 measure of how close the data are to the fitted regression line. Learning Algorithm Score on TestR2 Data Score on TrainingR2 Data Training Time Linear Regression 15 minutes Gradient Boost 130 minutes Random Forest 75 minutes Light GBM 104 seconds XGBoost 180 minutes KMeans + LinReg 70 minutes Deep Neural Network 10 hours ComparedtoLinearRegression,mostDecision- Treebasedmethodsdidnotperformcomparablyw ell.

10 CS 229 Project Report Kshitij Kumbar, Pranav Gadre and Varun Nayak depthoftreestodifferentvaluesanditwasobs ervedthatbeyondlimitingdepthto36resulted in lightGBM performed marginally better than XGBoost but had a significantly faster training time. Buildingupfromtherelativelygoodperforman ceofLinearRegression,theKMeans+LinearReg ression EnsembleLearningMethod(withK=3)producedt hebestscoreontestdatawithouthighvariance as R2 small batch-sizes.


Related search queries