Microeconometric Analysis of Household Consumption Expenditures: LAD-LASSO MethodKadriye Hilal Topal, Ebru Çağlayan Akay
This study examined how supervised machine learning methods help us select the relevant variables of a Household Budget Survey Consumption Expenditures dataset with outliers in order to achieve better performance in the predicting and forecasting of the Household Consumption Expenditures Model. To achieve this, the Household Budget Survey Consumption Expenditures dataset of Turkey for 2018 was examined using the Least Absolute Deviation (LAD), Least Absolute Shrinkage and Selection Operator (LASSO) and LAD-LASSO methods. In addition, the classical regression method and the prediction and forecasting performances of the methods were compared. According to the analyzed results,it was concluded that the LAD-LASSO machine learning method, which enables the selection of variables while obtaining robust predictors in the presence of long-tailed errors, was the most successful method in prediction performance and forecasting accuracy. Additionally, several fundamental variables such as income, saving, and household size increase the household consumption expenditures for all models. In addition to these variables, other variables including the structure of a room, the kitchen, bathroom floors, heating, air conditioning preferences, energy sources used, detached house, apartment, cottage, vineyard ownership, investment preferences, credit card usage, and internet shopping habits were selected as determinants of household consumption expendituresin the LAD-LASSO model. From the results of the study, it wasfound that machine learning algorithms can be used in the selection of the most appropriate variablesin the course of the construction of microeconometric models.
Hanehalkı Tüketim Harcamalarının Mikroekonometrik Analizi: LAD-LASSO YöntemiKadriye Hilal Topal, Ebru Çağlayan Akay
Bu çalışmanın amacı, denetimli makine öğrenmesi yöntemlerinin aşırı değer ve uzun kuyruklu hatalara sahip Hanehalkı Bütçe Anketi Hane veri setinin ilgili değişkenlerini seçmemize nasıl yardımcı olduğunu incelemek ve Türkiye’nin Hanehalkı TüketimHarcamaları’nın tahmininde en iyitahmin ve öngörü performansına sahip olanmodelin belirlenmesinisağlamaktır. Bu amaçla, 2018 yılı Türkiye’nin Hanehalkı Bütçe Anketi Hane veri seti klasik regresyon yönteminin yanı sıra En Küçük Mutlak Sapma (LAD), En Küçük Mutlak Küçültme ve Seçim Operatörü (LASSO) ve LAD-LASSO yöntemleri kullanılarak incelenmiş ve yöntemlerin tahmin ve öngörü performansları karşılaştırılmıştır. Analiz sonuçlarına göre; uzun kuyruklu hataların varlığında dayanıklı tahminciler elde edilirken aynı zamanda değişken seçimine olanak sağlayan LAD-LASSO makine öğrenmesi yönteminin tahmin performansı ve öngörü açıklığı açısından en başarılı yöntem olduğu sonucuna ulaşılmıştır. Ayrıca gelir, tasarruf ve hane halkı büyüklüğü gibi bazı temel değişkenler tüm modeller için hanehalkı tüketim harcamalarını artırmaktadır. Bu değişkenlere ek olarak odanın yapısı, mutfak, banyo zeminleri, ısıtma, klima tercihleri, kullanılan enerji kaynakları, müstakil ev, apartman, yazlık, bağ sahipliği ve yatırım tercihleri, kredi kartı kullanımı, internet alışveriş alışkanlıkları gibi çeşitli değişkenler LAD-LASSO modelinde hane halkı tüketim harcamalarının belirleyicileri olarak seçilmiştir. Çalışma sonuçlarından, makine öğrenme algoritmalarının mikroekonometrik modellerin oluşturulması sırasında gerekli değişkenlerin seçiminde kullanılabileceğine dair bulgular elde edilmiştir. Bu çalışma doktora tezinden üretilmiştir.
Household consumption expenditures play an important role both in providing information about the economic development levels of countries and determining rational production policies together with the determination of socioeconomic determinants. In literature, there were many studies on consumption expenditures. Although these studies aimed to select variables that determine consumption and obtain the most appropriate statistical and econometric model, these studies were modeled with different variables.
The Least Squares regression model (LS) is one of the most widely used estimation methods but LS estimators give unrealistic predictions in the presence of long-tailed errors, so LAD estimators are often used. However, since the number of variables in large data sets is high and the number of candidate models increases exponentially, the best model cannot be selected due to processing complexity. For this reason, Wang, Li, and Jiang (2007) developed the LAD-LASSO method, which enables the best model selection using the LASSO type penalty method, while obtaining robust estimators in the presence of outliers and long-tailed errors. The Household Budget Survey Consumption Expenditures dataset of Turkey contains both a great number of observations and many variables. Since the income distribution is not homogeneous in Turkey, household consumption expenditure does not show a homogeneous structure. Therefore, the LAD-LASSO, a penalty based machine learning method based on dimension reduction, was used in the analysis of the Household Budget Survey household data set in this study.
This study examined how the supervised machine learning methods help us to select the relevant variables of the Household Budget Survey Consumption Expenditures dataset with outliers in order to achieve a better performance in predicting and forecasting performances of the Household Consumption Expenditures Model. Since the main purpose of a penalty-based variable selection method is the only estimation and causal and statistical inferences cannot be made from the obtained models, the results of the LAD-LASSO regression were evaluated in terms of variable selection and modeling.
In the study, the Household Consumption Expenditure model was predicted with the EKK method first, and diagnostic tests were applied to investigate the deviations from assumptions and outliers. To detect outliers, diagnostic tests were utilized to standard and student type residuals, and the presence of outliers was detected in 410 observations. In addition to the LASSO regression, the LAD and LADLASSO methods were predicted, which enabled robust estimators to be obtained in the presence of outliers and long-tailed errors; The results were compared and interpreted. The EKK and LASSO models prediction performance comparisons made use of Root Mean Squared Error (RMSE), Mean Absolute Error (MAE) and R-Squared (R2 ) criteria which gave very similar results.
According to the analyzed results, it was concluded that the LAD-LASSO machine learning method, which enables the selection of variables while obtaining robust predictors in the presence of long-tailed errors, is the most successful in prediction performance and forecasting accuracy. Several fundamental variables such as income, saving, and household size increased the household consumption expenditures for all models. In addition to these variables, other variables including the structure of a room, the kitchen, bathroom floors, heating, air conditioning preferences, energy sources used, detached house, apartment, cottage, vineyard ownership, investment preferences, credit card usage, and internet shopping habits were selected as determinants of household consumption expenditures in the LAD-LASSO model. From the results of the study, it was found that machine learning algorithms can be used in the selection of most appropriate variables in the course of the construction of microeconometric models. Although, penalty-based machine learning methods are successful methods in determining the model in data sets with a large number of variables, they should be used carefully because they make predictions based on correlation rather than causality.