README Part 2: Guidance for Bottom-Up Tax Gap Toolkit

Part 2: Machine Learning

1 Overview

This toolkit, developed by UNU-WIDER/ATI, provides a comprehensive framework for conducting bottom-up tax gap analyses. It utilizes machine learning algorithms to estimate Value Added Tax (VAT), Personal Income Tax (PIT), and Corporate Income Tax (CIT) gaps, specifically designed for Revenue Authorities.

The toolkit consists of two code files: the first part is data management, which is used to clean the tax return and audit data and merge them, so it is ready for machine learning analysis using the second code file.

Given the steps and recommendations in the data management file, it is important to note that this document assumes the structure of the data advised there. This means we have different databases for each tax type (VAT, CIT, or PIT). In case you cannot differentiate between those tax types, the process is still valid but assumes the variable to predict the general misreporting amount and makes some assumptions to obtain a disaggregated estimation for each tax type. We will indicate what procedure you must follow if you have an aggregate measure of misreporting instead of each tax type when applicable.

Version:

Date: 28/06/2024

Prerequisites:

To use this toolkit, we must have Stata software installed on your computer. Additionally, the data requirements depend on the tax gap you want to analyze. Whether it is CIT, VAT, or PIT, the data should be cleaned and merged with audit data following the Data Management Part 1 of the toolkit.


2 First Stage: Stata Environment Setup

2.1 Installing Random Forest Package

The procedure utilizes Random Forest algorithms, as the code follows Schonlau and Zou (2020).

Line 48: installs the rforest package from the SSC (Statistical Software Components) archive, essential for executing the Random Forest model in Stata.

ssc install rforest

2.2 Preparing Stata Environment

Lines 51 to 57 of the code configure the environment to ensure smooth and efficient script execution by disabling output pauses, clearing matrices, setting wider output lines and larger matrix sizes, and safely closing any open log files.

set more off
clear all
clear matrix
set linesize 200
set matsize 800
cap log close
version 14  /* Use Stata Version 14 */

Note 1:

The command ‘version 14’ makes STATA interpret and run the code according to the syntax and features available in STATA version 14, regardless of the actual version currently running. The data files will also be saved in STATA 14 so anyone can access the saved files and results with a lower software version.

2.3 Setting Up Global Directory Paths

To enhance the code’s flexibility and ease of updating, we employ a “global” command. This command acts as a macro, allowing for the storage of file paths that can be reused throughout the Stata session.

Here we assume that you will use the same working path dirictory for the first part of the toolkit.

Note 2:

In the dofile, at line 61, insert the your working path folder between the quotation marks.


global dir "<YOUR PATH>"

global data     "$dir\Data"
global graphs   "$dir\Graphs"
global xlsout   "$dir\Excel"
global dopath   "$dir\Dofile"
global results  "$dir\Results"
global log      "$dir\Log"

2.4 Importing Data

The following code line upload your database, you need to replace ‘zz’ with the name of the database.This data is the CIT, PIT or VAT data created with the Data Management Part 1 of the toolkit.

use "$data/ZZ"   

3 Second Stage:Choosing the Model of Machine Learning

3.1 Setting Up Parameters

Two critical parameters need to be chosen before running the machine learning (ML) model:

  1. The number of iterations in each forest branch.
  2. The number of independent variables randomly selected at each split.

In line 84, insert “region” as the variable identifying the area where audits were conducted. YY and TT are the “values” of the region variable. For example, if a country has 4 regions: CC, BB, NN, and MM, but only we have audit data from the first three, the command must be: generate reg_audit = (region == CC | region == BB | region == NN). Notice that, if the variable region is composed by names, you must include ” “. In the example, this means a code as: generate reg_audit = (region ==”CC” | region == “BB” | region == “NN”).

generate reg_audit = (region ==YY | region == TT)

Note 3:

This variable should specifically identify the geographic zones where audit data is available. If your country comprises multiple zones but you only have audit data for a subset, the variable must reflect those audited areas. This variable serves as an indicator, taking the value 1 if a geographic zone has audit data, and 0 if it does not. It’s important to remember that the definition of the area can vary by country—whether it’s a region, a tax office jurisdiction, or another type of geographic area—what matters is that it accurately represents where audits were conducted.

In line 88, insert “XX” as the variable that compasses the taxes that taxpayers do not pay to estimate the gap (VAT, CIT, or PIT). Generally, this is found by auditing. This variable cannot have missing values.

generate evaded = XX

Note 4:

This variable can contain negative values when taxpayers declare more than what they owe. In this case, the method will learn what observations correspond to negative values and use this information to predict. To obtain a more accurate prediction, you must keep the negative values. A similar situation applies to outliers. Given the plasticity of the estimation procedure, the methodology addresses potential outliers. This means you do not need to inspect unusual values in the data. However, it is always recommended to examine the data, identify unusual values, determine their direction (positive or negative values), and assess whether this is a mistake in reporting.

In line 88, we retain only the audited periods in audited tax regions; this restriction of the sample is necessary to determine the two critical parameters.

keep if evaded !=. & reg_audit==1 

Note 5:

We are keeping only the variables that show data from audits. Hence, those observations (firms) were audited in the specific year. Later, we will use the entire sample to estimate the model and make predictions, but for tuning the model, we need only the observations that have been audited.

3.2 Splitting the Data

The sample is divided into two:

  1. Training sample: This is the part of the sample where we run the model. In this sense, we are training the model on this subsample.

  2. Testing sample: In this part, we will compare the predictions from the model with the actual value of the variable. In this sense, we will test our predictions on this subsample.

To create these samples, we use the command “splitsample.” This code generates a variable named “u,” which takes on values of 1 and 0 to distinguish between the two samples. Additionally, the code ensures that both samples maintain a balance in the evasion variable, so they have similar statistics (for example, the mean) for this variable.

Note 6:

For more explanation about the Random Forest algorithm, see the “Random Forest” appendix in the technical note.

Line 92: We create a seed to separate the sample, so that every time you run this code, the random sampling will yield the same results, ensuring consistent division into training and testing groups.

set seed 201807     

Then, code line 93 splits the sample in balanced way, ensuring the evaded variable has similar statistics in both groups and creates an indicator variable u to differentiate them.

splitsample, gen(u) split(1 1) balance(evaded) rround

Line 94: We sort the dataset by the newly created variable u (training/testing indicator).

sort u, stable  

Based on the random variable, the testing sample is indicated by the variable u when u==0, and the training sample when u==1.

Sample Split Recommendation

Our recommendation is to split the sample into 50% for testing and 50% for training.

Note 7:

We will tune a predictive model to estimate non-collected taxes, specifically VAT, CIT, or PIT. In this context, the variable we want to predict is the tax amount, and we will use other relevant variables for this purpose. These variables can include any information related to the individual or entity in question (e.g., a firm or a person).

The methodology will assess the relevance of each variable, identifying the most important ones for accurate predictions. For example, if you have access to the complete forms filled out by agents regarding tax, industry, business, and geographical information, you can leverage all these variables to enhance your prediction’s accuracy. The algorithm will determine the “weight” of each variable automatically. Some variables may turn out to be irrelevant; in such cases, their weight will be zero.

The model’s effectiveness in predicting tax misreporting will guide these weight assignments. If you already know that a particular variable is non-informative, you can choose to omit it from your analysis. However, it is advisable to include as much relevant information as possible that you believe could assist in predicting the uncollected tax. Do keep in mind that incorporating a large number of variables may slow down the estimation process, as each variable must be evaluated.

In the code, we use the notation /VARIABLES/ to refer to the set of variables that help us to make a more accurate prediction.

Attention: Please remember to document the optimal parameters obtained in Sections 3.3 and 3.4. These numbers are essential for running the model and achieving more accurate predictions later.

3.3 Tuning the First Parameter: Number of Iteration

In this part of the code from line 121 to 151, we utilize the rforest command to estimate the number of iterations. The estimation process involves running the variable “evaded” against relevant variables to capture the best combination of variables to accuratelly predict the non-paid tax.

Note 8:

Remember that the method will choose relevant variables, so you can incorporate as much as you want, considering that this affects the estimation time.

Line 121: Create an auxiliary variable to capture the error in the training sample.

generate out_of_bag_error1 = .  

Line 122: Create an auxiliary variable to capture the error in the testing sample.

generate validation_error = .

Line 123: Creating auxiliary variables to produce the iterations to choose the critical parameters.

generate iter1 = .      

Line 124: Creating auxiliary variable.

local j = 0 

Line 126: Store the current time in a scalar variable t1. This is done to measure the time in this step.

scalar t1 = c(current_time)

Lines 128 to 136: Run a loop iterating from 10 to 500 with a step size of 5. In each iteration, fit a random forest model using the rforest command. Store information such as the number of iterations (iter), out-of-bag error (out_of_bag_error1), and validation root mean squared error (RMSE) (validation_error) for comparison of model performance and determination of the optimal number of iterations.

Note 9:

In line 130, replace /VARIABLES/ with relevant tax declaration variables. For example, in the VAT case: exempted purchases and sales, zero-rated purchases and sales, taxable purchases and sales, etc. All the variables that firms must declare in the VAT declaration. For other tax types, it is the same: fill in all the variables firms declare in the corresponding form. Be aware that increasing the number of variables has a computational cost, which is the time to obtain the optimal parameter. If you have a solid argument for discarding some variables from the tax declaration, you can do that. The critical element here is to include all the variables that can explain the misreport. Additionally, you must include geographical, time, and firm data. For instance, year, firm size, the location where it operates, industry-level variables, etc.

The required variables must be numeric, not labels (strings in Stata language). This means that the variable must assign a number to the value. Sometimes, variables describe the value of the variable with names (letters); for instance, in the industry variable, say “Agriculture”. To solve this, please use this command:

encode variable, generate(new_variable)

In the command, you must specify the variable that has labels, and for the new variable, you will create a distinct name. Our advice is to use the same name with a distinction; for instance, if the variable is industry:

encode industry, generate(industry_1)

forvalues i = 10(5)500{
local j = `j' + 1
rforest evaded /VARIABLES/ if u==1, type(reg) iter(`i') numvars(1)
quietly replace iter = `i' in `j'
quietly replace out_of_bag_error1 = `e(OOB_Error)' in `j'
predict pred if u==0
quietly replace validation_error = `e(RMSE)' in `j'
drop pred
}

Line 138: Store the current time in a scalar variable t2. This is done to measure the time in this step.

scalar t2 = c(current_time)

Line 140: Calculate and display the time spent in the first estimation in seconds.

display (clock(t2, "hms") - clock(t1, "hms")) / 1000 " seconds" 

Lines 142 to 144: Label the variables for clarity in the outputs.

label variable out_of_bag_error1 "Out-of-bag error"
label variable iter1 "Iterations"
label variable validation_error "Validation RMSE"

In the following part of the code, specifically in lines 149 to 151, we plot the out-of-bag error and the validation error. The number of iterations is chosen depending on when both variables stabilize.

For example, if after 300 iterations, both variables oscillate around some number, you can select 300 as the critical number of iterations, or any number greater than 300.

graph twoway (scatter out_of_bag_error1 iter1, mcolor(blue) msize(tiny)) ///
(scatter validation_error iter1, mcolor(red) msize(tiny))
graph save Graph "$graphs/ml_iter.gph", replace
graph export "$graphs/ml_iter.pdf", as(pdf) replace

The following Figure 1 shows the model performance over iterations for the first parameter.

Figure 1:Out-of-bag error and Validation RMSE over Iterations for the first parameter

Figure 1:Out-of-bag error and Validation RMSE over Iterations for the first parameter

3.4 Tuning the Second Parameter: Number of Variables to Use in Each Split

In this section of the code, we estimate the second crucial parameter: the number of variables to use for each split. During each iteration, the model selects a certain number of variables to predict evasion. We need to optimize this process by choosing the amount that minimizes prediction error.

Note 10:

  • We run the same model but iterate over the number of variables. Let us suppose that the number of variables is VV. You need to put in the optimal number of iterations. We assume it is 300, but you need to use the number you obtained in the previous step.

Line 168: Create an auxiliary variable to capture the error in the training sample.

generate oob_error = .

Line 169: Create an auxiliary variable for the number of variables that are chosen in each split.

generate nvars = .

Line 170: Create an auxiliary variable to capture the error in the testing sample.

generate val_error = .

Line 171: Define a local macro j and sets its initial value to 0. This macro will be used as a counter in subsequent operations.

local j = 0

Line 173: It attempts to drop a data frame named mydata2 if it exists.

capture frame drop mydata2

The loop in the code, from lines 176 to 184, iterates over a range of values for the number of variables to consider in each split of a random forest model. During each iteration, the model is trained, and pertinent information such as the number of variables (nvars), out-of-bag error (oob_error), and validation root mean squared error (val_error) is stored. This facilitates the comparison of model performance across various numbers of variables and assists in selecting the optimal number of variables for the model.

Note 11:

In line 178, “/VARIABLES/” designates the variables to be incorporated into the model. You have to list the variables you intend to utilize within the model here. This list of variables is the same as the one you used in section 3.3.

forvalues i = 1(1)VV{
local j = `j' + 1
rforest evaded /VARIABLES/ if u==1, type(reg) iter(300) numvars(`i')
quietly replace nvars = `i' in `j'
quietly replace oob_error = `e(OOB_Error)' in `j'
predict pre if u==0
quietly replace val_error = `e(RMSE)' in `j'
drop pre
}

The code lines 186 and 188 are used to measure and display the time spent in the estimation, as previously explained.

Lines 191-193: Assign labels to the variables out_of_bag_error1, iter1, and validation_error for clarity in the output.

label variable out_of_bag_error1 "Out-of-bag error"
label variable iter1 "Iterations"
label variable validation_error "Validation RMSE"

The following part of the code generates the message: “Minimum Error: XX; Corresponding Number of Variables: YY.” This message provides a simplified view of the second critical parameter, which is the corresponding number of variables (YY). Please take note of this number for use when estimating the model. For example, if the minimum error is obtained at iteration 20, you should choose 20 as the second critical parameter. Before plotting, a piece of code is introduced to display the number of iterations that minimize the error and the corresponding error value.

frame put val_error nvars, into(mydata2)

frame mydata2{
    sort val_error, stable
    local min_val_err = val_error[1]
    local min_nvars = nvars[1]
}

frame drop mydata2
display "Minimum Error: `min_val_err'; Corresponding number of variables `min_nvars'"

In the following is a Stata screenshot shows the minimum validation error and the corresponding number of variables

Figure 2:Stata screenshot shows the minimum validation error and the corresponding number of variables

Figure 2:Stata screenshot shows the minimum validation error and the corresponding number of variables

Lines 212-214: Create the plot showing the out-of-bag error and validation error against the number of variables. Then save the graph and export it as a PDF. The second critical number is chosen by looking at the iteration that minimizes the error in the testing sample.

graph twoway (scatter oob_error nvars, mcolor(blue) msize(tiny)) (scatter val_error nvars, mcolor(red) msize(tiny))
graph save Graph "$graphs/ml_var.gph", replace
graph export "$graphs/ml_var.pdf", as(pdf) replace

The following Figure 3 shows the model performance over iterations for the second parameter.

Figure 3:Out-of-bag error and Validation RMSE over Iterations for the second parameter

Figure 3:Out-of-bag error and Validation RMSE over Iterations for the second parameter

Note 12:

You have already determined the key parameters for the machine learning model. In Section 3.3, you estimated the optimal number of iterations, and in Section 3.4, you identified the optimal number of variables for each resampled. From this point on, use these parameters in your model, as they yield the most accurate predictions.


4 Third Stage: Testing the Model

The following part of the code is used for testing the ML model, specifically a random forest model, and comparing its performance with an OLS (Ordinary Least Squares) model using the same variables, with RMSE (Root Mean Squared Error) as the focus metric.

Note 13:

Here, we assume that the critical variables are 300 and 20, but you need to change them to the optimal numbers that you obtained.

Lines 226-227: Tries to drop the data frame mydata2 if it exists, without stopping execution if mydata2 is not found, and stores the current time in a scalar variable t5 to calculate the duration of the model training process later.

capture frame drop mydata2
scalar t5 = c(current_time)

Line 229: Fits the random forest model (rforest) with the dependent variable “evaded” and the specified independent variables (/VARIABLES/). The model is trained using the trained using the training sample. The number of iterations (iter) is set to 300, and the number of variables considered at each split (numvars) is set to 20. However, you must replace both numbers with the optimal values obtained in the past stage (sections 3.3 and 3.4).

rforest evaded /VARIABLES/ if u==1, type(reg) iter(300) numvars(20)

In lines 231-233: The current time is captured again in t6 after the model runs. The difference between t6 and t5 is calculated and displayed in seconds, showing how long the model took to train.

scalar t6 = c(current_time)

display (clock(t6, "hms") - clock(t5, "hms")) / 1000 " seconds"

Lines 236-237: Retrieve the out-of-bag error rate from the ereturn list and store it in a scalar variable named oob_error.

ereturn list OOB_Error
scalar oob_error = `e(OOB_Error)'

4.1 Prediction and R-square Calculation

In Lines 240-244: Predictions are made for the entire sample using the trained model. The correlation coefficient (rho) between the actual values (evaded) and predicted values (tot_pred) is calculated and squared to get the coefficient of determination (R²) for the training dataset.

predict tot_pred    

corr evaded tot_pred if u==1 
scalar r_ml = `r(rho)'
di r(rho)^2 

4.2 Testing Data Prediction and Evaluation

In Lines 247-254: Predictions are then made for the testing subset, starting from nn+1 to NN. The RMSE of these predictions is extracted and stored. The correlation coefficient is again calculated for this subset and squared to determine the R² for the testing data.

predict pre if u==0 
ereturn list RMSE
scalar rmse_ml = `e(RMSE)'


corr evaded pre if u==0 
scalar r_ml2 = `r(rho)'
di r(rho)^2     

4.3 Plotting the Importance Scores of Variables Used in a Model

In lines 259-276, the code extracts the variable importance scores from the random forest model and stores them in a matrix named importance2. This matrix is then converted into a Stata dataset for further analysis. Afterward, the code generates a horizontal bar plot using the graph hbar command to visualize the mean importance score for each variable. To ensure clarity in the visualization, the plot only includes variables with an importance score greater than a specified threshold, determined by the value of the min_importance variable denoted by ii. The value of this variable should be set between 0 and 1. Finally, the code saves the resulting graph in both Stata’s graph file format and as a PDF document.

Note 14:

The plot shows the relevance of each variable, with the most relevant variables having an index close to 1. You can adjust the minimum relevance threshold for inclusion in the plot by changing the value of ‘min_importance’ ii to a number between 0 and 1 in code line 263. In the example in Figure 4, we set this value to 0.1%, which includes variables with a min_importance greater than 0,1%.

matrix importance2 = e(importance)
svmat importance2
generate importid2=""

local min_importance = ii

local mynames: rownames importance2
local k: word count `mynames'

if `k'>_N {
 set obs `k'
}

forvalues i = 1(1)`k' {
    local aword: word `i' of `mynames'
    local alabel: variable label `aword'
    if ("`alabel'"!="") qui replace importid2= "`alabel'" in `i'
    else quietly replace importid2= "`aword'" in `i'
}

graph hbar (mean) importance2 if importance2 > `min_importance',///
over(importid2, sort(1) label(labsize(2))) ytitle(Importance) ///
note("Only variables with an importance ///
score larger than  `min_importance' % are reported.") 
graph save Graph "$graphs/ml_var_rel.gph", replace
graph export "$graphs/ml_var_rel.pdf", as(pdf) replace

With the following command of lines 286-287, we save the index for each variable. With this information, you can analyze how relevant each variable is in your model to predict the unpaid tax.

putexcel set "$xlsout/ml_var_rel_tot.xlsx", replace
putexcel A3 = matrix(importance2), rownames nformat(number_d2)

The following Figure 4 shows the importance scores of variables used in the random forest model. The plot highlights the variables with an importance score larger than a specified threshold.

Figure 4: Importance Scores of Variables in the Random Forest Model

Figure 4: Importance Scores of Variables in the Random Forest Model

This figure displays the importance scores of the variables used in the random forest model, indicating which variables contribute most to the model’s predictions. Variables with higher importance scores are more influential in determining the model’s output. Only variables with an importance score larger than 0.1% are reported.

4.4 Comparison with OLS Regression

The following part of the code compares the performance of an ordinary least squares (OLS) regression model with a machine learning (ML) model in terms of R-squared (R²) and root mean square error (RMSE) metrics.

In line 296: We run an OLS regression with evaded as the dependent variable and other variables (/VARIABLES/) over the training data observation.

quietly regress evaded /VARIABLES/ if u==1

Lines 297-299: retrieve and list the R-squared value from the previous regression analysis to compare them with machine learning models in the training data. It then stores this R-squared value in a scalar named r_ols. Additionally, it stores the Root Mean Square Error (RMSE) from the regression in another scalar named rmse_ols.

ereturn list r2 
scalar r_ols = `e(r2)'
scalar rmse_ols = `e(rmse)'

Lines 301-302: generate predicted values for the dependent variable, evaded, using the regression model for observations from testing data, and store these predictions in a new variable named ev. It then attempts to list the Root Mean Square Error (RMSE) of these predictions to assess the model’s performance on the testing data.

predict ev in if u==0
ereturn list rmse

Lines 305-307: calculates the correlation coefficient (r(rho)) between the actual evaded values and the predicted ev values for the testing data subset. It then stores this correlation coefficient in a scalar named r_ols2. Finally, it displays the squared value of this correlation coefficient, which represents the R-squared (R²) value for the testing data. This R-squared value is used to compare the performance of the regression model with that of a machine learning model on the testing data.

corr evaded ev if u==0
scalar r_ols2 = `r(rho)'
di r(rho)^2 

Lines 310-312: creates a new variable diff_sqr2 which stores the squared differences between the actual values (evaded) and the predicted values (ev). Then, it calculates descriptive statistics for diff_sqr2 using the summarize command. Finally, it stores the mean of these squared differences in a scalar named rmse_ols2.

generate diff_sqr2= (evaded - ev)^2
summarize diff_sqr2
scalar rmse_ols2 = `r(mean)'

lines 315-321: calculates and stores summary statistics for the evaded variable in both the training and testing datasets. It first summarizes evaded for the training data subset, storing the mean value in a scalar named tra_mean and the standard deviation in a scalar named sd. Then, it summarizes evaded for the testing data subset, storing the mean value in a scalar named tes_mean and the standard deviation in a scalar named sd2.

summarize evaded if u==1
scalar tra_mean = `r(mean)'
scalar sd = `r(sd)'

summarize evaded if u==0
scalar tes_mean = `r(mean)'
scalar sd2 = `r(sd)'

The above calculation compares the benefits of following the ML approach instead of a regression (OLS). This code creates a table comparing R-squared and RMSE from the training and testing data.

Comparison Table

lines 325-378: Uses the collect command to organize and format a table that compares the OLS and ML models based on R-square and RMSE metrics.

collect clear

collect get row1 = oob_error, tags(col1[ML])

collect get row1 = rmse_ols, tags(col1[OLS])

collect get row2 = oob_error/tra_mean, tags(col1[ML])

collect get row2 = rmse_ols/tra_mean, tags(col1[OLS])

collect get row3 = oob_error/sd, tags(col1[ML])

collect get row3 = rmse_ols/sd, tags(col1[OLS])

collect get row4 = r_ml^2, tags(col1[ML])

collect get row4 = r_ols, tags(col1[OLS])

collect get row5 = tra_mean, tags(col1[ML])

collect get row5 = tra_mean, tags(col1[OLS])

collect get row6 = sd, tags(col1[ML])

collect get row6 = sd, tags(col1[OLS])

collect layout (result) (col1)


collect get row1 = rmse_ml, tags(col2[ML])

collect get row1 = sqrt(rmse_ols2), tags(col2[OLS])

collect get row2 = rmse_ml/tes_mean, tags(col2[ML])

collect get row2 = sqrt(rmse_ols2)/tes_mean, tags(col2[OLS])

collect get row3 = rmse_ml/sd2, tags(col2[ML])

collect get row3 = sqrt(rmse_ols2)/sd2, tags(col2[OLS])

collect get row4 = r_ml2^2, tags(col2[ML])

collect get row4 = r_ols2^2, tags(col2[OLS])

collect get row5 = tes_mean, tags(col2[ML])

collect get row5 = tes_mean, tags(col2[OLS])

collect get row6 = sd2, tags(col2[ML])

collect get row6 = sd2, tags(col2[OLS])

collect layout (result) (col2)

Lines 381-407: Add appropriate labels, style the table, and export it to an Excel file for easy access.

collect label levels result row1 "RMSE" row2 "RMSE/Mean (evasion)" row3 "RMSE/Std (evasion)" row4 "R-square" row5 "Mean (evasion)" row6 "Std (evasion)"
 
collect label dim col1 "Training"

collect label dim col2 "Test"

collect style header col1, title(label)

collect style header col2, title(label)

collect style column, dups(center)

collect style cell cell_type[corner], border(right, pattern(nil))

collect style cell col1#cell_type[column-header], border(bottom, pattern(single))

collect style cell col2#cell_type[column-header], border(bottom, pattern(single))

collect style cell result, border(right, pattern(nil))

collect title "Table. Summary Statistics"

collect style title, smcl(input)
 
collect layout (result) (col1 col2)

collect export "$xlsout/stats_comparison.xlsx"

The Table 1 below is an example of the visualization. This table summarizes the performance metrics for the Machine Learning (ML) and Ordinary Least Squares (OLS) regression models, using R-square and RMSE (Root Mean Squared Error) on both training and testing datasets.

Table 1: Summary Statistics

Metric Training ML Training OLS Test ML Test OLS
RMSE 482.95 509.11 409.18 502.15
RMSE/Mean (evasion) 0.450 0.474 0.386 0.474
R-square 0.964 0.772 0.852 0.777
Mean (evasion) 1073.44 1073.44 1059.05 1059.05

We can notice the following from this table:

  • The Machine Learning model consistently outperforms the OLS model on both training and testing data.

  • The RMSE values for the ML model are lower, indicating better predictive accuracy.

  • Higher R-square values for the ML model suggest a better fit to the data.

  • These findings highlight the advantages of using Machine Learning models for prediction tasks over traditional OLS regression models.

4.5 Graphs to See the Accuracy of the Prediction

The following part of the code aims to generate visual representations to compare predicted results against actual values for a training dataset.

Histogram to Compare Prediction and Actual Values

Lines 419-421: Plots a histogram of the distribution of values to compare two sets—tot_pred (predicted values) and evaded (actual values) from the training data. Also, the code save the graphs in the corresponding folder.

Note 15:

Please update the variable names to match those in your dataset. Additionally, make sure that the monetary values in line 419 are expressed in your local currency. In the code line “xtitle(”Monetary Value”)“, replace”Monetary Value” with the appropriate currency name, such as “Euro” or “Dollar.

graph twoway (histogram tot_pred if u==0, percent bin(100)) (histogram evaded if u==0, percent bin(100) fcolor(purple%30) lcolor(purple%30)), xtitle("Monetary Value") legend(order(1 "Prediction" 2 "Recovered VAT"))
graph save Graph "$graphs/histogram_prediction1.gph", replace
graph export "$graphs/histogram_prediction1.pdf", as(pdf) replace

The following Figure 5 shows the histogram of predicted and actual VAT values.

Figure 5: Histogram of Predicted and Actual VAT Values

Figure 5: Histogram of Predicted and Actual VAT Values

We can notice the following from the Figure 5:

  • The histogram shows the distribution of predicted values (in blue) and actual recovered VAT values (in purple).

  • The closer the distributions of the predicted and actual values, the more accurate the model.

Scatterplot to Compare Prediction and Actual Values

Lines 426-428: creates and saves a scatterplot that compares the predicted tax values (tot_pred) and the actual recovered tax values (evaded) against the declared tax (tax_vle).

Note 16:

Please update the variable names to match those in your dataset. Additionally, make sure that the monetary values in line 426 are expressed in your local currency. In the code line “xtitle(”Monetary Value”)“, replace”Monetary Value” with the appropriate currency name, such as “Euro” or “Dollar.” Also, please put this note before the last code, since give a regards for the two plots.

graph twoway (scatter tot_pred tax_vle, sort mcolor(%50)) ///
(scatter evaded tax_vle, sort mcolor(%50) msymbol(lgx)), ///
legend(rows(1) position(6)) xtitle("Monetary Value") ///
legend(order(1 "Prediction" 2 "Recovered VAT"))
graph save Graph "$graphs/histogram_prediction2.gph", replace
graph export "$graphs/histogram_prediction2.pdf", as(pdf) replace

The following Figure 6 shows the scatterplot of predicted and actual VAT values against the declared tax amounts.

Figure 6:Scatterplot of Predicted and Actual VAT Values Against Declared Tax Amounts

Figure 6:Scatterplot of Predicted and Actual VAT Values Against Declared Tax Amounts

We can notice the following from the Figure 6:

  • The scatterplot displays the predicted values (in blue) and actual recovered VAT values (in red) against the declared tax amounts.

  • The clustering of points indicates the relationship between the predicted and actual values, showing how well the model predicts VAT recovery.

  • A strong alignment of the blue and red points along a diagonal line would indicate good predictive performance.


5 Fourth Stage: Estimating the Tax Gap

This section of the code is designed to estimate the tax gap using a trained Random Forest model.

At the beginning, since we work with a sample from the original dataset, we need to clear the dataset and reload the original sample (with all the observations) as in code lines 442-444. In this sample, we will proceed to predict the unpaid tax and obtain the tax gap. Please remember to save the optimal parameters obtained from the last two steps of the second stage in sections 3.3 and 3.4.

clear

use "$data/ZZ"  // Insert the database to use

In the following code, lines 446-448, replace ‘YY’ with the appropriate indicator variable for the regions (geographical or tax administrative) where audits were conducted, and replace ‘XX’ with the variable representing the amount of taxes evaded, including VAT, CIT, or PIT, ensuring there are no missing values.

generate reg_audit = YY 

generate evaded = XX    

We create an indicator variable that flags records where the evaded variable is not missing and reg_audit is equal to 1 (indicating audited regions).

generate indicator = (evaded !=. & reg_audit==1)

Then we create a new variable mimic_tax as a copy of the evaded variable,and replace the missing values in evaded with zero in mimic_tax.

generate mimic_tax = evaded
replace mimic_tax = 0 if evaded == .

After that we run the Random Forest model to predict evasion (mimic_tax) using specified variables (/VARIABLES/) for records where indicator is 1. The number of iterations (II) and the number of variables per split (VV) should be set based on prior tuning steps.

rforest mimic_tax /VARIABLES/ if indicator==1, type(reg) iter(II) numvars(VV)

Note 17:

The parameters II and VV are derived from the tuning process outlined in steps one and two. Please replace II with the value found in Section 3.3 and VV with the value from Section 3.4. Additionally, ensure that you use the same variables as in the previous sections (“") to maintain the accuracy of the prediction model.

The following code lines 459-460 predict the amount of undeclared tax (assessment) for unaudited records within regions where audits were conducted. Then, it ensures that for audited records within the same regions, the assessment variable is set to the actual evaded amounts (evaded), ensuring the predicted values reflect true evaded amounts where audits have already determined these values.

predict assessment if indicator==0 & reg_audit==1
replace assessment = evaded if indicator==1 & reg_audit==1

The next command checks for negative predictions. If your evaluated variable allows negative values (for example, if a mistake is included as credits back), you can omit this.

sum assessment, d

If your variable assessment contains negative values, you can replace them with zero in a new variable assessment_2, as negative values are considered invalid and should be treated as zero by design. For instance, negative values in the misreporting variable indicate that firms are receiving credits, which is not part of the tax gap definition. Although the predictions with negative values might be useful for tax analysis, it’s preferable to create a new variable that truncates the distribution by imputing zero for all negative values.

generate assessment_2 = assessment
replace assessment_2 = 0 if assessment_2<0 

Note 18:

If there are no negative values, you can skip the previous step.

Also, since the tax collection for the period measures only the positive tax and does not account for refunds, we need to consider both aspects. In the following code, we create two variables. First, the “positive potential revenue” reflects the possibility that, despite a high assessment of misreporting, the actual tax owed could still be negative due to refunds or credits. Second, the “revenue losses” capture the assessments related to positive tax declarations, as well as the positive potential revenue for those who report a non-positive tax.

This code can still be used even if there are negative values in the evaded measure.

This code calculates potential tax revenue and revenue losses by handling both positive taxes and refunds/credits, ensuring accurate estimation of the tax gap.

We begin by creating “potential tax” by adding assessed evaded tax to the declared tax for each observation.

generate pot_tax = assessment + tax

Then, we set “potential tax” to zero where the calculation is negative, indicating refunds or credits.

replace pot_tax = 0 if pot_tax<0

In addation to initializing “revenue loss” with the assessed evaded tax where declared tax is positive.

generate rev_loss = assessment if tax>=0

After that we add to “revenue loss” the potential tax where declared tax is negative but potential tax is positive.

replace rev_loss = pot_tax if tax<0 & pot_tax>0

Finaly, we assign zero to “revenue loss” where it remains missing, representing cases with only refunds or credits.

replace rev_loss = 0 if rev_loss==.

The following command saves your data. You must include the name you want to put in the new data in NAME, including the prediction on misreporting tax.

save "$data/NAME.dta", replace

5.1 Plot the Tax Gap

The following code lines sum the variables by specific parameters (year, economic activity, geographical region, etc.) to obtain the tax gap. You can choose the parameters to sum on, depending on the focus of the analysis. You need to choose the proper tax to compare.

In the following part, the first code obtains the overall tax gap by year. Since this is the first estimation, the end of the command to create a table that stores the results ended with replace. In the subsequent estimations, this ending must be append. This means that the software appends the result in the same document and ends up with a unique Excel file named results with all your estimations.

Similarly, we recommend changing the name of the graphs in the lines "$graphs/vat_gap.gph" and "$graphs/vat_gap.pdf" when performing new estimations. Otherwise, you will replace the graphs generated from previous runs. Our suggestion is to add a suffix, such as "_number" or "_analysis", to differentiate the files. For instance, use "vat_gap_1.gph", "vat_gap_2.gph", and so on (similarly for "vat_gap.pdf"), or use "vat_gap_industry.gph", "vat_gap_region.gph", etc. (again, same for "vat_gap.pdf").

Important: In case your evasion variable contains negative values, please go to section 5.2. Otherwise, follow this.

preserve
collapse (sum) assessment tax pot_tax rev_loss if reg_audit==1, by(year)

The preserve command in Stata is used to save the current state of the dataset in memory. This allows you to make changes or manipulations to the dataset temporarily and then restore it to its original state using the restore command.

Note 19:

You need to replace the variable tax with the tax you are estimating (VAT, CIT, PAYE). Also, the var_ analysis for the group you are analyzing. Remember that if you include year, activity, and region, you will have one value for each combination. This means that if you only want to study activity across years, you need to include year and activity.

The following first three lines generate the necessary variables for the tax gap calculation. The first line generates the variable for the summation of assessments and absolute tax values, the second line calculates the percentage of the tax gap, and the third line calculates the tax revenue gap.

vat_gap01: Variable calculates the percentage of the assessed tax (assessment) relative to the combined tax base (v01).

vat_gap01 = ( assessment / (assessment + |tax|) ) × 100

This percentage indicates the proportion of the assessed tax in relation to the total of assessed and actual tax values, effectively measuring the tax gap or potential tax evasion.

vat_gap02: This variable calculates the percentage of revenue losses (rev_loss) relative to the potential tax revenue (pot_tax).

vat_gap02 = ( rev_loss / pot_tax ) × 100

This percentage measures the proportion of revenue losses in relation to the potential tax revenue, providing an indication of the tax revenue gap, which can be interpreted as the extent of tax leakage or inefficiencies in tax collection.

Then, the code generates a bar graph displaying the average vat_gap01 and vat_gap02 for each year. It then saves the graph in Stata’s .gph format and exports it as a PDF. Additionally, the code exports these statistics to a CSV file, providing both visual and numerical summaries of the annual tax gap analysis.

generate v01 = assessment+abs(tax)

generate vat_gap01 = (assessment/v01)*100
generate vat_gap02 = (rev_loss/pot_tax)*100

graph bar (mean) vat_gap01 vat_gap02, over(year) blabel(bar, size(vsmall) format(%9.2f)) legend( label(1 "TAX") ) ytitle(% of Evasion) title("Percentage of tax evaded") subtitle("by audited tax regions") note("Agregate values.") legend(order(1 "VAT gap with refunds" 2 "VAT gap w/o refunds"))
graph save Graph "$graphs/vat_gap_year.gph", replace
graph export "$graphs/vat_gap_year.pdf", as(pdf) replace

estpost tabstat vat_gap01 vat_gap02 , by(year) stat(max)col(stat)
esttab . using "$xlsout/results.csv", cells("max(lab(Max) fmt(%15.2fc))") noobs nonumber eqlabels(`e(labels)') varwidth(20) replace


restore

Note 20:

When you include a variable or a set of variables, you obtain the base to estimate the tax gap. For instance, if you want to estimate the tax gap by industry during the years, you must include years and industry variables. Hence, you must repeat this code every time you want to change the unit of the tax gap. If you need to estimate the tax gap for regions and industries, you must repeat all the codes, first including year and region variables, and later, in another code, year and industry.

Decide the variable to plot and obtain the Excel summary. For example, if you want to have the plot by region or industry, you need to replace var_analysis with these variables.


preserve

collapse (sum) assessment tax pot_tax rev_loss if reg_audit==1, by(var_analysis)

generate v01 = assessment+abs(tax)

generate vat_gap01 = (assessment/v01)*100
generate vat_gap02 = (rev_loss/pot_tax)*100


graph bar (mean) vat_gap01 vat_gap02, over(var_analysis) blabel(bar, size(vsmall) format(%9.2f)) legend( label(1 "TAX") ) ytitle(% of Evasion) title("Percentage of tax evaded") subtitle("by audited tax regions") note("Agregate values.") legend(order(1 "VAT gap with refunds" 2 "VAT gap w/o refunds"))
graph save Graph "$graphs/vat_gap.gph", replace
graph export "$graphs/vat_gap.pdf", as(pdf) replace

Generate summary statistics and export to CSV.

estpost tabstat vat_gap01 vat_gap02, by(var_analysis) stat(max)col(stat)
esttab . using "$xlsout/results.csv", cells("max(lab(Max) fmt(%15.2fc))") noobs nonumber eqlabels(`e(labels)') varwidth(20) append

Then, restore the original dataset. The restore command reverts the dataset in memory to the state it was in when preserve was called, effectively undoing any changes made after preserve.

restore

Note 21:

The parameters and variables used in the code should be replaced with actual values specific to your dataset and analysis requirements.

5.2 Tax Gap with Negative Values in Misreporting Variable

If your variable evaded contains negative values, you should follow the code in lines 556 to 573.

preserve

collapse (sum) assessment_2 tax pot_tax rev_loss if reg_audit==1, by(var_analysis)

generate v01 = assessment_2+abs(tax)

generate vat_gap01 = (assessment_2/v01)*100
generate vat_gap02 = (rev_loss/pot_tax)*100

graph bar (mean) vat_gap01 vat_gap02, over(var_analysis) blabel(bar, size(vsmall) format(%9.2f)) legend( label(1 "TAX") ) ytitle(% of Evasion) title("Percentage of tax evaded") subtitle("by audited tax regions") note("Agregate values.") legend(order(1 "VAT gap with refunds" 2 "VAT gap w/o refunds"))
graph save Graph "$graphs/vat_gap.gph", replace
graph export "$graphs/vat_gap.pdf", as(pdf) replace

estpost tabstat vat_gap01 vat_gap02, by(var_analysis) stat(max)col(stat)
esttab . using "$xlsout/results.csv", cells("max(lab(Max) fmt(%15.2fc))") noobs nonumber eqlabels(`e(labels)') varwidth(20) append


restore

Note 22:

Ensure that you have handled negative values appropriately before proceeding with the tax gap estimation. Negative values in the evaded variable may indicate over-reporting or credits, which are not part of the tax gap definition. By creating assessment_2 and replacing negative values with zero, you ensure the integrity of the tax gap calculation.

5.3 Estimating Nominal Values for Potential Revenue

Now, if you want to obtain the nominal values for potential revenue, total tax collected, and potential revenue lost, you can use the following commands.

Note 23:

You must replace “Monetary Unit” with your actual monetary unit, for example, “US dollars”. As before, we first obtain the overall results and later by sector/industry/regions, etc. Please follow the same advice as before concerning the naming of the graphs saved.

preserve

collapse (sum) assessment tax pot_tax rev_loss if reg_audit==1, by(year)

generate potential_revenue = assessment+abs(tax)


graph bar (mean) potential_revenue assessment tax pot_tax rev_loss, over(year) blabel(bar, size(vsmall) format(%9.2f)) legend( label(1 "Potential Revenue") label(2 "Revenue Lost (predicted)") label(3 "Revenue Collected") label(4 "Potential Revenue w/o Refunds") label(5 "Revenue Lost (predicted) w/o Refunds")) ytitle(Monetary Unit) title("Revenue Consequences") subtitle("by audited tax regions") note("Agregate values.")
graph save Graph "$graphs/revenue_gap.gph", replace
graph export "$graphs/revenue_gap.pdf", as(pdf) replace

estpost tabstat potential_revenue assessment tax pot_tax rev_loss , by(year) stat(max)col(stat)
esttab . using "$xlsout/results_nominal.csv", cells("max(lab(Max) fmt(%15.2fc))") noobs nonumber eqlabels(`e(labels)') varwidth(20) replace


restore

Here, as before, decide which variable to collapse and plot, and obtain the Excel summary. For example, if you want to create the plot by sector, industry, or region, you need to replace var_analysis with these variables.

preserve

collapse (sum) assessment tax pot_tax rev_loss if reg_audit==1, by(var_analysis)

generate potential_revenue = assessment+abs(tax)


graph bar (mean) potential_revenue assessment tax pot_tax rev_loss, over(var_analysis) blabel(bar, size(vsmall) format(%9.2f)) legend( label(1 "Potential Revenue") label(2 "Revenue Lost (predicted)") label(3 "Revenue Collected") label(4 "Potential Revenue w/o Refunds") label(5 "Revenue Lost (predicted) w/o Refunds")) ytitle(Monetary Unit) title("Revenue Consequences") subtitle("by audited tax regions") note("Agregate values.")
graph save Graph "$graphs/revenue_gap.gph", replace
graph export "$graphs/revenue_gap.pdf", as(pdf) replace

estpost tabstat potential_revenue assessment tax pot_tax rev_loss , by(var_analysis) stat(max)col(stat)
esttab . using "$xlsout/results_nominal.csv", cells("max(lab(Max) fmt(%15.2fc))") noobs nonumber eqlabels(`e(labels)') varwidth(20) append


restore