README Part 1: Guidance for Bottom-Up Tax Gap Toolkit
Part 1: Data Management
README Part 1: Guidance for Bottom-Up Tax Gap Toolkit
[x]- 1 Overview
- 2 Usage Guidelines
- 3 First Stage
- 4 Second Stage: Data Management Process for VAT, CIT, and PIT
- 5 Third Stage: Data Management
Process for Audit Return Data
- 5.1 Importing Audit Data
- 5.2 Appending Audit Datasets
- 5.3 Inspecting the date variables and ensuring consistent formats
- 5.4 Adjusting Total Assessments for Annual Audit Periods
- 5.5 Reshaping Audit Data
- 5.6 Dealing with Duplicates
- 5.7 Improvement and Checks
- 6 Fourth Stage: Merging CIT, PIT, and VAT with the Audit Data
Mostafa Bahbah, Kwabena Adu-Ababio, Sebastian Castillo, and Amina Ebrahim
1 Overview
This toolkit, developed by UNU-WIDER/ATI, provides a comprehensive framework for conducting bottom-up tax gap analyses. It utilizes machine learning algorithms to estimate Value Added Tax (VAT), Personal Income Tax (PIT), and Corporate Income Tax (CIT) gaps, specifically designed for Revenue Authorities.
The toolkit consists of two code files: the first part is data management, which is used to clean the tax return and audit data and merge them, so it is ready for machine learning analysis using the second code file.
Date: 22/11/24
To use this toolkit, we must have Stata software installed on your computer. Additionally, the data requirements depend on the tax gap you want to analyze, such as CIT, PIT, and/or VAT return data, in addition to audit data.
2 Usage Guidelines
This part of the code consists of four main stages:
First Stage: This stage involves setting up the working environment by installing and updating Stata packages, configuring directory paths, and importing CIT, PIT, and VAT data. It also includes renaming and formatting variables to ensure consistency.
Second Stage: This stage focuses on managing CIT, PIT, and VAT data. The processes include removing duplicates, appending data if it comes in multiple files, sorting, reordering variables, and handling duplicates for appended data.
Third Stage: Similar to the second stage, this stage involves data management for audit data. The processes include removing duplicates and appending data if it comes in multiple files. In addition to correcting any inconsistent formats in audit date variables.
Fourth Stage: In this stage, CIT, PIT, and VAT data are merged with the audit data.
Throughout these stages, it is crucial to handle tax return and audit data variables carefully. Some of these variables should be present in your data, although they might have different names. Therefore, it is advisable to rename and format them according to the instructions provided in these stages. Additionally, some variables are created temporarily during the data management and cleaning process. Moreover, certain variables that do not exist in the return and audit tax data are created as they are necessary for the next part of the code, which involves using a machine learning model to estimate tax gaps.
Table 1:
Variable Name | Type/Category | Definition |
---|---|---|
tpin | VAT, CIT, PIT, Audit / O, Ness | A Taxpayer Identification Number (TPIN) is a unique identifier assigned to each taxpayer by tax authorities for identification and administrative purposes, used to track and manage tax obligations. TPIN ensures accurate processing of tax returns, payments, and related activities. Different countries may refer to TPINs using various terms. |
rtn_year | VAT, CIT, PIT, Audit / O, Ness | Return year, the specific year in which the tax return was filed by the taxpayer. |
month | VAT, PIT / O, Ness | The month during which the VAT/PIT return was filed, important for tracking monthly returns. |
isic_code | VAT, CIT, PIT / O, Ness | International Standard Industrial Classification (ISIC) Code: A numeric code, used to classify the economic activities of the taxpayer. This code may also consist of 2 or 4 digits if available. |
broad_sector | VAT, CIT, PIT / O, Unn | The broad sector classification of the taxpayer’s economic activities is indicated by a single-digit code, providing a general overview of the sector. |
regdate | VAT, CIT, PIT / O, Unn | Registration date, the date when the taxpayer registered for VAT/CIT/PIT. |
town_name | VAT, CIT, PIT, Audit / O, Ness | Name of the town, city, or locality where the taxpayer is located. This is used for geographical analysis and audit planning, taking into account the administrative structure of the specific country or region |
province | VAT, CIT, PIT, Audit / O, Ness | Name of the administrative division where the taxpayer is located. This could refer to a province, state, region, district, or any other relevant administrative unit, depending on the country. |
returnid | VAT, CIT, PIT / O, Unn | Return ID, a unique identifier for each tax return. |
returnstatus | VAT, CIT, PIT / O, Unn | Status of the tax return, indicating whether the return was filed, processed, or pending. |
returntype | VAT, CIT, PIT / O, Unn | Type of tax return filed, indicating whether it is a regular return, amendment, etc. |
invat | VAT / O, Ness | Input VAT, the amount of VAT the taxpayer paid on their purchases, expressed in local currency. |
outvat | VAT / O, Ness | Output VAT, the amount of VAT the taxpayer collected on their sales, expressed in local currency. |
vat_due | VAT / O, Ness | VAT due, the net amount of VAT the taxpayer needs to pay (output VAT minus input VAT), expressed in local currency. |
vat_vars | VAT / O, Ness | Other relevant VAT variables required for analysis. |
totalfixedassets | CIT / O, Ness | Total fixed assets, the value of the taxpayer’s fixed assets used for business operations, expressed in local currency. |
accumulated-depreciation | CIT / O, Ness | Accumulated depreciation, the total amount of depreciation claimed on the taxpayer’s fixed assets, expressed in local currency. |
gsalary | PIT / O, Ness | Gross salary refers to the total compensation paid to employees before any deductions, including benefits, allowances, and other forms of remuneration, expressed in local currency. |
pit_due | PIT / O, Ness | PIT due, the amount of personal income tax the taxpayer needs to pay, expressed in local currency. |
pit_vars | PIT / O, Ness | Other relevant PIT variables required for analysis. |
TotalTax | Audit / O, Ness | The total tax amount determined by the audit, representing the taxpayer’s total tax liability as assessed by the audit, is expressed in local currency. This variable was renamed totalassessments. |
totalassessments_annual | Audit / N, Ness | Adjusted annual assessment calculated by dividing total assessments by the number of years covered in the audit period. |
total_audit_period | Audit / N, Ness | Number of years covered by the audit period. |
assessment_date | Audit / O, Ness | The date on which the tax assessment was conducted. It is crucial for tracking the timeline of the audit process. |
audit_started_date | Audit / O, Ness | The date when the audit process began. Important for determining the duration and timeline of the audit. |
audit_completion_date | Audit / O, Ness | The date when the audit process was completed. Essential for calculating the total duration of the audit. |
audit_period | Audit / O, Ness | Represents a range or multiple periods covered by the audit. It may include multiple dates or periods relevant to the audit. |
auditperiodfrom | Audit / O, Ness | The starting year of the audit period. This variable captures the first year that the audit covers, helping to determine the range of the audit’s coverage. |
auditperiodto | Audit / O, Ness | The ending year of the audit period. This variable indicates the final year covered by the audit, important for determining the total audit period length. |
totalassessments | Audit / O, Ness | The total number of assessments conducted for a given taxpayer, used to quantify the audit activity and identify the frequency of audits. |
initial_period | Audit / N, Ness | First year covered by the audit. Important for understanding the range of periods the audit covers. |
final_period | Audit / N, Ness | Final year covered by the audit. Important for understanding the range of periods the audit covers. |
auditcode | Audit / T, Ness | Encoded variable representing different types of audits, used to classify and analyze the nature of audits conducted. |
audittype | Audit / N, Ness | Reclassified variable for audit types based on auditcode, providing a more detailed classification of audit types such as comprehensive audit, issue audit, credibility check, etc. |
started_date | Audit / N, Ness | A reformatted variable that captures only the month and year when the audit started. Used to standardize date formats for analysis. |
audit_completion | Audit / N, Ness | A reformatted variable that captures only the month and year when the audit was completed. Used to standardize date formats for analysis. |
tax_head | Audit / O, Ness | Tax type or head, indicating the specific tax category being addressed (e.g., CIT, PIT, VAT). |
latest0 | Temporary / T, Ness | Generated variable used to identify the latest return year for a given taxpayer and period (year or month) in the initial dataset. |
latest1 | Temporary / T, Ness | Generated variable used to identify the latest return year for a given taxpayer and period (year or month) after appending datasets or additional processing steps. |
dup0 | Temporary / T, Ness | Temporary variable to tag duplicates for further inspection, helping identify and manage duplicate records in the dataset. |
dup1 | Temporary / T, Ness | Temporary variable to tag duplicates for further inspection, helping identify and manage duplicate records in the dataset. |
date_problem | Audit / T, Ness | Flag indicating if there are problems with the dates (e.g., audit start date is after the completion date). |
s | Audit / T, Ness | Temporary variable for swapping dates (start date). Used to correct cases where start date is after the completion date. |
e | Audit / T, Ness | Temporary variable for swapping dates (end date). Used to correct cases where start date is after the completion date. |
p | Audit / T, Ness | Temporary variable used for splitting audit_period into individual years. |
p1 | Audit / T, Ness | First segment of the split audit_period. |
p2 | Audit / T, Ness | Second segment of the split audit_period. |
p3 | Audit / T, Ness | Third segment of the split audit_period. |
p4 | Audit / T, Ness | Fourth segment of the split audit_period. |
p22 | Audit / T, Ness | Intermediate variable used to store the final year if third segment is missing. |
p33 | Audit / T, Ness | Intermediate variable used to store the first intermediate year if third segment is present. |
sm | Audit / T, Ness | Month of audit_started_date. |
sy | Audit / T, Ness | Year of audit_started_date. |
em | Audit / T, Ness | Month of audit_completion_date. |
ey | Audit / T, Ness | Year of audit_completion_date. |
vataudstat | Merge / T, Ness | Variable indicating the audit status of VAT returns, showing whether the return was matched and audited. |
_merge | Merge / T, Ness | A temporary variable created by the merge command to indicate the match status of the merged datasets. |
nr_audits | Merge / T, Ness | Temporary variable to count the number of audits for a taxpayer, used to create ever audited variables. |
vataudited | Merge / N, Ness | Dummy variable indicating if the VAT return was audited. |
ever_vataudited | Merge / N, Ness | Dummy variable indicating if the taxpayer has ever been audited for VAT. |
citaudstat | Merge / T, Ness | Variable indicating the audit status of CIT returns, showing whether the return was matched and audited. |
citaudited | Merge / N, Ness | Dummy variable indicating if the CIT return was audited. |
ever_citaudited | Merge / N, Ness | Dummy variable indicating if the taxpayer has ever been audited for CIT. |
pitaudstat | Merge / T, Ness | Variable indicating the audit status of PIT returns, showing whether the return was matched and audited. |
pitaudited | Merge / N, Ness | Dummy variable indicating if the PIT return was audited. |
ever_pitaudited | Merge / N, Ness | Dummy variable indicating if the taxpayer has ever been audited for PIT. |
Note:
1. O: Original variable,
part of the original dataset or source.
2. T:
Temporary variable, used for intermediate steps in data cleaning
processing.
3. N: Newly created variable during
data processing.
4. Ness: Necessary variable for
the analysis of the tax gaps.
5. Unn: Unnecessary
variable for the analysis of the tax gaps, might be excluded.
3 First Stage
This stage aims to ensure all necessary packages are installed and updated, prepare the STATA environment, set up directory paths, and import and check the data files, as illustrated in Figure 1.
3.1 Install and Update Necessary Packages
Before starting this stage, we need to ensure that the STATA packages required for this stage are installed. This can be done by running the following code lines 49 and 52:
Remove the “*” before these two lines in the dofile. Other packages can also be added if needed in lines 49 and 52.
3.2 Preparing Stata Environment
Lines 60 to 64 of the code configure the environment to ensure smooth and efficient script execution by disabling output pauses, clearing matrices, setting wider output lines and larger matrix sizes, and safely closing any open log files.
The command ‘version 14’ makes STATA interpret and run the code according to the syntax and features available in STATA version 14, regardless of the actual version currently running. The data files will also be saved in STATA 14 so anyone can access the saved files and results with a lower software version.
3.3 Setting Up Global Directory Paths
To enhance the code’s flexibility and ease of updating, we employ a “global” command. This command acts as a macro, allowing for the storage of file paths that can be reused throughout the STATA session.
Before executing the code from lines 67 to 75, you need to select a location on your computer. In this chosen location, create a folder named “Cleaning.” This folder will serve as a repository for the outputs from this stage. Within the “Cleaning” folder, you should create six separate subfolders as follows:
- Excel: This folder is for storing your raw data.
- Data: Here, the cleaned data will be placed.
- Graphs: This folder is to save the graphs.
- Dofile: This folder is for saving dofiles.
- Results: This folder is for storing results.
- Log: This folder is used to store log files.
In the dofile, at line 67, insert the path of your ‘Cleaning’ folder between the quotation marks.
global dir "<YOUR PATH>"
global data "$dir\Data"
global graphs "$dir\Graphs"
global xlsout "$dir\Excel"
global dopath "$dir\Dofile"
global results "$dir\Results"
global log "$dir\Log"
The following code line aim is to start recording the session’s output to a log file for documentation or review purposes.
Attention: The following steps in this stage aim to import the CIT, PIT, or VAT return data files and rename their variables. You can utilize the code based on the type of tax data you have. In case you are dealing with two or three of these tax types, you should repeat these steps for each type of tax separately. If your tax data for any of these tax types comes only in one file, please ignore the last step in this stage.
3.4 Importing Data, Renaming Variables, and Saving the Data
Line 81 of code is utilized to import the raw data located in a directory specified by the local macro ‘xlsout’. You need to replace “<Your_CIT,PIT,or VAT_returns_Data_1st_File>” with your CIT, PIT, or VAT data filename.
1- ‘import delimited’ is employed to import a delimited text file format (such as CSV, TSV, DSV). If the file is in a different delimited text format (such as CSV), ensure to adjust the extension after the filename accordingly.
2- If the data file is in xlsx format, we use the ‘import excel’ command like the following example:
Line 83 formats the pseudo tax ID variable, named ‘tpin’ in this example, to make it visible with a 15-character wide field, with no decimal places.
Replace ‘tpin’ with the name of the pseudo tax ID in your data. In some countries, the pseudo tax ID is a string combination of numbers and letters. In this case, the following code can be followed before formatting, as in our case.
Lines 87 to 90 are utilized for renaming existing variables in your dataset to new names if applicable. For example,
rename old_var_name new_var_name // Example
rename actvty_isic_code isic_code
rename provincename province
rename rtn_id returnid
rename form_name returntype
Additional checks must be made on formats for filing and return dates, fiscal year, province, and industry codes. Here we assume that the dates for the tax return take the corrected format; however, you should verify these variables. If there is any inconsistency in the dates, you can correct them by following the instructions provided for dealing with date variables related to audit data in section 5.3.
Line 95 saves the data in the “Data” folder in STATA format.
3.5 Importing, Renaming, and Saving Additional Data Files if the Data Comes in Multiple Files
Lines 98 to 107 repeat the preceding steps of importing, formatting, renaming, and saving the data. This applies only if the data comes in multiple files.
import delimited "$xlsout\<Your_CIT,PIT,or VAT_returns_Data_2nd_File>.dsv", clear
format tpin %15.0g
rename actvty_isic_code isic_code
rename provincename province
rename rtn_id returnid
rename form_name returntype
Please see Note 5.
4 Second Stage: Data Management Process for VAT, CIT, and PIT
Attention:
The following steps of this stage are general steps applicable for VAT, PIT, and/or CIT, depending on which type of tax return data you are dealing with. If you have more than one of these types, ensure that you apply the same steps for each type.
This stage in the do-file is organized to deal with each of these tax types separately. You only need to go to the part that deals with the type of tax you have and load your data file following the instructions below.
The provided instructions use the variable names from the VAT section in the do-file. You should be consistent and use the variables that reflect the same variables used in each example in the do-file from your data.
If the VIT, CIT, and PIT return data comes in more than one file, you should follow the second and third steps in sections 4.2 and 4.3. Otherwise, you can ignore these steps and go directly to the fourth step in section 4.4, as illustrated in Figure 2.
4.1 Dealing with Duplicates
The following code line loads the STATA data file of VAT, PIT, or CIT. Here, you have to place your STATA filename instead of ‘<Your_CIT, PIT, or VAT_returns_Data_1st_File>’. The clear option clears any existing data in memory before loading this file:
If there are duplicates, the following command tags duplicate observations. It then creates a new variable “dup0” which marks duplicates. If “dup0” is greater than 0, the observation has duplicates; if it’s 0, it’s unique.
The following code line displays a frequency table of the dup0 variable, showing how many duplicates and unique records are in the dataset.
Then we use the “browse” command to open a data browser to inspect the records flagged as duplicates. This displays the (rtn_year, tpin, month, broad_sector, outvat, dup0) variables for manual verification.
Since our objective is to identify and retain the most recent tax return (self-reported by the taxpayer) and discard older versions or reviewed/assessed returns. The following part of the code identifies the most recent return year (rtn_year) for each key identifier. In your case, the key identifiers could be the variables that represent, for example, taxpayer ID, fiscal year, or filing month. In our case, we use TPIN (Taxpayer Identification Number) and month.
The following code line identifies the most recent VAT return year for each taxpayer (TPIN) and month among duplicates. Then, the code creates a new variable latest0 to store this value and uses the function max(rtn_year) to indicate the most recent return year within each group of TPIN and month.
Then we use the following code line to drop records that are not the latest return for each tpin and month group where duplicates exist.
The following commend line ensures that the dataset is cleaned of any duplicate records for the same taxpayer and period, keeping only one unique entry for each combination of return year, taxpayer ID, and filing month.
Then the following command re-tags the duplicate observations using the same criteria and generates the same variable dup0. Since dup0 already exists from the first command, STATA will overwrite it with the new tags.
In the following code line, STATA checks the condition for all observations. If every observation satisfies the condition (i.e., dup0 is 0 for all rows), the command completes successfully without any output. If there are any observations where dup0 is not 0 (indicating a duplicate), STATA will throw an error and stop executing the script. This helps ensure that the dataset is indeed free of duplicates before proceeding.
At the end of dealing with the duplicates process, we drop all temporary created variables by executing the following line.
4.2 Appending Datasets
If data comes in multiple files, execute following code to append the data from the file ‘<Your_CIT, PIT, or VAT_returns_Data_2nd_File>.dta’ to the currently loaded dataset in STATA ‘<Your_CIT, PIT, or VAT_returns_Data_1st_File>.dta’.
After appending, you might want to check if there are empty columns in the combined dataset. For that ‘browse’ command opens a window to view the dataset.
4.3 Dealing with Duplicates if Data Comes in Multiple Files
The following code does the same process of dealing with duplicates in section 4.1, but now for the appended tax return.
duplicates tag rtn_year tpin month, gen(dup1)
tabulate dup1
browse rtn_year tpin month broad_sector outvat dup1 if dup1 == 1
bys tpin month: egen latest1 = max(rtn_year) if dup1 != 0
drop if rtn_year != latest1 & dup1 != 0
duplicates drop rtn_year tpin month, force
duplicates tag rtn_year tpin month, gen(dup1)
assert dup1 == 0
drop dup1 latest1
4.4 Sorting, Reordering, Collapsing, and Saving Data
The following code line sorts the dataset first by “rtn_year” and then by “tpin” variables.
The following code line changes the order of the variables in the dataset.
order rtn_year tpin isic_code broad_sector regdate town_name ///
province returnid returnstatus returntype
Example 1: Collapse VAT to annual level and save as STATA data
Example 2: Collapse PIT to annual level and save as STATA data
Example 3: Saving CIT as STATA data
5 Third Stage: Data Management Process for Audit Return Data
The aim of the Third Stage is to manage audit return data by importing, renaming, formatting, and handling duplicates, ensuring consistent date formats, and improving the dataset.The workflow for this stage is illustrated in Figure 3.
5.1 Importing Audit Data
The command on line 296 imports the audit data from an Excel file
named ‘
The code lines 299 to 301 are executed to ensure that the variable names in the audit data match those in CIT, PIT, and VAT.
Line 303 formats the pseudo tax ID variable, named ‘tpin’ in this example, to make it visible with a 15-character wide field, with no decimal places.
Additional checks must be made on formats for audit and return dates, fiscal year, province and industry codes. Verify these variables and check the tax assessment variable.
Line 308 saves the Audit return data in the “Data” folder in Stata format.
The next part of the code, from lines 311 to 319, repeats the same process of importing, formatting, and saving audit data, but for “Your_Audits_Data_2nd_File”.
import delimited "$xlsout\<Your_Audits_Data_2nd_File>.xlsx",clear
rename TPIN tpin
rename TOTALTAX totalassessments
format tpin %15.0g
save "$data\<Your_Audits_Data_2nd_File>.dta", replace
The last part of this step is needed if audit data comes in more than one file (e.g., two files). In this case, please see Note 8 and 9.
5.2 Appending Audit Datasets
The following step is only needed if the data comes in Multiple Files.
The code line 324 appends the data from the file
‘
After appending, you can use the ‘browse’ command in line 325 to check whether there are empty columns in the combined dataset.
The code line 328 saves all audit data into one file.
5.3 Inspecting the date variables and ensuring consistent formats
In audit data, dates often appear in inconsistent formats, which can cause significant issues if they are not converted to STATA formats. Examples of these dates include variables such as assessment_date, audit_started_date, audit_completion_date, and audit_period. To ensure the data is accurate and usable, it is essential to check these date variables and make the necessary adjustments to achieve consistent date formats.
At the beginning, imagine that Table 2 has some of the observations of your date variables below. The initial look at these variables reveals that they have some inconsistent formats. How do you convert them to STATA formats to ease your analysis?
Table 2:
assessment_date | audit_started_date | audit_completion_date | audit_period |
---|---|---|---|
1-Nov | 1.7.20147 | 13/8/20120 | 2015-16 |
05/012/2018 | 10-Apr | 13/JU//21 | 2015/16-2016/17 |
16/21207 | 17-Nov | 25/001/2017 | 2015-208 |
3.1.12018 | 3.1.12018 | 31/3/218 | 2015/16-2020/21 |
Feb-21 | Apr-21 | DEC-21 | 2011/2013/2014/2015 |
01/05/2019 | 01/07/2017 | 13/08/2020 | 2016 TO 2020 |
01/06/2018 | 01/04/2019 | 13/06/2020 | 2016 to 2022 |
01/06/2017 | 17/11/2017 | 25/01/2017 | 2016/2017-2019/2020 |
Attention:
The entire Section 5.3 is designed to explain the code in the do-file by providing additional examples, as shown in Table 2, to illustrate how users can utilize the code. Users may not encounter any issues with their audit data dates, making some of the following substeps unnecessary. Conversely, users might face unique problems with date variables that require tailored solutions. The provided code aims to demonstrate various methods to identify, address, and correct these inconsistencies, ultimately standardizing date formats for accurate analysis. Following this step ensures a unified and clean format for date variables, as illustrated in Tables 4 and 5. Section 5.3 provides four options for identifying the audit period. The chosen option depends on the variables available in the audit data. The first best option is to use variables auditperiodfrom and auditperiodto or audit_period to identify the period covered by the audit. If the audit data includes these variables, we recommend using them in subsection 5.3.6. The second best option is to use variables audit_started_date and audit_completion_date to create the period covered by the audit. The third option is to use an audited year (=rtn_year) variable, this can be used to identify the year covered by the audit. The final option is to use an audit_completion_date variable which assumes that the audit outcome will be applied to the returns in that specific year when the audit was finalized. In all cases, users must complete all the steps from 5.3.1 to 5.3.5 before moving to subsection 5.3.6. Users may need to adjust the codes in subsection 5.3.6 to match the available variables.5.3.1 Inspect the Date Variables
First, we can display a sample of the data to get an initial look at the date values in the dataset by executing code line 349 to list the first 20 observations of the variables.
Then, we use the tabulate command in code lines 352 to 354 to create frequency tables for the assessment_date, audit_started_date, and audit_completion_date variables. The missing option includes missing values in the tabulation. This helps in identifying the different formats and potential issues with the date values.
tabulate assessment_date, missing
tabulate audit_started_date, missing
tabulate audit_completion_date, missing
Although the tabulated date variable entries look like dates, in the above steps, the entries/cells are not recognized as dates by STATA. After performing the steps needed to clean these entries, we demonstrate how to tell STATA to format these variables to be recognized as dates in section 5.3.3.
After that, the following section of the code from lines 357 to 364 checks for unique formats and issues in assessment_date, audit_started_date, and audit_completion_date. It first sorts the dataset by each date variable to order the entries. Then, it attempts to convert the dates to the “DMY” (day-month-year) format. If the conversion fails, it further checks if the dates can be converted to “MDY” (month-day-year), “YMD” (year-month-day), or “MYD” (month-year-day) formats. If all these conversions fail, the browse command displays these problematic entries in the Data Browser for manual inspection and correction. This process is repeated for each date variable to ensure consistency and accuracy in date formatting.
sort assessment_date
browse if missing(date(assessment_date, "DMY")) & ///
missing(date(assessment_date, "MDY")) & ///
missing(date(assessment_date, "YMD")) & ///
missing(date(assessment_date, "MYD"))
sort audit_started_date
browse if missing(date(audit_started_date, "DMY")) & ///
missing(date(audit_started_date, "MDY")) & ///
missing(date(audit_started_date, "YMD")) & ///
missing(date(audit_started_date, "MYD"))
sort audit_completion_date
browse if missing(date(audit_completion_date, "DMY")) & ///
missing(date(audit_completion_date, "MDY")) & ///
missing(date(audit_completion_date, "YMD")) & ///
missing(date(audit_completion_date, "MYD"))
The Table 3 below shows observations that need manual correction in the next step after implementing the previous code.
Table 3
assessment_date | audit_started_date | audit_completion_date |
---|---|---|
05/012/2018 | 10-Apr | 13/JU//21 |
1-Nov | 1.7.20147 | 13/8/20120 |
16/21207 | 17-Nov | 25/001/2017 |
3.1.12018 | 3.1.12018 | 31/3/218 |
Feb-21 | Apr-21 | DEC-21 |
Audit_period is handled differently as it represents a range or multiple periods.
5.3.2 Manual Corrections
This step is needed only if the browse commands in the previous step indicate cases that need manual corrections. Here, we replace these incorrect date formats with the correct ones. You need to amend the following example to fit into your case.
The following code section corrects specific date format issues identified in the previous steps for assessment_date, audit_started_date, and audit_completion_date. Each replace command sets the date to the correct “DMY” (day-month-year) format before the if condition, and the incorrect date format identified previously is placed after the if condition. For example, replace assessment_date = “01/05/2019” if assessment_date == ” 1-May” changes assessment_date values from ” 1-May” to “01/05/2019”. This process is repeated for each incorrect date entry identified, ensuring all dates are standardized to the correct format. Additional corrections can be added as necessary to handle all identified date format issues.
Example:
* Corrections for assessment_date
replace assessment_date = "01/11/2019" if assessment_date == "1-Nov"
replace assessment_date = "05/12/2018" if assessment_date == "05/012/2018"
replace assessment_date = "16/12/2007" if assessment_date == "16/21207"
replace assessment_date = "01/02/2021" if assessment_date == "Feb-21"
replace assessment_date = "03/01/2018" if assessment_date == "3.1.12018"
* Add more corrections as needed
* Corrections for audit_started_date
replace audit_started_date = "01/07/2017" if audit_started_date == "1.7.20147"
replace audit_started_date = "10/04/2019" if audit_started_date == "10-Apr"
replace audit_started_date = "17/11/2017" if audit_started_date == "17-Nov"
replace audit_started_date = "01/04/2021" if audit_started_date == "Apr-21"
replace audit_started_date = "03/01/2018" if audit_started_date == "3.1.12018"
* Add more corrections as needed
* Corrections for audit_started_date
replace audit_started_date = "01/07/2017" if audit_started_date == "1.7.20147"
replace audit_started_date = "10/04/2019" if audit_started_date == "10-Apr"
replace audit_started_date = "17/11/2017" if audit_started_date == "17-Nov"
replace audit_started_date = "01/04/2021" if audit_started_date == "Apr-21"
replace audit_started_date = "03/01/2018" if audit_started_date == "3.1.12018"
* Add more corrections as needed
Once you address all the incorrect formats just like the example of the previous code, you can double check by running code lines 357 to 364 again, and if the browser window does not show any observation you have already addressed all the inconsistent date formats.
5.3.3 Identify and Correct Remaining Incorrect Dates to Stata format
This section of the code from lines 389 to 407 is designed to identify and correct any remaining incorrect dates in the dataset, converting them to Stata’s date format. The local date_vars command creates a local macro containing the date variables assessment_date, audit_started_date, and audit_completion_date. The foreach loop iterates over each variable in the date_vars list to perform the corrections. Within the loop, a new variable is generated for each date variable with the suffix _new, converting the date to the “DMY” (day-month-year) format. The format command ensures that the new date variable is displayed in the proper date format.
Subsequent replace commands attempt to correct any remaining missing values by trying different date formats: “DM20Y”, “MDY”, and “Y”. If there are still missing values, a new temporary variable with the suffix _numeric is generated to handle non-date entries, converting them to numeric format using the destring command. The replace command then fills in any remaining missing values in the new date variable with these numeric values. Finally, the temporary _numeric variable is dropped to clean up the dataset.
local date_vars assessment_date audit_started_date audit_completion_date
* Update the corrected dates after manual corrections
foreach var of local date_vars {
* Generate the corrected date variable
generate `var'_new = date(`var', "DMY")
format `var'_new %td
replace `var'_new = date(`var', "DM20Y") if missing(`var'_new)
replace `var'_new = date(`var', "MDY") if missing(`var'_new)
replace `var'_new = date(`var', "Y") if missing(`var'_new)
* Handle remaining cases where date conversion failed
generate `var'_numeric = `var' if missing(`var'_new)
destring `var'_numeric, replace force
replace `var'_new= `var'_numeric if missing(`var'_new)
* Clean up temporary variables
drop `var'_numeric
}
The following command line 410 removes the original date variables (assessment_date, audit_started_date, and audit_completion_date) from the dataset using the drop command. After correcting and standardizing these dates, dropping the old variables ensures only the properly formatted date variables remain, keeping the dataset clean and consistent.
In the do-file, the code section from lines 420 to 422 renames the new corrected date variables to their original names. The rename command changes assessment_date_new back to assessment_date, audit_started_date_new back to audit_started_date, and audit_completion_date_new back to audit_completion_date.
rename assessment_date_new assessment_date
rename audit_started_date_new audit_started_date
rename audit_completion_date_new audit_completion_date
After these steps, you now have these date variables in a unified format to work with, as shown in Table 4.
Table 4
assessment_date | audit_started_date | audit_completion_date |
---|---|---|
01jun2018 | 01apr2019 | 13jun2020 |
01may2019 | 01jul2017 | 13aug2020 |
01nov2019 | 01jul2017 | 13aug2020 |
05dec2018 | 10apr2019 | 13jun2021 |
16dec2007 | 17nov2017 | 25jan2017 |
01jun2017 | 17nov2017 | 25jan2017 |
03jan2018 | 03jan2018 | 31mar2018 |
01feb2021 | 01apr2021 | 01dec2021 |
5.3.4 Checking problems
The aim of this step is to identify, flag, and correct instances where the audit_started_date is later than the audit_completion_date.
The code line 420 creates a new variable called date_problem. It assigns a value of 1 to date_problem if audit_started_date is greater than audit_completion_date, indicating a potential issue where the start date is after the completion date
The code line 421 generates a frequency table for the date_problem variable, showing how many records have a start date that is after the completion date and how many do not. This helps in assessing the extent of the issue.
Applying this to the example we discussed earlier, we can notice that the table indicates two observations that have date issues, as shown in the following screenshot in Figure 4.
The code line 422 shows a window of these two observations that have problems, which is shown in Table 5.
Table 5
audit_started_date | audit_completion_date |
---|---|
17nov2017 | 25jan2017 |
17nov2017 | 25jan2017 |
The code line 424 sets the date_problem variable to missing (.) for records where either audit_started_date or audit_completion_date is missing. This ensures that incomplete records are not flagged incorrectly.
The code section from lines 434 to 439 corrects records in which the audit_started_date is later than the audit_completion_date. It first creates temporary variables, s and e, to store the original audit_started_date and audit_completion_date, respectively. The replace commands then swap these dates for records flagged with date problems (date_problem == 1), assigning the start date to the completion date and vice versa. Finally, the temporary variables s, e, and date_problem are dropped to clean up the dataset, ensuring that only corrected dates remain.
5.3.5 Amending the dates to keep only the month and year
The code section from lines 442 to 453 generates new date variables in a monthly format. For audit_started_date, it creates sm and sy to extract the month and year, respectively, then combines them into started_date using the ym function, formats it as a monthly date (%tm), and labels it “Date when audit started.” Similarly, for audit_completion_date, it creates em and ey for the month and year, combines them into audit_completion, formats it, and labels it “Date when audit ended.” After generating these new date variables, the temporary variables sy, sm, ey, and em are dropped.This ensures that these date variables are consistently formatted and properly labeled.
generate sm=month(audit_started_date)
generate sy=year(audit_started_date)
generate started_date=ym(sy,sm)
format started_date %tm
label var started_date "Date when audit started"
generate em=month(audit_completion_date)
generate ey=year(audit_completion_date)
generate audit_completion=ym(ey,em)
format audit_completion %tm
label var audit_completion "Date when audit ended"
drop sy sm ey em
5.3.6 Dealing with the period coverd by audit
The following section of the code is designed to inspect the audit_period variables for incorrect date formats. The audit period indicates the years covered by an audit and is crucial for accurate analysis.
5.3.6.1 Inspecting the Audit Period Variables
Firstly, the code displays a sample of the audit_period variable to provide an initial look at the data:
Some revenue authorities may provide the audit period in two separate variables, such as auditperiodfrom and auditperiodto. If this is the case in your dataset, you should inspect these variables as well:
This initial inspection helps you understand the structure and format of your audit period data.
Secondly,we will identify different and incorrect date formats that need to be corrected, the code generates a summary of unique values using the tabulate command. This helps spot inconsistencies and errors in the date formats of your audit period variables:
5.3.6.2 Manual Corrections
This step is necessary only if the inspection in the previous step indicates cases that require manual corrections. Here, you replace incorrect date formats with the correct ones. Amend the following examples to fit your specific case.
Example 1: Correcting audit_period Variable
If your audit_period variable contains incorrect date formats. The below code tries to correct the audit_period variable problems provided in the example that we gave earlier in Table 1.
replace audit_period = "2015 2016" if audit_period == "2015-16"
replace audit_period = "2015 2017" if audit_period == "2015/16-2016/17"
replace audit_period = "2015 2018" if audit_period == "2015-208"
replace audit_period = "2015 2021" if audit_period == "2015/16-2020/21"
replace audit_period = "2016 2018" if audit_period == "2016 TO 18"
// Add more corrections as needed based on your data inspection
Example 2: Creating audit_period from Separate Variables
If your data provides the audit period in two separate variables (auditperiodfrom and auditperiodto), you can create a unified audit_period variable:
tostring auditperiodfrom auditperiodto, replace
generate audityearfrom = substr(auditperiodfrom, 1, 4)
generate audityearto = substr(auditperiodto, 1, 4)
generate audit_period = audityearfrom + " " + audityearto
This code converts the variables to string format, extracts the year components, and combines them into a single audit_period variable. Ensure that the year extraction (substr) is appropriate for your data format.
Then, the next section of the code cleans up the audit_period variable by replacing any character that is not a digit (0-9) or a space with a space. The forvalues i = 1/10 { … } loop runs the replacement 10 times to ensure all non-digit characters are replaced. The regexr(audit_period, “[^0-9 ]”, ” “) function uses a regular expression to identify any character that is not a digit or space and replaces it with a space. For example, if audit_period contains”2017-2018” or “2017 to 2018”, these entries will be converted to “2017 2018”, standardizing the format across the dataset.
The following section of the code extracts and labels the individual components of the audit_period variable. It begins by duplicating audit_period into a new variable p to preserve the original data.
The split p, destring command then splits p into separate numeric components p1, p2, p3, and p4 based on spaces.
Next, it generates initial_period from p1 and labels it “First year covered by auditing”.
For the final period, it creates p22 from p2 if p3 is missing, and replaces p22 with p4 if necessary, then generates final_period from p22 and labels it “Final year covered by auditing”.
gen p22 = p2 if p3==.
replace p22 = p4 if p22==.
gen final_period=p22
label var final_period "Final year covered by auditing"
Finally, it drops the temporary variables p, p1, p2, p3, p4, p22 and the original audit_period to clean up the dataset. This process ensures that all periods covered by auditing are clearly extracted and labeled for further analysis.
This process applied to audit_period will result in variables resembling the following:
Table 6
initial_period | final_period |
---|---|
2016 | 2022 |
2016 | 2020 |
2015 | 2016 |
2015 | 2017 |
2015 | 2018 |
2016 | 2020 |
2015 | 2021 |
2011 | 2015 |
This table shows the initial and final years covered by each audit.
5.4 Adjusting Total Assessments for Annual Audit Periods
In this section, we focus on adjusting the totalassessments variable to ensure that it accurately reflects annual audit periods. The totalassessments variable represents the audit outcome, which can be either positive or negative. However, for the purpose of analyzing tax evasion, we are only interested in positive or zero values. Negative values are set to zero as in the following code.
Since audits may cover multiple years, we allocate the totalassessments value across the audit period to reflect annual assessments.
The following commend, calculates number of total audit period.
To avoid division by zero, set any zero total_audit_period to 1:
Adjust Total Assessments:
5.5 Reshaping Audit Data
The following code first initializes a new variable, audited_year, with missing values. This variable will later store the specific audited years for each taxpayer (tpin). Then, it expands the dataset so that each tpin has multiple observations—one for each year in their audit period. The expression 1 + (final_period - initial_period) calculates the number of years in the audit period, ensuring that the data now has a separate row for each audited year.
After that, the code assigns specific audited years to each expanded observation by creating a sequence number (temp) for each tpin, calculating audited_year from initial_period, and ensuring data integrity by dropping any rows exceeding final_period. Finally, it cleans up by removing the temporary temp variable.
bysort tpin: gen temp = _n
replace audited_year = initial_period + temp - 1
drop if audited_year > final_period
drop temp
Finally, the code renames the variable audited_year to rtn_year to align with the corresponding variable in your tax return data. This standardization is crucial for accurately merging the audit data with the return data on both tpin and rtn_year.
5.6 Dealing with Duplicates
In the context of audit data, a duplicate refers to multiple entries that have the same combination of key identifiers. These identifiers can vary depending on the dataset but typically include variables such as the return year, taxpayer identification number, and sometimes additional identifiers like month or specific audit-related fields. A duplicate occurs when more than one record has the same combination of these identifiers, indicating that the same audit entry has been recorded multiple times.
Our objective here is to identify and keep the most recent post-audit return (reported by the audit department) and discard older versions or reviewed tax assessments.
In this section, we use two approaches to handle duplicates: one based on assessment dates and the other without relying on assessment dates. Either approach can be applied if the required variables are available. However, we highly recommend using the assessment date-based approach whenever the variable is available.
5.6.1 Dealing with Duplicates with Assessment Date
The audit datasets often include key variables such as assessment_date, which indicates the final assessment made by the Revenue Authority on a firm’s tax filings for a specific tax category. This date is crucial for distinguishing between multiple records for the same tax entity within the same return year.
Since our objective is to identify and keep the most recent post-audit return (reported by the audit department) and discard older versions or reviewed tax assessments.This approach uses the assessment_date to retain only the latest entry for each return year and tax category. This ensures that our dataset reflects the most recent and relevant information, eliminating outdated or duplicated records.
5.6.1.1 Sort and Drop Duplicates Based on Assessment Date
First, the dataset is sorted by tpin, rtn_year, tax_head, initial_period, final_period, and assessment_date in descending order (line 550). This ensures that the latest assessment_date is prioritized.
Next, duplicates are dropped based on tpin, rtn_year, tax_head, initial_period, and final_period, keeping the latest assessment_date due to the previous sorting step.
5.6.1.2 Sort and Organize Variables
The dataset is first sorted in ascending order by key identifiers (rtn_year, tpin, initial_period, and final_period) to ensure logical organization and facilitate analysis.
Next, the columns are reordered to place key identifiers (rtn_year and tpin) at the beginning, followed by industry and geographic details (isic_code, broad_sector, town_name, and province). This restructuring improves readability and simplifies merging with other datasets.
Lastly, the taxpayer identification number (tpin) is formatted to display consistently as a 15-character number without decimal places, ensuring clarity and uniformity in the data.
5.6.1.3 Check for Duplicates
The first step in this part involves checking for duplicate entries. This is accomplished by examining the unique combinations of the variables rtn_year, tpin, initial_period, and final_period. By doing so, we ensure that no repeated records exist within the dataset. The command used for this operation is:
Following this, a new variable named dup1 is generated to tag duplicate entries. This variable is created using a command that identifies records with identical combinations of rtn_year, tpin, initial_period, and final_period. A value of 0 in dup1 indicates a unique record, while 1 or higher signifies duplicates:
Next, a summary table is created to display the distribution of unique and duplicate records. This table is critical for evaluating the extent of duplication in the dataset, as it provides an overview of how many records fall into each category:
Finally, for a more detailed review, a browsing window is opened to manually inspect the records flagged as duplicates. This step allows for a thorough visual verification of the flagged entries to ensure accuracy before any potential removal:
5.6.1.4 Dropping Clear Duplicates
To address duplicate records, all entries identified as duplicates are removed, retaining only one record for each unique combination of rtn_year, tpin, initial_period, and final_period. This ensures that only unique records remain in the dataset:
After the removal of duplicates, the dataset is rechecked to confirm that no duplicates remain. Additionally, the auxiliary variable dup1, created during the duplication check process, is removed to maintain a clean dataset:
5.6.2 Dealing with Duplicates without Assessment Date
This approach to dealing with duplicates follows similar steps as before to clear duplicates in the audit data in case the assessment date does not exist.
5.7 Improvement and Checks
The command line 578 converts the text descriptions of audit types into numerical codes, creating a new variable auditcode. This transformation is typically used to make categorical data more manageable and efficient for statistical analysis.
The command line 579 adjusts the numeric codes in auditcode to new values, consolidating several codes into simpler categories, and generates a new variable audittype. For instance, codes 2 through 6 and 13 are grouped into a new code 1, and so on. This helps simplify the categories of audits into more generalized groups.
The command line 580 removes the intermediate variable auditcode used for recoding, as it is no longer needed after creating audittype.
The command lines 581 and 582 assigns descriptive labels to each of the numeric codes in audittype, making the data more interpretable. Labels like “Comprehensive audit” and “Issue audit” are assigned to corresponding codes to clarify the nature of each audit category.
lab define type_lbl 1 "1.Comprehensive audit" ///
2 "2.Issue audit" ///
3 "3.Credibility check" ///
4 "4.CIT Audit" ///
5 "5.PIT Audit" ///
6 "6.VAT audit" ///
7 "7.Desk audit"
label val audittype type_lbl
Omit or add audit type where necessary and applicable.
The following is a list of examples of audit types, which is not exhaustive:
- Registration Check: A quick check on businesses to establish that they are correctly registered.
- Advisory Visit: A visit to newly established businesses advising obligations in terms of tax types, filing of returns, payment of amounts due, and records to be maintained.
- Compliance Advisor (“Soft Audit”): A message to the taxpayer pointing out action points that the taxpayer should check and explain. The taxpayer has a certain time to respond to the message.
- Record Keeping Audit (or Compliance Visit): The visit points out the obligations of the taxpayer regarding the keeping of records and includes checking the records, stocks, etc. The taxpayer needs to respond to any questions at the same time. This visit may be followed up with penalties if the taxpayer continues to disregard record-keeping requirements.
- Desk Audit: Basic checks conducted at the tax office when the auditor is confident that all necessary information can be ascertained through an in-office examination.
- Single (or Specific) Issue Audit: Focusing on a single tax type, a single period, or a single issue.
- Refund Examination: Verifying the taxpayer’s entitlement to a refund prior to processing the refund.
- Refund Audit: Similar to a refund examination, but includes a visit to the taxpayer’s premises.
- Audit Project: Audits organized as a separate project for specific groups of taxpayers. This could involve an industry (e.g., construction), a line of business (e.g., retail), or certain items from the declaration or profit and loss account (e.g., depreciation). Such audits are conducted to address a particular risk or to establish the degree of non-compliance in a particular sector.
- Comprehensive (or Full) Audit: A comprehensive audit covers all tax obligations over a number of tax periods. As these audits are time-consuming, they are usually based on a risk profiling of taxpayers. In some cases, they may be randomized.
- Fraud Investigation: This involves the most serious cases of non-compliance that have criminal implications, such as fraud, evasion, and criminal activity. Fraud investigations require special skills in investigation and evidentiary requirements and are therefore usually conducted by a specialized unit.
The command line 583 displays a frequency table of the different audit types using the newly labeled audittype variable. This helps verify the distribution and prevalence of each audit type in the dataset.
The command line 584 rearranges the order of variables in the dataset to place key audit-related variables at the forefront.
The command line 587 saves the current dataset as a STATA-format file named audit.dta in the Data folder.
6 Fourth Stage: Merging CIT, PIT, and VAT with the Audit Data
In this stage, we merge CIT, PIT, and VAT data with audit data. These three types are merged annually on a one-to-one basis. The necessary annual tax return data were created in the second stage. Depending on your tax return type, you can follow the respective section in the do-file that deals with this tax return type. Then, we create variables to indicate the audited firms (see the workflow for this stage in Figure 5). The instructions below for VAT data also apply to PIT and CIT.
6.1 Upload and Merge Your Audit and Tax Data
The first step in merging is to load the dataset audit.dta, as in line 601.
Next, the merge command (line 606) performs a one-to-one merge of the main dataset with the annaual VAT dataset located at Data folder using the variables tpin and rtn_year. This merge helps in identifying firms with annual VAT returns that have been audited.
Merging audit and tax data by the return year the audit was carried out. This ensures the audit assessment is assigned to the correct return year for accurate analysis. It’s important to note that the name “vat_ann” refers to the collapsed VAT (annual) data, as described in Note 7, and this also applies to the case of PIT. Since audits are conducted on an annual basis, the final merge can only be performed using these annualized data versions rather than the monthly versions.
After merging, a special variable, _merge, is created to indicate the result of the merge operation. The recode command is used to change the values in _merge and create a new variable, vataudstat. This new variable categorizes firms based on their audit status:
- vataudstat = 1 for firms that have been audited and have matching
VAT return data (original _merge = 3).
- vataudstat = 0 for firms that have not been audited (original _merge
= 2).
- vataudstat = 2 for firms that have been audited but do not have
matching VAT return data (original _merge = 1).
The goal is to flag firms that are audited (either matched or unmatched in VAT data) and those that are not audited. Ideally, all audited firms should have matching VAT return data, so the occurrence of vataudstat = 2 should be minimal or non-existent.
After that, in line 609, the _merge variable is then dropped as it is no longer needed.
This section of code in lines 601 and 602 defines labels for the different values of vataudstat and assigns these labels to the variable for clearer interpretation.
label define vatstat_lbl 1 "1.Match & Audited" ///
0 "0.No Match not Audited" ///
2 "2.No Match & Audited"
label val vataudstat vatstat_lbl
Then, the (tabulate) command is used in line 612 to view the distribution of vataudstat, and the order command arranges the variables in the dataset for easier viewing, as in line 613.
The code lines 607 and 608 create a dummy variable ‘vataudited’ to identify firms that have been audited that is nessesery for Bottom-up approach.
6.2 Tracking Audit History
The following part of the code in lines 620 and 630 creates two new variables based on the existing “audit” variable, which indicates if a firm has been audited.
Line 620 calculates the total number of times each firm (identified by TPIN, which stands for Taxpayer Identification Number) has been audited, while line 621 creates a dummy variable that indicates whether a firm has ever been audited for VAT.
The command line 623 displays the frequency distribution of the variable ‘vataudited’.
Lines 624 and 625 label the variables ‘rtn_year’ and ‘vataudited’ for easier interpretation. ‘label var’ assigns labels to variables. After executing these commands, ‘rtn_year’ will be labeled as “Return Year” and ‘vataudited’ as “VAT Audit Status” in the dataset.
The command in line 626 links the value labels defined for vataudited to the vataudited variable. The next two commands, of lines 627 and 628, define the specific value labels for this variable. Specifically, they assign the label “Unaudited” to the value 0 and the label “VAT Audited” to the value 1 within the vataudited label set. The replace option ensures any existing label for value 0 is updated, and the add option incorporates the new label for value 1 into the existing set. This labeling helps in clearly identifying whether a firm has been audited for VAT when analyzing the dataset.
label values vataudited vataudited
label define vataudited 0 "Unaudited", replace
label define vataudited 1 "VAT Audited", add
The command line 629 defines a label for the ever_vataudited variable, where a value of 0 is labeled as “never VAT Audited”, and line 630 adds a label for the ever_vataudited variable, where a value of 1 is labeled as “ever VAT Audited”.
label define ever_vataudited 0 "never VAT Audited", replace
label define ever_vataudited 1 "ever VAT Audited", add
The command on line 635 saves the merged version of VAT with Audit data as a STATA-format file named vataudit.dta in the Data folder.
Finally, we use the following command to close the currently open log file, which records all the commands you run and their corresponding output, allowing you to keep a record of your session for later reference or documentation.
The following section of the code, from line 638 to 708, repeats the same process of merging VAT with audit data, but for CIT from lines 638 to 671 and for PIT from lines 674 to 708