README Part 1: Guidance for Bottom-Up Tax Gap Toolkit

Part 1: Data Management

1 Overview

This toolkit, developed by UNU-WIDER/ATI, provides a comprehensive framework for conducting bottom-up tax gap analyses. It utilizes machine learning algorithms to estimate Value Added Tax (VAT), Personal Income Tax (PIT), and Corporate Income Tax (CIT) gaps, specifically designed for Revenue Authorities.

The toolkit consists of two code files: the first part is data management, which is used to clean the tax return and audit data and merge them, so it is ready for machine learning analysis using the second code file.

Version:

Date: 22/11/24

Prerequisites:

To use this toolkit, we must have Stata software installed on your computer. Additionally, the data requirements depend on the tax gap you want to analyze, such as CIT, PIT, and/or VAT return data, in addition to audit data.

2 Usage Guidelines

This part of the code consists of four main stages:

First Stage: This stage involves setting up the working environment by installing and updating Stata packages, configuring directory paths, and importing CIT, PIT, and VAT data. It also includes renaming and formatting variables to ensure consistency.

Second Stage: This stage focuses on managing CIT, PIT, and VAT data. The processes include removing duplicates, appending data if it comes in multiple files, sorting, reordering variables, and handling duplicates for appended data.

Third Stage: Similar to the second stage, this stage involves data management for audit data. The processes include removing duplicates and appending data if it comes in multiple files. In addition to correcting any inconsistent formats in audit date variables.

Fourth Stage: In this stage, CIT, PIT, and VAT data are merged with the audit data.

Throughout these stages, it is crucial to handle tax return and audit data variables carefully. Some of these variables should be present in your data, although they might have different names. Therefore, it is advisable to rename and format them according to the instructions provided in these stages. Additionally, some variables are created temporarily during the data management and cleaning process. Moreover, certain variables that do not exist in the return and audit tax data are created as they are necessary for the next part of the code, which involves using a machine learning model to estimate tax gaps.

Table 1:

Variable Name	Type/Category	Definition
tpin	VAT, CIT, PIT, Audit / O, Ness	A Taxpayer Identification Number (TPIN) is a unique identifier assigned to each taxpayer by tax authorities for identification and administrative purposes, used to track and manage tax obligations. TPIN ensures accurate processing of tax returns, payments, and related activities. Different countries may refer to TPINs using various terms.
rtn_year	VAT, CIT, PIT, Audit / O, Ness	Return year, the specific year in which the tax return was filed by the taxpayer.
month	VAT, PIT / O, Ness	The month during which the VAT/PIT return was filed, important for tracking monthly returns.
isic_code	VAT, CIT, PIT / O, Ness	International Standard Industrial Classification (ISIC) Code: A numeric code, used to classify the economic activities of the taxpayer. This code may also consist of 2 or 4 digits if available.
broad_sector	VAT, CIT, PIT / O, Unn	The broad sector classification of the taxpayer’s economic activities is indicated by a single-digit code, providing a general overview of the sector.
regdate	VAT, CIT, PIT / O, Unn	Registration date, the date when the taxpayer registered for VAT/CIT/PIT.
town_name	VAT, CIT, PIT, Audit / O, Ness	Name of the town, city, or locality where the taxpayer is located. This is used for geographical analysis and audit planning, taking into account the administrative structure of the specific country or region
province	VAT, CIT, PIT, Audit / O, Ness	Name of the administrative division where the taxpayer is located. This could refer to a province, state, region, district, or any other relevant administrative unit, depending on the country.
returnid	VAT, CIT, PIT / O, Unn	Return ID, a unique identifier for each tax return.
returnstatus	VAT, CIT, PIT / O, Unn	Status of the tax return, indicating whether the return was filed, processed, or pending.
returntype	VAT, CIT, PIT / O, Unn	Type of tax return filed, indicating whether it is a regular return, amendment, etc.
invat	VAT / O, Ness	Input VAT, the amount of VAT the taxpayer paid on their purchases, expressed in local currency.
outvat	VAT / O, Ness	Output VAT, the amount of VAT the taxpayer collected on their sales, expressed in local currency.
vat_due	VAT / O, Ness	VAT due, the net amount of VAT the taxpayer needs to pay (output VAT minus input VAT), expressed in local currency.
vat_vars	VAT / O, Ness	Other relevant VAT variables required for analysis.
totalfixedassets	CIT / O, Ness	Total fixed assets, the value of the taxpayer’s fixed assets used for business operations, expressed in local currency.
accumulated-depreciation	CIT / O, Ness	Accumulated depreciation, the total amount of depreciation claimed on the taxpayer’s fixed assets, expressed in local currency.
gsalary	PIT / O, Ness	Gross salary refers to the total compensation paid to employees before any deductions, including benefits, allowances, and other forms of remuneration, expressed in local currency.
pit_due	PIT / O, Ness	PIT due, the amount of personal income tax the taxpayer needs to pay, expressed in local currency.
pit_vars	PIT / O, Ness	Other relevant PIT variables required for analysis.
TotalTax	Audit / O, Ness	The total tax amount determined by the audit, representing the taxpayer’s total tax liability as assessed by the audit, is expressed in local currency. This variable was renamed totalassessments.
totalassessments_annual	Audit / N, Ness	Adjusted annual assessment calculated by dividing total assessments by the number of years covered in the audit period.
total_audit_period	Audit / N, Ness	Number of years covered by the audit period.
assessment_date	Audit / O, Ness	The date on which the tax assessment was conducted. It is crucial for tracking the timeline of the audit process.
audit_started_date	Audit / O, Ness	The date when the audit process began. Important for determining the duration and timeline of the audit.
audit_completion_date	Audit / O, Ness	The date when the audit process was completed. Essential for calculating the total duration of the audit.
audit_period	Audit / O, Ness	Represents a range or multiple periods covered by the audit. It may include multiple dates or periods relevant to the audit.
auditperiodfrom	Audit / O, Ness	The starting year of the audit period. This variable captures the first year that the audit covers, helping to determine the range of the audit’s coverage.
auditperiodto	Audit / O, Ness	The ending year of the audit period. This variable indicates the final year covered by the audit, important for determining the total audit period length.
totalassessments	Audit / O, Ness	The total number of assessments conducted for a given taxpayer, used to quantify the audit activity and identify the frequency of audits.
initial_period	Audit / N, Ness	First year covered by the audit. Important for understanding the range of periods the audit covers.
final_period	Audit / N, Ness	Final year covered by the audit. Important for understanding the range of periods the audit covers.
auditcode	Audit / T, Ness	Encoded variable representing different types of audits, used to classify and analyze the nature of audits conducted.
audittype	Audit / N, Ness	Reclassified variable for audit types based on auditcode, providing a more detailed classification of audit types such as comprehensive audit, issue audit, credibility check, etc.
started_date	Audit / N, Ness	A reformatted variable that captures only the month and year when the audit started. Used to standardize date formats for analysis.
audit_completion	Audit / N, Ness	A reformatted variable that captures only the month and year when the audit was completed. Used to standardize date formats for analysis.
tax_head	Audit / O, Ness	Tax type or head, indicating the specific tax category being addressed (e.g., CIT, PIT, VAT).
latest0	Temporary / T, Ness	Generated variable used to identify the latest return year for a given taxpayer and period (year or month) in the initial dataset.
latest1	Temporary / T, Ness	Generated variable used to identify the latest return year for a given taxpayer and period (year or month) after appending datasets or additional processing steps.
dup0	Temporary / T, Ness	Temporary variable to tag duplicates for further inspection, helping identify and manage duplicate records in the dataset.
dup1	Temporary / T, Ness	Temporary variable to tag duplicates for further inspection, helping identify and manage duplicate records in the dataset.
date_problem	Audit / T, Ness	Flag indicating if there are problems with the dates (e.g., audit start date is after the completion date).
s	Audit / T, Ness	Temporary variable for swapping dates (start date). Used to correct cases where start date is after the completion date.
e	Audit / T, Ness	Temporary variable for swapping dates (end date). Used to correct cases where start date is after the completion date.
p	Audit / T, Ness	Temporary variable used for splitting audit_period into individual years.
p1	Audit / T, Ness	First segment of the split audit_period.
p2	Audit / T, Ness	Second segment of the split audit_period.
p3	Audit / T, Ness	Third segment of the split audit_period.
p4	Audit / T, Ness	Fourth segment of the split audit_period.
p22	Audit / T, Ness	Intermediate variable used to store the final year if third segment is missing.
p33	Audit / T, Ness	Intermediate variable used to store the first intermediate year if third segment is present.
sm	Audit / T, Ness	Month of audit_started_date.
sy	Audit / T, Ness	Year of audit_started_date.
em	Audit / T, Ness	Month of audit_completion_date.
ey	Audit / T, Ness	Year of audit_completion_date.
vataudstat	Merge / T, Ness	Variable indicating the audit status of VAT returns, showing whether the return was matched and audited.
_merge	Merge / T, Ness	A temporary variable created by the merge command to indicate the match status of the merged datasets.
nr_audits	Merge / T, Ness	Temporary variable to count the number of audits for a taxpayer, used to create ever audited variables.
vataudited	Merge / N, Ness	Dummy variable indicating if the VAT return was audited.
ever_vataudited	Merge / N, Ness	Dummy variable indicating if the taxpayer has ever been audited for VAT.
citaudstat	Merge / T, Ness	Variable indicating the audit status of CIT returns, showing whether the return was matched and audited.
citaudited	Merge / N, Ness	Dummy variable indicating if the CIT return was audited.
ever_citaudited	Merge / N, Ness	Dummy variable indicating if the taxpayer has ever been audited for CIT.
pitaudstat	Merge / T, Ness	Variable indicating the audit status of PIT returns, showing whether the return was matched and audited.
pitaudited	Merge / N, Ness	Dummy variable indicating if the PIT return was audited.
ever_pitaudited	Merge / N, Ness	Dummy variable indicating if the taxpayer has ever been audited for PIT.

Note:
1. O: Original variable, part of the original dataset or source.
2. T: Temporary variable, used for intermediate steps in data cleaning processing.
3. N: Newly created variable during data processing.
4. Ness: Necessary variable for the analysis of the tax gaps.
5. Unn: Unnecessary variable for the analysis of the tax gaps, might be excluded.

3 First Stage

This stage aims to ensure all necessary packages are installed and updated, prepare the STATA environment, set up directory paths, and import and check the data files, as illustrated in Figure 1.

Figure 1: Workflow for the First Stage

3.1 Install and Update Necessary Packages

Before starting this stage, we need to ensure that the STATA packages required for this stage are installed. This can be done by running the following code lines 49 and 52:

capture ssc install usepackage
usepackage unique mdesc savesome xls2dta rmfiles, near

Note 1:

Remove the “*” before these two lines in the dofile. Other packages can also be added if needed in lines 49 and 52.

3.2 Preparing Stata Environment

Lines 60 to 64 of the code configure the environment to ensure smooth and efficient script execution by disabling output pauses, clearing matrices, setting wider output lines and larger matrix sizes, and safely closing any open log files.

version 14  /* Use STATA Version 14 */

set more off
clear matrix
set linesize 200
set matsize 800
cap log close

The command ‘version 14’ makes STATA interpret and run the code according to the syntax and features available in STATA version 14, regardless of the actual version currently running. The data files will also be saved in STATA 14 so anyone can access the saved files and results with a lower software version.

3.3 Setting Up Global Directory Paths

To enhance the code’s flexibility and ease of updating, we employ a “global” command. This command acts as a macro, allowing for the storage of file paths that can be reused throughout the STATA session.

Before executing the code from lines 67 to 75, you need to select a location on your computer. In this chosen location, create a folder named “Cleaning.” This folder will serve as a repository for the outputs from this stage. Within the “Cleaning” folder, you should create six separate subfolders as follows:

Excel: This folder is for storing your raw data.
Data: Here, the cleaned data will be placed.
Graphs: This folder is to save the graphs.
Dofile: This folder is for saving dofiles.
Results: This folder is for storing results.
Log: This folder is used to store log files.

Note 2:

In the dofile, at line 67, insert the path of your ‘Cleaning’ folder between the quotation marks.


global dir "<YOUR PATH>"

global data     "$dir\Data"
global graphs   "$dir\Graphs"
global xlsout   "$dir\Excel"
global dopath   "$dir\Dofile"
global results  "$dir\Results"
global log      "$dir\Log"

The following code line aim is to start recording the session’s output to a log file for documentation or review purposes.

log using "$log\Tax Gaps data prep.log", replace

Attention: The following steps in this stage aim to import the CIT, PIT, or VAT return data files and rename their variables. You can utilize the code based on the type of tax data you have. In case you are dealing with two or three of these tax types, you should repeat these steps for each type of tax separately. If your tax data for any of these tax types comes only in one file, please ignore the last step in this stage.

3.4 Importing Data, Renaming Variables, and Saving the Data

Line 81 of code is utilized to import the raw data located in a directory specified by the local macro ‘xlsout’. You need to replace “<Your_CIT,PIT,or VAT_returns_Data_1st_File>” with your CIT, PIT, or VAT data filename.

import delimited "$xlsout\<Your_CIT,PIT,or VAT_returns_Data_1st_File>.dsv",clear

Note 3:

1- ‘import delimited’ is employed to import a delimited text file format (such as CSV, TSV, DSV). If the file is in a different delimited text format (such as CSV), ensure to adjust the extension after the filename accordingly.

2- If the data file is in xlsx format, we use the ‘import excel’ command like the following example:

import excel "$xlsout\<Your_CIT,PIT,or VAT_returns_Data_2nd_File>.xlsx", clear

Line 83 formats the pseudo tax ID variable, named ‘tpin’ in this example, to make it visible with a 15-character wide field, with no decimal places.

format tpin %15.0g

Note 4 :

Replace ‘tpin’ with the name of the pseudo tax ID in your data. In some countries, the pseudo tax ID is a string combination of numbers and letters. In this case, the following code can be followed before formatting, as in our case.

* Remove all letters from the string
forvalues i = 1/40 {
     replace tipn = regexr(tipn, "[A-Za-z]", "") 
}

* Convert the resulting string to a numeric variable
destring tipn, replace

Lines 87 to 90 are utilized for renaming existing variables in your dataset to new names if applicable. For example,

rename old_var_name new_var_name // Example

rename actvty_isic_code isic_code
rename provincename province
rename rtn_id returnid
rename form_name returntype

Note 5:

Additional checks must be made on formats for filing and return dates, fiscal year, province, and industry codes. Here we assume that the dates for the tax return take the corrected format; however, you should verify these variables. If there is any inconsistency in the dates, you can correct them by following the instructions provided for dealing with date variables related to audit data in section 5.3.

Line 95 saves the data in the “Data” folder in STATA format.

Save "$data\<Your_CIT_returns_Data_1st_File>.dta", replace

3.5 Importing, Renaming, and Saving Additional Data Files if the Data Comes in Multiple Files

Lines 98 to 107 repeat the preceding steps of importing, formatting, renaming, and saving the data. This applies only if the data comes in multiple files.

import delimited "$xlsout\<Your_CIT,PIT,or VAT_returns_Data_2nd_File>.dsv", clear
format tpin %15.0g

rename actvty_isic_code isic_code
rename provincename province
rename rtn_id returnid
rename form_name returntype

Please see Note 5.

save "$data\<Your_CIT,PIT,or VAT_returns_Data_2nd_File>.dta", replace

Note 6:

1- When renaming your variables, ensure that the names of your variables are the same for both data files in case your data comes in more than one file.

2- Revise the instructions in Note 3 about importing the data.

4 Second Stage: Data Management Process for VAT, CIT, and PIT

Attention:

The following steps of this stage are general steps applicable for VAT, PIT, and/or CIT, depending on which type of tax return data you are dealing with. If you have more than one of these types, ensure that you apply the same steps for each type.
This stage in the do-file is organized to deal with each of these tax types separately. You only need to go to the part that deals with the type of tax you have and load your data file following the instructions below.
The provided instructions use the variable names from the VAT section in the do-file. You should be consistent and use the variables that reflect the same variables used in each example in the do-file from your data.
If the VIT, CIT, and PIT return data comes in more than one file, you should follow the second and third steps in sections 4.2 and 4.3. Otherwise, you can ignore these steps and go directly to the fourth step in section 4.4, as illustrated in Figure 2.

Figure 2: Workflow for the Second Stage

4.1 Dealing with Duplicates

What is a Duplicate for Tax Return Data?

In this context, a duplicate refers to multiple entries that share the same combination of key identifiers. These identifiers typically include variables such as the return year, taxpayer identification number (TPIN), and sometimes additional identifiers like the type of tax or filing period.

Tax authorities sometimes make multiple reviews or assessments, resulting in multiple records for the same period (e.g., month, year) and taxpayer. In this scenario, a duplicate refers to these multiple records that pertain to the same tax return.

Our objective is to identify and keep the most recent tax return (self-reported by the taxpayer) and discard older versions or reviewed/assessed returns.

The following code line loads the STATA data file of VAT, PIT, or CIT. Here, you have to place your STATA filename instead of ‘<Your_CIT, PIT, or VAT_returns_Data_1st_File>’. The clear option clears any existing data in memory before loading this file:

use "$data\<Your_CIT, PIT, or VAT_returns_Data_1st_File>.dta", clear

If there are duplicates, the following command tags duplicate observations. It then creates a new variable “dup0” which marks duplicates. If “dup0” is greater than 0, the observation has duplicates; if it’s 0, it’s unique.

duplicates tag rtn_year tpin month, gen(dup0)

The following code line displays a frequency table of the dup0 variable, showing how many duplicates and unique records are in the dataset.

tabulate dup0

Then we use the “browse” command to open a data browser to inspect the records flagged as duplicates. This displays the (rtn_year, tpin, month, broad_sector, outvat, dup0) variables for manual verification.

browse rtn_year tpin month broad_sector outvat dup0 if dup0 == 1

Since our objective is to identify and retain the most recent tax return (self-reported by the taxpayer) and discard older versions or reviewed/assessed returns. The following part of the code identifies the most recent return year (rtn_year) for each key identifier. In your case, the key identifiers could be the variables that represent, for example, taxpayer ID, fiscal year, or filing month. In our case, we use TPIN (Taxpayer Identification Number) and month.

The following code line identifies the most recent VAT return year for each taxpayer (TPIN) and month among duplicates. Then, the code creates a new variable latest0 to store this value and uses the function max(rtn_year) to indicate the most recent return year within each group of TPIN and month.

bys tpin month: egen latest0 = max(rtn_year) if dup0 != 0

Then we use the following code line to drop records that are not the latest return for each tpin and month group where duplicates exist.

drop if rtn_year != latest0 & dup0 != 0

The following commend line ensures that the dataset is cleaned of any duplicate records for the same taxpayer and period, keeping only one unique entry for each combination of return year, taxpayer ID, and filing month.

duplicates drop rtn_year tpin month, force

Then the following command re-tags the duplicate observations using the same criteria and generates the same variable dup0. Since dup0 already exists from the first command, STATA will overwrite it with the new tags.

duplicates tag rtn_year tpin month, gen(dup0)

In the following code line, STATA checks the condition for all observations. If every observation satisfies the condition (i.e., dup0 is 0 for all rows), the command completes successfully without any output. If there are any observations where dup0 is not 0 (indicating a duplicate), STATA will throw an error and stop executing the script. This helps ensure that the dataset is indeed free of duplicates before proceeding.

assert dup0 == 0

At the end of dealing with the duplicates process, we drop all temporary created variables by executing the following line.

drop dup0 latest0

4.2 Appending Datasets

If data comes in multiple files, execute following code to append the data from the file ‘<Your_CIT, PIT, or VAT_returns_Data_2nd_File>.dta’ to the currently loaded dataset in STATA ‘<Your_CIT, PIT, or VAT_returns_Data_1st_File>.dta’.

append using "$data\<Your_CIT, PIT, or VAT_returns_Data_2nd_File>.dta"

After appending, you might want to check if there are empty columns in the combined dataset. For that ‘browse’ command opens a window to view the dataset.

browse

4.3 Dealing with Duplicates if Data Comes in Multiple Files

The following code does the same process of dealing with duplicates in section 4.1, but now for the appended tax return.

duplicates tag rtn_year tpin month, gen(dup1)  
tabulate dup1 
browse rtn_year tpin month broad_sector outvat dup1 if dup1 == 1  


bys tpin month: egen latest1 = max(rtn_year) if dup1 != 0  
drop if rtn_year != latest1 & dup1 != 0  

duplicates drop rtn_year tpin month, force 
duplicates tag rtn_year tpin month, gen(dup1)  
assert dup1 == 0  
drop dup1 latest1

4.4 Sorting, Reordering, Collapsing, and Saving Data

The following code line sorts the dataset first by “rtn_year” and then by “tpin” variables.

sort rtn_year tpin

The following code line changes the order of the variables in the dataset.

order rtn_year tpin isic_code broad_sector regdate town_name ///
province returnid returnstatus returntype

Note 7:

For VAT and PIT, we aggregate their variables by summing them up for each unique combination of TPIN (Taxpayer Identification Number) and return year (rtn_year). This process reduces the dataset to one observation per taxpayer per year. Unlike VAT and PIT, CIT is already reported annually, so no further aggregation is necessary. Additionally, PAYE data requires careful aggregation to an annual level due to the inconsistency in filing patterns, as some firms do not file throughout the year.

See Examples 1 and 2 below for how to collapse the VAT and PIT data, and then save the new annual, cleaned VAT and PIT data as vat_ann and pit_ann, respectively, to distinguish them from the monthly data.

Examples

Example 1: Collapse VAT to annual level and save as STATA data

collapse (sum) invat outvat vat_due vat_vars, by(tpin rtn_year)
save "$data\vat_ann.dta", replace

Example 2: Collapse PIT to annual level and save as STATA data

collapse (sum) pit_due pit_vars, by(tpin rtn_year) 
save "$data\pit_ann.dta", replace

Example 3: Saving CIT as STATA data

save "$data\cit.dta", replace

5 Third Stage: Data Management Process for Audit Return Data

The aim of the Third Stage is to manage audit return data by importing, renaming, formatting, and handling duplicates, ensuring consistent date formats, and improving the dataset.The workflow for this stage is illustrated in Figure 3.

Figure 3:Workflow for the Third Stage

5.1 Importing Audit Data

The command on line 296 imports the audit data from an Excel file named ‘.xlsx’ located in the directory folder Excel. It takes the data from ‘Sheet1’ of that Excel file. The ‘firstrow’ option indicates that the first row of the Excel sheet contains column headers.

import excel "$xlsout\<Your_Audits_Data_1st_File>.xlsx", sheet("Sheet1") firstrow clear

The code lines 299 to 301 are executed to ensure that the variable names in the audit data match those in CIT, PIT, and VAT.

rename TPIN tpin
 
rename TOTALTAX totalassessments

Note 8:

In code line 301,“TOTALTAX” in our case is the outcome after an audit. This variable name may change depending on what the Revenue Authority calls it. We rename it to total assessments as it denotes the final value after reassessment due to an audit. Entries can be negative, zero or positive. In Section 5.4, we discuss its treatment in the toolkit.

Line 303 formats the pseudo tax ID variable, named ‘tpin’ in this example, to make it visible with a 15-character wide field, with no decimal places.

format tpin %15.0g

Note 9:

Additional checks must be made on formats for audit and return dates, fiscal year, province and industry codes. Verify these variables and check the tax assessment variable.

Line 308 saves the Audit return data in the “Data” folder in Stata format.

save "$data\<Your_Audits_Data_1st_File>.dta", replace

The next part of the code, from lines 311 to 319, repeats the same process of importing, formatting, and saving audit data, but for “Your_Audits_Data_2nd_File”.

import delimited "$xlsout\<Your_Audits_Data_2nd_File>.xlsx",clear

rename TPIN tpin
 
rename TOTALTAX totalassessments

format tpin %15.0g  
save "$data\<Your_Audits_Data_2nd_File>.dta", replace

Note 10:

The last part of this step is needed if audit data comes in more than one file (e.g., two files). In this case, please see Note 8 and 9.

5.2 Appending Audit Datasets

Note 11:

The following step is only needed if the data comes in Multiple Files.

The code line 324 appends the data from the file ‘’ to the currently loaded dataset in STATA ‘.dta’.

append using "$data\<Your_Audits_Data_2nd_File>.dta"

After appending, you can use the ‘browse’ command in line 325 to check whether there are empty columns in the combined dataset.

The code line 328 saves all audit data into one file.

save "$data\<combined_audit_data>.dta", replace

5.3 Inspecting the date variables and ensuring consistent formats

In audit data, dates often appear in inconsistent formats, which can cause significant issues if they are not converted to STATA formats. Examples of these dates include variables such as assessment_date, audit_started_date, audit_completion_date, and audit_period. To ensure the data is accurate and usable, it is essential to check these date variables and make the necessary adjustments to achieve consistent date formats.

At the beginning, imagine that Table 2 has some of the observations of your date variables below. The initial look at these variables reveals that they have some inconsistent formats. How do you convert them to STATA formats to ease your analysis?

Table 2:

assessment_date	audit_started_date	audit_completion_date	audit_period
1-Nov	1.7.20147	13/8/20120	2015-16
05/012/2018	10-Apr	13/JU//21	2015/16-2016/17
16/21207	17-Nov	25/001/2017	2015-208
3.1.12018	3.1.12018	31/3/218	2015/16-2020/21
Feb-21	Apr-21	DEC-21	2011/2013/2014/2015
01/05/2019	01/07/2017	13/08/2020	2016 TO 2020
01/06/2018	01/04/2019	13/06/2020	2016 to 2022
01/06/2017	17/11/2017	25/01/2017	2016/2017-2019/2020

Attention:

The entire Section 5.3 is designed to explain the code in the do-file by providing additional examples, as shown in Table 2, to illustrate how users can utilize the code.

Users may not encounter any issues with their audit data dates, making some of the following substeps unnecessary. Conversely, users might face unique problems with date variables that require tailored solutions.

The provided code aims to demonstrate various methods to identify, address, and correct these inconsistencies, ultimately standardizing date formats for accurate analysis. Following this step ensures a unified and clean format for date variables, as illustrated in Tables 4 and 5.

Section 5.3 provides four options for identifying the audit period. The chosen option depends on the variables available in the audit data. The first best option is to use variables auditperiodfrom and auditperiodto or audit_period to identify the period covered by the audit. If the audit data includes these variables, we recommend using them in subsection 5.3.6. The second best option is to use variables audit_started_date and audit_completion_date to create the period covered by the audit. The third option is to use an audited year (=rtn_year) variable, this can be used to identify the year covered by the audit. The final option is to use an audit_completion_date variable which assumes that the audit outcome will be applied to the returns in that specific year when the audit was finalized. In all cases, users must complete all the steps from 5.3.1 to 5.3.5 before moving to subsection 5.3.6. Users may need to adjust the codes in subsection 5.3.6 to match the available variables.

5.3.1 Inspect the Date Variables

First, we can display a sample of the data to get an initial look at the date values in the dataset by executing code line 349 to list the first 20 observations of the variables.

list assessment_date audit_started_date audit_completion_date  in 1/20

Then, we use the tabulate command in code lines 352 to 354 to create frequency tables for the assessment_date, audit_started_date, and audit_completion_date variables. The missing option includes missing values in the tabulation. This helps in identifying the different formats and potential issues with the date values.

tabulate assessment_date, missing
tabulate audit_started_date, missing
tabulate audit_completion_date, missing

Note 12 :

Although the tabulated date variable entries look like dates, in the above steps, the entries/cells are not recognized as dates by STATA. After performing the steps needed to clean these entries, we demonstrate how to tell STATA to format these variables to be recognized as dates in section 5.3.3.

After that, the following section of the code from lines 357 to 364 checks for unique formats and issues in assessment_date, audit_started_date, and audit_completion_date. It first sorts the dataset by each date variable to order the entries. Then, it attempts to convert the dates to the “DMY” (day-month-year) format. If the conversion fails, it further checks if the dates can be converted to “MDY” (month-day-year), “YMD” (year-month-day), or “MYD” (month-year-day) formats. If all these conversions fail, the browse command displays these problematic entries in the Data Browser for manual inspection and correction. This process is repeated for each date variable to ensure consistency and accuracy in date formatting.

sort assessment_date
browse if missing(date(assessment_date, "DMY")) & ///
    missing(date(assessment_date, "MDY")) & ///
    missing(date(assessment_date, "YMD")) & ///
    missing(date(assessment_date, "MYD"))

sort audit_started_date
browse if missing(date(audit_started_date, "DMY")) & ///
    missing(date(audit_started_date, "MDY")) & ///
    missing(date(audit_started_date, "YMD")) & ///
    missing(date(audit_started_date, "MYD"))

sort audit_completion_date
browse if missing(date(audit_completion_date, "DMY")) & ///
    missing(date(audit_completion_date, "MDY")) & ///
    missing(date(audit_completion_date, "YMD")) & ///
    missing(date(audit_completion_date, "MYD"))

The Table 3 below shows observations that need manual correction in the next step after implementing the previous code.

Table 3

assessment_date	audit_started_date	audit_completion_date
05/012/2018	10-Apr	13/JU//21
1-Nov	1.7.20147	13/8/20120
16/21207	17-Nov	25/001/2017
3.1.12018	3.1.12018	31/3/218
Feb-21	Apr-21	DEC-21

Note 13:

Audit_period is handled differently as it represents a range or multiple periods.

5.3.2 Manual Corrections

Note 14:

This step is needed only if the browse commands in the previous step indicate cases that need manual corrections. Here, we replace these incorrect date formats with the correct ones. You need to amend the following example to fit into your case.

The following code section corrects specific date format issues identified in the previous steps for assessment_date, audit_started_date, and audit_completion_date. Each replace command sets the date to the correct “DMY” (day-month-year) format before the if condition, and the incorrect date format identified previously is placed after the if condition. For example, replace assessment_date = “01/05/2019” if assessment_date == ” 1-May” changes assessment_date values from ” 1-May” to “01/05/2019”. This process is repeated for each incorrect date entry identified, ensuring all dates are standardized to the correct format. Additional corrections can be added as necessary to handle all identified date format issues.

Example:

* Corrections for assessment_date
replace assessment_date = "01/11/2019" if assessment_date == "1-Nov"
replace assessment_date = "05/12/2018" if assessment_date == "05/012/2018"
replace assessment_date = "16/12/2007" if assessment_date == "16/21207"
replace assessment_date = "01/02/2021" if assessment_date == "Feb-21"
replace assessment_date = "03/01/2018" if assessment_date == "3.1.12018" 
* Add more corrections as needed

* Corrections for audit_started_date
replace audit_started_date = "01/07/2017" if audit_started_date == "1.7.20147"
replace audit_started_date = "10/04/2019" if audit_started_date == "10-Apr"
replace audit_started_date = "17/11/2017" if audit_started_date == "17-Nov"
replace audit_started_date = "01/04/2021" if audit_started_date == "Apr-21"
replace audit_started_date = "03/01/2018" if audit_started_date == "3.1.12018" 
* Add more corrections as needed

* Corrections for audit_started_date
replace audit_started_date = "01/07/2017" if audit_started_date == "1.7.20147"
replace audit_started_date = "10/04/2019" if audit_started_date == "10-Apr"
replace audit_started_date = "17/11/2017" if audit_started_date == "17-Nov"
replace audit_started_date = "01/04/2021" if audit_started_date == "Apr-21"
replace audit_started_date = "03/01/2018" if audit_started_date == "3.1.12018" 
* Add more corrections as needed

Once you address all the incorrect formats just like the example of the previous code, you can double check by running code lines 357 to 364 again, and if the browser window does not show any observation you have already addressed all the inconsistent date formats.

5.3.3 Identify and Correct Remaining Incorrect Dates to Stata format

This section of the code from lines 389 to 407 is designed to identify and correct any remaining incorrect dates in the dataset, converting them to Stata’s date format. The local date_vars command creates a local macro containing the date variables assessment_date, audit_started_date, and audit_completion_date. The foreach loop iterates over each variable in the date_vars list to perform the corrections. Within the loop, a new variable is generated for each date variable with the suffix _new, converting the date to the “DMY” (day-month-year) format. The format command ensures that the new date variable is displayed in the proper date format.

Subsequent replace commands attempt to correct any remaining missing values by trying different date formats: “DM20Y”, “MDY”, and “Y”. If there are still missing values, a new temporary variable with the suffix _numeric is generated to handle non-date entries, converting them to numeric format using the destring command. The replace command then fills in any remaining missing values in the new date variable with these numeric values. Finally, the temporary _numeric variable is dropped to clean up the dataset.

local date_vars assessment_date audit_started_date audit_completion_date

* Update the corrected dates after manual corrections
foreach var of local date_vars {
    * Generate the corrected date variable
    generate `var'_new = date(`var', "DMY")
    format `var'_new %td
    replace `var'_new = date(`var', "DM20Y") if missing(`var'_new)
    replace `var'_new = date(`var', "MDY") if missing(`var'_new)
    replace `var'_new = date(`var', "Y") if missing(`var'_new)

    * Handle remaining cases where date conversion failed
    generate `var'_numeric = `var' if missing(`var'_new)
    destring `var'_numeric, replace force
    replace `var'_new= `var'_numeric if missing(`var'_new)
    
    * Clean up temporary variables
    drop `var'_numeric  
}

The following command line 410 removes the original date variables (assessment_date, audit_started_date, and audit_completion_date) from the dataset using the drop command. After correcting and standardizing these dates, dropping the old variables ensures only the properly formatted date variables remain, keeping the dataset clean and consistent.

drop assessment_date audit_started_date audit_completion_date

In the do-file, the code section from lines 420 to 422 renames the new corrected date variables to their original names. The rename command changes assessment_date_new back to assessment_date, audit_started_date_new back to audit_started_date, and audit_completion_date_new back to audit_completion_date.

rename assessment_date_new  assessment_date
rename audit_started_date_new  audit_started_date
rename audit_completion_date_new  audit_completion_date

After these steps, you now have these date variables in a unified format to work with, as shown in Table 4.

Table 4

assessment_date	audit_started_date	audit_completion_date
01jun2018	01apr2019	13jun2020
01may2019	01jul2017	13aug2020
01nov2019	01jul2017	13aug2020
05dec2018	10apr2019	13jun2021
16dec2007	17nov2017	25jan2017
01jun2017	17nov2017	25jan2017
03jan2018	03jan2018	31mar2018
01feb2021	01apr2021	01dec2021

5.3.4 Checking problems

The aim of this step is to identify, flag, and correct instances where the audit_started_date is later than the audit_completion_date.

The code line 420 creates a new variable called date_problem. It assigns a value of 1 to date_problem if audit_started_date is greater than audit_completion_date, indicating a potential issue where the start date is after the completion date

generate date_problem=(audit_started_date > audit_completion_date)

The code line 421 generates a frequency table for the date_problem variable, showing how many records have a start date that is after the completion date and how many do not. This helps in assessing the extent of the issue.

tabulate date_problem

Applying this to the example we discussed earlier, we can notice that the table indicates two observations that have date issues, as shown in the following screenshot in Figure 4.

Figure 4:Frequency of Date Problems Detected in Audit Data

The code line 422 shows a window of these two observations that have problems, which is shown in Table 5.

browse audit_started_date audit_completion_date if date_problem==1

Table 5

audit_started_date	audit_completion_date
17nov2017	25jan2017
17nov2017	25jan2017

The code line 424 sets the date_problem variable to missing (.) for records where either audit_started_date or audit_completion_date is missing. This ensures that incomplete records are not flagged incorrectly.

replace date_problem=. if audit_started_date==. | audit_completion_date==.

The code section from lines 434 to 439 corrects records in which the audit_started_date is later than the audit_completion_date. It first creates temporary variables, s and e, to store the original audit_started_date and audit_completion_date, respectively. The replace commands then swap these dates for records flagged with date problems (date_problem == 1), assigning the start date to the completion date and vice versa. Finally, the temporary variables s, e, and date_problem are dropped to clean up the dataset, ensuring that only corrected dates remain.

generate s = audit_started_date
generate e = audit_completion_date
replace audit_completion_date = s if date_problem==1
replace audit_started_date = e if date_problem==1

drop s e date_problem

5.3.5 Amending the dates to keep only the month and year

The code section from lines 442 to 453 generates new date variables in a monthly format. For audit_started_date, it creates sm and sy to extract the month and year, respectively, then combines them into started_date using the ym function, formats it as a monthly date (%tm), and labels it “Date when audit started.” Similarly, for audit_completion_date, it creates em and ey for the month and year, combines them into audit_completion, formats it, and labels it “Date when audit ended.” After generating these new date variables, the temporary variables sy, sm, ey, and em are dropped.This ensures that these date variables are consistently formatted and properly labeled.

generate sm=month(audit_started_date)
generate sy=year(audit_started_date)
generate started_date=ym(sy,sm)
format started_date %tm
label var started_date "Date when audit started"

generate em=month(audit_completion_date)
generate ey=year(audit_completion_date)
generate audit_completion=ym(ey,em)
format audit_completion %tm
label var audit_completion "Date when audit ended"
drop sy sm ey em

5.3.6 Dealing with the period coverd by audit

The following section of the code is designed to inspect the audit_period variables for incorrect date formats. The audit period indicates the years covered by an audit and is crucial for accurate analysis.

5.3.6.1 Inspecting the Audit Period Variables

Firstly, the code displays a sample of the audit_period variable to provide an initial look at the data:

list audit_period in 1/20

Some revenue authorities may provide the audit period in two separate variables, such as auditperiodfrom and auditperiodto. If this is the case in your dataset, you should inspect these variables as well:

list auditperiodfrom auditperiodto in 1/20

This initial inspection helps you understand the structure and format of your audit period data.

Secondly,we will identify different and incorrect date formats that need to be corrected, the code generates a summary of unique values using the tabulate command. This helps spot inconsistencies and errors in the date formats of your audit period variables:

tabulate audit_period, missing

// If using separate variables for audit period
tabulate auditperiodfrom, missing
tabulate auditperiodto, missing

5.3.6.2 Manual Corrections

Note 15:

This step is necessary only if the inspection in the previous step indicates cases that require manual corrections. Here, you replace incorrect date formats with the correct ones. Amend the following examples to fit your specific case.

Example 1: Correcting audit_period Variable

If your audit_period variable contains incorrect date formats. The below code tries to correct the audit_period variable problems provided in the example that we gave earlier in Table 1.

replace audit_period = "2015 2016" if audit_period == "2015-16"
replace audit_period = "2015 2017" if audit_period == "2015/16-2016/17"
replace audit_period = "2015 2018" if audit_period == "2015-208"
replace audit_period = "2015 2021" if audit_period == "2015/16-2020/21"
replace audit_period = "2016 2018" if audit_period == "2016 TO 18"
// Add more corrections as needed based on your data inspection

Example 2: Creating audit_period from Separate Variables

If your data provides the audit period in two separate variables (auditperiodfrom and auditperiodto), you can create a unified audit_period variable:

tostring auditperiodfrom auditperiodto, replace 
generate audityearfrom = substr(auditperiodfrom, 1, 4)
generate audityearto   = substr(auditperiodto, 1, 4)
generate audit_period  = audityearfrom + " " + audityearto

This code converts the variables to string format, extracts the year components, and combines them into a single audit_period variable. Ensure that the year extraction (substr) is appropriate for your data format.

Then, the next section of the code cleans up the audit_period variable by replacing any character that is not a digit (0-9) or a space with a space. The forvalues i = 1/10 { … } loop runs the replacement 10 times to ensure all non-digit characters are replaced. The regexr(audit_period, “[^0-9 ]”, ” “) function uses a regular expression to identify any character that is not a digit or space and replaces it with a space. For example, if audit_period contains”2017-2018” or “2017 to 2018”, these entries will be converted to “2017 2018”, standardizing the format across the dataset.

forvalues i = 1/10 {
    replace audit_period = regexr(audit_period, "[^0-9 ]", " ")
}

The following section of the code extracts and labels the individual components of the audit_period variable. It begins by duplicating audit_period into a new variable p to preserve the original data.

generate p= audit_period

The split p, destring command then splits p into separate numeric components p1, p2, p3, and p4 based on spaces.

split p,destring

Next, it generates initial_period from p1 and labels it “First year covered by auditing”.

gen initial_period=p1
label var initial_period "First year covered by auditing"

For the final period, it creates p22 from p2 if p3 is missing, and replaces p22 with p4 if necessary, then generates final_period from p22 and labels it “Final year covered by auditing”.

gen p22 = p2 if p3==.
replace p22 = p4 if p22==.
gen final_period=p22
label var final_period "Final year covered by auditing"

Finally, it drops the temporary variables p, p1, p2, p3, p4, p22 and the original audit_period to clean up the dataset. This process ensures that all periods covered by auditing are clearly extracted and labeled for further analysis.

drop p p1 p2 p3 p4 p22 p33 audit_period

This process applied to audit_period will result in variables resembling the following:

Table 6

initial_period	final_period
2016	2022
2016	2020
2015	2016
2015	2017
2015	2018
2016	2020
2015	2021
2011	2015

This table shows the initial and final years covered by each audit.

5.4 Adjusting Total Assessments for Annual Audit Periods

In this section, we focus on adjusting the totalassessments variable to ensure that it accurately reflects annual audit periods. The totalassessments variable represents the audit outcome, which can be either positive or negative. However, for the purpose of analyzing tax evasion, we are only interested in positive or zero values. Negative values are set to zero as in the following code.

replace totalassessments = 0 if totalassessments < 0

Since audits may cover multiple years, we allocate the totalassessments value across the audit period to reflect annual assessments.

The following commend, calculates number of total audit period.

generate total_audit_period= 1 + (final_period - initial_period)

To avoid division by zero, set any zero total_audit_period to 1:


replace total_audit_period = 1 if total_audit_period == 0

Adjust Total Assessments:

generate totalassessments_annual = totalassessments / total_audit_period

5.5 Reshaping Audit Data

The following code first initializes a new variable, audited_year, with missing values. This variable will later store the specific audited years for each taxpayer (tpin). Then, it expands the dataset so that each tpin has multiple observations—one for each year in their audit period. The expression 1 + (final_period - initial_period) calculates the number of years in the audit period, ensuring that the data now has a separate row for each audited year.

gen audited_year = .   

expand 1 + (final_period - initial_period)

After that, the code assigns specific audited years to each expanded observation by creating a sequence number (temp) for each tpin, calculating audited_year from initial_period, and ensuring data integrity by dropping any rows exceeding final_period. Finally, it cleans up by removing the temporary temp variable.

bysort tpin: gen temp = _n
replace audited_year = initial_period + temp - 1
drop if audited_year > final_period
drop temp

Finally, the code renames the variable audited_year to rtn_year to align with the corresponding variable in your tax return data. This standardization is crucial for accurately merging the audit data with the return data on both tpin and rtn_year.

rename audited_year rtn_year

5.6 Dealing with Duplicates

What is a Duplicate in Audit Data?

In the context of audit data, a duplicate refers to multiple entries that have the same combination of key identifiers. These identifiers can vary depending on the dataset but typically include variables such as the return year, taxpayer identification number, and sometimes additional identifiers like month or specific audit-related fields. A duplicate occurs when more than one record has the same combination of these identifiers, indicating that the same audit entry has been recorded multiple times.

Our objective here is to identify and keep the most recent post-audit return (reported by the audit department) and discard older versions or reviewed tax assessments.

In this section, we use two approaches to handle duplicates: one based on assessment dates and the other without relying on assessment dates. Either approach can be applied if the required variables are available. However, we highly recommend using the assessment date-based approach whenever the variable is available.

5.6.1 Dealing with Duplicates with Assessment Date

The audit datasets often include key variables such as assessment_date, which indicates the final assessment made by the Revenue Authority on a firm’s tax filings for a specific tax category. This date is crucial for distinguishing between multiple records for the same tax entity within the same return year.

Since our objective is to identify and keep the most recent post-audit return (reported by the audit department) and discard older versions or reviewed tax assessments.This approach uses the assessment_date to retain only the latest entry for each return year and tax category. This ensures that our dataset reflects the most recent and relevant information, eliminating outdated or duplicated records.

5.6.1.1 Sort and Drop Duplicates Based on Assessment Date

First, the dataset is sorted by tpin, rtn_year, tax_head, initial_period, final_period, and assessment_date in descending order (line 550). This ensures that the latest assessment_date is prioritized.

gsort +tpin +rtn_year +tax_head +initial_period +final_period ///
      -assessment_date

Next, duplicates are dropped based on tpin, rtn_year, tax_head, initial_period, and final_period, keeping the latest assessment_date due to the previous sorting step.

duplicates drop tpin rtn_year tax_head initial_period final_period, force

5.6.1.2 Sort and Organize Variables

The dataset is first sorted in ascending order by key identifiers (rtn_year, tpin, initial_period, and final_period) to ensure logical organization and facilitate analysis.

sort rtn_year tpin initial_period final_period

Next, the columns are reordered to place key identifiers (rtn_year and tpin) at the beginning, followed by industry and geographic details (isic_code, broad_sector, town_name, and province). This restructuring improves readability and simplifies merging with other datasets.

order rtn_year tpin initial_period final_period  isic_code ///
    broad_sector town_name province

Lastly, the taxpayer identification number (tpin) is formatted to display consistently as a 15-character number without decimal places, ensuring clarity and uniformity in the data.

format tpin %15.0g

5.6.1.3 Check for Duplicates

The first step in this part involves checking for duplicate entries. This is accomplished by examining the unique combinations of the variables rtn_year, tpin, initial_period, and final_period. By doing so, we ensure that no repeated records exist within the dataset. The command used for this operation is:

unique rtn_year tpin initial_period final_period

Following this, a new variable named dup1 is generated to tag duplicate entries. This variable is created using a command that identifies records with identical combinations of rtn_year, tpin, initial_period, and final_period. A value of 0 in dup1 indicates a unique record, while 1 or higher signifies duplicates:

duplicates tag rtn_year tpin initial_period final_period, gen(dup1)

Next, a summary table is created to display the distribution of unique and duplicate records. This table is critical for evaluating the extent of duplication in the dataset, as it provides an overview of how many records fall into each category:

tabulate dup1

Finally, for a more detailed review, a browsing window is opened to manually inspect the records flagged as duplicates. This step allows for a thorough visual verification of the flagged entries to ensure accuracy before any potential removal:

browse rtn_year tpin totalassessments dup1 if dup1==1

5.6.1.4 Dropping Clear Duplicates

To address duplicate records, all entries identified as duplicates are removed, retaining only one record for each unique combination of rtn_year, tpin, initial_period, and final_period. This ensures that only unique records remain in the dataset:

duplicates drop rtn_year tpin initial_period final_period , force

After the removal of duplicates, the dataset is rechecked to confirm that no duplicates remain. Additionally, the auxiliary variable dup1, created during the duplication check process, is removed to maintain a clean dataset:

unique rtn_year tpin initial_period final_period   
drop  dup1

5.6.2 Dealing with Duplicates without Assessment Date

This approach to dealing with duplicates follows similar steps as before to clear duplicates in the audit data in case the assessment date does not exist.

unique rtn_year initial_period  final_period tpin  

duplicates tag rtn_year initial_period  final_period tpin , gen(dup0)
duplicates drop rtn_year initial_period final_period tpin , force 
drop dup0

5.7 Improvement and Checks

The command line 578 converts the text descriptions of audit types into numerical codes, creating a new variable auditcode. This transformation is typically used to make categorical data more manageable and efficient for statistical analysis.

encode AuditType, gen(auditcode)

The command line 579 adjusts the numeric codes in auditcode to new values, consolidating several codes into simpler categories, and generates a new variable audittype. For instance, codes 2 through 6 and 13 are grouped into a new code 1, and so on. This helps simplify the categories of audits into more generalized groups.

recode auditcode (2/6 13=1) (8/12 14=2) (7=3), gen(audittype)

The command line 580 removes the intermediate variable auditcode used for recoding, as it is no longer needed after creating audittype.

drop auditcode

The command lines 581 and 582 assigns descriptive labels to each of the numeric codes in audittype, making the data more interpretable. Labels like “Comprehensive audit” and “Issue audit” are assigned to corresponding codes to clarify the nature of each audit category.

lab define type_lbl 1 "1.Comprehensive audit" ///
    2 "2.Issue audit" ///
    3 "3.Credibility check" ///
    4 "4.CIT Audit" ///
    5 "5.PIT Audit" ///
    6 "6.VAT audit" ///
    7 "7.Desk audit"
label val audittype type_lbl

Note 16:

Omit or add audit type where necessary and applicable.

The following is a list of examples of audit types, which is not exhaustive:

Registration Check: A quick check on businesses to establish that they are correctly registered.
Advisory Visit: A visit to newly established businesses advising obligations in terms of tax types, filing of returns, payment of amounts due, and records to be maintained.
Compliance Advisor (“Soft Audit”): A message to the taxpayer pointing out action points that the taxpayer should check and explain. The taxpayer has a certain time to respond to the message.
Record Keeping Audit (or Compliance Visit): The visit points out the obligations of the taxpayer regarding the keeping of records and includes checking the records, stocks, etc. The taxpayer needs to respond to any questions at the same time. This visit may be followed up with penalties if the taxpayer continues to disregard record-keeping requirements.
Desk Audit: Basic checks conducted at the tax office when the auditor is confident that all necessary information can be ascertained through an in-office examination.
Single (or Specific) Issue Audit: Focusing on a single tax type, a single period, or a single issue.
Refund Examination: Verifying the taxpayer’s entitlement to a refund prior to processing the refund.
Refund Audit: Similar to a refund examination, but includes a visit to the taxpayer’s premises.
Audit Project: Audits organized as a separate project for specific groups of taxpayers. This could involve an industry (e.g., construction), a line of business (e.g., retail), or certain items from the declaration or profit and loss account (e.g., depreciation). Such audits are conducted to address a particular risk or to establish the degree of non-compliance in a particular sector.
Comprehensive (or Full) Audit: A comprehensive audit covers all tax obligations over a number of tax periods. As these audits are time-consuming, they are usually based on a risk profiling of taxpayers. In some cases, they may be randomized.
Fraud Investigation: This involves the most serious cases of non-compliance that have criminal implications, such as fraud, evasion, and criminal activity. Fraud investigations require special skills in investigation and evidentiary requirements and are therefore usually conducted by a specialized unit.

The command line 583 displays a frequency table of the different audit types using the newly labeled audittype variable. This helps verify the distribution and prevalence of each audit type in the dataset.

tabulate audittype

The command line 584 rearranges the order of variables in the dataset to place key audit-related variables at the forefront.

order tpin initial_period final_period audittype started_date
audit_completion totalassessments

The command line 587 saves the current dataset as a STATA-format file named audit.dta in the Data folder.

save "$data\audit.dta", replace

6 Fourth Stage: Merging CIT, PIT, and VAT with the Audit Data

In this stage, we merge CIT, PIT, and VAT data with audit data. These three types are merged annually on a one-to-one basis. The necessary annual tax return data were created in the second stage. Depending on your tax return type, you can follow the respective section in the do-file that deals with this tax return type. Then, we create variables to indicate the audited firms (see the workflow for this stage in Figure 5). The instructions below for VAT data also apply to PIT and CIT.

Figure 5:Workflow for the Fourth Stage

6.1 Upload and Merge Your Audit and Tax Data

The first step in merging is to load the dataset audit.dta, as in line 601.

Use "$data\audit.dta", clear

Next, the merge command (line 606) performs a one-to-one merge of the main dataset with the annaual VAT dataset located at Data folder using the variables tpin and rtn_year. This merge helps in identifying firms with annual VAT returns that have been audited.

merge 1:1 tpin rtn_year using "$data\vat_ann.dta"

Note 17:

Merging audit and tax data by the return year the audit was carried out. This ensures the audit assessment is assigned to the correct return year for accurate analysis. It’s important to note that the name “vat_ann” refers to the collapsed VAT (annual) data, as described in Note 7, and this also applies to the case of PIT. Since audits are conducted on an annual basis, the final merge can only be performed using these annualized data versions rather than the monthly versions.

After merging, a special variable, _merge, is created to indicate the result of the merge operation. The recode command is used to change the values in _merge and create a new variable, vataudstat. This new variable categorizes firms based on their audit status:

vataudstat = 1 for firms that have been audited and have matching VAT return data (original _merge = 3).
vataudstat = 0 for firms that have not been audited (original _merge = 2).
vataudstat = 2 for firms that have been audited but do not have matching VAT return data (original _merge = 1).

The goal is to flag firms that are audited (either matched or unmatched in VAT data) and those that are not audited. Ideally, all audited firms should have matching VAT return data, so the occurrence of vataudstat = 2 should be minimal or non-existent.

Note 18:

This depends only on the data structure, and it is only possible if audit data is split by tax head (CIT, VAT, PIT).

recode _merge (3=1) (2 =0) (1 =2), gen(vataudstat)

After that, in line 609, the _merge variable is then dropped as it is no longer needed.

drop _merge

This section of code in lines 601 and 602 defines labels for the different values of vataudstat and assigns these labels to the variable for clearer interpretation.

label define vatstat_lbl 1 "1.Match & Audited" ///
    0 "0.No Match not Audited" ///
    2 "2.No Match & Audited"
label   val vataudstat vatstat_lbl

Then, the (tabulate) command is used in line 612 to view the distribution of vataudstat, and the order command arranges the variables in the dataset for easier viewing, as in line 613.

tabulate vataudstat
order   tpin rtn_year vataudstat

The code lines 607 and 608 create a dummy variable ‘vataudited’ to identify firms that have been audited that is nessesery for Bottom-up approach.

generate vataudited=1 if vataudstat==1|vataudstat==2 
replace vataudited=0 if vataudstat==0|vataudstat==.

6.2 Tracking Audit History

The following part of the code in lines 620 and 630 creates two new variables based on the existing “audit” variable, which indicates if a firm has been audited.

Line 620 calculates the total number of times each firm (identified by TPIN, which stands for Taxpayer Identification Number) has been audited, while line 621 creates a dummy variable that indicates whether a firm has ever been audited for VAT.

egen nr_audits=total(audit), by(tpin)
generate ever_vataudited = (nr_audits>0)

The command line 623 displays the frequency distribution of the variable ‘vataudited’.

tabulate  vataudited

Lines 624 and 625 label the variables ‘rtn_year’ and ‘vataudited’ for easier interpretation. ‘label var’ assigns labels to variables. After executing these commands, ‘rtn_year’ will be labeled as “Return Year” and ‘vataudited’ as “VAT Audit Status” in the dataset.

label var rtn_year "Return Year"
label var vataudited "VAT Audit Status"

The command in line 626 links the value labels defined for vataudited to the vataudited variable. The next two commands, of lines 627 and 628, define the specific value labels for this variable. Specifically, they assign the label “Unaudited” to the value 0 and the label “VAT Audited” to the value 1 within the vataudited label set. The replace option ensures any existing label for value 0 is updated, and the add option incorporates the new label for value 1 into the existing set. This labeling helps in clearly identifying whether a firm has been audited for VAT when analyzing the dataset.

label   values vataudited       vataudited
label   define  vataudited  0   "Unaudited", replace
label   define  vataudited  1   "VAT Audited", add

The command line 629 defines a label for the ever_vataudited variable, where a value of 0 is labeled as “never VAT Audited”, and line 630 adds a label for the ever_vataudited variable, where a value of 1 is labeled as “ever VAT Audited”.

label   define  ever_vataudited 0   "never VAT Audited", replace
label   define  ever_vataudited 1   "ever VAT Audited", add

The command on line 635 saves the merged version of VAT with Audit data as a STATA-format file named vataudit.dta in the Data folder.

save "$data\vataudit.dta", replace

Finally, we use the following command to close the currently open log file, which records all the commands you run and their corresponding output, allowing you to keep a record of your session for later reference or documentation.

log close

The following section of the code, from line 638 to 708, repeats the same process of merging VAT with audit data, but for CIT from lines 638 to 671 and for PIT from lines 674 to 708