The European commission uses the concept of VAT GAP to estimate the amount ofnon-compliance with the VAT legislation.
The VAT GAP measures the differencebetween the amount of VAT that should be paid and the actual VAT paid by the taxpayers.VAT undercollection is a problem all European Union member states haveto face and solve.The abundant data from filed tax returns and other sources can be exploited by machinelearning in order to assess whether a taxpayer is complaint.Semi-supervised learning is used for classification when a fraction of the observationshave corresponding class labels. In many real life classification problems, likeimage search (Fergus et al., 2009), genomics (Shi and Zhang, 2011), natural languageparsing (Liang, 2005), and speech analysis (Liu and Kirchhoff, 2013). Similarly taxdepartments have abundant unlabeled data for taxpayers, but obtaining audit results(class labels) is expensive and impossible to be performed on all taxpayers.
We Will Write a Custom Essay Specifically
For You For Only $13.90/page!
To the author best knowledge deep learning, based on generative semi-supervisedlearning paradigm has never been used until now for taxpayer audit selection.”Can a Tax Department use data of unaudited taxpayers to predict with high accuracythe tax yield in case of a tax audit?The answer is the development of probabilistic models for inductive and transductivesemi-supervised learning by utilizing an explicit model of the data density, followingthe recent advances in deep generative models and scalable variational inference(Kingma andWelling, 2014; Rezende et al., 2014).The basic algorithm for semi-supervised learning is the self-training scheme (Rosenberget al., 2005) where labelled data acquired from its own predictions.
A numberof repetitions is performed until a preset goal is achieved. Poor predictions mightbe reinforced because these are based on heuristics. Transductive SVMs (TSVM)(Joachims, 1999) extend SVMs with the aim of max-margin classification while ensuringthat few predictions close to the margin are utilized. Optimization and utilizationof these approaches to large datasets of unlabeled data is difficult.Graph-based methods are popular, and create a graph connecting similar observations,when the minimum energy (MAP) configuration is found, the label informationis propagated between labelled and unlabeled nodes(Blum et al.
, 2004, Zhu etal., 2003). For Graph-based approaches the graph structure is crucial and eigenanalysisof the graph Laplacian is required, which limits the scale to which these