The European commission uses the concept of VAT GAP to estimate the amount of
non-compliance with the VAT legislation. The VAT GAP measures the difference
between the amount of VAT that should be paid and the actual VAT paid by the taxpayers.
VAT undercollection is a problem all European Union member states have
to face and solve.
The abundant data from filed tax returns and other sources can be exploited by machine
learning in order to assess whether a taxpayer is complaint.
Semi-supervised learning is used for classification when a fraction of the observations
have corresponding class labels. In many real life classification problems, like
image search (Fergus et al., 2009), genomics (Shi and Zhang, 2011), natural language
parsing (Liang, 2005), and speech analysis (Liu and Kirchhoff, 2013). Similarly tax
departments have abundant unlabeled data for taxpayers, but obtaining audit results
(class labels) is expensive and impossible to be performed on all taxpayers.
To the author best knowledge deep learning, based on generative semi-supervised
learning paradigm has never been used until now for taxpayer audit selection.
“Can a Tax Department use data of unaudited taxpayers to predict with high accuracy
the tax yield in case of a tax audit?
The answer is the development of probabilistic models for inductive and transductive
semi-supervised learning by utilizing an explicit model of the data density, following
the recent advances in deep generative models and scalable variational inference
(Kingma andWelling, 2014; Rezende et al., 2014).
The basic algorithm for semi-supervised learning is the self-training scheme (Rosenberg
et al., 2005) where labelled data acquired from its own predictions. A number
of repetitions is performed until a preset goal is achieved. Poor predictions might
be reinforced because these are based on heuristics. Transductive SVMs (TSVM)
(Joachims, 1999) extend SVMs with the aim of max-margin classification while ensuring
that few predictions close to the margin are utilized. Optimization and utilization
of these approaches to large datasets of unlabeled data is difficult.
Graph-based methods are popular, and create a graph connecting similar observations,
when the minimum energy (MAP) configuration is found, the label information
is propagated between labelled and unlabeled nodes(Blum et al., 2004, Zhu et
al., 2003). For Graph-based approaches the graph structure is crucial and eigenanalysis
of the graph Laplacian is required, which limits the scale to which these

x

Hi!
I'm Kyle!

Would you like to get a custom essay? How about receiving a customized one?

Check it out