The model provides an embedding or feature representation of the data of all taxpayers. The features are then used to train a separate classifier. The information acquired allows for the clustering of related features in a hidden space.

ewline

ewline

A deep generative model of both audited and not audited taxpayers data provides a more robust set of hidden(latent) features. The generative model used is:

ewline

ewline

egin{math}

p( extbf{z}) = mathcal{N}( extbf{z}| extbf{0,I}); p _ heta( extbf{x|z}) = f( extbf{x;z},oldsymbol heta), quad(1)

end{math}

ewline

ewline

whereegin{math} f ( extbf{x; z},oldsymbol heta) end{math}is a Gaussian distribution whose probabilities are formed by a non-linear functions (deep neural networks), with parameters egin{math} oldsymbol heta end{math}, of a set of hidden (latent) variables extbf{z}.

ewline

ewline

Approximate samples from the posterior distribution (the probability distribution that represents the updated beliefs about the parameters after the model has seen the data) over the hidden (latent) variables p(z|x) are used as features to train a classifier that predicts whether a material audit yield will result if a taxpayer is audited (y) such as Support Vector Machine (SVM).This approach enables the classification to be performed in a lower dimensional space since we typically use hidden (latent) variables whose dimensionality is much less than that of the observations.

These low dimensional embeddings should now also be more easily separable since we make use of independent hidden (latent) Gaussian posteriors whose parameters are formed by a sequence of non-linear transformations of the data.

ewline

ewline

extbf{Generative semi-supervised model (Model 2): }

A probabilistic model describes the data as being generated by a hidden(latent) class variable y in addition to a continuous hidden(latent) variable z. The model used is:

ewline

ewline

egin{math}

p(y) = Cat(y| oldsymbolpi);quad p( extbf{z}) = mathcal{N} ( extbf{z|0, I});quad p heta ( extbf{X}|y, extbf{Z}) = f ( extbf{x}; y, extbf{z}, oldsymbol heta), quad(2)

end{math}

ewline

ewline

where ensuremath{Cat(y| oldsymbolpi)} is the multinomial distribution, the class labels y are treated as hidden (latent) variables if no class label is available and z are additional hidden (latent) variables. These hidden (latent) variables are marginally independent.

As in Model 1, egin{math} f( extbf{X};y, extbf{z},oldsymbol heta) end{math} is a Gaussian distribution, parameterized by a non-linear function (deep neural networks) of the hidden(latent) variables.

ewline

ewline

Since most labels y are unobserved, we integrate over the class of any unlabeled data during the inference process, thus performing classification as inference (deriving logical conclusions from premises known or assumed to be true.). The inferred posterior distribution is used to obtain labels for any missing labels.

\

\

extbf{Stacked generative semi-supervised model: }

The two models can be stacked together; the extbf{Model 1} learns the new hidden (latent) representation extbf{z$_1$} using the generative model, and afterwards the generative semi-supervised extbf{Model 2} using extbf{z$_1$} instead of raw data ( extbf{x}).

The outcome is a deep generative model with two layers:

ewline

ewline

egin{math} p heta( extbf{x}, y, end{math} extbf{z$_1$, z$_2$})egin{math} = p(y)pend{math}( extbf{z$_2$})egin{math}p _ heta end{math}( extbf{z}$_1$|ensuremath{y}, extbf{z$_2$})egin{math}p _ heta end{math}(x| extbf{z$_1$})

ewline

ewline

where the priors ensuremath{p(y)} and ensuremath{p}( extbf{z$_2$}) equal those of y and extbf{z} above, and both ensuremath{ p _ heta( extbf{z}}$_1$|ensuremath{y}, ensuremath{ extbf{z}}$_2$) and ensuremath{p _ heta( extbf{x}| extbf{z}}$_1$) are parameterized as deep neural networks.

The computation of the exact posterior distribution is not easily managed because of the nonlinear, non-conjugate dependencies between the random variables. To allow for easier management and scalable inference and parameter learning, the recent advances in variational inference (Kingma and Welling, 2014; Rezende et al., 2014) are utilized. A fixed form distribution ensuremath{q _phi( extbf{z}| extbf{x}) }with parameters ensuremath{phi} that approximates the true posterior distribution egin{math} p( extbf{z}| extbf{x}) end{math}.

ewline

ewline

The variational principle is used to derive a lower bound on the maximum likelihood of the model. This consists in maximizing function of the variational bound and the approximate posterior has the minimum difference with the true posterior. The approximate posterior distribution egin{math} q _phi(cdot) end{math} is constructed as an inference or recognition model (Dayan, 2000; Kingma and Welling, 2014; Rezende et al., 2014; Stuhlmuller et al., 2013).

ewline

ewline

With the use of an inference network, a set of global variational parameters egin{math} phi end{math}, allowing for fast inference at both training and testing because the delay of inference is for all the posterior estimates for all hidden (latent) variables through the parameters of the inference network. An inference network is introduced for all hidden (latent) variables, and are parameterized as deep neural networks. Their outputs construct the parameters of the distribution ensuremath{ q _phi(cdot) }.

ewline

ewline

For the latent-feature discriminative model (Model 1), we use a Gaussian inference network egin{math} q _phi( extbf{z}| extbf{x}) end{math}for the hidden(latent) variable extbf{z}. For the generative semi-supervised model (Model 2),an inference model the hidden(latent) variables extbf{z} and extbf{y}, which its assumed have a factorized form

egin{math} q _phi( extbf{z}, y| extbf{x}) = q _phi( extbf{z}| extbf{x})q_phi(y| extbf{x}) end{math}, specified as Gaussian and multinomial distributions.

ewline

ewline

extbf{Model 1:} ensuremath{q _phi( extbf{z}| extbf{x}) = mathcal{N} ( extbf{z}| oldsymbolmu _phi( extbf{x}), diag( oldsymbolsigma^2 _phi( extbf{x})))}, quad(3)

ewline

extbf{Model 2:} ensuremath{q _phi( extbf{z}|y, extbf{x})= mathcal{N}( extbf{z}| oldsymbolmu _phi(y, extbf{x}),diag( oldsymbolsigma^2 _phi( extbf{x}))); q _phi(y| extbf{x})= extit{C}at(y| oldsymbolpi _phi( extbf{x}))}, (4)

ewline

ewline

ewline

ewline

where:

ensuremath{oldsymbolsigma _phi( extbf{x})} is a vector of standard deviations,

ensuremath{oldsymbolpi _phi( extbf{x})} is a probability vector,

functions ensuremath{oldsymbolmu _phi(x), oldsymbolsigma _phi( extbf{x}) and oldsymbolpi _phi( extbf{x})} are represented as extbf{MLPs}.

ewline

ewline

extbf{Generative Semi-supervised Model Objective}

The label corresponding to a data point is observed and the variational bound is:

ewline

ewline

egin{math}

logp _ heta( extbf{X},y) leq mathbb{E}_q {_phi} _{(z|x,y)} log p _ heta( extbf{x}|y, extbf{z})+ log p _ heta(y)+ log p ( extbf{z})-log q _phi( extbf{z}| extbf{x},y)=-mathcal{L}( extbf{x},y), quad(5)

end{math}

ewline

ewline

The objective function is minimized by resorting to AdaGrad, which is a gradient-descent based optimization algorithm. It automatically tunes the learning rate based on its observations of the data’s geometry. AdaGrad is designed to perform well with datasets that have infrequently-occurring features.

chapter{Evaluation}

The model was used to analyze taxpayers data from the Cyprus Tax Department database in order to identify taxpayers yielding material additional tax in case of performing a VAT audit.

The Deep Generative Models for Semi-supervised Learning is a solution that enables increased efficiency in the audit selection process. Its input includes both audited (supervised) and not audited (unsupervised) taxpayer data. Its output is a collection of labels, each of which corresponds to a taxpayer with one of two possible values (binary) good (1) or bad (0). If the taxpayer is expected to yield a material tax after audit, would be classified as good (1).

ewline

ewline

Nearly all the VAT returns of the last few years were processed in order to generate the features to be used by the model. These were selected based on the advice of experienced field auditors, data analysts and rules from rule based models. Some of the selected fields where further processed to generate extra fields. The features selected broadly relate to business characteristics like location of the business, type of business and features from its tax returns.

For data preparation , the data was cleaned, for example we removed taxpayers with little or no tax history, mainly new businesses.

ewline

ewline

The details of the criteria used to select the features, the features processing, the new generated features, feature number and cleansing process, cannot be disclosed due to the confidentiality nature of the audit selection. Also publication of the features could result in compromise of future audit selection as well as being unlawful.

For, modelling taxpayer data from the tax department registry like economic activity and from the tax returns ensuremath{(X)} and actual audit results ensuremath{(Y)} appear as pairs.

ewline

ewline

ensuremath{(X, Y) = in }{(x$_1$, y$_1$), . . . , ensuremath{(xmathcal{N} , ymathcal{N} )}}

ewline

ewline

with the ith observation x$_i$ and the corresponding class label ensuremath{y}$_i$ ensuremath{{1, . . . , L}} for the taxpayers audited.

ewline

ewline

For each observation we infer corresponding hidden (latent) variables denoted by z. In the semi-supervised classification, where both audited taxpayers and not audited taxpayers are utilized, only a subset of all the taxpayers have corresponding class labels (audit result). The empirical distribution over the labelled (audited) and unlabeled (not audited) subsets as ensuremath{p (x, y)} and ensuremath{p (x)}.

ewline

ewline

For building the model, Tensorflow was used, an open source software library for high performance numerical computation, running on top of python programming language. The hardware used is a custom build machine of the Cyprus Tax Department with an NVIDIA 10 series Graphic Processing Unit. The performance was measured using a k-fold cross validation on training data.

ewline

ewline

The model was trained on actual tax audit data collected from the prior years (supervised) and on actual data of not audited taxpayers(unsupervised). The amount over which an audit yield is classified as material was set following internal guidelines. The same model was used for both large medium and small taxpayers irrespective of the economic activity classification (NACE code). The predictions made by the model were compared to the actual audit findings with an accuracy of 78,4\%. The results compared favorably to peer results, using Data Mining Based Tax Audit Selection with a reported accuracy of 51\% (Kuo-Wei Hsu et al., 2015).

section{The confusion matrix}

The confusion matrix in Table 1 represents the classification of the model on the training data set. Columns and

rows are for predictions. The top-left element indicates correctly classified cases, the top-right element indicates the tax audits

lost (i.e. cases predicted as bad turning out to be good). The bottom-left element indicates tax audits incorrectly predicted as good, and the bottom-right element indicates correctly predicted bad tax audits.

The confusion matrix indicates that the model is balanced. The actual numbers are not disclosed for confidentiality reasons, instead they are presented as percentages.