Although Chapter 2 outlined many good reasons for using administrative sources, there are also a number of problems associated with their use. Some of these problems are specific to a particular source or use, but many of them will be more generic in nature. This chapter outlines some of the more common problems and proposes methods to solve them, or at least to minimise their impact. The specific problems of getting access to administrative sources in the first place, and of data linkage, are treated separately in Chapters 3 and 6 respectively.
4.2 Public Opinion
Chapter 2 considered how public opinion might favour the sharing of data in some countries. In other countries, however, there may be public unease at the thought of data being shared around government. It is very difficult to reduce such concerns, but possible approaches could include the publication of clear limits and rules regarding the use of data, ensuring that people and businesses understand that sensitive data used or collected for statistical purposes will not be fed back to other parts of government (particularly tax and benefits agencies).
This is in line with the United Nations “Fundamental Principles of Official Statistics”, where Principle 5 (“Data for statistical purposes may be drawn from all types of sources, be they statistical surveys or administrative records. Statistical agencies are to choose the source with regard to quality, timeliness, costs and the burden on respondents”) encourages the use of administrative data. Taken together with Principle 6 (“Individual data collected by statistical agencies for statistical compilation, whether they refer to natural or legal persons, are to be strictly confidential and used exclusively for statistical purposes”), this establishes the principle of the one-way flow of data.
Other ways to help overcome hostile public opinion include the publication of analyses of the costs and benefits, both to government and to respondents, of the use of different sources. It may also be possible to claim that micro-data are more secure when administrative sources are used. No questionnaires are sent by post, data are not held on paper or electronically by interviewers, and fewer clerical staff are needed for the statistical production process, thus fewer people have access to sensitive data.
4.3 Public Profile
Direct contact with the public via surveys helps to raise the profile of the statistical organisation. The use of administrative data can reduce that contact and hence also reduce public awareness of the work of the statistical organisation. If this becomes an issue, the most obvious solution is to improve the ‘marketing’ of the statistical organisation and its data outputs. This may require a small proportion of the savings from using administrative sources to be transferred to the marketing budget.
Perhaps the most effective way of promoting the activities and outputs of a national statistical organisation, particularly in the medium to long term is to ensure greater involvement with education institutions, business groups, and other target customers. User groups are also particularly important in this respect, and should be actively encouraged.
4.4 Management of Change
Public sector administrative sources are generally set up for the purposes of collecting taxes or monitoring government policies. This means that they are susceptible to political changes. If a policy changes, administrative sources may be affected in terms of coverage, definitions, thresholds etc., or possibly even abolished completely. Changes to the computer systems used to store and process administrative data may also have an impact on the supply of data for statistical purposes. Even private sector sources are not immune from these sort of changes, though in this case, change is more likely to be driven by changing market factors
Such changes may happen suddenly, with little warning, particularly high-risk times tend to be immediately after a change of government, a change of minister, or a change in legislation. An example was reported some years ago from Slovenia, where the supply of administrative data on employment was halted for a while following a change of minister, leaving the statistical office with serious problems for the production of employment statistics. Procedures, backed up by legislation, have since been implemented to minimise the likelihood and impact of this sort of change.
Reliance on a particular source will always, therefore, carry a certain degree of risk. These risks can be managed to some extent by legal or contractual provisions. The best way in practice to avoid such problems tends to be through regular contact with those responsible for the administrative source, to ensure they are aware of the statistical requirements, and to try to influence and get early warning of any possible changes. Where there is a strong reliance on a particular source, it is also worth preparing contingency plans setting out what could be done if that source became unavailable It is clearly better to be proactive beforehand than have to react after the event!
One major problem often encountered when using administrative sources is that the units used in those sources do not correspond directly to the definition of the required statistical units. The process of converting from administrative units (legal units, tax units, claimants etc.) to statistical units (enterprises, people, households etc.) can be quite difficult conceptually, and often involves some form of modelling.
In business statistics, this process is known as profiling, and typically is a function of statistical business registers. Eurostat has published guidelines for this process in Chapter 19 of their Business Register Recommendations Manual, where they define profiling as “a method to analyse the legal, operational and accounting structure of an enterprise group at national and world level, in order to establish the statistical units within that group, their links, and the most efficient structures for the collection of statistical data.”
Figure 4.1 shows how the structure of a set of linked business units can look very different from the legal / administrative point of view, compared to the statistical point of view. Profiling, as defined above, can be seen as the process of creating the statistical structure and mapping it to the legal / administrative structure.
Figure 4.1 – Different views of a group of business units
Although profiling gives a better understanding of complex unit structures, it is expensive and time consuming, and needs trained staff. It is therefore totally impractical to attempt detailed clerical profiling for all business units in an economy, it is necessary to focus on those cases that give the most benefit. Profiling can be seen as a trade-off or compromise between three factors:
- Quantity of business structures profiled;
- Quality or depth of the profiling activity and;
- Available resources (determined both by cost and suitability of staff).
Box 4.1 gives four examples of business structures that were profiled separately in three different countries (Denmark, The Netherlands and the United Kingdom) as part of a study into the consistency of application of the statistical definition of the enterprise within the European Union. This shows clearly that profiling is to some extent an art, and there is not always a “right” answer, however this particular exercise resulted in considerable methodological work to harmonise the rules for profiling, which is partially documented in Chapter 19 of the Eurostat Business Register Recommendations Manual cited above.
Although clerical profiling is not practical for all units in a large population, some form of automated, rules-based profiling might be. Standard rules based on attributes or the nature of links between units can help to overcome differences between administrative and statistical units in many areas of statistics. For example, statistical households can be derived based on relationships between the individuals living in a building. This approach is used successfully within the register-based population census methodology applied in Nordic countries.
An alternative to profiling that may be feasible in some cases is to consider correcting for differences in the definitions of units by making statistical “adjustments”. A crude example of this approach could be where the statistical unit is persons and the administrative unit is jobs. Assuming that it is known from a survey that working people have, on average, 1.15 jobs, this adjustment factor can be used to estimate persons in employment from the number of jobs.
Box 4.1 – An Exercise in Profiling: How Many Enterprises?
The following examples are taken from the study “The Impact of Diverging Interpretations of the Enterprise Concept”, prepared for Eurostat by Statistics Netherlands with input from Denmark and UK. Each example is followed by the answer given by each of the three participating countries, along with a summary of their reasoning. The examples are based on the following definition of an enterprise:
“the smallest combination of legal units that is an organisational unit producing goods or services, which benefits from a certain degree of autonomy in decision-making .... An enterprise may be a sole legal unit.”
Source: EU Regulation 696/93 on statistical units
Example 1 - Two legal units in an enterprise group have different 4 digit NACE codes; both are selling mainly to third parties outside the group. They share buildings, management, purchases and employees.
Example 2 - Four Legal Units: A and B have different activities, no combined purchases, but share buildings. C and D share buildings, employees, and purchases. All four present themselves as one firm.
Example 3 - Three legal units: All produce mainly for external customers, they share management and purchases, and represent themselves as one firm. A and B share a building. B and C have the same activity, share employees and capital goods and can not supply separate data.
Example 4 - Twelve legal units form an enterprise group. Only one is active, the others have no employees.
4.6 Definitions of Variables
As well as differences in the definitions of units, there are also likely to be differences in the definitions of variables between administrative and statistical systems. The data in administrative sources have generally been collected for a specific administrative purpose, and the needs and priorities relating to that purpose are likely to be different to those of the statistical system. For example, turnover for value added tax (VAT) purposes may not include turnover related to the sales of VAT exempt goods and services, whereas the statistical system is likely to require total turnover.
Another common example is the definition of unemployment. The standard statistical definition is:
“The "unemployed" comprise all persons above a specified age who during the reference period were:
(a) "without work", i.e. were not in paid employment or self-employment
(b) "currently available for work", i.e. were available for paid employment or self-employment during the reference period; and
(c) "seeking work", i.e. had taken specific steps in a specified recent period to seek paid employment or self-employment.”
However definitions of unemployment in administrative sources are more often based on the number of people claiming unemployment benefits, or registered as looking for work. Some people who are out of work may not register as unemployed, if they expect to find work quickly, and in some cultures there may be a social stigma attached to claiming unemployment benefits. On the other hand, some people claiming unemployment benefit may not be available for work or actively seeking work, so should not be counted as statistically unemployed.
The first step towards solving the problem of different definitions is to try to understand the differences and quantify the impact. Some differences may have no real impact in practice, so could be safely ignored, others may be systematic, so could be resolved through adjustments to the data. Sometimes it might be possible to derive or estimate the impact of the difference by combining variables from different sources, particularly for financial accounting variables such as the turnover example above. In some cases, it might even be possible to influence the administrative definition.
4.7 Classification Systems
As is the case for variables, the classification systems used within administrative sources may be different to those used in the statistical world. Even if they are the same, they may be applied differently depending on the primary purpose of the administrative source, perhaps focusing on specific attributes of the unit. For example, an administrative source concerned with licensing, health and safety or environmental protection may be more interested in the economic activities of a business that are of most concern to that source, rather than the main economic activity of a business, which is required for statistical purposes.
In other cases, classifications in administrative sources may not be applied at the level of detail required for statistical purposes, or the classification may simply not be a priority variable for the administrative source, resulting in quality deficiencies.
Where classification systems or versions are different, the usual solution is to construct conversion matrices to map the codes in the administrative classification to those in the statistical classification. Such mappings may be one to one, many to one, one to many or many to many. In the latter two cases, some sort of probabilistic allocation may be required.
Box 4.2 – Using a Simple Conversion Matrix
This extract from a conversion matrix illustrates the main problems found when converting from one classification system to another. In this case, the codes used in the administrative source (Code 1) are mapped onto those used in the statistical system (Code 2), in a probabilistic way based on weights.
The first issue is therefore how to determine the weights. These can be estimated, but a preferable method, where possible, is to derive them from an analysis of units that have been classified according to both systems, looking at the proportions of units with certain combinations of codes. It may be necessary to constrain these analyses to cover only combinations of codes that are considered valid or plausible to reduce the impact of coding errors.
The first line above shows a one to one correlation, reflected by a weight of 100%. This means that all units with an administrative code of 0100 should be allocated the statistical code 01300. The next five lines show a one to many correlation. If a unit has the administrative code 0101, there are five possible statistical codes. For each of these statistical codes, the likelihood of it being the correct code for the unit is reflected by the weight, thus there is a 26% chance that 01210 is the correct statistical code.
In this case, the probability that a unit with the administrative code 0101 will be given the correct statistical code can be calculated by summing the squares of the probabilities for each combination, e.g.
0.262 + 0.142 + 0.292 + 0.112 + 0.22 = 0.2234
This means that there is a 77.66% chance that a unit with the administrative code 0101 will be given the wrong statistical code. Whilst this likelihood might seem unacceptably high, it should be remembered that even though codes may be wrong at the unit level, providing the weights are accurate, the distribution of units between codes should be correct at the aggregate level, and as long as there are no systematic biases in the application of the conversion matrix, there should be no resulting biases in statistical data for units coded in this way.
It should also be noted that conversion matrices such as the example above are uni-directional. A separate matrix, with different weights, would be required to convert from the statistical codes to the administrative codes. For example, a one to one correlation in one direction may become a one to many correlation in the other direction. This is illustrated in the table above, where there is a one to one correlation between codes 0100 and 01300, but when converting from statistical to administrative codes, 01300 could map to 0100 or 0103.
Where accuracy is required at the micro-data level, it is clear from Box 4.2 that the conversion matrix approach has severe limitations. Various other methods may be possible depending on resources and data availability, but a useful first step is always to gain a detailed understanding of how the classification data are collected and processed by the administrative source, and the nature of the administrative functions they are used for.
In some cases, other variables may be available within the administrative source, which could be used to improve the likelihood of selecting the correct statistical code. One such variable could be the text description from which the administrative code was derived. If this is available, it is potentially of more use to the statistician than the administrative code itself, because the statistician could apply manual or automatic techniques to derive the correct statistical code directly from the description. This method can be used in conjunction with the conversion matrix approach, such that text descriptions are only coded in cases where there is not a one to one correlation between administrative and statistical codes, though there is a risk of potential bias if the quality of coding is different between the administrative and statistical systems.
One approach that has been used successfully in several countries is to develop an automatic coding tool for use in both statistical and administrative systems. This ensures a high degree of consistency of coding, and strongly encourages (but does not necessarily enforce) the use of a common classification system.
In addition to the use of common coding tools, the provision of coding expertise and training to administrative data suppliers can help to improve coding consistency. At the same time, it is always helpful for the statistician to stress the advantages of using a common classification system. It also helps to give early notice of any revisions to the classification system, and to provide as much help as possible to administrative data suppliers during the implementation of the resulting changes.
There are three separate issues relating to timeliness that affect the usefulness of administrative data for statistical purposes:
- Administrative data may not be available in time to meet statistical needs
- Administrative data may relate to a period that does not coincide with the statistical reference period
- Administrative data may be measured over a period, whilst the statistical requirement is for a specific point in time (or vice-versa).
Considering the first issue, there will generally be some sort of lag between an event happening in the real world, and it being recorded by an administrative source, this is then followed by a further lag before the data are made available to the national statistical organisation. Figure 4.2 below shows the total lag in days between businesses commencing activities and being recorded on the statistical business register in the United Kingdom. Lags relating to births and deaths of enterprises are a major source of business register coverage errors. If these lags are measured, allowance can be made for them in any statistics based on register data.
By analysing lags in this way, it is possible to produce summary statistics to estimate their impact. For example, in the case above, two-thirds of businesses appear on the statistical business register within 2 months of starting activity. The mean lag is around 120 days, but this figure is not particularly useful as it is affected by outliers in the very long tail of the distribution (truncated in Figure 4.2, as the most extreme cases involved lags of up to ten years). Perhaps a more useful measure of average in this case is the median, which is around 40 days. Another interesting feature of this analysis is the small number of negative lags, which can happen when a businesses completes registration formalities well before commencing trading.
This sort of analysis is clearly important to help the statistician understand the nature and impact of the lags in the sources used to compile statistics. It also gives information that can be used to inform adjustments to improve the quality of the statistical outputs.
The existence and length of lags can make the use of administrative sources difficult for short-period statistics, e.g. a six-month lag would probably be unacceptable for a key monthly economic data series, but would be less of a problem for annual statistics.
The first step to resolving the problem of lags is to understand their impact by preparing analyses such as the one above. Once this has been done, it may be possible to develop models to adjust for their impact. It might also be the case in some relatively stable data series that opposing lags may cancel each other out, for example the business registration lags in Figure 4.2 may be cancelled out by de-registration lags for the purposes of producing data on the business population. It can be dangerous, however, to assume that this is the case, without empirical evidence.
Figure 4.2 Business Registration Lags in the United Kingdom
When the nature and impact of lags have been determined, it is useful to try to understand what causes them. In some cases, it might be possible to propose changes in the administrative source that would reduce lags. This may be beneficial to both the statistician and the administrative source.
The second issue related to timeliness is that of differing periods, for example data from annual tax returns are often only available several months after the end of the tax year, so are probably not suitable for monthly or quarterly statistics. In some cases, however, annual administrative data can be used for shorter-period statistics, particularly if they are collected on a rolling annual basis. This can happen if there is a requirement to spread the workload of collecting and processing these data by the administrative source throughout the year. As long as the distribution of the units for which data are collected during the year is sufficiently random, it may be possible to derive meaningful monthly or quarterly statistical trend data from such sources.
Figure 4.3 shows a case where administrative data are based on a financial year running from 1 April to 31 March, whereas the statistical requirement is for calendar year data. The simplest way to convert these data is to add 25% of the value from the first financial year to 75% of the value for the second. This method should give a reasonable approximation if the long-term trend in the data is reasonably stable, though for more volatile series, other, more complex estimation methods may be required.
Figure 4.3 Dealing with Different Time Periods
The third issue concerns the difference between data referring to a specific point of time and data relating to a period (e.g. an annual or monthly average). For example, there may be a statistical requirement for employment data on a specific reference date, whereas administrative data may only give monthly averages.
As in previous examples, the first step is to analyse the impact of the difference, and determine whether it is significant enough to require further action. One possible solution is a model-based mathematical adjustment, e.g. if the statistical reference date is near the start of the month, a model that takes into account the average figure for the previous period may be appropriate. An alternative approach may be to use the results of a relatively small survey to adjust the administrative data.
4.9 Inconsistency between Sources
A specific problem where multiple sources are used concerns inconsistencies between those sources. Data from one source may appear to contradict those from another. This may be due to different definitions or classifications, differences in timing, or simply to an error in one source. This can happen when comparing administrative data with statistical data, or when comparing two administrative (or two statistical) sources.
To resolve such conflicts it is necessary to establish priority rules, by deciding which source is most reliable for a particular variable. Once a priority order of sources has been determined for a variable, it should then be possible to ensure that data from a high priority source are not overwritten by a lower priority source. This process is made much easier if source codes are stored alongside variables for which several sources are available. The use and storage of dates can also be helpful, as even when one source is thought to be more reliable than another, data from that source that are ten years old may not be of higher quality than data for the most recent period from the less reliable source. A simpler method that may be appropriate in some cases is to load data in reverse priority order, allowing data of higher quality to overwrite those of lower quality.
In most cases, there will be several variables of interest, and it is likely that the priorities will differ from variable to variable. For example, an administrative source concerning employment of workers is likely to give reasonable estimates of (legal) employment, as that variable is closely related to the core function of the source. It may not, however be so good for determining the economic activity of the employer, as this may be only of secondary importance for the purpose of the source. Thus if multiple sources are used to derive employment data, it would be necessary to consider the relative quality of each variable in each source in order to derive the optimal statistical data set.
The more data sources that are used, the more complex this comparison process becomes, but having multiple sources often helps to validate data quality. In some cases, certain sources may not be used directly for statistical production, but purely for benchmarking purposes as part of a quality assurance process. The resulting knowledge about the quality of various sources can also be fed back (usually at aggregate rather than unit level, to protect statistical confidentiality) to the source, and can provide a basis for discussions about improving the quality of that source.
Box 4.3 – Data From Different Sources
Source 1: Education Register
Source 2: Population Register
5 St Peter’s St
5 Saint Peters Street
Date of birth
Office for National Statistics
This example shows two records containing fictional data about the author (education and population registers do not yet exist in the UK). It is designed to illustrate several common issues when trying to reconcile data from different sources:
4.10 Missing Data
The problem of missing data is not unique to administrative sources. It can also be due to full or partial non-response to statistical surveys, or even to the removal of data values during the editing process. However, with administrative sources, the issues can sometimes be different, particularly as the problem of missing data can often be more systematic.
The main reasons for this are that a particular variable may not be collected at all by the administrative source, or it may only be collected for certain categories of units where there is a specific administrative requirement. The variable may also simply be a low priority for administrative purposes, so the owners of that source do not see missing data as a problem.
Some of the standard solutions for dealing with non-response in statistical surveys can also be used to solve the problem of missing data in administrative sources. Various imputation methods, such as deductive, ‘hot-deck’ or ‘cold-deck’ imputation are often suitable where the problem only affects some of the units. In cases where most or all of the units are affected, a modelling approach may be more appropriate.
Box 4.4 – Case Study: Dealing With Missing Administrative Data – Turnover per Head Ratios
The two variables most commonly available to measure the size of a business are the number of employees, and the total sales (turnover). However it is common for one or both of these variables to be missing or unreliable for new businesses, particularly smaller ones.
To help resolve this problem, turnover per head ratios can be used to estimate the missing variables. These ratios are constructed using information for similar businesses for which both variables are present and considered reliable, then calculating average turnover per head ratios for different categories based on economic activity and institutional sector.
For example, the following are dummy turnover per head (TPH) values calculated for different classes of the International Standard Industrial Classification (ISIC):
If a business has ISIC class 45.12, and its turnover is 200, but employment is missing, the imputed employment value is:
200 / 68 = 2.94 (rounded to 3)
When calculating turnover per head values, problems with outliers are often encountered, so methods such as trimming (removing the top and or bottom x% of values), and calculating the mean of the inter-quartile range are often used.
Ratios of this type can also be of more general use for validating updates, matching records from different sources, and detecting errors. For example, by graphing and studying the distribution of turnover per head values, it is also often possible to get useful information about the population of units in question. The following charts are examples of what has been observed from such an exercise:
1. Normal distribution
In this case the turnover per head values are distributed evenly around the mean indicating a relative degree of homogeneity amongst the population of units, and very limited impact of outliers.
2. Skewed distribution
In this case there is a clear grouping of units around a relatively low value, but the outliers towards the outer end of the right-hand tail would clearly affect the mean of the distribution. This is a relatively common distribution for turnover per head data, and highlights the need to take measures to reduce the impact of these outliers.
3. Bi-modal distribution
This case illustrates that the population in question is rather heterogeneous, and that it might be worth splitting it into two sub-populations to get more meaningful turnover per head ratios.
4.11 Resistance to Change
One of the main barriers to the more effective use of administrative sources in official statistics, and one of the least recognised, can come from within the organisation. Statisticians may resist the use of administrative data because they do not trust data that they have not collected themselves. They often focus on the negative quality aspects of administrative data, and they have an over-optimistic view of the quality of survey data, often based on the largely untested assumption that survey responses actually comply with statistical norms.
The solution is clearly through better education of statisticians regarding the possibilities offered by administrative sources, encouraging them to take a wider view of all the dimensions of quality, and focus on the impact on data suppliers and users. In this context it is important to determine the real relative quality of survey and administrative data. For example, it is often assumed that data from administrative sources do not meet the requirements of statistical definitions, whereas those from official surveys do. However, there may not be any real difference in practice, particularly if respondents to statistical surveys simply copy values from recent administrative returns, without reading the often lengthy notes about how a particular variable should be defined for statistical purposes.
A further way to help break down the barriers of internal resistance is to show that cost savings from using administrative data do not necessarily mean staff reductions. The resources saved can, at least partly, be used to improve quality or increase the range or frequency of outputs.
This chapter clearly shows that there are many problems to overcome when using administrative sources. It also aims to show that others have also faced these problems, and that in most cases it is possible to find full or partial solutions. It can not cover all potential problems the reader may face, particularly those that are source-specific, but the intention is to give ideas that can be adapted to meet specific circumstances.
Overall, it is true to say that most problems encountered in the use of administrative data for statistical purposes, in common with many other areas of statistics, can be overcome, or at least reduced, by effective planning and management, a good knowledge of data sources, creative thinking, and the willingness to exchange experiences and learn from others.
Administrative data often require different processing than statistical sources. Simply substituting administrative data for statistical data without changing the statistical production process will rarely work in practice.
The final thing to remember is that despite all of the problems, the benefits of using administrative data are still often much greater than the costs.
 Available from Eurostat at: http://epp.eurostat.ec.europa.eu/cache/ITY_OFFPUB/KS-BG-03-001/EN/KS-BG-03-001-EN.PDF
 “The Impact of Diverging Interpretations of the Enterprise Concept” - a study prepared for Eurostat by Statistics Netherlands with input from Denmark and UK.
 See the Resolution concerning statistics of the economically active population, employment, unemployment and underemployment, adopted by the Thirteenth International Conference of Labour Statisticians (October 1982) http://www.ilo.org/global/statistics-and-databases/standards-and-guidelines/resolutions-adopted-by-international-conferences-of-labour-statisticians/WCMS_087481/lang--en/index.htm
 For an example relevant to Figure 4.2, see Annex B of Business start-ups and closures: VAT registrations and de-registrations in 2005 - Guidance and Methodology http://stats.berr.gov.uk/smes/vat/VATGuidance2005.pdf
 Source: Model Quality Report in Business Statistics, Volume III, Eurostat http://epp.eurostat.ec.europa.eu/portal/page/portal/quality/documents/MODEL%20QUALITY%20REPORT%20VOL%203.pdf
 An example of benchmarking, using maps to compare the coverage of a statistical business register with that of a commercial telephone directory can be found in the paper “The Development of Small-area Business Statistics in the United Kingdom” at http://live.unece.org/fileadmin/DAM/stats/documents/ces/sem.53/wp.7.e.pdf