Applying data synthesis for longitudinal business data across three countries

Publications

Share / Export Citation / Email / Print / Text size:

Statistics in Transition New Series

Polish Statistical Association

Central Statistical Office of Poland

Subject: Economics , Statistics & Probability

GET ALERTS

ISSN: 1234-7655
eISSN: 2450-0291

DESCRIPTION

36
Reader(s)
113
Visit(s)
0
Comment(s)
0
Share(s)

SEARCH WITHIN CONTENT

FIND ARTICLE

Volume / Issue / page

Related articles

VOLUME 21 , ISSUE 4 (August 2020) > List of articles

Special Issue

Applying data synthesis for longitudinal business data across three countries

M. Jahangir Alam / Benoit Dostie / Jörg Drechsler / Lars Vilhuber

Keywords : business data, confidentiality, LBD, LEAP, BHP, synthetic

Citation Information : Statistics in Transition New Series. Volume 21, Issue 4, Pages 212-236, DOI: https://doi.org/10.21307/stattrans-2020-039

License : (CC BY-NC-ND 4.0)

Received Date : 31-January-2020 / Accepted: 30-June-2020 / Published Online: 15-September-2020

ARTICLE

ABSTRACT

Data on businesses collected by statistical agencies are challenging to protect. Many businesses have unique characteristics, and distributions of employment, sales, and profits are highly skewed. Attackers wishing to conduct identification attacks often have access to much more information than for any individual. As a consequence, most disclosure avoidance mechanisms fail to strike an acceptable balance between usefulness and confidentiality protection. Detailed aggregate statistics by geography or detailed industry classes are rare, public-use microdata on businesses are virtually inexistant, and access to confidential microdata can be burdensome. Synthetic microdata have been proposed as a secure mechanism to publish microdata, as part of a broader discussion of how to provide broader access to such data sets to researchers. In this article, we document an experiment to create analytically valid synthetic data, using the exact same model and methods previously employed for the United States, for data from two different countries: Canada (Longitudinal Employment Analysis Program (LEAP)) and Germany (Establishment History Panel (BHP)). We assess utility and protection, and provide an assessment of the feasibility of extending such an approach in a cost-effective way to other data.

Content not available PDF Share

FIGURES & TABLES

REFERENCES

ABOWD, J. M. and J. I. LANE (2004). “New Approaches to Confidentiality Protection Synthetic Data, Remote Access and Research Data Centers”. In: Privacy in Statistical Databases. Ed. by J. DOMINGO-FERRER and V. TORRA. Vol. 3050. Lecture Notes in Computer Science. Springer, pp. 282–289. doi: 10.1007/978-3-540-22118-0. url: http://www.springer.com/la/book/9783540221180.

ABOWD, J. M. and I. SCHMUTTE (2015). “Economic analysis and statistical disclosure limitation”. In: Brookings Papers on Economic Activity Fall 2015. url: http: / / www . brookings . edu / about / projects / bpea / papers / 2015 / economic - analysis-statistical-disclosure-limitation.

ABOWD, J. M., B. E. STEPHENS, L. VILHUBER, F. ANDERSSON, K. L. MCKINNEY, M. ROEMER, and S. D. WOODCOCK (2009). “The LEHD Infrastructure Files and the Creation of the Quarterly Workforce Indicators”. In: Producer Dynamics: New Evidence from Micro Data. Ed. by T. DUNNE, J. B. JENSEN, and M. J. ROBERTS. University of Chicago Press. url: http://www.nber.org/chapters/c0485.

ABOWD, J. M. and L. VILHUBER (2010). VirtualRDC - Synthetic Data Server. Cornell University, Labor Dynamics Institute. url: http://www.vrdc.cornell.edu/sds/.

ALAM, M. J., B. DOSTIE, J. DRECHSLER, and L. VILHUBER (2020). Replication archive for: Applying Data Synthesis for Longitudinal Business Data across Three Countries. Code and data. Zenodo. doi: 10.5281/zenodo.3785744.

ARELLANO, M. and S. BOND (1991). “Some Tests of Specification for Panel Data: Monte Carlo Evidence and an Application to Employment Equations”. In: Review of Economic Studies 58.2, pp. 277–297. url: https://EconPapers.repec.org/ RePEc:oup:restud:v:58:y:1991:i:2:p:277-297.

ARELLANO, M. and O. BOVER (1995). “Another look at the instrumental variable estimation of error-components models”. In: Journal of Econometrics 68.1, pp. 29– 51. url: https://EconPapers.repec.org/RePEc:eee:econom:v:68:y:1995: i:1:p:29-51.

BARTELSMAN, E., J. HALTIWANGER, and S. SCARPETTA (2009). “Measuring and Analyzing Cross-country Differences in Firm Dynamics”. In: DUNNE, T., J. B. JENSEN, and M. J. ROBERTS. Producer Dynamics: New Evidence from Micro Data. University of Chicago Press, pp. 15–76. url: http : / / www . nber . org / chapters/c0480.

BENDER, S. (2009). “The RDC of the Federal Employment Agency as a part of the German RDC Movement”. In: Comparative Analysis of Enterprise Data, 2009 Conference. Comparative Analysis of Enterprise Data, 2009 Conference (Tokyo). url: http://gcoe.ier.hit-u.ac.jp/CAED/index.html (visited on 05/05/2014).

BENEDETTO, G., J. HALTIWANGER, J. LANE, and K. MCKINNEY (2007). “Using Worker Flows in the Analysis of the Firm”. In: Journal of Business and Economic Statistics 25.3, pp. 299–313.

BLUNDELL, R. and S. BOND (1998). “Initial conditions and moment restrictions in dynamic panel data models”. In: Journal of Econometrics 87.1, pp. 115–143. url: https://ideas.repec.org/a/eee/econom/v87y1998i1p115-143.html.

BLUNDELL, R., S. BOND, and F. WINDMEIJER (2001). “Estimation in dynamic panel data models: Improving on the performance of the standard GMM estimator”. In: Nonstationary Panels, Panel Cointegration, and Dynamic Panels. Ed. by B. H. BALTAGI, T. B. FOMBY, and R. CARTER HILL. Vol. 15. Advances in Econometrics. Emerald Group Publishing Limited, pp. 53–91. doi: 10.1016/S0731- 9053(00) 15003-0. url: https://doi.org/10.1016/S0731-9053(00)15003-0 (visited on 04/30/2020).

BUNDESAGENTUR FÜR ARBEIT (2013). Establishment History Panel (BHP). [Computer file]. N¨urnberg, Germany: Research Data Centre (FDZ) of the German Federal Employment Agency (BA) at the Institute for Employment Research (IAB) [distributor].

DAVIS, S. J., J. C. HALTIWANGER, and S. SCHUH (1996). Job creation and destruction. Cambridge, MA: MIT Press.

DRECHSLER, J. (2011a). Synthetic Datasets for Statistical Disclosure Control–Theory and Implementation. New York: Springer. doi: 10.1007/978-1-4614-0326-5.

DRECHSLER, J. (2011b). Synthetische Scientific-Use-Files der Welle 2007 des IABBetriebspanels. FDZ Methodenreport 201101 de. Institute for Employment Research, Nuremberg, Germany. url: http://ideas.repec.org/p/iab/iabfme/201101_ de.html.

— (2012). “New data dissemination approaches in old Europe – synthetic datasets for a German establishment survey”. In: Journal of Applied Statistics 39.2, pp. 243–265. url: http://ideas.repec.org/a/taf/japsta/v39y2012i2p243-265.html.

DRECHSLER, J. and L. VILHUBER (2014a). A First Step Towards A German Synlbd: Constructing A German Longitudinal Business Database. Working Papers 14-13. Center for Economic Studies, U.S. Census Bureau. url: https://ideas.repec. org/p/cen/wpaper/14-13.html.

DRECHSLER, J., A. DUNDLER, S. BENDER, S. RÄSSLER, and T. ZWICK (2008). “A new approach for disclosure control in the IAB establishment panel—multiple imputation for a better data access”. In: AStA Advances in Statistical Analysis 92.4, pp. 439–458.

DRECHSLER, J. and L. VILHUBER (2014b). “A First Step Towards A German SynLBD: Constructing A German Longitudinal Business Database”. In: Statistical Journal of the IAOS: Journal of the International Association for Official Statistics 30.2. doi: 10.3233/SJI-140812. url: http://iospress.metapress. com/content/X415V18331Q33150.

GUZMAN, J. and S. STERN (2016). The State of American Entrepreneurship: New Estimates of the Quality and Quantity of Entrepreneurship for 32 US States, 1988- 2014. Working Paper 22095. National Bureau of Economic Research. doi: 10.3386/ w22095. url: http://www.nber.org/papers/w22095.

— (2020). Startup Cartography. url: https : / / www . startupcartography . com/ (visited on 01/26/2020).

HANSEN, L. P. (1982). “Large Sample Properties of Generalized Method of Moments Estimators”. In: Econometrica 50.4, p. 1029. doi: 10.2307/1912775. url: https: //www.jstor.org/stable/1912775?origin=crossref (visited on 04/30/2020).

HETHEY, T. and J. F. SCHMIEDER (2010). Using worker flows in the analysis of establishment turnover: Evidence from German administrative data. FDZ Methodenreport 201006 en. Institute for Employment Research, Nuremberg, Germany. url: http://ideas.repec.org/p/iab/iabfme/201006_en.html.

JARMIN, R. S. and J. MIRANDA (2002). The Longitudinal Business Database. Working Papers 02-17. Center for Economic Studies, U.S. Census Bureau. url: https:// ideas.repec.org/p/cen/wpaper/02-17.html.

JARMIN, R. S., T. A. LOUIS, and J. MIRANDA (2014). “Expanding The Role Of Synthetic Data At The U.S. Census Bureau”. In: Statistical Journal of the IAOS: Journal of the International Association for Official Statistics 30.2. doi: 10.3233/SJI-140813. url: http://iospress.metapress.com/content/fl8434n4v38m4347/ ?p=00c99b98bf2f4701ae806ee638594915&pi=0.

KARR, A. F., C. N. KOHNEN, A. OGANIAN, J. P. REITER, and A. P. SANIL (2006). “A Framework for Evaluating the Utility of Data Altered to Protect Confidentiality”. In: The American Statistician 60.3, pp. 1–9. doi: 10.1198/000313006X124640.

KINNEY, S. K., J. P. REITER, and J. MIRANDA (2014a). Improving The Synthetic Longitudinal Business Database. Working Papers 14-12. Center for Economic Studies, U.S. Census Bureau. url: https://ideas.repec.org/p/cen/wpaper/14- 12.html.

— (2014b). “Improving The Synthetic Longitudinal Business Database”. In: Statistical Journal of the IAOS: Journal of the International Association for Official Statistics 30.2. doi: 10.3233/SJI-140808.

KINNEY, S. K., J. P. REITER, A. P. REZNEK, J. MIRANDA, R. S. JARMIN, and J. M. ABOWD (2011a). LBD Synthesis Procedures. CES Technical Notes Series 11-01. Center for Economic Studies, U.S. Census Bureau. url: https://ideas.repec. org/p/cen/tnotes/11-01.html.

— (2011b). “Towards Unrestricted Public Use Business Microdata: The Synthetic Longitudinal Business Database”. In: International Statistical Review 79.3, pp. 362–384. doi: j.1751-5823.2011.00152.x. url: https://ideas.repec.org/a/bla/ istatr/v79y2011i3p362-384.html.

LITTLE, R. J. (1993). “Statistical Analysis of Masked Data”. In: Journal of Official Statistics 9.2, pp. 407–426.

NATIONAL RESEARCH COUNCIL (2007). Understanding Business Dynamics: An Integrated Data System for America’s Future. Ed. by J. HALTIWANGER, L. M. LYNCH, and C. MACKIE. Washington, DC: The National Academies Press. doi: 10. 17226/11844. url: https://www.nap.edu/catalog/11844/understandingbusiness-dynamics-an-integrated-data-system-for-americas-future.

NOWOK, B., G. RAAB, and C. DIBBEN (2016). “synthpop: Bespoke Creation of Synthetic Data in R”. In: Journal of Statistical Software, Articles 74.11, pp. 1–26. doi: 10.18637/jss.v074.i11. url: https://www.jstatsoft.org/v074/i11

RAAB, G. M., B. NOWOK, and C. DIBBEN (2018). “Practical Data Synthesis for Large Samples”. In: Journal of Privacy and Confidentiality 7.3, pp. 67–97. doi: 10.29012/jpc.v7i3.407. url: https://journalprivacyconfidentiality. org/index.php/jpc/article/view/407.

RUBIN, D. B. (1993). “Discussion of Statistical Disclosure Limitation”. In: Journal of Official Statistics 9.2, pp. 461–468.

SEDLÁČEK, P. and V. STERK (2017). “The Growth Potential of Startups over the ˇ Business Cycle”. In: American Economic Review 107.10, pp. 3182–3210. doi: 10. 1257/aer.20141280. url: http://www.aeaweb.org/articles?id=10.1257/ aer.20141280.

SNOKE, J., G. M. RAAB, B. NOWOK, C. DIBBEN, and A. SLAVKOVIC (2018a). “General and specific utility measures for synthetic data”. In: Journal of the Royal Statistical Society: Series A (Statistics in Society) 181.3, pp. 663–688. doi: 10 . 1111/rssa.12358. eprint: https://rss.onlinelibrary.wiley.com/doi/pdf/ 10.1111/rssa.12358. url: https://rss.onlinelibrary.wiley.com/doi/ abs/10.1111/rssa.12358.

SNOKE, J. and A. SLAVKOVIC (2018b). “pMSE Mechanism: Differentially Private Synthetic Data with Maximal Distributional Similarity: UNESCO Chair in Data Privacy, International Conference, PSD 2018, Valencia, Spain, September 26-28, 2018, Proceedings”. In: pp. 138–159. doi: 10.1007/978-3-319-99771-1_10.

STATISTICS CANADA (2019a). Business Register (BR). url: https://www23.statcan.gc.ca/imdb/p2SV.pl?Function=getSurvey%5C&SDDS=1105 (visited on 01/30/2020).

— (2019b). Longitudinal Employment Analysis Program (LEAP). url: https://www23.statcan.gc.ca/imdb/p2SV.pl?Function=getSurvey%5C&SDDS=8013 (visited on 01/30/2020).

— (2019c). Survey of Employment, Payrolls and Hours (SEPH). url: https://www23.statcan.gc.ca/imdb/p2SV.pl?Function=getSurvey%5C&SDDS=2612 (visited on 01/30/2020).

STATISTICS CANADA and BUREAU OF THE CENSUS (1991). Concordance between the Standard Industrial Classifications of Canada and the United States, 1980 Canadian SIC - 1987 United States SIC. Catalogue No. 12-574E. Statistics Canada. url: http://publications.gc.ca/site/eng/9.847987/publication.html (visited on 01/30/2020).

STATISTISCHES BUNDESAMT (2003). Classification of Economic Activities, issue 2003 (WZ 2003). Statistisches Bundesamt (Federal Statistical Office) of Germany. url: https://www.klassifikationsserver.de/klassService/index.jsp?variant=wz2003 (visited on 02/02/2020).

U.S. CENSUS BUREAU (2015). Longitudinal Business Database 1975-2015 [Data file]. Tech. rep. url: https://www.census.gov/programs-surveys/ces/data/ restricted-use-data/longitudinal-business-database.html (visited on 01/26/2020).

— (2016a). County Business Patterns (CBP). U.S. Census Bureau. url: https://www.census.gov/programs-surveys/cbp.html (visited on 01/26/2020).

— (2016b). Statistics of U.S. Businesses (SUSB). U.S. Census Bureau. url: https: //www.census.gov/programs-surveys/susb.html (visited on 01/26/2020).

— (2017). Business Dynamics Statistics (BDS). U.S. Census Bureau. url: https : //www.census.gov/programs-surveys/bds.html (visited on 01/26/2020).

VILHUBER, L. (2013). Methods for Protecting the Confidentiality of Firm-Level Data: Issues and Solutions. Document 19. Labor Dynamics Institute. url: http:// digitalcommons.ilr.cornell.edu/ldi/19/.

— (2018). LEHD Infrastructure S2014 files in the FSRDC. Working Papers 18-27. Center for Economic Studies, U.S. Census Bureau. url: https://ideas.repec.org/p/cen/wpaper/18-27.html.

— (2019). Utility of two synthetic data sets mediated through a validation server: Experience with the Cornell Synthetic Data Server. Presentation. Conference on Current Trends in Survey Statistics. url: https://hdl.handle.net/1813/43883.

VILHUBER, L. and J. M. ABOWD (2016a). Usage and outcomes of the Synthetic Data Server. Presentation. Meetings of the Society of Labor Economists. url: https://hdl.handle.net/.

VILHUBER, L., J. M. ABOWD, and J. P. REITER (2016b). “Synthetic establishment microdata around the world”. In: Statistical Journal of the International Association for Official Statistics 32.1, pp. 65–68. doi: 10.3233/SJI-160964.

WOO, M.-J., J. P. REITER, A. OGANIAN, and A. F. KARR (2009). “Global Measures of Data Utility for Microdata Masked for Disclosure Limitation”. In: Journal of Privacy and Confidentiality 1.1. doi: 10.29012/jpc.v1i1.568.url: https://journalprivacyconfidentiality.org/index.php/jpc/article/view/568.

WOODCOCK, S. D. and G. BENEDETTO (2009). “Distribution-preserving statistical disclosure limitation”. In: Computational Statistics & Data Analysis 53.12, pp. 4228– 4242. doi: https://doi.org/10.1016/j.csda.2009.05.020. url: http: //www.sciencedirect.com/science/article/pii/S0167947309002011.

 

EXTRA FILES

COMMENTS