Big Data: Big Threat or Big Opportunity for Official Statistics?
by Dr. Jose Ramon G. Albert 1
(Posted 18 October 2013)
Official Statistics and Public Policy
Governments across the world, including our own, recognize the need for information to manage their economies more effectively, particularly to accelerate progress in meeting national development plans, and global commitments to reduce poverty and related goals exemplified in the Millennium Development Goals (MDGs), and the Post-2015 MDG Agenda. When nations committed to carrying out the MDGs in 2000 during the Millennium Summit, a number of statistical indicators were identified by experts from the development community a year later in order to set specific time-bound targets by 2015, and to monitor the progress in meeting the MDGs. Official statisticians especially from developing countries, however, were not involved in the selection of these MDG indicators, and ironically, even as a number of developing countries do not have a sufficient number of these MDG indicators, even more statistical indicators were added to diagnose progress in the MDGs or the lack of it. The neglect of inputs from official statisticians is also seen from the lack of participation from the PH Statistical System during the last United Nations General Assembly meeting held in the last week of September this year.
The UN Development Programme (UNDP) in Manila seems to have inadvertently neglected to include a representative from the PH Statistical System (PSS), despite the fact that compilation of the MDG indicators is lodged at the National Statistical Coordination Board (NSCB), and many of these indicators are sourced from the PSS. Despite this big oversight by UNDP, I still managed to get to New York, not for the GA sessions itself but rather to help organize a well-attended side-event on “Engineering a Development Data Revolution” hosted by the PH government through the NSCB and the National Economic and Development Authority, in cooperation with the Partnership in Statistics for Development in the 21st Century (PARIS21).
In the PSS, there are currently in operation a number of major statistical agencies involved in the production of official statistics through primary data collection and/or compilation of data. Among these agencies are the National Statistical Coordination Board (NSCB) and the National Statistics Office. The NSCB, a policy making and coordinating body, releases the national income accounts and official poverty statistics; while the NSO is the main producer of many general purpose statistics, since it conducts censuses, and several major household surveys as well as establishment inquiries.2
Many PSS stakeholders expect the NSCB Technical Staff to produce preliminary estimates of the Gross Domestic Product (GDP) much earlier than the schedules in the NSCB Advance Release Calendar (which are typically 55 to 60 days after the reference quarter). Data users of sample surveys of NSO would similarly expect this producer of data to release survey results much quicker than their regular schedule, especially since in this age of information and communications technology, we have various tools and gadgets to share and process information with increasing velocity.
However, while the PSS has infrastructure and a lot of expertise for examining data quality, integrity and comparability, the data harvested in official statistics through various data sources such as surveys, censuses and administrative reporting systems, are still dependent on what potential data suppliers provide to the PSS. Despite the advancements in the pace of sharing of information, respondents to NSO establishment surveys, for instance, still do not always provide information in a timely manner. All the information supplied by survey respondents still have to undergo consistency checks before they can be aggregated for purposes of estimating production in the national income accounts to ensure the robustness of results from these surveys. In addition, like many statistical systems across the world, the PSS has also not been provided the requisite financial and human resources to meet the ever-growing needs of stakeholders, thus further constraining plans to improve timelines in the delivery of statistical products and services to the public.
The Code of Practice in the PSS for integrity, independence and professionalism
The NSCB Technical Staff have often pointed out that the PSS tries to live up to a code of practice summarized by the United Nations Fundamental Principles of Official Statistics (UN FPOS). The first two principles of the FPOS summarizes the relevance of official statistics, the impartiality required in the production process, as well as the professional standards and ethics for ensuring the credibility of official statistics. Implicit in the first principle is a definition of official statistics, i.e., “data about the economic, demographic, social and environmental situation”, and a mandate of national statistical systems to be auditors of a country’s socio-economic performance. Consequently, it is vital for any National Statistical System (NSS) such as the PSS, to live up to the UN FPOS for preserving its integrity, to maintain its independence, and to adhere to a professional conduct in the production and release of official statistics.
While the quality of statistics involves various criteria, there has undoubtedly been more focus in the production of official statistics on managing precision and accuracy over timeliness and other quality issues. Official statistics are currently almost exclusively based on surveys and censuses, as well as administrative data reporting systems from government programs, often resulting from legislative mandates provided to a NSS. While databases from these sources have a big “volume,” the resulting data can hardly be called “big data” unless the data collection gets increased “velocity”, i.e. if data collection for these data sources is more frequent, i.e., hourly, daily or weekly instead of the usual monthly, semi-annually, or annually.
Data sources in official statistics have been “tried and tested” mechanisms for ensuring the credibility of official statistics. National income accounts, data on prices, among others typically follow a Statistical Quality Assessment Framework to ensure integrity and credibility of resulting figures. Fellegi (1996) suggests that credibility is fundamental in official statistics: "Credibility plays a basic role in determining the value to users of the special commodity called statistical information. Indeed, few users can validate directly the data released by statistical offices. They must rely on the reputation of the provider of the information. Since information that is not believed is useless, it follows that the intrinsic value and usability of information depends directly on the credibility of the statistical system. That credibility could be challenged at any time on two primary grounds; because the statistics are based on inappropriate methodology, or because the office is suspected of political biases."
While National Statistical Systems, including our very own PSS, has been producing a lot of official statistics, there are those who point out that the current stream of official statistics are not sufficient to help us identify what needs to be done to totally eradicate poverty in the world, to improve the lives of every person, and to sustain progress in societies. A High Level Panel (HLP) of Eminent Persons came out with a report that suggested what could be the beginning of a Post-2015 MDG agenda. The HLP called for a “data revolution” in order to reach Zero (Poverty), and this has been re-echoed by an Open Working Group (OWG) on Sustainable Development Goals.
Understandably, there is demand for official statistics not only to be more disaggregate, but also to be more frequent and timely in an age when quick, voluminous data is also being produced as a by-product of use of electronic devices (mobile phones, smart phones, tablets, laptops), social media, “google”, sensors, tracking devices (GPS). In 2012 alone, 2.5 quintillion (2.5 x 1018) bytes of data were being created per day. We are having more and more Internet subscribers. In the PH alone, Internet penetration reached 36% in 2012 from 2% in 2000 (Figure 1). Similarly, we find more and more mobile subscribers in the world. In the Philippines, as of 2012, there were 102 mobile subscribers per 100 persons (Figure 2). The latter statistics may initially seem strange, but they can be readily explained given that some people (myself include) have more than one mobile subscription (I have Globe, Smart and Sun cellular lines!!!). This age of gadgets, social media and sensors has increased the public need and expectation for “Knowing (information) in (Real) Time”.
Source: International Telecommunication Union
Source: International Telecommunication Union
Big Data is Here!
The world’s capacity to collect data is reported to double every 40 months since the 1980s, with about 2.5 quintillion (2.5 x 1018) bytes of data being created per day in 2012. With a tsunami of data being shared and transmitted on the web and by way of various electronic means, including tracking devices (such as mobile phones and GPS) at an exponential rate (but with variety in formats), the public’s hunger for information is likewise accelerating.
What is now being referred to as Big Data3, which is typically characterized with the 3V’s: volume, velocity and variety, is certainly creating a number of business opportunities, especially in rich countries. In addition, there are anecdotes about the success of using Big Data for practical problems.
Google established in 2008 a real-time flu tracker called “Google Flu Trends” by watching where people searched for terms relating to illness and mapping that data with the US Center for Disease Control. There have been some indications of success in using these health-related online data (see, e.g., J. Ginsburg et al, Nature , 2009) to detect disease outbreaks. Currently, Google is engaged in an experiment to work on the same idea but for Dengue Trends (see Figure 3), and there are also indications of success at least for Brazil. Google has also made attempts to examine web-searches and correlate them with actual sales (of cars), among other statistics (Choi & Varian, April 2009).
The UN Global Pulse also reports of studies made tracking tweets in twitter accounts in Jakarta regarding the high prices of rice, and correlating such information with the actual price of rice (Letouze, 2012), as well as examining mobile phone usage in Jakarta with traffic in Jakarta. These case studies show the vast amounts of potentials of Big Data, but there are also issues regarding the realization of these potentials, especially in developing countries.
The ICT Data and Statistics Division of the International Telecommunication Union reports that this year, there are practically as many mobile-cellular subscriptions (6.8 billion) in the world as there are people (7.1 billion) with growth and penetration higher in the developing world. In addition, nearly two in five people in the world (39%) are Internet users, but while Internet penetration is 77% in the developed world, the figures are much lower (31%) for the developing world, and one has to add that Internet speed is certainly not the same in the developed and developing worlds. Disparities in technological and analytical capacities may seriously yield a big divide in knowledge of using big data to inform decision making between advanced economies and the developing world.
Figure 3. Screenshot of Google Dengue Trends
Data Revolution in the PH
Governments want to have fast pace, if possible, real-time data that will help in making better and quicker decisions. For instance, in the Philippines, there is growing recognition that climate disasters (including storms and floods) that batter the Philippines every year with their increasing intensity and their movements in areas affected, are becoming a very serious threat to the country’s growth and development. As reported by Thomas et. al. (2012), disaster data from the Centre for Research on the Epidemiology of Disasters suggest that within Asia and the Pacific, the Philippines experienced the fourth highest frequency (98) of intense hydrological disasters during 1971–2010, topped only by Indonesia (124), India (167), and the PRC (172), all of which have much larger land areas, and the Philippines experienced the highest frequency (218) of intense meteorological disasters in the region during the span of four decades.
The Philippine Atmospheric, Geophysical and Astronomical Services Administration (PAGASA) suggests that from 1951 to 2010, the annual average frequency of tropical cyclones affecting the country has remain unchanged at around 19 to 20 cyclones per year. However, an examination of the typical paths of tropical cyclones per decade indicates that cyclones have been shifting southward toward central and southern Philippines. In addition, there is evidence that the amount of precipitation with cyclones appear to be increasing. To manage the risks associated with these hazards of nature, the PH government, through the Department of Science and Technology (DOST) started a flagship project called Nationwide Operational Assessment of Hazards (NOAH) last June 2012. Project NOAH involves the development of hydromet sensors (e.g. automatic rain gauges, water level sensors, stream gauges) as well as high resolution geo-hazard maps. The latter can provide national and local chief executives lead time early warning (i.e., 6 hours or less) to minimize the costs to lives, property and livelihood from these hazards of nature. Project NOAH uses topographic maps generated by light-detection and ranging (LiDAR) for flood modeling, but currently the maps generated are limited to selected locations around the country’s major rivers basins. These maps and other weather information are shared publicly through the NOAH website noah.dost.gov.ph and some mirror sites. Undoubtedly, these high velocity data has led national and local governments to become more disaster prepared. In Cagayan de Oro alone, we see evidence of how information has brought about disaster readiness. In 2011, typhoon Sendong led to 676 deaths in Cagayan de Oro. A year later, a typhoon with a similar strength (Pablo) only had one death reported.
In the Philippines, the data revolution has begun. Aside from the development of hazard maps and other useful information by DOST through Project NOAH, the PSS has also been using extensively improved technologies in the design, production, and release of official statistics. In the on-going re-design of its master sample of household surveys, the National Statistics Office (NSO) has made extensive use of Google maps, and is testing out the use of tablets for faster collection and processing of information. The NSCB Technical Staff is extensively using the web and social media (Facebook, Twitter, livestream) for dissemination of online articles, and its releases.
Big Data: Big News or Big Mess?
There is undoubtedly growing enthusiasm about this data revolution and its possibilities for making use of Big Data, especially for measuring and monitoring progress in societies. Many have come to realize that the data revolution and its effects are here to stay. Official statisticians are taking note of this emerging alternative data source, but with some degree of caution, as bigger data need not always mean better data. There is undoubtedly some tension between official statistics and big data, as the latter were not tailor made for statistical purposes (see Table 1). Big data is largely unstructured, unfiltered data exhaust from digital products, such as electronic and online transactions, social media, sensors (GPS, climate sensors), and consequently, analytics can be poor, unlike traditional data sources utilized for official statistics that are well-structured, but of high cost, and typically infrequent with time lags.
1. Structured and planned product
1. Largely unstructured unfiltered “data exhaust”, i.e., by-product of digital products (transactions, web, social media, sensors)
2. Methodological and clear concepts
2. Poor analytics
4. Macro-level but typically based on high volume primary data
4. Micro-level huge volume with high velocity (or frequency) and variety
5. High cost
5. Generally little, or no cost
6. Centralized; point in time
6. Distributed; real-time
At the 44th Session of the UN Statistics Commission held last February 2003 in New York, a Seminar on Emerging Issues, entitled “Big Data for Policy, Development and Official Statistics” was held. In this seminar, the High-Level Group for the Modernisation of Statistical Production and Services released a white paper discussing legislative, financial, management, methodological, and technological challenges in the use of Big Data in official statistics, notwithstanding the major concern of privacy given the extent of personally identifiable information available in social media, transactional data, that could potentially allow “Big Brother” to watch over us.
Much of Big Data being generated includes very personal information. Precise, geo-location-based information certainly pushes the boundary of privacy/confidentiality. It is clear that Amazon, Visa, Mastercard are watching our shopping preferences; Google is watching our browsing habits; Twitter is watching what’s on our minds; Facebook is watching various information about us, including our social relationships; and mobile providers are watching whom we talk to, what we say to them, and even who is nearby.
Privacy not only has legal issues but technological ones. While users of technology routinely tick a box to routinely consent to the collection and use of web-generated data and may decide to have some information put on public view, it is unclear whether they actually consent to having their data being analyzed, especially if it can be to their disadvantage. Can users give “informed consent” to an unknown use? For instance, when Google Flu Trends was developed in 2008, did Google have to contact all its users for approval to use old search queries for this project? Even if that were possible, the time and cost for doing that would have been enormous for Google. So, should users be asked to agree to any possible future use of their data? Of course, there are other ways to protect privacy and confidentiality, but these are still imperfect. Providers of data could opt out (but this can still leave a trace), and the same goes for anonymization (as “re-identification” is still possible). Mr. Johannes Jutting, Manager of the PARIS21 Secretariat further illustrated the ill effects of privacy issues when big data are interlinked similar to what is shown in the video below.
Even after getting through hurdles on addressing privacy issues, and national statistical systems like the PSS being able to access Big Data, or a component of Big Data, there are still other methodological challenges in ensuring the representatives of the information gathered. In his new book “The Signal and The Noise”, statistics guru Nate Silver (who is credited not only for his excellent analysis of sports statistics, but also for accurately predicting the results of the last presidential election in the United State) points out that "[Big Data] is sometimes seen as a cure-all, as computers were in the 1970s. Chris Anderson… wrote in 2008 that the sheer volume of data would obviate the need for theory, and even the scientific method…. [T]hese views are badly mistaken. The numbers have no way of speaking for themselves. We speak for them .. If the quantity of information is increasing by 2.5 quintillion bytes per day, the amount of useful information almost certainly isn't. Most of it is just noise, and the noise is increasing faster than the signal.”
There are those who think that there are big gains in velocity and cost over sacrificing precision and accuracy, i.e. Big Data may not be completely accurate, but it is “good enough.” But how good is “good enough?” Recent work, for instance, (D. Butler, Nature, Feb., 2013) reports on the over-estimation of Google Virus Trends of flu levels (11% in the US public this flu season, almost double the CDC’s estimate of about 6%).
A study of Twitter and Foursquare data before, during and in aftermath of Hurricane Sandy (Grinberg, et al., 2013) revealed interesting results: (i) grocery shopping peaks the night before the storm; (ii) nightlife picked up the day after; (iii) greatest number of tweets about Hurricane Sandy came from Manhattan. The latter creates the illusion that Manhattan was the most hit in the US by Hurricane Sandy, and it certainly wasn’t.
There is also danger when Big Data may be perilously used to predict the future. Those of you who watched the movie Minority Report may have remembered a scene about someone getting arrested for a crime he was supposedly going to commit! While one might say that this is too far-fetched? But is it? Parole boards in the US are using “predictions” from data analysis to decide on whether or not to give parole to inmates. The City of Memphis, Tennessee uses a program called Blue CRUSH (Crime Reduction Utilizing Statistical History) to concentrate police resources in a specific area at a specific time. They report that crimes fell by a quarter from CRUSH inception in 2006, but was it due to CRUSH??? Also, we must know that the US Department of Homeland Security uses FAST (Future Attribute Screening Technology) to identify potential terrorists. This is reportedly 70% accurate (but how this rate was obtained is certainly baffling!).
While I have often said that statistics tell a story, in the case of Big Data, the story we get may not be very clear especially as each piece of information is not given weights. While it is clear that Big Data is here to stay, it should also be clear that the data revolution does not mean the end of official statistics.
Statistical systems such as the PSS are merely challenged to come up with better statistics for a better society. The challenge is for the PSS to be more forward looking and open to making use of non-traditional data sources, such as Big Data. Clearly, there will also be a need to identify legal protocols and institutional arrangements so that the PSS can get access to Big Data. There will be a need for Public-Private Partnerships whether in bilateral arrangements of the PSS with those that own Big Data holdings. But there will also be a need to addressing privacy Issues with Big Data, in order to prevent misuse of Big Data. Capacity building will also be required for the PSS and its partners to Harness Big Data, and so that the Official Statistics community can help identify “signals” within “noise”, certify quality and ultimately decipher truth from falsehood, so that in this country, statistics can truly matter to every Filipino.
Reactions and views are welcome thru email to the author at firstname.lastname@example.org
1 Secretary General of the National Statistical Coordination Board (NSCB). The NSCB, a statistical agency functionally attached to the National Economic and Development Authority (NEDA), is the highest policy making and coordinating body on statistical matters in the Philippines. Immediately prior to his appointment at NSCB, Dr. Albert was a Senior Research Fellow at the Philippine Institute for Development Studies, a policy think tank attached to NEDA. Dr. Albert finished summa cum laude with a Bachelor of Science degree in Applied Mathematics from the De La Salle University in 1988. He completed a Master of Science in Statistics from the State University of New York at Stony Brook in 1989 and a Ph.D. in Statistics from the same university in 1993. He has taught at various higher educational institutions, and is currently a Professorial Lecturer at the Decision Sciences and Innovation Department of Ramon V. Del Rosario College of Business, De La Salle University. He is also a past President of the Philippine Statistical Association, a Fellow of the Social Weather Stations, and an Elected Regular Member of the National Research Council of the Philippines.
The author thanks Director Candido J. Astrologo, Jr., Director Regina S. Reyes, Noel S. Nepomuceno and Sonny U. Gutierrez for the assistance in the preparation of the article. The views expressed in the article are those of the author and do not necessarily reflect those of the NSCB and its Technical Staff.
2 The NSCB, and NSO, together with the Bureau of Agricultural Statistics as well as the Bureau of Labor and Employment Statistics are effectively abolished, and are merged into the Philippine Statistics Authority as a result of the Philippine Statistical Act of 2012 (Republic Act 10625) that was signed into law last September 12, 2013 by President Simeon Benigno C. Aquino III, and took effect fifteen days later . Pending the full operationalization of the merger is awaiting implementing rules and regulations (IRRs) of the law, these major statistics agencies will continue to carry out their statistical activities and programs under a hold-over status.
- Butler, D. (2013). When Google got flu wrong, Nature. Available on the Internet : http://www.nature.com/polopoly_fs/1.12413!/menu/main/topColumns/ topLeftColumn/pdf/494155a.pdf
- Choi, H. and Varian, H. (2009). Predicting the Present with Google Trends. http://static.googleusercontent.com/external_content/untrusted_dlcp/ www.google.com/en//googleblogs/pdfs/google_predicting_the_present.pdf
- Fellegi I (1996), Characteristics of an Effective Statistical System, International Statistical Review, Vol 64, pp165-197
- Ginsburg, J. et al (2009), “Detecting Influenza Epidemics Using Search Engine Query Data,” Nature,457 pp. 1012-14. Available on the Internet : http: www.nature.com/nature/journal/v457/n7232/full/nature07634.html)
- Grinberg, N., Naaman, M., Shaw, B., and Lotan, G. Extracting Diurnal Patterns of Real World Activity from Social Media. Available on the Internet:
- Letouzé, E. (2012) “Big Data for Development: Challenges & Opportunities”. UN Global Pulse. Available on the Internet: http://www.unglobalpulse.org/sites/default/files/BigDataforDevelopment-UNGlobalPulseJune2012.pdf
- Mayer-Schonberger, V. and Cukier, K. (2013). Big Data: A Revolution that Will Transform How We Live, Work and Think. New York: Houghton Mifflin Harcourt Publishing Company.
- Silver, N. (2013). The Signal and the Noise: Why So Many Predictions Fail — but Some Don't. United Kingdom: Penguin.
- Thomas, V., Albert, J.R., Perez, R. (2013). Climate-Related Disasters in Asia and the Pacific. ADB Economics Working Paper Series No. 358 , July 2013 Available on the Internet : http://www.adb.org/sites/default/files/pub/2013/ewp-358.pdf
- UNECE. (2013). What Does “Big Data” Mean For Official Statistics? United Nations Economic Commission For Europe. Available on the Internet: http://www1.unece.org/stat/platform/pages/viewpage.action?pageId=77170614
comments powered by Disqus