Measuring the information society using big data

27-11-2014

How can we take advantage of big data (1) to improve the availability of currently existing global ICT statistics, amd (2) to improve the measurement of the information society?

The opportunities for using big data in the context of ICT statistics production are in combining a big data source with a big data collection method. The following nine categories of big data sources have been identified that could be of interest.

Telecommunication networks

Mobile operators possess a huge amount of telecommunications data, such as traffic data, service access and call detail records (CDRs), location and movement data, device characteristics, tariff data and customer details. The data can be used in several research areas such as monitoring information society, epidemiology, migration patterns and socio-economic analyses. The first research area is especially useful because detailed data can be derived about user consumption. The telecommunication’s data potential is enormous since by combining data from telecom operators it is possible to obtain insights into every subscriber in the world. To obtain the telecommunication data, all mobile operators around the world have to cooperate.

Mobile devices

Mobile devices such as smartphones, tablets and e-readers present an opportunity for gathering data on the usage of these devices and the applications that run on them. The owners of the software platforms of these devices (e.g. Google Android, Apple iOS, Microsoft Windows Phone) have mechanisms in place in their operating systems (OS) to automatically collect diagnostic data upon consent of the end-user. The collected data can be divided into three categories: wireless statistics (e.g. amount of data transmitted), usage statistics (e.g. how many times the device was ‘woken up’) and market statistics (e.g. number of app downloads by country). The wireless statistics could be used to extend and complement the current statistics on the mobile cellular network, while the usage and market statistics could be used to produce new indicators.

Legally speaking, there do not seem to be many problems with this type of data collection since users have to give their consent. The collection is however biased towards users who: (1) have a smartphone, (2) use the vendor’s operating system, and (3) choose to enable diagnostics data collection. The continuity of the data collection is another issue since platforms come and go.

Social networks and instant messaging services

Social networks have enormous amounts of knowledge about their users. Facebook has for instance demographic information on its users such as gender, age, and location, and in some cases also users’ online activities. The information can be used to profile internet users in different countries. The information does not capture users who use either a different local alternative social network (such as Weibo or VKontakte) or no social network at all. Social networks also come and go.

Instant messaging services do not capture a lot of data about their users. The relevant indicator would be the number of instant messages sent, which can be useful to capture the transition from SMS and MMS to instant messaging. The instant messaging services suffer from similar issues as the social networks. A lot of different instant messaging services are available such as WhatsApp, WeChat and Line, causing fierce competition between them.

Telecommunications equipment

A distinctly different method is to skip the operators and retrieve data directly from telecommunications equipment such as core routers and switches. The data is related to the performance of the network such as amount of traffic and amount of errors. The main benefit is that the number of actors who manufacture telecommunication equipment is relatively small, making the data collection procedure less complex. The collection of the data is legally difficult because the manufacturers must have permission from their customers (e.g. the mobile operators).

Auto-update services

Modern software packages (both on mobile as well as desktop devices) often have a mechanism in place to automatically download updates from the software vendor over the Internet. This could be a useful source of information, as the software vendor can keep statistics on update frequency at a device and country level. One of the most important pieces of software on mobile as well as desktop devices is the Internet browser. Information from browser manufacturers can be used to estimate the number of devices connected to the Internet. The data collection should not pose an issue from a legal point of view since only high level data, and not individual user data, is transmitted to the manufacturers. Continuity may however be an issue.

Content delivery networks

A CDN (Content Delivery Network) is a system that allows for efficient and fast content distribution from content provider to end-user. Most CDNs distribute oft-requested content from the content provider to servers placed locally (e.g. at popular internet exchanges) to improve response time and throughput. CDN providers are able to retrieve inter alia end-user download speeds and end-user connection latency. There should neither be significant technical nor legal issues, as some CDN providers are already publishing part of the statistics themselves.

Internet as a data source (IaD)

The internet itself can be used as a source of data. One notable example is the use of web scrapers. Web scraping is the practice of parsing a website’s pages in order to extract structured data. This practice can be used to find mobile phone and broadband tariffs on the websites of mobile operators around the world. This method gives the ability to obtain data much faster and without the intervention of NRAs and mobile operators. NSOs are already experimenting with this method to estimate the Consumer Price.

Other types of Internet as a data source are network-centric (where measurements are performed by sending data over the internet and tracing the path travelled) and site-centric (where user’s behavior is measured using tracking software installed on websites).

Security vendors

The security of ICT infrastructures has become increasingly relevant in the past few years. Data from virus scanners, such as the percentage of computers with antivirus software can be used to make statements raising people’s awareness of  computer viruses. The use of virus scanner data is biased towards computers with Windows installed and it requires the cooperation of the vendors. Spamhaus is an international non-profit organization whose mission is to provide real-time, anti-spam protection. Their data can be used to identify countries which house a lot of spammers. The data is already being collected by Spamhaus but no link has yet been made with these countries.

Mobile payment platforms

Payment platforms provide an opportunity for big data, especially considering the ongoing and rapid introduction of mobile payment. The mobile payment system can provide a new category of indicators such as the number of mobile payment transactions. The indicators can be used to measure the penetration of mobile banking in different countries. A variety of parties is exploring the potential of mobile payments, each with their own platforms. Examples are Apple (Apple Pay), eBay (PayPal), Vodafone (M-Pesa) and Alibaba (Alipay). The variety makes it difficult to collect the data from all the users of mobile payment platforms.

Conclusion

Big data can either augment or improve existing statistics, or introduce new statistics. The following figure gives an overview of all the big data sources discussed earlier. The various indicator topics are shown horizontally and  the different points at which data is produced are displayed vertically. The figure shows big data sources (blue bars) as well as the currently collected indicators (green circles). Where a blue bar overlaps a green circle, big data is used to improve existing statistics. In other cases, big data provides a new indicator.

Big data for information society statistics

Source: Research I conducted while at Dialogic in 2014.