Google’s 62 petabyte search scale ranks last among well-known big data sources

As we all know, algorithms, computing power and data are the troika of artificial intelligence (AI) development. Scholars such as Wu Enda often say data-centric AI or data-driven AI.

It can be seen that the surge in the amount of data in recent years is one of the driving forces for the take-off of AI, and data plays an important role in AI. So, what is the scale of the “big data” that people often say?

join us on telegram

Out of curiosity, an Italian physics researcher, Luca Clissa, investigated the size of several well-known big data sources (Google Search, Facebook, Netflix, Amazon, etc.) in 2021 and compared them with the Large Hadron Collider ( The data detected by the electronic equipment of the LHC were compared.

Address:

https://arxiv.org/pdf/2202.07659.pdf _

There is no doubt that the amount of data in the LHC is staggering, up to 40k EB. However, the data volume of commercial companies should not be underestimated. For example, the amount of data stored in Amazon S3 has reached about 500 EB, which is roughly equivalent to 7,530 times that of Google search (62 PB).

In addition, streaming data also has a place in the big data market. Services such as Netflix and electronic communications generate one to two orders of magnitude more traffic than pure data producers.

LHC data volume

According to Luca Clissa’s survey, the volume of major well-known data sources in 2021 is approximate as follows:

In the upper right corner (in grey) is the largest scale of data detected by the electronics of the Large Hadron Collider (LHC) experiment at CERN. In the last run (2018), the LHC produced about 2.4 billion particle collisions per second in each of the four main experiments (ATLAS, ALICE, CMS, and LHCb), and each collision can deliver about 100 MB of data, so the estimated annual raw data volume is about 40k EB (= 1 billion gigabytes).

But with current technology and budgets, storing 40k exabytes of data is impossible. Also, only a fraction of the data is actually meaningful, so it is not necessary to record all the data. The amount of recorded data has also been reduced to about 1 petabyte per day, with only 160 petabytes of real data and 240 petabytes of simulated data collected for the last time in 2018.

In addition, the collected data is continuously transmitted through the WLCG (Worldwide LHC Computing Network), which generated annual traffic of 1.9k PB in 2018. However, the European Organization for Nuclear Research (CERN) is working to strengthen the capacity of the LHC with an HL-LHC upgrade. This process is expected to generate a more than 5-fold increase in the amount of data generated, generating an estimated 800 petabytes of new data annually by 2026.

Big factory data volume comparison

Large companies’ data volumes are difficult to track, and the data is often not made public. For this, Luca Clissa employs Fermi estimation, breaking down the data production process into its atomic components and making reasonable guesses.

For example, for a particular data source, retrieve the amount of content produced within a given time window. The total amount of data is then extrapolated by making reasonable guesses about the unit size of these content, such as average email or image size, average data traffic for a 1-hour video, and so on.

He made estimates on Google Search, YouTube, Facebook, etc. data sources, and concluded as follows:

Google Search: A recent analysis estimated that the Google search engine contains 3 to 50 billion web pages. According to information provided by Web Almanac, assuming Google’s annual average page size is about 2.15 MB, the total data size of Google’s search engine should be about 62 petabytes by 2021.

YouTube: According to Backlinko, users will upload 720,000 hours of video per day on YouTube in 2021. Assuming an average size of 1 GB (standard definition), YouTube’s data size in 2021 is about 263 petabytes.

Facebook vs Instagram: Domo’s Data Never Sleeps 9.0 report estimates that in 2021 Facebook and Instagram will upload 240k and 65k images per minute, respectively. Assuming an average size of 2 MB, that’s about 252 PB and 68 PB in total.

DropBox: While Dropbox itself does not produce data, it provides a cloud storage solution to host users’ content. In 2020, the company announced 100 million new users, of which 1.17 million paid subscribers. By speculating at 75% (2 GB) and 25% (2 TB) occupancy for free and paid subscriptions, respectively, Dropbox users will need about 733 PB of storage in 2020.

Email: According to Statista, users sent approximately 13.1 trillion electronic communications (comprising 7.1 trillion emails and 6.0 trillion spam) from October 2020 to September 2021. Assuming the average sizes of standard mail and spam are 75 KB and 5 KB, respectively, we can estimate the total email traffic to be around 5.7k PB.

Netflix: Domo estimates that Netflix subscribers will consume 140 million hours of streaming per day in 2021, assuming 1 GB per hour (standard definition), for a total of approximately 51.1k PB.

Amazon: As of 2021, Amazon S3 (Simple Storage Service) has more than 100 trillion objects stored in it, according to Jeff Barr, chief evangelist for Amazon Web Services (AWS). Assuming an average object size of 5 MB per bucket, the total size of files stored in S3 is approximately 500 EB.

Overall, scientific data is comparable in volume to commercial data sources.

Reference link:

Leave a Comment