1.13 How Big Is Big Data?

For computer scientists and data scientists, data is now as important as writing programs

  • According to IBM, approximately 2.5 quintillion bytes (2.5 exabytes) of data are created daily, and 90% of the world’s data was created in the last two years
  • According to IDC, the global data supply will reach 175 zettabytes (equal to 175 trillion gigabytes or 175 billion terabytes) annually by 2025

Megabytes (MB)

  • One megabyte is about one million (actually 220) bytes
  • Many of the files we use on a daily basis require one or more MBs of storage
    • MP3 audio files—High-quality MP3s range from 1 to 2.4 MB per minute
    • Photos—JPEG format photos taken on a digital camera can require about 8 to 10 MB per photo
    • Video—Smartphone cameras can record video at various resolutions
      • Each minute of video can require many megabytes of storage
      • On one of our iPhones, the Camera settings app reports that 1080p video at 30 frames-per-second (FPS) requires 130 MB/minute and 4K video at 30 FPS requires 350 MB/minute

Gigabytes (GB)

  • One gigabyte is about 1000 megabytes (actually 230 bytes
  • A dual-layer DVD can store up to 8.5 GB, which translates to:
    • as much as 141 hours of MP3 audio
    • approximately 1000 photos from a 16-megapixel camera
    • approximately 7.7 minutes of 1080p video at 30 FPS
    • approximately 2.85 minutes of 4K video at 30 FPS
  • Highest-capacity Ultra HD Blu-ray discs can store up to 100 GB of video
  • Streaming a 4K movie can use between 7 and 10 GB per hour (highly compressed)

Terabytes (TB)

  • One terabyte is about 1000 gigabytes (actually 240 bytes)
  • Recent disk drives for desktop computers come in sizes up to 15 TB, which is equivalent to
    • approximately 28 years of MP3 audio
    • approximately 1.68 million photos from a 16-megapixel camera
    • approximately 226 hours of 1080p video at 30 FPS
    • approximately 84 hours of 4K video at 30 FPS
  • Nimbus Data now has the largest solid-state drive (SSD) at 100 TB, which can store 6.67 times the 15-TB examples of audio, photos and video listed above

Petabytes, Exabytes and Zettabytes

  • There are nearly four billion people online creating about 2.5 quintillion bytes of data each day
    • 2500 petabytes (each petabyte is about 1000 terabytes) or 2.5 exabytes (each exabyte is about 1000 petabytes)
  • According to a March 2016 AnalyticsWeek article, within five years there will be over 50 billion devices connected to the Internet and by 2020 we’ll be producing 1.7 megabytes of new data every second for every person on the planet
  • At today’s numbers (approximately 7.7 billion people), that’s about
    • 13 petabytes of new data per second
    • 780 petabytes per minute
    • 46,800 petabytes (46.8 exabytes) per hour
    • 1,123 exabytes per day—that’s 1.123 zettabytes (ZB) per day (each zettabyte is about 1000 exabytes)
  • That’s the equivalent of over 5.5 million hours (over 600 years) of 4K video every day or approximately 116 billion photos every day!

Additional Big-Data Stats

  • For an entertaining real-time sense of big data, check out https://www.internetlivestats.com, with various statistics, including the numbers so far today of
    • Google searches
    • Tweets
    • Videos viewed on YouTube
    • Photos uploaded on Instagram

Additional Big-Data Stats (cont.)

  • Every hour, YouTube users upload 24,000 hours of video, and almost 1 billion hours of video are watched on YouTube every day
  • Every second, there are 51,773 GBs (or 51.773 TBs) of Internet traffic, 7894 tweets sent, 64,332 Google searches and 72,029 YouTube videos viewed
  • On Facebook each day there are 800 million “likes,” 60 million emojis are sent, and there are over two billion searches of the more than 2.5 trillion Facebook posts since the site’s inception

Additional Big-Data Stats (cont.)

  • In June 2017, Will Marshall, CEO of Planet, said the company has 142 satellites that image the whole planet’s land mass once per day
    • They add one million images and seven TBs of new data each day
    • They’re using machine learning on that data to improve crop yields, see how many ships are in a given port and track deforestation
    • With respect to Amazon deforestation, he said: “Used to be we’d wake up after a few years and there’s a big hole in the Amazon. Now we can literally count every tree on the planet every day.”

Additional Big-Data Stats (cont.)

Domo, Inc. has a nice infographic called “Data Never Sleeps 6.0” showing how much data is generated every minute, including:

* 473,400 tweets sent.
* 2,083,333 Snapchat photos shared.
* 97,222 hours of Netflix video viewed.
* 12,986,111 million text messages sent.
* 49,380 Instagram posts. 
* 176,220 Skype calls.
* 750,000 Spotify songs streamed.
* 3,877,140 Google searches.
* 4,333,560 YouTube videos watched. 

Computing Power Over the Years

  • Data is getting more massive and so is the computing power for processing it
  • Performance of today’s processors is measured in terms of FLOPS (floating-point operations per second)
  • In the early to mid-1990s, the fastest supercomputer speeds were measured in gigaflops (109 FLOPS)
  • Late 1990s: Intel produced the first teraflop (1012 FLOPS) supercomputers
  • Early-to-mid 2000s: Speeds reached hundreds of teraflops
  • 2008: IBM released the first petaflop (1015 FLOPS) supercomputer
  • Currently, the fastest supercomputer—the IBM Summit, located at the Department of Energy’s (DOE) Oak Ridge National Laboratory (ORNL)—is capable of 122.3 petaflops

Computing Power Over the Years (cont.)

  • Distributed computing can link thousands of personal computers via the Internet to produce even more FLOPS
  • 2016: The Folding@home network—a distributed network in which people volunteer their personal computers’ resources for use in disease research and drug design—was capable of over 100 petaflops
  • Companies like IBM are now working toward supercomputers capable of exaflops (1018 FLOPS)

Computing Power Over the Years (cont.)

  • Quantum computers now under development theoretically could operate at 18,000,000,000,000,000,000 times the speed of today’s “conventional computers”!
  • In one second, a quantum computer theoretically could do staggeringly more calculations than the total that have been done by all computers since the world’s first computer appeared.
    • Could wreak havoc with blockchain-based cryptocurrencies like Bitcoin
    • Engineers are already rethinking blockchain to prepare for such massive increases in computing power

Computing Power Over the Years (cont.)

  • Computing power’s cost continues to decline, especially with cloud computing
  • People used to ask the question, “How much computing power do I need on my system to deal with my peak processing needs?”
  • That thinking has shifted to “Can I quickly carve out on the cloud what I need temporarily for my most demanding computing chores?”
    • Pay for only what you use to accomplish a given task

Processing the World’s Data Requires Lots of Electricity

  • Data from the world’s Internet-connected devices is exploding, and processing that data requires tremendous amounts of energy.
  • According to a recent article, energy use for processing data in 2015 was growing at 20% per year and consuming approximately three to five percent of the world’s power
    • That total data-processing power consumption could reach 20% by 2025

Processing the World’s Data Requires Lots of Electricity (cont.)

  • Another enormous electricity consumer is the blockchain-based cryptocurrency Bitcoin
    • Processing just one Bitcoin transaction uses approximately the same amount of energy as powering the average American home for a week!
    • The energy use comes from the process Bitcoin “miners” use to prove that transaction data is valid

Big-Data Opportunities

  • Big data’s appeal to big business is undeniable given the rapidly accelerating accomplishments
  • Many companies are making significant investments and getting valuable results through technologies in this book, such as big data, machine learning, deep learning and natural-language processing
  • Forcing competitors to invest as well, rapidly increasing the need for computing professionals with data-science and computer science experience

1.13.1 Big Data Analytics

  • The term “data analysis” was coined in 1962, though people have been analyzing data using statistics for thousands of years going back to the ancient Egyptians
  • Big data analytics is a more recent phenomenon—the term “big data” was coined around 2000
  • Four of the V’s of big data:
    1. Volume—the amount of data the world is producing is growing exponentially.
    2. Velocity—the speed at which that data is being produced, the speed at which it moves through organizations and the speed at which data changes are growing quickly.
    3. Variety—data used to be alphanumeric (that is, consisting of alphabetic characters, digits, punctuation and some special characters)—today it also includes images, audios, videos and data from an exploding number of Internet of Things sensors in our homes, businesses, vehicles, cities and more.
    4. Veracity—the validity of the data—is it complete and accurate? Can we trust that data when making crucial decisions? Is it real?

1.13.1 Big Data Analytics (cont.)

  • Most data is now being created digitally in a variety of types, in extraordinary volumes and moving at astonishing velocities
  • Digital data storage has become so vast in capacity, cheap and small that we can now conveniently and economically retain all the digital data we’re creating

1.13.1 Big Data Analytics (cont.)

To get a sense of big data’s scope in industry, government and academia, check out the high-resolution graphic

http://mattturck.com/wp-content/uploads/2018/07/Matt_Turck_FirstMark_Big_Data_Landscape_2018_Final.png

1.13.2 Data Science and Big Data Are Making a Difference: Use Cases

  • The data-science field is growing rapidly because it’s producing significant results that are making a difference
  • Some data-science and big data use cases in the following table
Data-science use cases
anomaly detection
assisting people with disabilities
auto-insurance risk prediction
automated closed captioning
automated image captions
automated investing
autonomous ships
brain mapping
caller identification
cancer diagnosis/treatment
carbon emissions reduction
classifying handwriting
computer vision
credit scoring
crime: predicting locations
crime: predicting recidivism
crime: predictive policing
crime: prevention
CRISPR gene editing
crop-yield improvement
customer churn
customer experience
customer retention
customer satisfaction
customer service
customer service agents
customized diets
cybersecurity
data mining
data visualization
detecting new viruses
diagnosing breast cancer
diagnosing heart disease
diagnostic medicine
disaster-victim identification
drones
dynamic driving routes
dynamic pricing
electronic health records
emotion detection
energy-consumption reduction
facial recognition
fitness tracking
fraud detection
game playing
genomics and healthcare
Geographic Information Systems (GIS)
GPS Systems
health outcome improvement
hospital readmission reduction
human genome sequencing
identity-theft prevention
immunotherapy
insurance pricing
intelligent assistants
Internet of Things (IoT) and medical device monitoring
Internet of Things and weather forecasting
inventory control
language translation
location-based services
loyalty programs
malware detection
mapping
marketing
marketing analytics
music generation
natural-language translation
new pharmaceuticals
opioid abuse prevention
personal assistants
personalized medicine
personalized shopping
phishing elimination
pollution reduction
precision medicine
predicting cancer survival
predicting disease outbreaks
predicting health outcomes
predicting student enrollments
predicting weather-sensitive product sales
predictive analytics
preventative medicine
preventing disease outbreaks
reading sign language
real-estate valuation
recommendation systems
reducing overbooking
ride sharing
risk minimization
robo financial advisors
security enhancements
self-driving cars
sentiment analysis
sharing economy
similarity detection
smart cities
smart homes
smart meters
smart thermostats
smart traffic control
social analytics
social graph analysis
spam detection
spatial data analysis
sports recruiting and coaching
stock market forecasting
student performance assessment
summarizing text
telemedicine
terrorist attack prevention
theft prevention
travel recommendations
trend spotting
visual product search
voice recognition
voice search
weather forecasting

©1992–2020 by Pearson Education, Inc. All Rights Reserved. This content is based on Chapter 1 of the book Intro to Python for Computer Science and Data Science: Learning to Program with AI, Big Data and the Cloud.

DISCLAIMER: The authors and publisher of this book have used their best efforts in preparing the book. These efforts include the development, research, and testing of the theories and programs to determine their effectiveness. The authors and publisher make no warranty of any kind, expressed or implied, with regard to these programs or to the documentation contained in these books. The authors and publisher shall not be liable in any event for incidental or consequential damages in connection with, or arising out of, the furnishing, performance, or use of these programs.