BIG DATA: How much data is produced everyday?
Today, By enhancing of technology, we are generating lots of data. Let’s take an example , Have you ever noticed how much data has generated in your mobilephones? your every action, even if one video send to your whatsapp or any messenger app generates data and we don’t know the proper count of this data!
What is BIG DATA?
Big Data is the term for the collection of datasets so large and complex that it becomes difficult to process using on -hand database system tools or traditional data processing applications.
How to classify data as Big Data?
1.VOLUME: Volume refers to the unimaginable amounts of information generated every second from social media, cell phones, cars, credit cards, images, video, etc. We are currently using distributed systems, to store data in several locations and brought together by a software Framework like Hadoop.
If we see big data as a pyramid, volume is the base. The volume of data that companies manage skyrocketed around 2012, when they began collecting more than three million pieces of data every data. “Since then, this volume doubles about every 40 months,” Herencia said.
2. VARIETY: Big Data is generated in multiple varieties. A company can obtain data from many different sources: from in-house devices to smartphone GPS technology or what people are saying on social networks. The importance of these sources of information varies depending on the nature of the business. For example, a mass-market service or product should be more aware of social networks than an industrial business.
These data can have many layers, with different values. As Muñoz explained, “When launching an email marketing campaign, we don’t just want to know how many people opened the email, but more importantly, what these people are like.”
Various types of data:-
→ Structured data: Data is in tabular format or proper arrangement of data.
→ Semi-structured data: Data is in .csv, .json files, etc. and not structured or arranged properly.
→ Unstructured data: Data is not arranged or not structured.
3. VELOCITY: Velocity plays a major role compared to the others. The major aspect of Big Data is to provide data on demand and at a faster pace.
In addition to managing data, companies need the information to flow quickly — as close to real-time as possible. So much so that the MetLife executive stressed that: “Velocity can be more important than volume because it can give us a bigger competitive advantage. Sometimes it’s better to have limited data in real time than lots of data at a low speed.”
By the figure, You may say that earlier we just used the mainframe of systems and when client invloved , needs server i.e. client server then, internet for file transfer and then, mobile, social media & cloud to store data and share with multiple users. Therefore, more users, more appliances, more apps and hence, lots of data.
4.VALUE: V for value sits at the top of the big data pyramid. This refers to the ability to transform a tsunami of data into business. It is the major issue that we need to concentrate on. It is not just the amount of data that we store or process. It is actually the amount of valuable, reliable and trustworthy data that needs to be stored, processed, analyzed to find insights.
5.VERACITY: Last but never least,veracity, which in this context is equivalent to quality. We have all the data, but could we be missing something? Are the data “clean” and accurate? Do they really have something to offer?
Veracity basically means the degree of reliability that the data has to offer. Since a major part of the data is unstructured and irrelevant, Big Data needs to find an alternate way to filter them or to translate them out as the data is crucial in business developments.
Analysis of Big Data
IBM Big Data Analytics(IBM — UseCases)
Here, in the figure shown as if you have noticed earlier the data shown by the meters have on your home shows electricity consumed , it is actually sending data after one month but nowadays, what IBM did they came up with the thing that smart meter which is used to collect the data after every 15 minutes so ,whatever enery you have consumed in every 15 minutes , it will send that data because of it big data has been generated and now here if you see 96 millions read per day for every million meters. So, IBM realise that it will generate large amount of data i.e. BIG DATA so they start analyze which makes their work to easily store and maintain that data.
Apache Hadoop: Framework to Process Big Data
Hadoop is a framework that allows us to store and process large data sets in parallel and distributed way. The Apache Hadoop develops an open-source software for reliable, distributed computing.In other words, Hadoop is a software or program for distributed storage used to solve the problem of big data-a huge amount of data.
Now, there are two major problems occur while dealing with big data:-
- HDFS: HDFS stands for Hadoop Distributed File System. So. in this whatever the large amount of data, we are dumping gets distributed with the other machines. These machines are inter-connected and data distributed along them and in hadoop , this is called as hadoop-cluster.
- mapReduce: It is the programming unit of hadoop. This allows the distributed process of data lying under hadoop cluster. In this, as all the connected machines process on the data after distribution and then, the intermediate outputs get integrated and reduce is known as mapReduce and the machines are called map here.
Hadoop: Master/Slave Architecture
Hadoop follows a master slave architecture design for data storage and distributed data processing using HDFS and MapReduce respectively. The master node for data storage is hadoop HDFS is the NameNode and the master node for parallel processing of data using Hadoop MapReduce is the Job Tracker. The slave nodes in the hadoop architecture are the other machines in the Hadoop cluster which store data and perform complex computations. Every slave node has a Task Tracker daemon and a DataNode that synchronizes the processes with the Job Tracker and NameNode respectively. In Hadoop architectural implementation the master or slave systems can be setup in the cloud or on-premise.
What is Distributed Storage ?
The Hadoop Distributed File System (HDFS) is the primary data storage system used by Hadoop applications. It employs a NameNode and DataNode architecture to implement a distributed file system that provides high-performance access to data across highly scalable Hadoop clusters.
A distributed file system is a client/server-based application that allows clients to access and process data stored in multiple systems. For distributed storage, instead of storing a large file sequentially, you can split it into pieces, and scatter those pieces across many disks. This illustration shows a file split into pieces, sometimes called blocks, with those blocks distributed across multiple disks for storage of the file. The big data platform Apache so by the help of distributed storage we can splits our work so that it will increase velocity and decrease the problem of volume and here, all the PC’s are called slave (Contribute RAM & CPU) and cloud in the centre known as master (receiver). This is known as HADOOP CLUSTER.
How much data does MNC’s handle?
MNC’s like google, amazon, facebook, twitter, etc. handles the data by apache hadoop. What they do , they simply split the data among many other systems and maintain the proper record of those data. Now, One question in arise in the mind that which data , from where they get this amount of data??
ANSWER : When you visit any site like Amazon which is a largest online shopping platform where you purchase something, order it , pay money or cash on delivery. But before this procedure, an user or a customer has to register and in registration they have to enter their details like Name, Email-address, Mobile no., DOB, Gender, Password,etc. and then, one OTP comes to the registered mobile no. or on email address and then, have to login. . . . .So, the procedure is somehow like this. After this, customer search and their requirement and scroll many items and then, ordered or some times many customers just scroll and watch the items not even buy a single one…and by seeing this, as amazon has tracking the customer so next time when the customer login , it shows his/her to their earlier preference or recommend from their wishlist similar items. So, now amazon knows that what kind of products that customer want to buy, in this way 87% of the population daily casually look around into it sometimes to know about sale on festive season or sometimes for the 90% discount offer. So, Could you imagine per day how many person register their data likewise on google, facebook ,etc.
Let’s talk about the other MNCs-Google. . .
How much data does google handle??
This is one of those kind of questions whose answer can never be accurate. On a funnier note, it is like a child asking how many stars are there up in the sky?? which is somewhat similar to asking “how much data does google handle??”
Commonly a PC holds 500GB of storage data and a smartphone holds about 32GB, but as days pass there are newer PCs and smartphones with bigger storage than this. We all know Google is the only one who can answer any kind of question!! We simply conclude that Google knows everything!! And Everything means Everything! Now you must be wondering how much data does google handle to answer all these questions!!??
Yes it holds a whole lot of data to answer any kind of question u ask it!! Google doesn’t provide numbers on how much data they store.
Google now processes over 40,000 search queries every second on average, which translates to over 3.5 billion searches per day and 1.2 trillion searches per year worldwide.
A place where google stores and handles all its data is a data center. Google doesn’t hold the biggest of data centers but still it handles a huge amount of data. A data center normally holds petabytes to exabytes of data.
Google currently processes over 20 petabytes of data per day through an average of 100,000 MapReduce jobs spread across its massive computing clusters. The average MapReduce job ran across approximately 400 machines in September 2007, crunching approximately 11,000 machine years in a single month.
Google is also very much interested in collecting users data like photos to improve their ad delivery system.Now, Can you anwser ; WHICH IS THE LARGEST ‘BIG DATA’ COMPANY IN THE WORLD?
Unsurprisingly, the answer to this question is Google. Perhaps more surprisingly are some of the figures behind the company… For example, did you know Google processes 3.5 billion requests per day? Or that Google stores 10 exabytes of data (10 billion gigabytes!)? Facebook, Microsoft, and Amazon all give Google a run for their money; Facebook alone has 2.5 billion pieces of content, 2.7 billion ‘likes’ and 300 million photos — all of which adds up to more than 500 terabytes of data.
Now, What are these new terms? Petabytes or Exabytes? The highest data size I have heard till now is Terabyte(TB). 1 Petabyte(PB) = 1024 Terabytes(TB) 1 Exabyte(EB)= 1024 Petabyte(PB) An exabyte can be understood as 1 million Terabytes(TB). So , from this we can slowly understand this huge amount of data. Google uses its datacenters as well as collaborates with other datacenters to store their data. Each data center would cover an area of 20 Football fields combined. Its hard to calculate this huge amount of data. But with some educated guessing using the capital expenditures at remote locations and electricity consumption at each of the data centers and number of servers they have respectively, we can come to a conclusion that Google holds 10–15 Exabytes of data. This equals to data of 30 Million PCs combined. So now when someone stops you somewhere and asks you how much data does google handle!! You can boldly answer that Google handles 10–15 Exabytes of data.
Google processes its data on a standard machine cluster node consisting two 2 GHz Intel Xeon processors with Hyper-Threading enabled, 4 GB of memory, two 160 GB IDE hard drives and a gigabit Ethernet link. This type of machine costs approximately $2400 each through providers such as Penguin Computing or Dell or approximately $900 a month through a managed hosting provider such as Verio (for startup comparisons).
Google uses these data to improve their products like their search engines and Google maps!!
The average MapReduce job runs across a $1 million hardware cluster, not including bandwidth fees, data center costs, or staffing.
The January 2008 MapReduce paper provides new insights into Google’s hardware and software crunching processing tens of petabytes of data per day. Google converted its search indexing systems to the MapReduce system in 2003, and currently processes over 20 terabytes of raw web data. It’s some fascinating large-scale processing data that makes your head spin and appreciate the years of distributed computing fine-tuning applied to today’s large problems.
HOW MUCH DATA IS PRODUCED EVERY DAY?
The amount of data is growing exponentially. Today, our best estimates suggest that at least 2.5 quintillion bytes of data is produced every day (that’s 2.5 followed by a staggering 18 zeros!). As the infographic points out, that’s everything from data collected by the Curiosity Rover on Mars, to your Facebook photos from your latest vacation.
How does BIG DATA affect our daily lives?
Big Data is somehow very useful in our daily life as we see earlier ,we have telephones so that people were only able to call and talk ,nothing sharing of any image and video and now, this time technology increases and come up to the smart phones where everything is possible. Earlier and currently too , People are using automobiles or vehicles which they have to manually drive but in our AI(Artificial Intelligence) World , self-driving automobiles are being established and successfully working. Some other examples are shown in the below figure:
CONCLUSION:
Big data isn’t just an important part of the future, it may be the future itself. The future of big data analytics promises to change the way businesses operate in finance, healthcare, manufacturing, and other industries.
The overwhelming size of big data may create additional challenges in the future, including data privacy and security risks, shortage of data professionals, and difficulties in data storage and processing.
However, most experts agree that big data will mean big value. It will give rise to new job categories and even entire departments responsible for data management in large organizations. New regulatory structures and standards of conduct will emerge, as companies continue to use consumers’ personal data. Also, most companies will shift from being data-generating to data-powered, making use of actionable data and business insights.
Hope you found this post informative and useful, Please do not hesitate to keep clap for it. Also, feel free to share it and follow me for the more efficient content.
THANK YOU:-)