Big Data Generation and what is the solution to handle it.

9 min readSep 16, 2020

Over 2.3 quintillion bytes of data are created every single day, and it’s only going to grow from there. In 2020, it’s estimated that more than 1.7MB of data will be created every second for every person on Earth!

What is Big Data?

Big data is a field that treats ways to analyze, systematically extract information from data sets that are too large or complex to be dealt with by traditional data processing methods that involves data collection & data manipulation.

Big data refers to datasets that are not only big, but also high in variety and velocity, which makes them difficult to handle using traditional tools and techniques. Due to the rapid growth of such data, solutions need to be studied and provided in order to handle and extract value and knowledge from these datasets.

Big Data includes 2 types of data :

Structured Data
Unstructured Data

Structured Data is that type of data which usually resides in relational databases (RDBMS).Fields store length-delineated data phone numbers, Social Security numbers, or ZIP codes. Even text strings of variable length like names are contained in records, making it a simple matter to search.
Data may be human- or machine-generated as long as the data is created within an RDBMS structure. This format is eminently searchable both with human generated queries and via algorithms using type of data and field names, such as alphabetical or numeric, currency or date.
Structured Query Language (SQL) enables queries on this type of structured data within relational databases.
Some relational databases do store or point to unstructured data such as customer relationship management (CRM) applications. The integration can be awkward at best since memo fields do not loan themselves to traditional database queries. Still, most of the CRM data is structured.

Unstructured Data is essentially everything else. Unstructured data has internal structure but is not structured via pre-defined data models or schema. It may be textual or non-textual, and human- or machine-generated. It may also be stored within a non-relational database like NoSQL.
It includes :

-Text Files
-Email
-Social Media
-Website
-Mobile Data & Media
-Satellite Imagery
-Scientific Data
-Sensor Data
-Survillience Data

What are the characterstics of Big Data ?

We have 4 V’s of Big Data as their features

1. Velocity: It is the same as of the Input/Output or I/O problems that means the speed of storing data on hard disk & the speed of processing data from hard disk.And the size of data also matters in measuring the velocity.

2. Variety: In this we consider the nature & type of data. As earlier we use RDBMS which was capable to handle structured data efficiently and effectively.

3. Volume: It is the quantity of generated and stored data. The size of the data determines the value and potential insight, and whether it can be considered big data or not.

4. Veracity: It is the extended definition for big data, which refers to the data quality and the data value.

What are the impacts of Big Data in Industry ?

All the big giants of IT Industry like Google,Facebook,Microsoft provides us the services which is used globally.They have very huge data in there data centers and they need to manage & manipulate data efficiently & effectively.
Big data has increased the demand of information management specialists so much so that Software AG, Oracle Corporation, IBM, Microsoft, SAP, EMC, HP and Dell have spent more than $15 billion on software firms specializing in data management and analytics. In 2010, this industry was worth more than $100 billion and was growing at almost 10 percent a year: about twice as fast as the software business as a whole.

Facebook revealed some big, big stats on big data to a few reporters at its HQ today, including that its system processes 2.5 billion pieces of content and 500+ terabytes of data each day. It’s pulling in 2.7 billion Like actions and 300 million photos per day, and it scans roughly 105 terabytes of data each half hour.

A data center normally holds petabytes to exabytes of data. Google currently processes over 20 petabytes of data per day through an average of 100,000 MapReduce jobs spread across its massive computing clusters
Hence to solve all the I/O problems in Big Data we use Master-Slave Topology & Distributed Storage Cluster.

What is Distributed Storage ?

A distributed data store is a computer network where information is stored on more than one node, often in a replicated fashion. It is usually specifically used to refer to either a distributed database where users store information on a number of nodes, or a computer network in which users store information on a number of peer network nodes. Where user distribute the storage are known as Slave Nodes( Data Node ). From where the user distribute the storage to the slave nodes is known as Master Node( Name Node ).

The Slave nodes and Master node collectively form the infrastructure known as Cluster and in big data world it is know as Distributed Storage Cluster .

**Master-Slave Topology & Distributed Storage Cluster.**

What is Hadoop?

Apache Hadoop is a software framework employed for clustered file systems and the handling of big data. It processes datasets of big data by means of the MapReduce programming model.
Hadoop is an open-source framework that is written in Java and it provides cross-platform support.

HDFS is a distributed file system that handles large data sets running on commodity hardware. It is used to scale a single Apache Hadoop cluster to hundreds (and even thousands) of nodes. HDFS is one of the major components of Apache Hadoop.

No doubt, this is the topmost big data tool. In fact, over half of the Fortune 50 companies use Hadoop. Some of the Big names include Amazon Web services, Hortonworks, IBM, Intel, Microsoft, Facebook, etc.

Why is Hadoop important?

Ability to store and process huge amounts of any kind of data, quickly.-With data volumes and varieties constantly increasing, especially from social media and the Internet of Things (IoT), that’s a key consideration.
Computing power- Hadoop’s distributed computing model processes big data fast. The more computing nodes you use, the more processing power you have.
Fault tolerance- Data and application processing are protected against hardware failure. If a node goes down, jobs are automatically redirected to other nodes to make sure the distributed computing does not fail. Multiple copies of all data are stored automatically.
Flexibility- Unlike traditional relational databases, you don’t have to preprocess data before storing it. You can store as much data as you want and decide how to use it later. That includes unstructured data like text, images and videos.
Low cost- The open-source framework is free and uses commodity hardware to store large quantities of data.
Scalability- You can easily grow your system to handle more data simply by adding nodes. Little administration is required.

Examples of how some MNCs are using Big Data Analytics

Facebook

There’s a lot of data stored on Facebook, and a lot of its users’ own content. That content is the most important asset on the service, and users need to believe it’s secure, otherwise they won’t share. Getting storage right is critical — and is helping define how Facebook designs its data centers.

Instead of servers that include compute, memory, flash storage and HDD storage, Facebook’s disaggregated server model splits the various server components across separate racks, allowing it to tune the components for specific services and to use what Qin calls “smarter hardware refreshes” to extend useful life. By separating server resources mixes of compute, memory and storage on different racks can be combined, for example, to deliver a set of servers that can run Hadoop. As loads and usage change, the balance of components that power a service can be changed — keeping inefficiencies to a minimum.

Qin notes that the key to this approach is faster networking, with the latest technologies used to build Facebook’s first fabric-based data center in Iowa. The system is designed to work at speeds up to the network card line rate — though it’s not yet operating at that speed, as the service doesn’t need the bandwidth. Qin expects this approach to extend the life of storage modules, as Facebook can swap out memory and CPU on a different, faster, schedule.

2. Amazon

The online retail goliath has got access to a gigantic measure of information on its clients; names, locations, installments, and search accounts are altogether documented in its information bank. While this data is put to use in publicizing calculations, Amazon likewise utilizes the data to improve client relations, a region that numerous big data users disregard.

Whenever you contact the Amazon help work area with an inquiry, don’t be astounded when the worker on the opposite end has already received a large portion of the relevant data about you close by. The applicable data takes into consideration a quicker, progressively practical client administration experience that does exclude illuminating your name multiple times.

3. American Express

The American Express Company is utilizing big data to break down and anticipate shopper conduct. By taking a gander at authentic exchanges and fusing more than 100 factors, the organization uses refined prescient models instead of conventional business insights based on knowing the past.

Current time permits an increasingly precise conjecture of potential beat and client dedication. American Express has guaranteed that, in their Australian market, they can anticipate 24% of records that will close within four months.

4. Netflix

The entertainment streaming service has an abundance of information and examination, giving knowledge into the survey propensities for many global customers. Netflix utilizes this information to commission unique programming content that interests all around just as acquiring the rights to movies and arrangement boxsets that they realize will perform well with specific crowds.

For instance, Adam Sandler has demonstrated disliked in the US and UK showcases as of late, yet Netflix green-lit four new films with the on-screen character in 2015, equipped with the information that his past work had been effective in Latin America.

Within three months of introducing House of Cards, Netflix added 2 million subscribers in the US and 1 million additional subscribers internationally.

This meant that an estimated $72 million was added to the company’s bottom line, nearly paying off its initial investment in the House of Cards show in mere months.

With a 93 percent renewal rate for its shows after the first season, the success of House of Cards isn’t an isolated incident. Other series like Orange Is The New Black, Arrested Development, and The Crown were introduced to acclaim using a similar process that relies on big data.

Big Data Generation and what is the solution to handle it.

What is Big Data?

What are the characterstics of Big Data ?

What are the impacts of Big Data in Industry ?

What is Hadoop?

Why is Hadoop important?

Examples of how some MNCs are using Big Data Analytics

Thankyou For Reading!

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Shailja Tripathi

No responses yet