Big Data

Big data is a term used to refer to data sets that are too large or complex for traditional data-processing application software to adequately deal with.

Data with many cases (rows) offer greater statistical power, while data with higher complexity (more attributes or columns) may lead to a higher false discovery rate.

Big data challenges include capturing data, data storage, curation, data analysis, search, sharing, transfer, visualization, querying, updating, information privacy and data source.

Big data was originally associated with three key concepts: volume, variety, and velocity. Other concepts later attributed with big data are veracity (i.e., how much noise is in the data) and value

Why Big Data?

Traditional RDBMS queries isn't sufficient to get useful information out of the huge volume of data
To search it with traditional tools to find out if a particular topic was trending would take so long that the result would be meaningless by the time it was computed
Big data come up with a solution to store this data in novel ways in order to make it more accessible, and also to come up with methods of performing analysis on it

Challenges

Capturing
Storing
Searching
Sharing
Analysing
Visualization

Big data enabling technologies

Apache Hadoop
Hadoop Ecosystem
HDFS Architecture
YARN
NoSQL
Hive
Map Reduce
Apache Spark
Zookeeper
Cassandra
Hbase
Spark Streaming
Kafka
Spark MLib
GraphX

Steps for Data Platform

Data
Query
Merge
Wrangle
Visualize

Data Wrangling

Data wrangling, sometimes referred to as data munging, is the process of transforming and mapping data from one "raw" data form into another format with the intent of making it more appropriate and valuable for a variety of downstream purposes such as analytics. A data wrangler is a person who performs these transformation operations.

This may include further munging, data visualization, data aggregation, training a statistical model, as well as many other potential uses. Data munging as a process typically follows a set of general steps which begin with extracting the data in a raw form from the data source, "munging" the raw data using algorithms (e.g. sorting) or parsing the data into predefined data structures, and finally depositing the resulting content into a data sink for storage and future use.

https://en.wikipedia.org/wiki/Data_wrangling

Links

MotherDuck: Big Data is Dead

Data sizes may have gotten marginally larger, but hardware has gotten bigger at an even faster rate
the era of Big Data is over. It had a good run, but now we can stop worrying about data size and focus on how we’re going to use it to make better decisions
MOST PEOPLE DON’T HAVE THAT MUCH DATA
THE STORAGE BIAS IN SEPARATION OF STORAGE AND COMPUTE.
Instead of "shared nothing" architectures which are hard to manage in real world conditions, shared disk architectures let you grow your storage and your compute independently. The rise of scalable and reasonably fast object storage like S3 and GCS meant that you could relax a lot of the constraints on how you built a database.
WORKLOAD SIZES ARE SMALLER THAN OVERALL DATA SIZES

Zero ELT could be the death of the Modern Data Stack | by Hugo Lu | May, 2023 | Medium

Why Big Data?​

Challenges​

Big data enabling technologies​

Steps for Data Platform​

Data Wrangling​

Links​

MotherDuck: Big Data is Dead​