Big Data Testing
This blog is meant for the Techies who are interested to grab the knowledge about Big Data Ecosystem or want to know the cases we test while doing the Big Data Testing. Following will be the Topics for this blog:
- What is the meaning of Big Data?
- Characteristics of Big Data (10 V’s of Big Data)
- Need of Testing
- How to Perform Big Data Testing?
- Performance Test in Big Data
- Big Data Testing Environment
- Challenges we Face while Testing Big Data
- Tools we use in Big Data Testing
Meaning of Big Data
Big data is a new word in the Software industry due to a large amount of data.
Big Data Refers to the huge and complex data sets from new data sources. These data sets are so huge and complex that traditional data processing software is not capable to store it and process it.
Sources of Big Data – Data collected from
- Sensors
- Devices
- Video/audio
- Networks
- Log files
- Transactional applications
- Web
- Social media
Types of big data are:
- Structured: Organized data in the form of fixed format. Ex: RDBMS
- Semi-Structured: Partially organized data which does not have a fixed format. Ex: XML, JSON
- Unstructured: Data with unknown format or structure is called Unstructured data. Ex: Audio, video files etc.
Characteristics of Big Data (10 V’s of Big Data)
Need of Big Data Testing
In this , data plays a crucial role. If Big Data System is not tested Correctly then it will affect the business, it will be difficult to understand the error, cause of the error and where it occurs, and finding the solution will also become difficult.
How to Perform Big Data Testing?
The testing process is divided into below 3 phases:
- Data Ingestion
- Data Processing
- Validation of the Output
Data Ingestion: In this Phase,we fill data from various sources in the Big Data System via extracting tools. Hadoop Distributed File System (HDFS), MongoDB, etc. are examples of storage.
Then, we test the filled data for errors, corrupt and missing data
Data Processing: In this phase, also known as ‘MapReduce Validation’, the generation of key-value pairs for data takes place. Various nodes apply MapReduce and check whether the algorithms are working as expected or not.
A data validation process is done here to check if generated data is as per expectations or not.
Data Storage: The Final step of Big Data Testing is to store output data in HDFS or any other storage system (such as Data Warehouse).
In this Phase, transformation logic, data integrity is verified, and the key-value pairs are validated for accuracy.
Performance Testing Approach
In a very short span of time, a big data system processes a huge amount of structured and unstructured data.This can lead to performance issues.
So, in Big Data, it is essential to Test Performance Test to ignore such bottlenecks. We focus on below points while doing Performance Testing of Big Data System:
Data loading and Throughput: In this Phase, we test how quickly data can be consumed from various Data Sources and the rate at which data is created in the data store.
Data Processing Speed: The rate at which Map Reduce Jobs are executed is calculated at this stage.
Sub-System Performance: As the system has multiple components, it is necessary to test the components individually.
Parameters for Performance Testing
- Data Storage: Way to store data in different nodes
- Commit logs: Maximum allowed size for the commit log to grow.
- Concurrency: How many threads can perform write and read operation
- Caching: Tune the cache setting “row cache” and “key cache.”
- Timeouts: Values for connection timeout, query timeout, etc.
- JVM Parameters: Heap size, GC collection algorithms, etc.
- Map-reduce performance: Sorts, merge, etc.
- Message queue: Message rate, size, etc.
Big Data Testing Environment
Below are the basic requirements for the setup of Big data testing environment:
- Space for Storing, Processing, and Validating Huge the volume of data should be available.
- It should have a responsive cluster with distributed nodes and data
- It should have powerful CPU and memory utilization to keep performance high
Challenges we Face while Testing
Testing Tools we use in Big Data Testing
Process | Tools |
Data Ingestion | Zookeeper, Kafka, Sqoop |
Data Processing | MapR, Hive, Pig |
Data Storage | Amazon S3, HDFS |
Data Migration | Talend, Kettle, CloverDX |