Posted on

Understanding ETL development with a case study on terrorist activity and its connection to country demographics

ETL (Extract, Transform, Load) development is an essential process in data management. It enables organizations to collect, transform, and transfer data from various sources into a central location, usually a data warehouse or database, for further analysis and reporting. In this post, we will explore a case study on ETL development, where we examine the connection between terrorist activity and country demographics using large datasets.

Data Sources: For this case study, we used two CSV datasets; one containing information on terrorist activity from 1972 to 2022 and another containing information on country demographics, economic status, and military expenditures. We used web scraping to supplement the project from the New York Times API.

Data Extraction: To extract the data from the sources, we used Python and its Pandas library to read the CSV files. Web scraping was done using the BeautifulSoup library. The extracted data was then cleaned and preprocessed to ensure that it was in the correct format and ready for analysis.

Data Transformation: The extracted data was then transformed to make it usable for analysis. This involved cleaning and preprocessing the data, removing duplicate values, filling in missing values, and transforming the data into the correct format. I mainly used the Pandas library to perform these tasks.

Data Loading: The transformed data was then loaded into a MongoDB database, which is a document-oriented NoSQL database. We used PyMongo, a Python driver for MongoDB, to connect to the database and load the data.

Data Analysis: We used Python’s libraries such as Pandas, Plotly, PyMongo, and Scikit-learn to analyze the data through a machine learning model and answer the following questions:

  • Which organizations are responsible for the most deaths and injuries from 1972-2022?
  • Which states/regions of a country had the most terrorist attacks from 1972-2022?
  • What are the most common categories of terrorism within each region from 1972-2022?
  • On what particular decade where the most terroist acts committed across the world?
  • What is the demographic breakdown (gender ratio and percentage of each age range in a population) of each state where terrorism was committed?

To showcase the insights obtained from the analysis, we built a web-based dashboard using Flask, a Python web framework. The dashboard contains interactive visualizations including graphs, tables, a map, and web-scraped articles.

Conclusion: This case study demonstrates how ETL development can be used to analyze large datasets and answer complex questions. It also highlights the importance of the different steps involved in ETL development and the tools and technologies used to perform each step. With the increasing availability of data, ETL development will continue to play a critical role in data management and decision-making.

Github Repo: https://github.com/namin1993/Capstone_Project