Skip to content

Commit

Permalink
Add README file
Browse files Browse the repository at this point in the history
  • Loading branch information
majawasielewska committed Jan 26, 2024
1 parent d106845 commit 2656989
Showing 1 changed file with 148 additions and 0 deletions.
148 changes: 148 additions & 0 deletions train/exploratory_data_analysis/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,148 @@
### Exploratory data analysis

=========================================

The notebook is designed to analyze and derive insights from 'metadata.csv' file and ultimately on analyzing the basic statistics of several .jpg images.

Data Structure
--------------

The data for model training should be placed in the `/train/data` directory, also referred to as `DATA_DIR` in `consts.py`. The structure of the data directory is as follows:

.. code-block:: none

├── data
│ ├── imdb_crop
│ │ ├── **
│ │ │ ├── **.jpg
│ │ ├── imdb.mat
│ ├── wiki_crop
│ │ ├── **
│ │ │ ├── **.jpg
│ │ ├── wiki.mat

Source:

1. imdb_crop: https://data.vision.ee.ethz.ch/cvl/rrothe/imdb-wiki/static/imdb_crop.tar

2. wiki_crop: https://data.vision.ee.ethz.ch/cvl/rrothe/imdb-wiki/static/wiki_crop.tar

To use this notebook:

1. Ensure that the data is structured as mentioned in the "Data Structure" section.
2. Run each cell in the notebook sequentially to perform exploratory data analysis.

Most important first steps - preparation for EDA
------------------------------------------------

1. Imports
----------

.. code-block:: python

import os
import csv
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
import seaborn as sns
import statsmodels.api as sm
from scipy import stats
from PIL import Image
from tabulate import tabulate

This cell imports various libraries necessary for data analysis and visualization. Key libraries include:

- ``pandas`` and ``numpy`` for data manipulation,
- ``matplotlib`` and ``seaborn`` for data visualization,
- ``statsmodels`` and ``scipy`` for statistical analysis,
- ``PIL`` for image processing,
- ``tabulate`` for presenting tabular data.

2. Setting up Data Path
------------------------

.. code-block:: python

path_to_metadatacsv = os.path.realpath('../data/metadata.csv')
print(path_to_metadatacsv)

This cell calculates the real path of the 'metadata.csv' file, which is presumably the main dataset for analysis. The resolved path is printed for verification.

3. Reading CSV File
-------------------

.. code-block:: python

rows = []
with open(path_to_metadatacsv, 'r') as file:
read_metadatacsv = csv.reader(file)
column_names = next(read_metadatacsv)
for row in read_metadatacsv:
rows.append(row)
print("Column names:", column_names)
print("\nThe first few sample rows:\n")
for row in rows[:10]:
print(row)

In this cell, the notebook:

- Opens and reads the CSV file specified in the previous cell.
- Extracts and prints the column names of the dataset.
- Reads the first few rows of the dataset for a preliminary overview.

4. Loading CSV File into DataFrame
-----------------------------------

.. code-block:: python

df_metadata = pd.read_csv(path_to_metadatacsv)
df_metadata.head()

This cell loads the CSV file (referenced by `path_to_metadatacsv`) into a Pandas DataFrame named `df_metadata`. It then displays the first few rows of the DataFrame using the `head()` method. This is a common practice in data analysis to get a quick glimpse of the dataset structure and contents.

**General overview of Code Cells**
----------------------------------

The notebook includes various code cells for tasks such as data loading, cleaning, analysis, and visualization. Key code cells include:

1. **Imports** - explained above
2. **Data Path Setup** - explained above
3. **Data Loading** - explained above
4. **Preliminary Analysis**
5. **Exploratory Data Analysis**
6. **Image Analysis**

Preliminary Analysis
--------------------

The notebook provides analysis for subsequent data cleansing approaches, including:

- Collection size analysis.
- Checking whether NaN values are present.
- Analysis of outliers, including values impossible in biology.

Exploratory Data Analysis
-------------------------

This section covers:

- Techniques for uncovering patterns, trends, and relationships in the data.
- Statistical summaries and visualizations for understanding the data

Image Analysis
--------------

This section delves into:

- Loading and analyzing image data.
- Extracting basic statistics like format, size, and dimensions from images.
- Visualizing image data.

Conclusion
----------

**The most important thing** after analyzing this data set is that it contains **values that are impossible** for the ages of men and women. In the world of biology this is impossible. We set the upper possible age limit for a human to be 122 years, the lower one to 1, according to the oldest living person.

Article: https://en.wikipedia.or/wiki/List_of_the_verified_oldest_people

0 comments on commit 2656989

Please sign in to comment.