Report Seminar 6741
Report Seminar 6741
Report Seminar 6741
On
2023-2024
1
HYDERABAD INSTITUTE OF TECHNOLOGY AND
MANAGEMENT
CERTIFICATE
This is to certify that the Technical Seminar entitled “Data Science Tools and Libraries"
is being submitted by Sagina Vijay bearing hall ticket number 20E51A6741 in partial
fulfillment of the requirements for the degree BACHELOR OF TECHNOLOGY in
COMPUTER SCIENCE AND ENGINEERING by the Jawaharlal Nehru Technological
University, Hyderabad, during the academic year 2023-2024. The matter contained in this
document has not been submitted to any other University or institute for the award of any
degree or diploma.
2
HYDERABAD INSTITUTE OF TECHNOLOGY AND
MANAGEMENT
(UGC Autonomous, Affiliated to JNTUH, Accredited by NAAC (A+) and NBA)
DECLARATION
3
ACKNOWLEDGEMENT
An endeavor of a long period can be successful only with the advice of many
well-wishers.
We would like to thank our chairman, SRI. ARUTLA PRASHANTH, for
providing all the facilities to carry out the Technical Seminar successfully.
We would like to thank our Principal DR. P. RAJESH KUMAR, who has inspired
lot through their speeches and providing this opportunity to carry out our
Technical Seminar successfully.
We are very thankful to our Head of the Department, DR. Ila chandana Kumari
and B-Tech Technical Seminar Coordinator Dr. P. Madhuri We would like to
specially thank my internal supervisor Dr. P.Madhuri,ASSOCIATE PROFESSOR
for Technical Guidance.We wish to convey our gratitude and express sincere
thanks to all D.C(DEPARTMENTAL COMMITTEE) and T.R.C (TECHNICAL
REVIEWCOMMITTEE) members, non-teaching staff for their support and Co-
operation rendered for successful submission of our Technical Seminar.
We also want to express our sincere gratitude to all my family members and my
friends for their individual care and everlasting moral support.
4
TABLE OF CONTENTS
1. CHAPTER - 01…………………………………………………………………….....1
● INTRODUCTION
2. CHAPTER - 02……………………………………………………………………….2
● DATA SCIENCE TOOLS AND LIBRARIES.
3 CHAPTER -03………………………………………………………………………12
●DATA SCIENCE TOOLS
3.1.1 Programming Languages
3.1.2 Integrated Development Environments
3.1.3 Data Visualization Tools
3.1.4 Notebook Sharing and Collaboration
4 CHAPTER - 04…………………………………………………………………….14
● DATA SCIENCE LIBRARIES
4.1 Introduction to Libraries
5
4.2 Advanced tools and libraries
5 CONCLUSION ……………………………………………………………………20
6 REFERENCES…………………………………………………………………….21
6
LIST OF FIGURES
Sl.no CAPTION
7
i
ABSTRACT
Data science tools and libraries are essential resources that empower data professionals to
efficiently work with data, extract insights, and build predictive models. They encompass
a variety of software solutions to different stages of the data science workflow, from data
acquisition and preprocessing to analysis, visualization, and model deployment. These
tools abstract away complex underlying processes, enabling data scientists to focus on
extracting meaningful information from data. Here's an abstract breakdown of these tools
and libraries:-
8
Tools:-- Libraries:--
Programming Languages Pandas
Integrated Development Environments NumPy
Data Warehousing and Processing Scikit-Learn
Data Visualization Tools Pytorch
Notebook Sharing and Collaboration Matplot Lib
ii
1. INTRODUCTION
In modern civilization groups of people, technology has emerged and evolved as a robust
tool to resolve modern-day problems and challenges. The invention of computers, which
were initially used as computing devices for mathematics, has extended their
compatibility with other machines and improved their capability to supply a big selection
of operations from distinct and diverse kinds of applications. This computing revolution
9
forced every industry to exponential growth by better performance and quick
improvements by overcoming challenges. Computer science sub-fields like data science
which uses statistics, probability, and their related methods to analyze and understand the
insights of information, Machine learning for exploratory data analysis and building
models by training data, AI which is employed to form intelligent systems, Deep learning
which uses different layers during a network to predict etc. These technologies have
evolved as an important need within the technology industry to seek out solutions for
ever-challenging problems. The last decade witnessed a considerable and extraordinary
amount of stored data. Growth of knowledge in every industry including healthcare,
automotive, manufacturing, finance, food processing, etc., then came a desire to utilize
this information for building and inventing the best new products and to renovate the
present ones, and also to enhance customer experience in their respective fields. To
handle such amounts of information, there’s a necessity for mathematical tools like
statistics, calculus, infinitesimal calculus, probability, etc., they play a prominent role in
understanding, interpreting, and converting information to information. Now comes a
desire for an honest programming language that is powerful and versatile to implement
the methods required to develop data science applications, which is simple to use and
popular among developers. Python could be a high-level general-purpose programming
language which had built-in data types like lists, arrays, etc python ASCII text file is
compiled to be byte code without a necessity for separate compilation. In recent years,
python with the assistance of mathematical libraries like Numpy, Pandas, Scipy, and
Scikit-learn made Python really for machine learning and deep learning.
Data Extraction:
Data extraction is the method of obtaining data from an information base or SaaS
platform so as that it's replicated to a destination — sort of a data warehouse, designed to
support online analytical processing (OLAP). Data Science operations start with
extracting information from the planet, this data is in any format, shape, or size. Python
provides many libraries for extracting data from the web and universal machines like
requests, beautiful soup, scrappy, and pypdf. you will be ready to extract data from SQL
files and databases using the Pandas library. this will be done by opening a database, or
by running an SQL query.
Data Processing:
This operation entails steps to transform raw data into usable information. Missing
values, corrupted values, time zone differences, and date range issues are all crucial
checks to make during this procedure. Numpy and Pandas libraries are provided by
Python for data processing, which is also known as data cleaning. The conversion of
information into something that a computer can understand, such as 0's and 1's, is known
as raw data.
Data Modelling:
After data analysis, there are many machine learning algorithms to create a model
based on the data. The design of models heavily relies on statistics and probability.
Python provides a Skit-learn library which had inbuilt methods for machine learning
models such as linear regression, logistic regression, etc. for supervised,
unsupervised, and reinforcement learning
Scientific Computations:
For scientific computations for researchers, students, and scientist’s python provides
a library called sci-py which has all the methods that are used for many mathematical and
scientific operations.
11
2.2 Which tool is most used for Data Science?
Python:
The most widely used data science programming language also considered a data
science tool. Python helps data science professionals to perform data analysis over
large datasets and data of different sorts. A good but clean basic syntax, a flexible but
robust integration programming language is required due to the large number of
integrated platforms and environments. Python satisfies all these qualities and it is
also easy to learn.
Let’s discuss some important characteristics of python
Integrity:
Python is a programming language that is well-known for its ability to integrate
with other languages. It can be used with a variety of other programming languages,
including C, C++, Java, CORBA, and TensorFlow, as well as a wide range of Computer
Science and Machine Learning tools, including Google Cloud ML Engine, Amazon
Machine Learning, and others. Python not only interacts with platforms and
programming language interfaces, but it also has a library stack that demonstrates the
strength of its integration capabilities.
Ease of Use:
Python is simple to use because it bases its operations on normal language rather
than on complicated syntax rules. Python programming is as easy to learn as entering an
English sentence into your computer. Installing and downloading Python is also simple.
OOPS:
In Python, object-oriented Programming (OOPs) is a programming paradigm that
uses objects and classes in programming. It aims to implement real-world entities like
inheritance, polymorphisms, encapsulation, etc. in the programming. The main concept
of OOPs is to bind the data and the functions that work on that together as a single unit so
that no other part of the code can access this data.
Python's Built-in Data Structures: Python has a variety of mutable and immutable
data structures, including arrays, Strings, and tuples for mutable data and list, set, and
dictionary for immutable data. We can simply organize and perform operations on
data using these data structures.
Compilation:
Python is generally called an interpreted language however; it combines compiling
and interpreting. When we execute a source code. Python first compiles it into a
bytecode. The bytecode is a low-level platform-independent representation of your
source code, even so, it isn't the binary machine code and cannot be run by the target
machine directly. Actually, the Python Virtual Machine is a set of instructions for a
12
virtual machine (PVM). Byte code is a lower level, platform-independent, effective, and
intermediate.
There are several benefits associated with Data science tools and libraries offer a wide
Data science tools and libraries can automate many of the repetitive and time-
consuming tasks involved in data science, such as data cleaning, preparation, and
analysis. This frees up data scientists to focus on more strategic and creative work.
Data science tools and libraries are typically well-tested and maintained, which
helps to ensure that the results of data analysis are accurate and reliable.
▪ Reproducibility:
Data science tools and libraries make it easier to reproduce data science workflows,
which is essential for scientific research and for ensuring that data science models are
used responsibly.
▪ Collaboration:
Data science tools and libraries are often open source, which makes it easy for data
▪ Reduced costs:
13
Data science tools and libraries can help to reduce the costs associated with data
science projects by automating tasks and reducing the need for custom development.
Data science tools are used for diving into raw and complicated data (unstructured or
structured data) and processing, extracting, and analyzing it to dig out valuable insights
by applying different data processing techniques such as statistics, computer science,
predictive modeling, and analysis, and deep learning.
14
Fig3: Data scientist tools
15
3.2 Integrated Development Environments (IDEs):
Jupyter Notebook and RStudio are IDEs provide an interactive environment for
writing, executing, and documenting code. Jupyter Notebook and RStudio are popular
choices that allow for code, visualizations, and explanatory text to be combined in a
single document.
16
4.DATA SCIENCE LIBRARIES
Introduction to Libraries:
4.2PANDAS:
Pandas is a fast, important, flexible and easy to use open source data analysis and
manipulation tool, built on top of the Python programming language.
Pandas give fast and effective DataFrame objects for data manipulation with
integrated indexing.Pandas is used as a tool for reading and writing data between in-
memory data structures and different formats CSV and text files, Microsoft Excel,
SQL databases, and the fast format. Intelligent data alignment and integrated care of
missing data gain automatic label-based alignment in performing calculations and
17
easily transform disordered data into a structured format pivoting and flexible
reconfiguration of data collections including:
Fast and efficient data structures: Pandas uses high-performance data
structures, such as NumPy arrays, to store and manipulate data efficiently. This
makes Pandas well-suited for working with large datasets.
Powerful data manipulation tools: Pandas provides a number of powerful
tools for data manipulation, such as filtering, sorting, grouping, and aggregation.
These tools make it easy to clean, prepare, and analyze data.
Flexible data analysis tools: Pandas also provides a number of flexible data
analysis tools, such as statistical functions, time series analysis, and machine learning
tools. These tools make it easy to perform complex data analysis tasks.
4.3Matplot Lib:
Python's Matplotlib toolkit provides a complete tool for building static, animated, and
interactive visualizations. Easy effects are made feasible by Matplotlib, as are
challenging effects
A tool for visualizing data, Matplotlib is a low-level graph charting framework
written in Python.
We are free to utilize Matplotlib because it is open-source. For platform portability, it
is primarily written in Python, with a few pieces also written in C, Objective-C, and
Javascript.
Example:
1) import matplotlib.pyplot as plt
2) import numpy as np
4) y = np.sin(x)
5) fig, ax = plt.subplots()
6) ax.plot(x, y)
7) plt.show()
18
Fig4.3: Matplotlib Plot Result
4.4 Scipy :
A scientific python called Scipy is used for N-dimensional array manipulation. This
library runs on the core of Numpy. This library provides numerous methods for scientific
computations such as optimization, linear programming, calculating distances.
Apache Spark:
Apache Spark is a powerful framework for distributed data processing and analytics.
Hadoop:
Hadoop is a distributed storage and processing framework for big data.
scikit-learn-extensions:
This library provides additional functionality on top of Scikit-Learn, offering tools for
feature engineering, preprocessing, and model evaluation.
20
DEEP LEARNING
▪ Many Python packages, modules, and libraries are available for artificial intelligence.
One such library with a potent neural network is neurolab. Single layer neural
networks and multi-layer neural networks are among its primary functionalities.
Numpy, Scipy, and Matplotlib libraries are extensions.
21
Convolutional Neural Networks (CNNs):
▪ CNNs are a class of deep neural networks primarily used for image analysis and
computer vision tasks. They are designed to automatically learn hierarchical features
from images by using convolutional layers that apply filters to capture spatial
patterns.
22
5. CONCLUSION
In this paper we have discussed about characteristics of python
programming language and the reasons behind python to become the most
popular language. We also discussed about various python libraries and there
functionalities on developing data science applications and analysis. We
discussed about the disadvantages of using python in data science projects
and improvements required to meet future needs of the industry. we also
discussed about deep learning and artificial neural networks and python
libraries which support their functionality.
Machine learning is rapidly growing area and its sub branches such as deep
learning and neural networks are headed towards new innovations and
advancements. There is a need for every technology to evolve to meet
machine learning needs in the future, this evolution process can be either by
advancing the existing systems or by knowing its limitations and improving
them. There are many other technologies which are in their respective
developing stages are getting ready for more powerful computational speed,
flexibility and being robust systems. But today python libraries are more
popular in the data science industry for their dynamic usage and
functionalities.
23
6.REFERENCES
1―python-oops-concepts @ https://www.javatpoint.com/
2―www.Tutorialspoint.com
3 ―Top-Python-Libraries-for-Data-Science-
[email protected]
4―https://www.w3schools.com/
5―pypi.org
6―https://www.w3schools.com/python/matplotlib_pyplot.asp
7―Jupyter.Org
8― www.spyder-ide.org
9―Matthew Mayo, KDnuggets on November 2, 2020 in Automated
Machine Learning, AutoML, Data Exploration, Data Processing, Data
Science, Data Visualization, Explainability, Machine
Learning,Pythonhttps://www.researchgate.net/publication/347444225_
Python_And_Its_libraries_in_Data_Science_and _Related_fields
24