Minor Report Content
Minor Report Content
Minor Report Content
INTRODUCTION
1.1 INTRODUCTION TO CRICKET:
Cricket is a sport that originated in England in the 16th century and later spread to
her colonies. The first international game however did not feature England but
was played between Canada and the United States in 1844 at the grounds of the
St. George’s Cricket Club in New York. In time, in both of these countries,
cricket took a back seat to other, faster sports like ice-hockey, basketball, and
baseball.
Figure 1.1: A cricket field showing the location of the pitch and some possible
fielding positions for players.
Cricket is played on an oval-shaped playing field and, apart from baseball, is the
only major international sport that does not define an exact size for the playing
field. The main action takes place on a rectangular 22 yard area called the pitch
(1)
in the middle of the large playing field. A diagram showing a cricket field is
given in Figure 1.1 and the pitch is magnified in Figure 1.2.
Cricket is a game played between two teams of 11 players each, where the two
teams alternate scoring (batting) and defending (fielding). A player (bowler) from
the fielding team delivers a ball to a player (batsman) from the batting team, who
should strike it with a bat in order to score while the rest of the fielding team
(fielders) defend the scoring.
Furthermore, though it is a team sport, the bowler and batsman in particular, and
fielders to some extent, act on their own, each carrying out certain solitary actions
independently. A similar sport with respect to individual duties is baseball. The
simplicity of these actions (relative to sports such as hockey, soccer and basketball)
facilitate statistical modelling. In the process of batting, batsmen can get dismissed
(get out) due to a variety of lapses on their part. When all the batsmen from the
batting team have been dismissed, or the batting side has faced their allotted number
of overs, (each over normally consists of six balls), that team’s turn (called innings)
is concluded. Their score (number of runs) is recorded. A photo of some game action
is provided in Figure 1.3. The teams then change places and the fielding team now
gets to wield the bat and try to overtake the score of the team that batted first. At the
end of one such set of innings (in the shorter versions of cricket) and two such sets of
innings (in the longer version of cricket) the winner is selected on the basis of the
most runs scored. This is a very simplified explanation of a very complex game, and
there are many variables and constraints that come into play. For more details on
cricket; see http://www.icc-cricket.com/cricket-rules-and-regulations/. When
international cricket matured, the standard format was a match that could last up to
five whole days. This format is called a test match. But even after five days of play
(2)
the match could end in a draw which means that there is no winner. This was fine in
a more leisurely age when both players and spectators had more time, when playing
the game was more important than winning, and when most cricketers were amateur
players. But as lifestyles became faster, spectators became ever more reluctant or
unable to spend five days watching one match (sometimes with no result).
Meanwhile, other faster sports became crowd pullers and earned much more in the
way of ticket sales and TV rights. As a result, cricketers became the poor relatives in
the sports world. In the 1960s a shorter version of cricket was developed called ’one-
day cricket’ with each batting side given 65 overs, and later 50 overs in which to
score runs. When this format was used in international matches, they became known
as one-day Internationals, or ODIs.
Figure 1.3: The pitch in use, with bowler, batsman, fielders and Umpire (referee).
This version of cricket was much more exciting to watch, as the batsmen had to
wield the bat aggressively. Compared to the five-day long test matches, the advent of
the 50-over format was a dramatic improvement in terms of spectator entertainment.
However, even a 50-over match lasted about 8 hours and could not compete with the
two to three hours match times and attention spans of the fans of ice-hockey,
football, baseball and basketball. As competition increased for the sports fans’ dollar,
and TV advertising income was linked directly to the number of viewers, it was
inevitable that a shorter format for cricket would emerge. With declining ticket sales
(3)
and dwindling sponsorships, the England and Wales Cricket Board (ECB) discussed
the options for a shorter and more entertaining game limited to twenty overs per side
and the first official game in this format was played on June 13, 2003. This version
became known as Twenty20. Since then, Twenty20 cricket has exploded in
popularity with the label “Twenty20” being shortened to T20. In 2008 the Indian
Premier League (IPL) was inaugurated using the T20 format.
Cricket Data Analysis is a field that combines the sport of cricket and the use of data
analysis techniques to understand and improve performance. Cricket is a highly
competitive sport that has millions of fans worldwide, and as a result, there is a vast
amount of data generated from every match. This data provides valuable information
about the performance of teams and players that can be used to make informed
decisions. The use of data analysis in cricket has been growing in recent years from
player performance analysis to team strategy and tactical decision-making, the
insights generated from data analysis can greatly improve the performance of teams
and players. This project aims to analyze the performance of cricket teams and
players using statistical methods and provide insights into their strengths and
weaknesses. The data used for this analysis will be collected from various sources
and will include information such as runs scored, and other relevant performance
metrics. The results of this analysis will be presented in a meaningful and concise
manner to provide valuable insights for coaches, managers and analysts. Overall, this
project will demonstrate the importance of data analysis in the sport of cricket and
how it can be used to gain a competitive advantage.
(4)
2. Strategic Decision-Making: Teams can use historical match data to
formulate effective game strategies. Analysis of opposition players and teams
helps in devising specific tactics, such as field placements, bowling
variations, and batting approaches, based on past performances.
6. Media and Broadcasting: Media outlets and broadcasters use data analysis
to enhance their coverage by providing in-depth insights, statistics, and
visualizations during live broadcasts. This adds a layer of analysis and
commentary that enriches the viewing experience for fans.
7. Team Management and Planning: Coaches and team management can use
data to plan match strategies, make informed selections, and assess the
overall team dynamics. This helps in optimizing team composition and
fostering a cohesive and effective playing unit.
Cricket, beyond its athletic prowess, is a game deeply intertwined with statistics.
Every ball bowled, every run scored, and every wicket taken contributes to an
(5)
intricate tapestry of data that holds the key to understanding player performance,
team dynamics, and strategic nuances. In this project, we embark on an end-to-end
data analytics journey, leveraging web scraping, Python, Pandas, and Power BI to
unravel the mysteries hidden within the vast expanse of cricket statistics.
D. Statistical Analysis:
Calculation of essential performance metrics, including batting averages,
strike rates, and team win-loss ratios, forms the bedrock of statistical
analysis. This stage incorporates the implementation of statistical tests to
validate assumptions and unearth significant differences within the dataset.
E. Feature Engineering:
Adding a layer of complexity to the analysis, the project aims to create new
features that offer additional insights into player form, team dynamics, and
recent performance. The challenge lies in defining features that not only
(6)
enhance the analysis but align seamlessly with the overarching project
objectives.
I. Documentation:
A comprehensive documentation strategy is imperative for ensuring the
reproducibility and clarity of the entire analytics process. This includes
documenting each step from data collection to analysis, with a keen focus on
creating materials accessible to both technical and non-technical audiences.
J. Presentation:
The culmination of the project involves the preparation of a presentation that
encapsulates the key aspects for diverse stakeholders. The challenge here is
to communicate technical findings in a clear and engaging manner, bridging
the gap between the intricacies of data analytics and the understanding of a
broader audience.
(7)
K. Portfolio Enhancement:
Beyond the analytical journey itself, the project seeks to contribute to the
individual's professional growth. Uploading the project code, documentation,
and presentation to a GitHub repository serves not only as a testament to the
acquired skills but also as a valuable addition to the individual's data science
portfolio.
1.4 Modules
This project aims to analyze cricket data using web scraping, Python, Pandas, and
Power BI. We will scrape match, player batting, and bowling data from ESPN
Cricinfo, clean and transform it in Python and Pandas, and finally, create interactive
dashboards in Power BI for insightful data visualization.
1. Data Scraping:
a. Target website:
ESPN cricinfo is the chosen website for data scraping due to its
comprehensive coverage of cricket matches and players statistics.
b. Web scraping tools:
Bright Data’s data collector is used for parsing HTML content and
extracting relevant data in the form of json files.
c. Data scraped:
Match data (match details: location, teams, winner, date, scorecard,
etc.)
Player batting data (player name, runs scored, boundaries, strike rate,
etc.)
Player bowling data (player name, wickets taken, economy rate, etc.)
2. Data cleaning & transformation:
a. Import libraries:
Pandas library is used for data cleaning, manipulation and analysis.
b. Data cleaning:
In data cleaning, HTML tags and unwanted characters are removed.
Missing values and inconsistencies are handled. Data formats are
standardized.
c. Data transformation:
(8)
Creating new features or derived variables (e.g., average score,
bowling average).
Combines data from different sources (match & player data).
Aggregate data for specific analysis (e.g., player performance over a
specified period).
3. Data analysis and insights:
a. DAX: calculate essential performance indicators for players and teams,
such as:
Batting average, strike rate, highest score for batters.
Bowling average, economy rate, most wickets for bowlers.
Win/loss ratio, average score for teams.
b. Descriptive statistics:
Analyse data using descriptive statistics to understand data distribution.
c. Data visualization:
Snowflake schema, charts and graphs are created to visualize trends and
patterns in player performance and team strategies.
d. Hypothesis testing:
Perform statistical tests to identify significant differences between players
or teams based on specific criteria.
4. Power BI dashboards:
a. Data import:
Import the cleaned and transformed data into Power BI.
b. Relationship building:
Define relationships between different data table for accurate analysis.
c. Visualization creation:
Creating interactive dashboards with various visuals such as:
Bar charts and line charts to compare players performance.
Scatter plots to identify trends and outliners.
Maps to analyse team performance across different venues.
(9)
2. REQUIREMENTS ANLAYSIS WITH SRS:
2.1 WHAT IS SOFTWARE REQUIREMENT SPECIFICATIONS (SRS)
DOCUMENT?
A software requirements specification (SRS) is a description of a software
system to be developed. It lays out functional and non-functional
requirements and may include a set of use cases that describe the user
interaction that the software must provide. The software requirement
specification document enlists enough and necessary requirements that are
required for the project development. To derive the requirements, we need to
have a clear and thorough understanding of the products to be developed or
being developed. This is achieved and refined with detail and continuous
communication with the project team and customer till the completion of the
software. The Software Requirements Specification (SRS) is a
communication tool between stakeholders and software designers.
The specific goals of the SRS are:
• Facilitating reviews
• Describing the scope of work
• Proving a reference to software designer (i.e. navigation aids, document
structure) • Providing a framework for testing primary and secondary use
cases
• Including features to the customer requirements
• Providing a platform for the ongoing refinement (via incomplete specs
or questions)
(11)
o Data Cleaning and Transformation: Cleaning scraped
data, handling missing values, and transforming data into a
suitable format for analysis.
o Exploratory Data Analysis (EDA): Analyzing data
trends, identifying patterns, and generating insights.
o Building Statistical Models: Building models to predict
match outcomes, player performance, etc.
Pandas:
• Purpose:
o Data Manipulation: Efficiently handle and organize data
in data frames.
o Data Analysis: Perform calculations, aggregations, and
statistical analysis on the data.
o Data Visualization: Create various charts and graphs to
visualize data insights.
Power BI:
• Purpose:
o Building Interactive Dashboards: Create visually appealing
and interactive dashboards to showcase analysis results in
a user-friendly way.
o Sharing Insights: Share dashboards and reports with
stakeholders for deeper understanding and decision-
making.
Additional technologies:
• Jupyter Notebook: Interactive environment for developing and
analysing data in Python.
• Git: Version control system for managing project code and data.
(12)
Parameters for selecting players:
A. OPENERS (2 PLAYERS):
The openers are the two players who bat first in an innings. This is a
crucial role, as they are responsible for setting the tone for the innings
and providing a solid foundation for the rest of the batting team.
(13)
These five players will give an average of 100-120 runs
(14)
Innings bowled Total innings bowled by the >2
bowler.
Bowling Average runs allowed per over. >7
economy
Bowling strike Average numbers of balls required <20
rate to take a wicket.
Table 2.4: Parameters for selecting allrounders for a match.
Table 2.5: Parameters for selecting specialist fast bowlers for a match.
(15)
3. SYSTEM DESIGN
3.1 REQUIREMENT ANALYSIS DIAGRAMS:
(16)
3.1.2 ENTITY-RELATIONSHIP DIAGRAM:
(17)
3.1.3 USE CASE DIAGRAM:
(18)
3.1.4 DATA FLOW DIAGRAMS:
LEVEL 0:
Explanation:
1. Cricket Data Analytics Project:
2. External Entities:
(19)
3. Data Stores:
4. Data Flows:
➔ Arrows indicate the flow of data between processes, entities, and data
stores.
• From Raw Cricket Data to Pandas: Represents the flow of data for
manipulation.
• From Processed Data to Power BI: Represents the flow of data for
visualization.
This Level 0 DFD provides a high-level overview of the major components and their
interactions in your cricket data analytics project. Depending on the complexity of
your project, you may need to create more detailed DFDs at lower levels to further
elaborate on subprocesses and data flows within each component.
(20)
Level 1:
1. EXTERNAL ENTITIES:
2. PROCESSES:
(21)
▪ Detailed Flow: Arrows show the detailed extraction process,
indicating the flow of data from the web source to the Local
File System (Raw Cricket Data).
3. DATA STORES:
➔ Power BI Dataset:
4. ANNOTATIONS:
➔ Descriptions:
(22)
▪ Descriptions provide clarity on the purpose of each
external entity, process, and data store.
➔ External Annotations:
(23)
3.1TEST PLAN
4.1 Test Case 1:
Fig 4.1: List of 11 players selected using on the basis of different parameters value
(24)
4.2 TEST CASE 2
Fig 4.2: List of 11 players selected using on the basis of different parameters value
(25)
4.3 TEST CASE 3:
(26)
5. BODY OF THESIS
5.1 LITERATURE REVIEW:
Cricket, a captivating sport with millions of fans globally, generates a wealth of
data, presenting a fertile ground for the application of data analytics. This
comprehensive literature review delves deep into the current landscape of cricket
analytics, exploring the diverse methodologies employed, scrutinizing the
challenges faced, and charting a path towards promising future directions.
(27)
Identifying player matchups: Analyzing data to understand individual
player strengths and weaknesses against specific opponents assists in
formulating optimal strategies and predicting potential outcomes.
Simulating match scenarios: Utilizing data-driven simulations enables
teams to test different strategies and predict potential outcomes under
various conditions, fostering informed decision-making and risk
management.
1.3. Predictive Modeling:
Predicting match winners: By analyzing historical data, team
compositions, and current playing conditions, predictive models can
forecast match winners with increasing accuracy, aiding in strategic
planning and resource allocation.
Predicting individual performances: Utilizing data-driven models,
analysts can predict individual player performance metrics like runs
scored or wickets taken, providing valuable insights for coaches and team
management.
Predicting impactful moments: Identifying statistically significant factors
that lead to pivotal moments in a match, such as sixes or dismissals,
empowers teams to capitalize on these moments and strategize
accordingly.
1.4. Talent Identification:
Identifying promising young players: Analyzing youth academy data and
player performance in lower leagues helps identify promising young
talent with the potential to excel at the highest level, facilitating early
recruitment and development.
Assessing fitness and injury risk: Utilizing data to analyze player fitness
levels and injury history assists in managing player workloads, preventing
injuries, and maximizing player availability.
Evaluating player development programs: Data analysis facilitates the
evaluation of player development programs, measuring their effectiveness
and identifying areas for improvement, optimizing talent development
initiatives.
(28)
2. Unveiling Insights: A Tapestry of Methodologies:
Cricket analytics utilizes a diverse range of methodologies, each offering its
unique strengths and insights:
2.1. Traditional Statistical Analysis:
Descriptive statistics: Calculating averages, medians, and standard
deviations provides a basic understanding of player and team
performance, enabling initial analysis and identifying trends.
Correlation analysis: Identifying relationships between different variables,
such as a batsman's strike rate and the size of the boundary, unveils
underlying patterns and informs strategic decision-making.
Regression analysis: Predicting continuous variables like a batsman's
score or a bowler's wickets based on other relevant factors allows for
deeper analysis of player performance and assists in talent identification.
2.2. Machine Learning Techniques:
Decision trees: These models classify data based on a series of rules,
aiding in predicting match outcomes, player roles, and impactful moments
within a match.
Random forests: Comprised of multiple decision trees, these models offer
improved accuracy and robustness compared to single decision trees,
enhancing the reliability of predictions.
Neural networks: These complex models learn from large amounts of data
and can identify complex patterns and relationships, providing deeper
insights into player performance and match dynamics.
2.3. Advanced Techniques:
Natural Language Processing (NLP): Analyzing commentary, news
articles, and social media using NLP techniques helps understand public
sentiment, identify trends, and gauge the impact of specific events.
Computer vision: Analyzing video footage to track player movements,
ball trajectories, and predict outcomes based on real-time data
revolutionizes performance analysis and tactical decision-making.
Big data integration: Utilizing diverse data sources, including weather
data, player bios, and social media interactions, provides a more
comprehensive view of the sport and its complexities, uncovering hidden
patterns and insights.
(29)
3. Navigating Challenges: Towards Responsible and Ethical Data
While cricket analytics offers immense potential, it also faces challenges that
require careful consideration:
3.1. Data Availability and Quality:
Accessing comprehensive and accurate data, particularly for historical
matches and lower-tier leagues, remains a challenge, hindering the scope
and depth of analysis.
Inconsistency in data collection and recording methods, especially for
older datasets, can impact the accuracy of results and limit the
generalizability of findings.
3.2. Model Generalizability and Explainability:
Models trained on specific data sets may not generalize well to different
playing conditions, teams, or players, leading to inaccurate predictions
and limited applicability.
Lack of transparency in complex AI models can hinder trust and raise
ethical concerns, requiring the development of explainable AI techniques
for greater understanding and accountability.
3.3. Ethical Considerations and Data Privacy:
Bias in data sets and potential misuse of information require careful
consideration.
Ensuring data privacy and protecting personal information of players and
other stakeholders is paramount.
3.4. Transparency and Collaboration:
Sharing data and collaborating between different stakeholders, including
cricket boards, teams, and researchers, can accelerate progress and lead to
more comprehensive and reliable insights.
Open access to data and research findings can promote
transparency, encourage innovation, and benefit the sport as a whole.
(30)
Utilizing large datasets from diverse sources, including social
media, weather data, and player bios, will provide a more comprehensive
view of the sport and its complexities, uncovering hidden patterns and
insights.
Advanced analytics techniques, such as natural language processing
(NLP) and computer vision, will offer deeper insights into player
performance, team dynamics, and match outcomes.
4.2. Explainable AI and Responsible Data Governance:
Developing explainable AI models that can provide clear explanations for
their predictions will enhance transparency and trust in their
results, fostering informed decision-making.
Implementing responsible data governance practices will ensure ethical
data utilization, safeguard privacy, and promote fairness and
transparency.
4.3. Democratization of Cricket Analytics:
Making data and analytics tools more accessible and user-friendly will
empower coaches, players, and fans to gain valuable insights and
personalize their approach to the sport.
Open-source data platforms and collaboration initiatives will promote
innovation and accelerate the development of new applications and
analytical tools.
(31)
5.2 Enhanced player development:
Tailoring training programs based on individual strengths and
weaknesses, optimizing player performance, and fostering
personalized growth.
6. Improved team performance:
As cricket analytics continues to evolve, its impact will extend beyond the field,
influencing coaching methods, broadcasting strategies, and the overall fan
experience. In conclusion, embracing a data-driven approach promises to
revolutionize the sport, fostering innovation, enhancing performance, and
enriching the experience for players, fans, and stakeholders alike.
5.2 METHODOLOGY:
5.2.1 DATA COLLECTION USING WEB SCRAPING:
(32)
➔ Evaluate the features and functionalities of each tool based on the
complexity of the website structure, dynamic content presence, and
desired scraping depth.
➔ Choose a library that offers suitable data parsing capabilities and
allows for efficient extraction of relevant information.
5. Grouping Data:
➔ Group data based on specific criteria to perform aggregate
calculations and analysis.
➔ Power Query's Group By function allows grouping by one or
more columns and calculating various aggregate functions like
sum, average, or count.
(36)
➔ This helps identify trends and patterns within different groups of
players, teams, or matches.
(37)
➔ This enables interactive exploration of the data and facilitates user-
driven insights.
➔ Parameters can be based on specific filters, date ranges, or other
criteria.
4. Testing and Refining Model:
➔ Test the data model for accuracy, performance, and ensure desired
calculations and measures are functioning correctly.
➔ Utilize Power BI's testing tools and data validation capabilities to
identify any errors or inconsistencies.
➔ refine the model based on testing results and optimize performance
for efficient analysis.
5. Documenting the Model:
➔ Create documentation for the data model, including:
o Descriptions of tables, columns, and relationships.
o Definitions of calculated measures and parameters.
o Explanation of cleaning and transformation procedures.
o Assumptions and limitations of the data and analysis.
➔ This documentation ensures clarity, transparency, and
reproducibility of the analysis for future reference and
collaboration.
(39)
functions and algorithms for various tasks, including data
cleaning, exploration, transformation, and modeling.
(40)
➔ Marketing: Understanding customer behavior, segmenting target
audiences, and optimizing marketing campaigns.
5.3.2 PANDAS
(41)
o Data cleaning and preprocessing
o Data visualization
(42)
• Financial Analysis: Analyzing stock market data, calculating financial
ratios, and building trading strategies.
• Healthcare: Analyzing patient data, identifying risk factors for
diseases, and developing treatment plans.
• Marketing: Analyzing customer behavior, segmenting target
audiences, and optimizing marketing campaigns.
• Social Sciences: Analyzing survey data, studying social
interactions, and exploring trends in social media.
• Science and Engineering: Analyzing experimental data, modeling
scientific phenomena, and designing simulations.
(43)
➔ Visual Query Editor: The visual query editor provides a user-
friendly interface for building complex data transformation steps
without writing code.
➔ Data Modeling: Power Query allows for creating data models by
establishing relationships between different datasets. This facilitates
multi-dimensional analysis and simplifies complex data exploration.
➔ M Language: For advanced users, Power Query offers the M
language, which provides a powerful scripting environment for
creating custom functions and automating complex tasks.
➔ Integration with Excel: The transformed data seamlessly integrates
with Excel spreadsheets, enabling users to leverage Excel's familiar
functionalities for further analysis and visualization.
(44)
➔ Marketing: Cleaning and combining customer data from different
sources, segmenting target audiences, and analyzing marketing
campaign performance.
➔ Operations: Analyzing operational data to identify
inefficiencies, improve processes, and optimize resource allocation.
➔ Human Resources: Transforming employee data for
analysis, calculating HR metrics, and creating custom dashboards.
➔ Sales: Analyzing sales data to identify trends, forecast sales, and
improve customer service.
5.3.4 DAX
(45)
➔ Time Intelligence Functions: DAX offers a powerful set of time
intelligence functions that enable you to analyze and visualize
trends, patterns, and seasonality in your data over time.
➔ Relationships and Filters: DAX formulas can leverage relationships
between tables and apply filters to specific data subsets, enabling you
to focus your analysis on relevant information.
➔ Logical and Conditional Operations: DAX supports various logical
and conditional operators, allowing you to build complex formulas
that perform different calculations based on specific conditions within
your data.
➔ Integration with Other Tools: DAX integrates seamlessly with other
Power BI features and tools, such as Power Query and Power
Pivot, enabling you to create a complete data analysis workflow.
(46)
➔ Marketing: Analyzing campaign performance, identifying customer
segments, and measuring marketing ROI.
➔ Operations: Monitoring key performance indicators, identifying
bottlenecks, and optimizing business processes.
➔ Human Resources: Analyzing employee data, calculating HR
metrics, and predicting workforce trends.
➔ Sales: Analyzing sales performance, forecasting future sales, and
identifying customer churn risks.
5.3.5 POWER BI
(47)
➔ Data Visualization: Power BI offers a rich collection of built-in
visualizations, including charts, graphs, maps, and custom visuals. These
visualizations enable users to explore their data in interactive and
visually appealing ways, facilitating deeper understanding and faster
decision making.
➔ Data Analysis: Power BI provides powerful data analysis
tools, including calculated columns, measures, and DAX
functions. These tools allow users to perform complex
calculations, aggregations, and logical operations on their data to
uncover hidden patterns and trends.
➔ Interactive Dashboards: Power BI facilitates the creation of interactive
dashboards that present key insights at a glance. Users can filter
data, drill down into details, and customize dashboards to meet their
specific needs.
➔ Collaborative Sharing: Power BI allows users to easily share their
dashboards and reports with others within their organization. This
enables collaboration and promotes informed decision making across
teams.
➔ Security and Governance: Power BI offers robust security features that
enable organizations to control access to data and ensure compliance
with regulatory requirements.
➔ Scalability and Flexibility: Power BI is built on a scalable cloud
architecture that can handle large datasets and complex analytical
workloads. It also offers flexible deployment options, including on-
premises, cloud, and hybrid deployments.
(48)
➔ Enhanced collaboration: Power BI facilitates collaboration by
allowing users to share insights and work together on data-driven
projects.
➔ Better communication: Power BI's interactive visualizations
effectively communicate complex data to stakeholders at all levels.
➔ Reduced costs: Power BI offers a cost-effective solution for data
analysis and visualization, eliminating the need for expensive BI tools
and consultants.
5.3.6 BEAUTIFULSOUP4
(49)
➔ Parsing: Beautiful Soup parses HTML and XML documents into tree-
like structures, allowing you to navigate and access elements based on
their tags, attributes, and relationships.
➔ Selection: You can easily select specific elements within the document
using various methods based on tags, classes, IDs, attributes, and text
content.
➔ Extraction: Beautiful Soup facilitates the extraction of specific data
points from the parsed document, such as text content, attribute
values, and table data.
➔ Navigation: You can efficiently navigate the document structure by
traversing through parent-child relationships, siblings, and descendants
of elements.
➔ Modification: Beautiful Soup allows you to modify the parsed
document structure by adding, removing, or editing elements and
attributes.
➔ Customization: You can customize the parsing process and extraction
behavior by defining filters, regular expressions, and custom functions.
➔ Integrations: Beautiful Soup integrates seamlessly with other Python
libraries, such as Requests and Selenium, for more complex web
scraping tasks.
(50)
5.3.7 NumPy
(51)
➔ Broadcasting: NumPy automatically broadcasts arrays of different
shapes to facilitate operations between them, eliminating the need for
explicit looping and reshaping.
➔ Linear Algebra: NumPy includes a comprehensive set of functions
for performing linear algebra operations like matrix
multiplication, inversion, and eigenvalue decomposition.
➔ Random Number Generation: NumPy provides functions for
generating random numbers from various distributions like
uniform, normal, and binomial, facilitating simulations and statistical
analysis.
➔ Integration with Other Libraries: NumPy seamlessly integrates with
other scientific libraries like Scikit-learn, matplotlib, and
Pandas, enabling collaborative data analysis and visualization.
(52)
Matplotlib is a fundamental and widely used Python library for creating
publication-quality data visualizations. Its versatility, ease of use, and
extensive functionalities make it a vital tool for researchers, analysts, and
anyone who wants to effectively communicate their data insights. This report
delves into the key features, applications, and benefits of Matplotlib,
highlighting its role in various scientific and data-driven fields.
Matplotlib is an open-source Python library that empowers users to create a
wide range of data visualizations, from simple line graphs and bar charts to
complex heatmaps and scatter plots. It provides a comprehensive API with
various functions and options for customizing plots, annotations, and styles.
(53)
Benefits of using Matplotlib:
➔ Versatility: Matplotlib's extensive plot types and customization
options make it suitable for visualizing a wide range of data and
conveying diverse information.
➔ Ease of Use: The library offers a user-friendly API and comprehensive
documentation, allowing beginners and experienced users alike to
create effective visualizations.
➔ Large Community and Resource Base: Matplotlib boasts a vast and
active community, providing extensive documentation, tutorials, and
examples to support users at all skill levels.
➔ Open-source and Free: Being open-source and freely
available, Matplotlib eliminates licensing costs and enables
accessibility for a broad range of users.
➔ Publication-quality Output: Matplotlib generates high-resolution and
publication-ready visualizations, making it ideal for scientific
papers, presentations, and reports
(54)
This facilitates collaboration and knowledge sharing among team
members.
3. Streamline Data Analysis: Effective database management reduces
the time and effort required to process and analyze data. This allows
analysts to focus on extracting insights and generating valuable
reports.
(55)
2. NoSQL Databases: These databases are designed for handling large
amounts of unstructured data and can be more efficient in scaling up.
Examples include MongoDB, Cassandra, and HBase.
3. Cloud Databases: Cloud-based databases offer scalability,
flexibility, and cost-effectiveness. AWS RedShift, Microsoft Azure
SQL Database, and Google Cloud SQL are popular choices.
5.5.1 LIMITATIONS:
These limitations can hinder the accuracy and effectiveness of analysis,
ultimately impacting decision-making and strategic planning.
(56)
1. Data Availability and Quality
One of the biggest challenges in cricket data analysis is the availability
and quality of data. This is particularly true for historical data, where
scoring systems and data recording methods have changed significantly
over time. Furthermore, access to granular data, such as detailed ball-by-
ball information, can be limited, especially for lower levels of the sport.
Data Availability Issues:
➔ Historical data: Scoring systems and data recording methods have
evolved significantly over time, making it difficult to compare data
across different eras.
➔ Limited access to granular data: Ball-by-ball data is often
unavailable for lower levels of cricket, hindering analysis of
individual performances and tactical nuances.
➔ Incomplete or inaccurate data: Data entry errors and
inconsistencies can significantly impact the reliability of analysis.
➔ Proprietary data: Data ownership rights can restrict access to
valuable information, limiting the scope of research and analysis.
(57)
➔ Measurement bias: The way data is collected and recorded can
introduce bias, such as favouring certain types of players or
performances.
➔ Confirmation bias: Analysts may unconsciously interpret data in a
way that supports their existing beliefs.
4. Ethical Considerations:
As cricket data analysis becomes increasingly sophisticated, ethical
considerations become more important. These include:
➔ Data privacy: Protecting the privacy of players and other
individuals whose data is collected and analysed is essential.
➔ Algorithmic bias: Ensuring that algorithms used for analysis are
fair and unbiased is crucial to prevent discrimination and
unintended consequences.
➔ Transparency and accountability: Analysts should be
transparent about their methods and data sources and be held
accountable for the accuracy and fairness of their results.
(59)
official cricket boards, sports news platforms, and fantasy cricket
websites.
➔ Live data streaming: Integrate live data streaming APIs to
capture real-time match details and analyze player performance in
real-time.
➔ Social media analysis: Capture and analyze social media data
related to cricket matches to understand fan sentiment and trends.
2. Data analysis and visualization:
➔ Advanced statistical analysis: Implement advanced statistical
techniques like predictive modeling to forecast match
outcomes, player performance, and analyze player strengths and
weaknesses.
➔ Visualization enhancements: Utilize Power BI's advanced
visualization capabilities to create interactive
dashboards, heatmaps, and network graphs for deeper insights.
➔ Sentiment analysis: Analyze textual data from news
articles, social media, and fan forums to understand public opinion
about teams, players, and specific matches.
3. Machine learning and AI:
➔ Develop predictive models: Utilize machine learning algorithms
to predict match outcomes, player performances, and identify
potential upsets or unexpected wins.
➔ Player recommendation systems: Develop AI-powered systems
to recommend players for specific roles or team compositions
based on their past performances and playing styles.
➔ Automated data analysis: Implement machine learning
algorithms to automate data analysis tasks, saving time and
resources.
4. Additional enhancements:
➔ Interactive storytelling: Use Power BI's storytelling features to
create interactive reports that engage the audience and
communicate insights effectively.
➔ Mobile accessibility: Develop mobile-friendly dashboards and
reports for easy access and analysis on the go.
(60)
➔ Natural language interaction: Integrate natural language
processing to enable users to interact with the data and generate
insights using voice commands or natural language queries.
5. Ethical considerations:
➔ Data privacy: Ensure data is collected and used
ethically, respecting user privacy and complying with relevant data
regulations.
➔ Bias and fairness: Be aware of potential biases in data sources and
algorithms and strive to develop fair and unbiased models.
➔ Transparency and explainability: Explain the data analysis
methods and results clearly to ensure transparency and build trust
with users.
(61)
6. RESULT
This dynamic dashboard aids users in visualising the different performance
and growth pattern of different players.
Users can not only visualise the data but also can make their new teams
according to the performance analysis or according to their will. This
approach encourages experience of different cricket players.
We can have the place for the openers, power raters etc., and we also have a
solid graph of it to understand the consistency and detailed information and
also a scatter plot which shows how their batting average fairs their strike
rate.
For example- in this plot Joes Buttler gives the highest average but he is a
good striker as well as he strikes at 140 plus which is a parameter, so firstly
we will select Joes Buttler because he is consistent and played all the matches
decently and for partner, Alex Hales is selected. We also need a left-hand
combination with a better strike rate. So, we will choose Rilee Russouw
because he can strike ball very nicely. So, they both will give 40 runs in
average at a strike rate of 150 plus and can stand for the average of 4 overs
which is all we need. So, this dashboard can give combined performance of
two players simultaneously. We will select three anchors now, so Virat Kohli
is first as he gives us lot of runs as seen from scattered plot .Second player
would be SuryaKumar Yadav because his average is good and striking at
180.So based on the statistics the fifth player we will choose will be Glenn
Philips because his strike rate is very high .For next place we will choose a
batting rounder and that is Sam Curran as his bowling average is very good
and Anrich Nortje and Shaheen Shah Afridi are also selected because they are
excellent fast bowlers. Next, we will choose because he is consistent and his
bowling average is good and batting is also good but we will choose at
number eight So at number seven, Sikander Raza because he has a high strike
rate and batting average. We will choose Marcus Stoinis because of batting
average. So final team on the basis of the analysis is as follows;
(62)
Total Players-11
1.Jos Buttler (wicket keeper)
2.Rilee Rossouw(batsman)
3.Virat Kohli (batsman)
4.Suryakumar Yadav(batsman)
5.Glenn Philips (batsman)
6.Marcus Stoinis (allrounder)
7.Sikandar Raza (allrounder)
8.Shadab Khan (allrounder)
9.Sam Curran(bowler)
10.Anrich Nortje(bowler)
11. Shaheen Shah Afridi(bowler)
We can also search the player result according to our will and can see the
detailed result and analysis which includes their batting style, playing role,
bowling style, strike rate and so on.
It analyses the overall performance of players in it. Users can not only
visualise the data but also can make their new teams according to the
performance analysis or according to their will. This approach encourages
experience of different cricket players
(63)
7. SUMMARY & CONCLUSIONS:
7.1 Summary
This dashboard is designed to analyze the performance of players in cricket. The
core features of this platform are designed to create a holistic analyzing
experience through the use of different technologies.
Cricket Data Analysis is a field that combines the sport of cricket and the use of
data analysis techniques to understand and improve performance. Cricket is a
highly competitive sport that has millions of fans worldwide, and as a result,
there is a vast amount of data generated from every match. This data provides
valuable information about the performance of teams and players that can be
used to make informed decisions.
The use of data analysis in cricket has been growing in recent years from player
performance analysis to team strategy and tactical decision-making, the insights
generated from data analysis can greatly improve the performance of teams and
players. This project aims to analyze the performance of cricket teams and
players using statistical methods and provide insights into their strengths and
weaknesses. The data used for this analysis will be collected from various
sources and will include information such as runs scored, and other relevant
performance metrics.
Cricket data analysis in modern sports holds immense significance as it provides
a systematic and data-driven approach to understanding and improving player
performance, team strategies, and overall game dynamics.
This project will demonstrate the importance of data analysis in the sport of
cricket and how it can be used to gain a competitive advantage.
Cricket, beyond its athletic prowess, is a game deeply intertwined with statistics.
Every ball bowled, every run scored, and every wicket taken contributes to an
intricate tapestry of data that holds the key to understanding player performance,
team dynamics, and strategic nuances. In this project, we embark on an end-to-
end data analytics journey, leveraging web scraping, Python, Pandas, and Power
BI to unravel the mysteries hidden within the vast expanse of cricket statistics.
The dashboard acts as a central hub, providing users with an intuitive interface to
access different team players and their performance. It allows users to visually
track the performance of various players. It is very interesting as user can make a
new team of their own according to the performance of different players.
(64)
This dynamic dashboard aids users in visualizing the different performance and
growth pattern of different players.
Users can not only visualize the data but also can make their new teams
according to the performance analysis or according to their will. This approach
encourages experience of different cricket players.
7.2 Conclusion:
In conclusion, we have made a cricket dashboard using technologies like power
BI, git, Jupiter notebook, pandas and web scraping.
(65)
➔ Interactive storytelling: Use Power BI's storytelling features to
create interactive reports that engage the audience and
communicate insights effectively.
(66)
REFERENCES:
Websites:
1. https://learn.microsoft.com/en-us/dax/
2. https://learn.microsoft.com/en-us/training/paths/dax-power-bi/
3. https://pandas.pydata.org/docs/
4. https://www.python.org/
5. https://researchgate.net/
6. https://www.tigeranalytics.com/blog/magic-off-pitch-role-data-analytics-
cricket/
Books:
1. Building Dashboards and Data Stories with Power BI by Alberto Ferrari
2. Data Cleaning with Python: Practical Techniques for Converting Messy Data
into Useful Insights by Jason Grout (2 editions)
3. Extracting Cricket Data from Unstructured Text using Web Scraping and
Natural Language Processing by Shaik Abdul Raheem et al. (2022)
4. Microsoft Power BI for Beginners: A Hands-on Guide to Data Visualization
and Business Intelligence by Rob Collie
5. Power BI and Excel: The Ultimate Guide to Data Visualization and Business
Intelligence by Reza Rad
6. Python for Data Analysis, 2nd Edition by Wes McKinney
7. Web Scraping with Python: Collecting Data from the Modern Web by Ryan
Mitchell
Research Papers:
1. Cricket Score Data Analysis by Mohammed Wahaj Arif Baji, Mohammad
Minhaj Arif Baji, and MD Suhail (2023)
2. Data analysis of cricket scores: ICC Men’s T20 World Cup 2022/2023
3. The application of data analytics in cricket by Drury, J., & Collins, K. (2019)
4. The Impact of Data Analytics on Cricket Performance: A Case Study by
S. Singh et al. (2017)
5. Visualization of Cricket Data using Power BI for Performance Analysis by
K. Patel et al. (2019)
(67)
Other Resources:
1. Cricket Analytics by Game by Harsha Perea (Simon Fraser University, Fall
2015)
2. Cricket Data Analysis using Python and R: A Hands-on Tutorial by M. Khan
et al. (2016)
3. Drury, J., & Collins, K. (2019). The application of data analytics in cricket.
Sports
(68)
APPENDIX A
Report View
(69)
(70)
(71)
(72)
Model View
(73)
Dataset View
Match Summary:
(74)
Batting Summary:
(75)
(76)
Bowling Summary:
(77)
Player Information:
(78)
APPENDIX B
var battingSummary = []
firstInningRows.each((index, element) => {
var tds = $(element).find('td');
battingSummary.push({
"match": matchInfo,
(79)
"teamInnings": team1,
"battingPos": index+1,
"batsmanName": $(tds.eq(0)).find('a > span > span').text().replace(' ', ''),
"dismissal": $(tds.eq(1)).find('span > span').text(),
"runs": $(tds.eq(2)).find('strong').text(),
"balls": $(tds.eq(3)).text(),
"4s": $(tds.eq(5)).text(),
"6s": $(tds.eq(6)).text(),
"SR": $(tds.eq(7)).text()
});
});
BOWLING SUMMARY:
/* -------------- STAGE 1 ------------ */
(80)
collect(parse());
var bowlingSummary = []
firstInningRows.each((index, element) => {
var tds = $(element).find('td');
bowlingSummary.push({
"match": matchInfo,
"bowlingTeam": team2,
"bowlerName": $(tds.eq(0)).find('a > span').text().replace(' ', ''),
"overs": $(tds.eq(1)).text(),
"maiden": $(tds.eq(2)).text(),
"runs": $(tds.eq(3)).text(),
"wickets": $(tds.eq(4)).text(),
"economy": $(tds.eq(5)).text(),
"0s": $(tds.eq(6)).text(),
"4s": $(tds.eq(7)).text(),
"6s": $(tds.eq(8)).text(),
"wides": $(tds.eq(9)).text(),
"noBalls": $(tds.eq(10)).text()
});
});
(81)
return {"bowlingSummary": bowlingSummary}
MATCH RESULTS:
/* -------------- STAGE 1 ------------ */
navigate('https://stats.espncricinfo.com/ci/engine/records/team/match_results.html?id=14450;type=tou
rnament');
collect(parse());
//Step3: Looping through each rows and get the data from the cells(td)
allRows.each((index, element) => {
const tds = $(element).find('td'); //find the td
matchSummary.push({
'team1': $(tds[0]).text(),
'team2': $(tds[1]).text(),
'winner': $(tds[2]).text(),
'margin': $(tds[3]).text(),
'ground': $(tds[4]).text(),
'matchDate': $(tds[5]).text(),
'scorecard': $(tds[6]).text()
})
})
PLAYER INFORMATION:
navigate('https://stats.espncricinfo.com/ci/engine/records/team/match_results.html?id=14450;type=tou
rnament');
(82)
let links = []
const allRows = $('table.engineTable > tbody > tr.data1');
allRows.each((index, element) => {
const tds = $(element).find('td');
const rowURL = "https://www.espncricinfo.com" +$(tds[6]).find('a').attr('href');
links.push(rowURL);
})
return {
'matchSummaryLinks': links
};
(83)
"team": team2,
"link": "https://www.espncricinfo.com" + $(tds.eq(0)).find('a').attr('href')
});
});
});
});
navigate(input.url);
final_data = parse()
collect(
{
"name": input.name,
"team": input.team,
"battingStyle": final_data.battingStyle,
"bowlingStyle": final_data.bowlingStyle,
"playingRole": final_data.playingRole,
"description": final_data.content,
});
//---------- 3.b Parser Code ---------//
const battingStyle = $('div.ds-grid > div').filter(function(index){
return $(this).find('p').first().text() === String('Batting Style')
})
const bowlingStyle = $('div.ds-grid > div').filter(function(index){
return $(this).find('p').first().text() === String('Bowling Style')
(84)
})
const playingRole = $('div.ds-grid > div').filter(function(index){
return $(this).find('p').first().text() === String('Playing Role')
})
return {
"battingStyle": battingStyle.find('span').text(),
"bowlingStyle": bowlingStyle.find('span').text(),
"playingRole": playingRole.find('span').text(),
"content": $('div.ci-player-bio-content').find('p').first().text()
}
(85)
PROCESS BATTING SUMMARY:
(86)
(87)
PROCESS BOWLING SUMMARY:
(88)
PROCESS PLAYER INFORMATION:
(89)