Data Glossary - Michael Dillon
Data Glossary - Michael Dillon
Data Glossary - Michael Dillon
I post free tips to help you transform your career via online networking. Connect with me on
LinkedIn here. Feel free to drop by and ask me anything about your career!
In other exiting news, I recently self-published my first e-book, How to Find a Job in Data Analytics.
I'm a big fan of online networking. I love learning from others. I love shortcuts.
I know you are super busy. So, I condensed this strategy and advice into bite-sized interviews that
you can read in 5 minutes.
Secondly, I just released a guide ‘How to use ChatGPT to Transform your Career’.
If you are one of the first readers of this glossary, I want to thank you.
So, the first 25 people who grab one, can use this discount code for 25% off!
Code: Thanks25
Enjoy,
Michael
Data Glossary:
• Alteryx can be used to speed up your processes, automate your processes, and enable
• Apache Hadoop is an open-source framework that is used to efficiently store and process
large datasets ranging in size from gigabytes to petabytes of data. Instead of using one large
computer to store and process the data, Hadoop allows clustering multiple computers to
• Apache Hbase is an open-source, NoSQL, distributed big data store, effective for handling
• Apache Phoenix: many customers use this data store for deploying machine learning-based
applications, apps like web scale and mobile apps, customer-facing dashboards etc. It
abstracts the underlying data store and allows you to query data using standard SQL.
• API stands for application programming interface. They allow different applications to talk to
• ArcGIS is a cloud-based mapping and analysis solution. Use it to make maps, analyse data,
• Array: A data structure that contains a group of elements. Typically, these elements are all the
• Applicant Tracking Software (ATS) is a software application that helps companies organize
• Amazon Web Services (AWS) is Amazon’s cloud computing platform. Amazon controls a
• Azure storage explorer is a GUI-based tool IT teams can use to oversee storage operations
• Azure Data Lake Storage Gen2 is a set of capabilities dedicated to big data analytics, built on
• Azure Synapse Analytics is an analytics service that brings together data integration,
• BuildWithAI is an initiative which leverages data science and AI to help humanity solve the
• Business intelligence (BI) uses business analytics, data mining, data visualization to help a
classification or image segmentation model that has been trained using Caffe.
• Cloud computing is a technology for storing and managing data on remote servers so it can be
• Comma-separated values (Csv) files are delimited text files that uses a comma to separate
• Customer relationship management (CRM) is a technology for managing all your company's
• Dataiku is an artificial intelligence and machine learning company which was founded in
2013.
• Dashboards are business intelligence (BI) reports. They are set up by BI Developers and
• Databricks combines the best of data warehouses and data lakes into a lakehouse architecture.
• Data factory is the cloud-based ETL and data integration service that allows you to create
data-driven workflows for orchestrating data movement and transforming data at scale.
• Data ingestion pipelines move streaming data and batched data from pre-existing databases
• Data mining is the process of finding anomalies, patterns and correlations within large data
operational systems and made available for ad-hoc queries and scheduled reporting.
• Data lake: A storage repository for raw data. It holds structured and unstructured data. A data
• Data Management Capability Assessment (DCAM) is the industry standard framework for
Data Management.
• Deep learning is a subset of machine learning, which is essentially a neural network with
three or more layers. These neural networks attempt to simulate the behaviour of the human
• ETL process which stands for extract, transform and load, is a data integration process that
combines data from multiple data sources into a single, consistent data store that is loaded
• Enterprise resource planning (ERP) is software that manages a company's financials, supply
• Full-stack technology refers to the entire depth of a computer system application. Full stack
developers straddle two separate web development domains: the front end and the back end.
The front end includes everything that a client, or site viewer, can see and interact with.
• Google Cloud Platform (GCP) is a suite of public cloud computing services run by Google
• GG Plot is a plotting package that provides helpful commands to create complex plots from
• GO is an open-source programming language that makes it easy to build simple, reliable, and
efficient software.
• Google Data Studio is a free tool that turns your data into informative, easy to read, easy to
• HackerRank is the market-leading technical assessment and remote interview solution for
hiring developers.
• Java can be used to build applications for a wide range of platforms. Desktops, servers,
mobile phones, tablets, Blu-ray players, televisions, and web browsers all use Java.
• JavaScript is used to develop interactive web applications. JavaScript can power featured like
• Jupyter Notebook is an open-source web application that allows data scientists to create and
share documents that integrate live code, equations, computational output & visualizations etc
• Kaizen Blitz Leader practices the Kaizen philosophy & takes note of feedback and insights
also used for distributed training of deep learning models. Keras is used by companies such as
• KNIME is free and open-source data analytics, reporting and integration platform.
• Lakehouse: A new, open architecture that combines the best elements of data lakes and data
warehouses.
visualizations in Python.
• MarTech otherwise known as Marketing Technology, is the term for the software and tech
• Machine Learning (ML) is the study of algorithms that can improve by being exposed to more
data. It is seen as a part of artificial intelligence. ML can automate predictive model building.
• Metric a system or standard of measurement. E.g., for marketing metrics, example, ‘email
opens’ and ‘clicks’ can show engagement level, while the ‘unsubscribe rate’ can indicate if
• Natural Language Toolkit (NLTK) is a leading platform for building Python programs to
• NumPy is a library for the Python programming language used for working with arrays.
intelligence concerned with the interactions between computers and human language, how to
program computers to process and analyse large amounts of natural language data.
relationships in a set of data through a process that mimics the way the human brain operates.
• Python is an open-source programming language. Python has many libraries, and its
flexibility makes it one of the most popular programming languages. It can be used for
scripting, web scraping, data analysis, data visualisation and much more.
• PySpark is an interface for Apache Spark in Python. It not only allows you to write Spark
applications using Python APIs, but also provides the PySpark shell for interactively
• QlikView lets users freely explore data on any device using powerful global search and
• REACT a free and open-source front-end JavaScript library for building user interfaces based
on UI components
• Selenium is a Python library and tool used for automating web browsers to do several tasks.
One of such is web-scraping to extract useful data and information that may be otherwise
unavailable.
• SciPy is a free and open-source Python library used for scientific computing and technical
computing.
• Spacy is an open-source software library for advanced natural language processing, written in
• SPSS modeler is a data mining and text analytics software application from IBM. It is used to
• Scikit-learn is a free software machine learning library for the Python programming language.
• Structured data is when data is in a standardized format. Its easily accessible and ready to be
• SQL Database Modeler (SqlDBM) is an online browser-based database design tool that also
• SQL scripts is a set of SQL commands saved as a file in SQL Scripts. You can use SQL
manage relational databases and perform various operations on the data in them.
data warehousing applications. This tool provides support for multiple data warehouse
operations simultaneously.
• Towards Data Science is a ‘Medium’ publication sharing concepts, ideas and codes.
• LSTM use a series of 'gates' which control how the information in a sequence of data comes
into, is stored in and leaves the network. These gates can be thought of as filters and are each
• Looker is a business intelligence software and big data analytics platform that helps you
• Machine Learning is a method of data analysis that automates analytical model building. It is
a branch of artificial intelligence based on the idea that systems can learn from data, identify
to help developers get started and get good at applied machine learning.
• Metabase is the open-source tool that makes everyone feel like they've got data superpowers.
• SAS develops and markets a suite of analytics software, which helps access, manage, analyse
• SciPy is a free and open-source Python library used for scientific computing and technical
computing.
• Snowflake is a cloud computing-based data warehouse. Snowflake is used for data storage,
processing and analytics. It’s provided as Software-as-a-Service (SaaS) and it runs on cloud
infrastructure.
• Spark (aka Apache Spark) is a multi-language engine for executing data engineering, data
• SQL (Structured Query Language) is a programming language that allows the user to write
code to communicate with a database. By writing lines of code, a user can update data and
• Structured data is when data is in a standardized format. Its easily accessible and ready to be
programmers
• Software engineer is a person who applies the principles of software engineering to design,
intelligence.
• TensorFlow is a free and open-source software library for machine learning and artificial
intelligence.
• Unstructured data is data that doesn’t fit in a spreadsheet with rows and columns. A social
media post, a video or a text message are all examples of unstructured data.
• Vision (Computer Vision) is a field of artificial intelligence (AI) that enables computers and
systems to derive meaningful information from digital images, videos and other visual
• Web developer is someone who creates reliable and high-performing web-based applications
and services.
• Weka is a collection of machine learning algorithms for data mining tasks. It contains tools
for data preparation, classification, regression, clustering, association rules mining, and
visualization.