Data Glossary - Michael Dillon

Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

Hello there!

Thanks for downloading this data glossary! I’m Michael Dillon.

I post free tips to help you transform your career via online networking. Connect with me on
LinkedIn here. Feel free to drop by and ask me anything about your career!

How to Find a Job in Data Analytics

In other exiting news, I recently self-published my first e-book, How to Find a Job in Data Analytics.

I'm a big fan of online networking. I love learning from others. I love shortcuts.

I interviewed 64 data experts because I wanted to show you:

How to LAND a job


How to fix your RESUME
Which data job is right for YOU
Which TOOLS are best to learn
How to land a job PROMOTION
How to perform better in INTERVIEWS
How to do the data PROJECTS to land a job

I know you are super busy. So, I condensed this strategy and advice into bite-sized interviews that
you can read in 5 minutes.

Secondly, I just released a guide ‘How to use ChatGPT to Transform your Career’.

You can find both e-books here: https://app.gumroad.com/michae1di11on.

If you are one of the first readers of this glossary, I want to thank you.

So, the first 25 people who grab one, can use this discount code for 25% off!

Code: Thanks25

If you enjoy this content, reach out and say hello.

Feel free to ask me anything!

Enjoy,

Michael
Data Glossary:

• Alteryx can be used to speed up your processes, automate your processes, and enable

predictive and geospatial solutions.

• Apache Hadoop is an open-source framework that is used to efficiently store and process

large datasets ranging in size from gigabytes to petabytes of data. Instead of using one large

computer to store and process the data, Hadoop allows clustering multiple computers to

analyse massive datasets in parallel more quickly.

• Apache Hbase is an open-source, NoSQL, distributed big data store, effective for handling

large, sparse datasets.

• Apache Phoenix: many customers use this data store for deploying machine learning-based

applications, apps like web scale and mobile apps, customer-facing dashboards etc. It

abstracts the underlying data store and allows you to query data using standard SQL.

• Amdocs is a software & services provider to communications and media companies.

• Android is a mobile operating system

• API stands for application programming interface. They allow different applications to talk to

each other and exchange information

• ArcGIS is a cloud-based mapping and analysis solution. Use it to make maps, analyse data,

and to share and collaborate.

• Array: A data structure that contains a group of elements. Typically, these elements are all the

same data type, such as an integer or string.

• Applicant Tracking Software (ATS) is a software application that helps companies organize

candidates for hiring and recruitment purposes.

• Amazon Web Services (AWS) is Amazon’s cloud computing platform. Amazon controls a

third of the cloud market.

• Artificial intelligence is intelligence from machines. AI is the ability of a computer, or a robot

controlled by a computer to do tasks that are usually done by humans.


• AI modelling is the creation, training, and deployment of machine learning algorithms that

emulate logical decision-making based on available data.

• Automate means to make a process automatic

• Automated models are AI that designs AI systems.

• Azure is a cloud computing service operated by Microsoft.

• Azure storage explorer is a GUI-based tool IT teams can use to oversee storage operations

related to the Azure public cloud.

• Azure Data Lake Storage Gen2 is a set of capabilities dedicated to big data analytics, built on

Azure Blob Storage.

• Azure Synapse Analytics is an analytics service that brings together data integration,

enterprise data warehousing, and big data analytics.

• BuildWithAI is an initiative which leverages data science and AI to help humanity solve the

world's biggest challenges

• Business intelligence (BI) uses business analytics, data mining, data visualization to help a

business make data-driven decisions.

• Caffe is a deep learning framework

• A Caffemodel file is a machine learning model created by Caffe. It contains an image

classification or image segmentation model that has been trained using Caffe.

• Cloud computing is a technology for storing and managing data on remote servers so it can be

accessed via the internet (as opposed to a database on a user’s computer).

• C++ is a popular programming language

• C# is a general-purpose, multi-paradigm programming language

• Comma-separated values (Csv) files are delimited text files that uses a comma to separate

values. Each line of the file is a data record.

• Customer relationship management (CRM) is a technology for managing all your company's

relationships and interactions with customers and potential customers.


• Cross-Industry Standard Process for Data Mining (CRISP DM) is an industry-proven way to

guide your data mining efforts.

• Coursera is an online course provider.

• Dataiku is an artificial intelligence and machine learning company which was founded in

2013.

• Datorama is AI-powered marketing intelligence. Make smarter decisions by connecting and

acting on all of your marketing data, investments, and KPIs.

• Dashboards are business intelligence (BI) reports. They are set up by BI Developers and

automatically update with new data each day.

• Databricks combines the best of data warehouses and data lakes into a lakehouse architecture.

• Database is an organized collection of structured information, or data, typically stored

electronically in a computer system.

• Data factory is the cloud-based ETL and data integration service that allows you to create

data-driven workflows for orchestrating data movement and transforming data at scale.

• Data ingestion pipelines move streaming data and batched data from pre-existing databases

and data warehouses to a data lake.

• Data mining is the process of finding anomalies, patterns and correlations within large data

sets to predict outcomes.

• Data warehouse is a repository of an organization's electronically stored data extracted from

operational systems and made available for ad-hoc queries and scheduled reporting.

• Data lake: A storage repository for raw data. It holds structured and unstructured data. A data

warehouse is a store for structured data.

• Data Management Capability Assessment (DCAM) is the industry standard framework for

Data Management.

• Deep learning is a subset of machine learning, which is essentially a neural network with

three or more layers. These neural networks attempt to simulate the behaviour of the human

brain - allowing it to ‘learn’ from large amounts of data


• Embedded C is a set of language extensions for the C programming language

• ETL process which stands for extract, transform and load, is a data integration process that

combines data from multiple data sources into a single, consistent data store that is loaded

into a data warehouse or other target system.

• Enterprise resource planning (ERP) is software that manages a company's financials, supply

chain, operations etc

• Full-stack technology refers to the entire depth of a computer system application. Full stack

developers straddle two separate web development domains: the front end and the back end.

The front end includes everything that a client, or site viewer, can see and interact with.

• Google Cloud Platform (GCP) is a suite of public cloud computing services run by Google

• GG Plot is a plotting package that provides helpful commands to create complex plots from

data in a data frame.

• GO is an open-source programming language that makes it easy to build simple, reliable, and

efficient software.

• Google Data Studio is a free tool that turns your data into informative, easy to read, easy to

share, and fully customizable dashboards and reports.

• HackerRank is the market-leading technical assessment and remote interview solution for

hiring developers.

• Informatica is a data integration tool based on ETL architecture.

• Java can be used to build applications for a wide range of platforms. Desktops, servers,

mobile phones, tablets, Blu-ray players, televisions, and web browsers all use Java.

• JavaScript is used to develop interactive web applications. JavaScript can power featured like

interactive images, carousels, and forms.

• Jupyter Notebook is an open-source web application that allows data scientists to create and

share documents that integrate live code, equations, computational output & visualizations etc

• Kaizen Blitz Leader practices the Kaizen philosophy & takes note of feedback and insights

from every member of the organization.


• Keras is used for creating deep models which can be productized on smartphones. Keras is

also used for distributed training of deep learning models. Keras is used by companies such as

Netflix, Yelp, Uber, etc.

• KNIME is free and open-source data analytics, reporting and integration platform.

• KX is a KX is a data analysis software developer and vendor.

• Lakehouse: A new, open architecture that combines the best elements of data lakes and data

warehouses.

• Matplotlib is a comprehensive library for creating static, animated, and interactive

visualizations in Python.

• MATLAB is a programming and numeric computing platform used by millions of engineers

and scientists to analyse data, develop algorithms, and create models.

• MarTech otherwise known as Marketing Technology, is the term for the software and tech

tools marketers leverage to plan, execute, and measure marketing campaigns.

• Machine Learning (ML) is the study of algorithms that can improve by being exposed to more

data. It is seen as a part of artificial intelligence. ML can automate predictive model building.

• Metric a system or standard of measurement. E.g., for marketing metrics, example, ‘email

opens’ and ‘clicks’ can show engagement level, while the ‘unsubscribe rate’ can indicate if

audiences find your content interesting and relevant.

• Natural Language Toolkit (NLTK) is a leading platform for building Python programs to

work with human language data.

• NumPy is a library for the Python programming language used for working with arrays.

• Natural Language Processing is a subfield of linguistics, computer science, and artificial

intelligence concerned with the interactions between computers and human language, how to

program computers to process and analyse large amounts of natural language data.

• Neural networks are a series of algorithms that endeavours to recognize underlying

relationships in a set of data through a process that mimics the way the human brain operates.

• OpenCV is a library of programming functions mainly aimed at real-time computer vision.


• Pandas is a software library written for the Python programming language for data

manipulation and analysis.

• PHP is a general-purpose scripting language geared toward web development.

• Python is an open-source programming language. Python has many libraries, and its

flexibility makes it one of the most popular programming languages. It can be used for

scripting, web scraping, data analysis, data visualisation and much more.

• PySpark is an interface for Apache Spark in Python. It not only allows you to write Spark

applications using Python APIs, but also provides the PySpark shell for interactively

analysing your data in a distributed environment.

• QlikView lets users freely explore data on any device using powerful global search and

selections, smart visualizations and more.

• Power BI is an interactive data visualization software product developed by Microsoft with

primary focus on business intelligence.

• R is a free software environment for statistical computing and graphics.

• Reinforcement learning is a machine learning training method based on rewarding desired

behaviours and/or punishing undesired ones.

• REACT a free and open-source front-end JavaScript library for building user interfaces based

on UI components

• Salesforce is an American cloud-based software company, providing customer relationship

management software and applications focused on sales, customer service, marketing

automation, analytics, and application development.

• Selenium is a Python library and tool used for automating web browsers to do several tasks.

One of such is web-scraping to extract useful data and information that may be otherwise

unavailable.

• SciPy is a free and open-source Python library used for scientific computing and technical

computing.
• Spacy is an open-source software library for advanced natural language processing, written in

the programming language Python.

• SPSS modeler is a data mining and text analytics software application from IBM. It is used to

build predictive models and conduct other analytic tasks.

• Snowflake is a cloud computing–based data warehousing company

• Scikit-learn is a free software machine learning library for the Python programming language.

• Structured data is when data is in a standardized format. Its easily accessible and ready to be

used. It’s data like names, dates or addresses etc.

• SQL Database Modeler (SqlDBM) is an online browser-based database design tool that also

imports existing DB, generates SQL, and allows collaboration.

• SQL scripts is a set of SQL commands saved as a file in SQL Scripts. You can use SQL

Scripts to create, edit, view, run, and delete script files.

• Structured Query Language (SQL) is a standardized programming language that is used to

manage relational databases and perform various operations on the data in them.

• Teradata SQL is an open-source Database Management System for developing large-scale

data warehousing applications. This tool provides support for multiple data warehouse

operations simultaneously.

• Towards Data Science is a ‘Medium’ publication sharing concepts, ideas and codes.

• LSTM use a series of 'gates' which control how the information in a sequence of data comes

into, is stored in and leaves the network. These gates can be thought of as filters and are each

their own neural network.

• Looker is a business intelligence software and big data analytics platform that helps you

explore, analyse and share real-time business analytics easily.

• Machine Learning is a method of data analysis that automates analytical model building. It is

a branch of artificial intelligence based on the idea that systems can learn from data, identify

patterns and make decisions with minimal human intervention.


• Machine Learning Mastery is an online community and store that offers support and training

to help developers get started and get good at applied machine learning.

• Metabase is the open-source tool that makes everyone feel like they've got data superpowers.

• MongoDB is a source-available cross-platform document-oriented database program.

• SAS develops and markets a suite of analytics software, which helps access, manage, analyse

and report on data to aid in decision-making.

• SciPy is a free and open-source Python library used for scientific computing and technical

computing.

• Spyder is an open-source cross-platform integrated development environment for scientific

programming in the Python language.

• Snowflake is a cloud computing-based data warehouse. Snowflake is used for data storage,

processing and analytics. It’s provided as Software-as-a-Service (SaaS) and it runs on cloud

infrastructure.

• Spark (aka Apache Spark) is a multi-language engine for executing data engineering, data

science, and machine learning on single-node machines or clusters.

• SQL (Structured Query Language) is a programming language that allows the user to write

code to communicate with a database. By writing lines of code, a user can update data and

retrieve data from a database.

• Structured data is when data is in a standardized format. Its easily accessible and ready to be

used. It’s data like names, dates or addresses etc.

• Stack Overflow is a question-and-answer website for professional and enthusiast

programmers

• Software engineer is a person who applies the principles of software engineering to design,

develop, maintain, test, and evaluate computer software.

• Tableau is an American interactive data visualization software company focused on business

intelligence.
• TensorFlow is a free and open-source software library for machine learning and artificial

intelligence.

• Unstructured data is data that doesn’t fit in a spreadsheet with rows and columns. A social

media post, a video or a text message are all examples of unstructured data.

• Vision (Computer Vision) is a field of artificial intelligence (AI) that enables computers and

systems to derive meaningful information from digital images, videos and other visual

inputs — and take actions or make recommendations based on that information.

• Web developer is someone who creates reliable and high-performing web-based applications

and services.

• Weka is a collection of machine learning algorithms for data mining tasks. It contains tools

for data preparation, classification, regression, clustering, association rules mining, and

visualization.

• YOU CANalytics: A blog exploring the power of data analytics.

You might also like