GIT
GIT
GIT
What is GIT?
Git as you probably already know is one of the most used source control systems. There are others, such as Microsoft
Team Foundation Server, Perforce, SVN, Mercurial, and others. Git however, in my opinion is the most powerful and
easiest to use, hence why it’s so widely adopted.
Enter the source control systems, which allow developers to easily collaborate by working on the same files and merging
the changes using intelligent algorithms at the same time keeping a history of every change that was made so that the
code repository could easily be reverted to any point in time.
GIT architecture
Git is designed in such a way that there are is no hierarchy of functions of the repositories. This means that, unlike in
other source control software (such as MS TFS or Perforce) there is no distinction between server’s repository and
client’s repository, they are exactly the same, functionality-wise. We will explain this in more detail.
Traditional hierarchical architecture
The first source control systems used to be designed in a different way – there was an almighty server on which the
repository along with the entire history existed and this server choreographed everything that happened on all machines
that participated in development.
That means that when a developer wanted to participate, he would have to get a copy of the code from the server, ask
the server for permission to change a file (check-out operation – that file would then become locked and no one else
could edit it), make the changes, then send them to the server (check-in operation); the server would receive the
changes, merge them into its local copy of the code and record a commit.
The developers had to have a network connection to the source control server at any moment they wanted to
make commits or edit a new file.
Two developers could not edit the same file at the same time.
There was only one location where the history existed – the server, which rendered periodic backups critical.
This means you can simply go ahead and create a local git repository on your machine, then start using it by committing
changes and so on. This has neither a „server” or „client” flavour, it simply is.
Let’s say later on you want to make a local fork of this repository. There are two options:
1. Just copy/paste the entire folder and you’ll now have a clone of the repository in another folder, but they are
completely independent from this point on, they cannot share any data in the future.
2. Clone the repository into a different folder using the git command line; this will create a real fork, where the
new repository is identical to the original, but it has the knowledge of having been forked from the original and
changes can be pushed and pulled between the two. Note that there is no „server” or „client” involved in this
scheme, we simply use the git command line interface.
Let’s say another developer wants to participate in your project, so he wants to be able to clone this repository onto his
machine and do changes, then push them back into your original repository. Only at this point you need to install a git
server sofware, simply to allow network connections to your local repository. The other developer would use the git
command line to clone the repository from you; your original repository now becomes the „origin” for his local clone.
A cloned repository can also become „origin” for another fork and so on. There is no hierarchy, but only flat
relationships – see figure on next page.
In the same figure you can see that the repositories are completely agnostic of the machine they’re hosted on. The git
server is only required to facilitate cloning and pushing from one machine onto another, but not between repositories
within the same machine.
Another thing that can be observed from the figure is that all three repositories are completely identical in status and
amount of information contained within. The only difference is that some of them have their „origin” address set.
Git files
So where does Git hold all its data? There’s no dedicated database or anything complicated, everything is held in plain
files within the „.git” subdirectory of your repository (mind the dot in front of the name).
hooks/ this is where „hook” scripts can be added to intercept various git actions
HEAD contains the HEAD reference of the current branch (the most recent commit)
config contains the git configuration for this repository (see next figure)
o
Git configuration
There are two types of configuration for git: one global (for all repositories belonging to the user on the current
machine) and one local (specific to the current repository).
Both can be accessed either by editing the specific config files or by using the „git config” command.
The git-config command can be used for example for setting the user’s details on the machine level before cloning or
creating any repositories. For example the following commands display the current configuration:
~/.gitconfig the global configuration file (user details and global preferences)
o This can also be edited with the command „git config --global -e”
o
<repository dir>/.git/config the local (repository specific) config file – this file contains configuration for
branches, origin, fetch method etc
This can also be edited with the command „git config -e”
One more repository specific piece of configuration is the “.gitignore” file which contains patterns (each on a line) that
will be excluded from Git (or ignored). This means that any file that matches those patterns will not be detected by git
and will not be taken into consideration when running the change detection.
Git clients
There are several graphical git clients out there, including ones that integrate into editors or development environments
and there is the standard command line client.
All graphical clients eventually translate all actions into standard git commands, but those are hidden from the user, thus
you don’t have complete knowledge of or control over what the graphical interface is doing behind the scenes. For this
reason, I prefer to use the command line client directly, since it will do precisely what told and nothing else. On top of
that, graphical interfaces usually don’t offer the entire spectrum of git functionality, whereas the command line can do
everything.
In the following sections we’ll be referring to the standard git command line client.
A new SHA is computed for the new commit based on this data. Once committed, there’s no way of altering the history
without recomputing everything that comes after the point at which edits are made.
Note the history does not have to be linear. In fact, this is one of the most important features of git – non-linear
histories are used for branching and merging. Thus, several commits can share the same parent SHA.
Creating a new repository
For creating a new repository in a directory, go to that directory and issue the command “ git init”:
A new folder “.git” is created. If the directory contained any files before running the command, those files remain
untracked for the moment and are not added automatically to the repository. A commit is required for that.
For cloning a remote repository via SSH, we issue the “git clone ssh://user@server:repo_name” where “user”
is the SSH username, “server” is the IP address or hostname of the remote server and “repo_name” is the name of the
repository as configured in the git server on that machine. The “ssh://” protocol specifier can sometimes be omitted.
Git clone automatically creates a new folder with the name of the repository in the current directory, and places all of
the repository’s contents within that folder.
For cloning a local repository we issue the “git clone” command using the directory path of the original repository.
After cloning it, if we look at the “.git/config” file we can see the origin of this new fork is set to a local disk path of the
original repository:
The locally forked repository can now pull changes from the original local repository and can also push new commits.
This is not something that you would use in everyday life, but it demonstrates the flat architecture of Git and the lack of
a need of a dedicated server.
Once cloned, work can be done on a Git repository in an offline manner since all changes are recorded locally.
Ignored (if it matches a pattern from the .gitignore file) – in this case git will never report it as “modified” or
“untracked”, it will simply pretend the file is not there
Untracked – this is a new file that git doesn’t know about; it can be either added to git, removed or ignored.
Clean – this file is known to git and it has not been modified since the last commit.
Modified – this file is known to git and it has detected changes, or edits to the file since the last commit.
Mode changed – the file’s contents have not changed, but its attributes or access permissions have.
Deleted – a file that was known to git is now missing from the repository – a commit can be made to mark this
modification into the history.
A quick summary of the status of every interesting file in the current repository can be shown using the “ git status”
command (interesting meaning anything different from “clean” and “ignored”).
Staging
Staging is a concept unique to git. The “staging” area is a list of files that are marked to be committed by the next
commit operation.
If the user modifies some files and attempts to make a commit directly, nothing will be committed unless they are
previously marked.
This allows the user to “stage” some files for a later commit and then continue to make changes to other files that will
not be committed. Also, it allows to selectively commit only some modifications and keep others as changed.
To add files to the staging area, we use the command “git add <file>”. To see which files are staged for commit, we
use the command “git status” which will show the staged files in green and the non-staged files in red.
We can remove a file from the staging area with the command “git restore --staged <file>”.
We can also remove all files from the staging area with the command “git reset HEAD”.
To add all modified files to the staging area, we can use “git add .”. This will only add the modified files from the “.”
directory, but not any new files (that are untracked).
To add all modified and untracked (new) files, we can use “git add -A”.
Commits
A commit is an entry in the history blockchain. To perform a commit operation, we use the “ git commit” command
with various arguments – see below.
Please note that a commit performed on your local repository does not affect any upstream (or “origin”) repositories,
it is only created in the local repository.
When you perform a commit, the current branch pointer is updated to point to the new commit. The HEAD pointer
(which references the tip of the current branch) will also indirectly be affected – it points to the current branch pointer
which points to the new commit.
d.
3. Ammend commits - TODO
Let’s say you start working on a task and make some changes to some files, but it’s still work in progress, when
something else comes up and you need to handle it NOW. You will have to restore your working directory to a pristine
condition before working on the new thing, but you don’t want to lose all the changes you made for the other task.
1. Commit your work in progress on a new branch and switch back to master to start working on the new task. This
has disadvantages:
a. Commits should usually be made when the code is in a working condition, which may not be the case
right now
b. It’s more work
2. Stash your changes for later and work on the new task with a clean repository. This is the preferred method
because:
a. It’s quick and easy
b. You’re not creating any partial commits with unfinished work and non-compiling code.
The stash in git is designed like a stack, so you can stash any number of changes by pushing on this stack and later
retrieve them in the reverse order by popping them off the stack.
The stash is also very useful when you start making changes to code only to realize after a while that you’re on the
wrong branch. Just stash the changes, switch to the correct branch and restore your changes. No commits involved, no
pollution of the repository, no branches created in the process.
We stash the changes using the “git stash” command, and restore the latest stashed changes using “git stash
pop”:
Using “git stash pop” will restore the changes from the stash and remove the stash entry from the stack. The
changes will now appear in git status report.
We can see the stack of stashes using the “git stash list” command and we can view a summary of any stash entry
using the “git stash show <index>” command (if index is not provided, it will show the latest stash entry).
Discarding non-committed changes
If you have made some changes that you’d like to discard, you can do that in a couple different ways:
1. git checkout <file> This will restore the file to its “clean” state, as it was before you
changed it.
a.
2. git checkout --force <current_branch> This will restore all modified files to their
“clean” state, but will not remove any untracked files:
a.
3. git checkout --force HEAD Same as the previous command, HEAD refers to the current
branch’s tip
a.
4. git clean --force This will only remove the untracked files, but keep the modified files as
they are
a.
Branches in Git
As you saw, the history in Git can be branched into many parallel paths. These paths are called “branches” and can be
used to represent different histories that share a common past point and evolved differently.
A branch in Git doesn’t store any information about its path or its current state, it is simply a “pointer”. The branch is a
user-friendly name that points to a specific HEAD commit (a commit that doesn’t have any descendants).
Looking at the diagram in “How Git stores history” you can see there are three such pointers displayed – one is the
“master” which represents the main branch of the repository. It just holds the SHA of the latest commit on that path.
Another one is the “branch_foo” which holds the SHA of the latest commit on its own code path and represents a
parallel branch. The last one is “HEAD” which points to the tip of the current branch.
HEAD is used so that Git knows what branch you’re currently working on, since behind the scenes all of the branches
exist at the same time.
To view the currently active branch, you can use the command “git branch”:
To view a list of all branches, you can use the command “git branch --list". This will also highlight the current
branch:
To switch your working directory to a different branch, you use the command “git checkout <branch_name>”;
please note that your working directory should be clean, otherwise the switch may fail due to conflicts. You can clean it
up by either committing changes, stashing them, or discarding them:
The files in the working directory will automatically be updated by Git to match the state of the branch. The HEAD
pointer will also be updated to point to the new branch’s tip.
To create a new branch and switch to it immediately, you can use “git checkout -b <branch_name>”. This will
create the new branch from the tip of the previous branch and keep all your modifications. The working directory needs
not be clean:
You can now commit your changes onto the new branch.
Tags in Git
A tag is also a pointer to a specific commit in git. What differentiate a tag from a branch are several aspects:
A branch is dynamic – its pointer always moves to the tip of the history as new commits are added
A tag is fixed – once created it never moves, it will always identify a fixed state of the code as it was on that path
at that moment in time.
A tag can have a name and a description, whereas a branch only has a name.
For these reasons, a tag is very useful for tagging a specific state of the code for a release candidate for example, since
at any later moment you can be sure you can return to it easily. A branch is not so useful for this purpose since new
commits can be added to that branch and the state of the code can be altered.
The information included for each commit is the SHA of the commit, the author, timestamp and commit message.
You can view the changes that were made in a commit using the “git show <SHA>” command:
Please note that it is not required to use the entire SHA, only the first 6 characters are sufficient to uniquely identify a
SHA in most situations:
Issuing “git show” without a SHA will show you the changes from the last commit.
Merging
Branches are useful for working on different tasks in parallel or for experimentation, but eventually you’ll want that
work to make it back into the master branch, or perhaps you want some changes from a different branch brought into
yours. For this purpose, we use the “git merge <branch_name>” command, which will bring all changes from the
indicated branch since the last common point into your current branch. These changes are not simply copied over, but
the entire history and commit information will be available into the current branch, as if they were committed there in
the first place.
Please note that what is shown in the last figure is only the “apparent” history of the branch we merged into (as
reported by git log), but the actual history remains as in the previous figure, the commits still have their original
parents since like we said before commits cannot be rewritten once they’re created without having to recompute the
entire history.
Git will automatically merge the changes made to the files in the two branches as long as it doesn’t run into situation
where the same portion of the same file has been modified separately in each branch (that situation is called a “conflict”
and we’ll talk about it later.
Conflicts
Todo…
Undoing changes
Todo…
Selectively merging commits from one branch into another
Todo… (Cherry pick)
Git bisect