Difference between revisions of "NautilusServer"

From Deep Depth 116E167 Project Documentation
Jump to: navigation, search
m (Port Tunnelling)
m (Easy file access (Linux))
 
(14 intermediate revisions by the same user not shown)
Line 6: Line 6:
  
 
=== VPN Help ===
 
=== VPN Help ===
 +
 +
If you are accessing from off-campus you will need to access via the university VPN.
  
 
* [http://bidb.itu.edu.tr/hizmetler/vpn BIDB has instructions for accessing the VPN.]
 
* [http://bidb.itu.edu.tr/hizmetler/vpn BIDB has instructions for accessing the VPN.]
Line 14: Line 16:
 
The SSH command to connect from a Unix environment:
 
The SSH command to connect from a Unix environment:
  
     ssh -X -p 1542 hossein@<SERVER_IP_ADDRESS>
+
     ssh -X -p 1542 YOURUSERNAME@SERVER_IP_ADDRESS
  
 
Switch meanings:
 
Switch meanings:
* '''-p 1542''': connect on port 1542
+
* '''-p 1542''': Connect on port 1542
 
* '''-X:''' This allows you to run X applications. Omit it if you will be pure command line. You can also run X applications from Windows but you will need to install an X server on your Windows machine.
 
* '''-X:''' This allows you to run X applications. Omit it if you will be pure command line. You can also run X applications from Windows but you will need to install an X server on your Windows machine.
 +
* '''YOURUSERNAME''': You should have been given this along with your password when your account on the server was created.
 +
* '''SERVER_IP_ADDRESS''': This is the IP address of the server. You can learn the IP off others in the team.
  
 
'''Note''': to check that X forwarding is working, once you have connected, try running on the server the command:
 
'''Note''': to check that X forwarding is working, once you have connected, try running on the server the command:
Line 24: Line 28:
 
Or:
 
Or:
 
     dolphin
 
     dolphin
 +
Or:
 +
    eog
  
 
You could for example run ''spyder'' like this. But there can be some latency across the network.
 
You could for example run ''spyder'' like this. But there can be some latency across the network.
Line 36: Line 42:
 
For example:
 
For example:
 
    
 
    
   ssh -p 1542 -L 6006:localhost:6006 alican@<SERVER_IP_ADDRESS>
+
   ssh -p 1542 -L 6006:localhost:6006 alican@SERVER_IP_ADDRESS
  
 
Here, anybody connecting to port 6006 on your local computer will be directed to port 6006 on the remote computer (the server).
 
Here, anybody connecting to port 6006 on your local computer will be directed to port 6006 on the remote computer (the server).
 +
 +
The IP address of the server can be learnt from others on the team
  
 
== Setting up a deep learning environment ==
 
== Setting up a deep learning environment ==
  
  
=== Install anaconda ===
+
=== Install Miniconda ===
  
     export ANACONDA_PATH_PARENT=$HOME/software
+
Note 2018-07-02 this has just been updated and is not tested. If you have tested it let me know so I can remove this message or fix it.
     export ANACONDA_PATH=$ANACONDA_PATH_PARENT/anaconda3
+
 
     export ANACONDA_INSTALLER=Anaconda3-4.3.1-Linux-x86_64.sh
+
     export MINICONDA_PATH_PARENT=$HOME # where you want miniconda to live
 +
     export MINICONDA_PATH=$MINICONDA_PATH_PARENT/miniconda3
 +
     export MINICONDA_INSTALLER=Miniconda3-latest-Linux-x86_64.sh
 +
    export MINICONDA_URL=https://repo.continuum.io/miniconda/$MINICONDA_INSTALLER
  
 
     mkdir -p ~/tmp
 
     mkdir -p ~/tmp
 
     cd ~/tmp
 
     cd ~/tmp
     mkdir -p $ANACONDA_PATH_PARENT
+
     mkdir -p $MINICONDA_PATH_PARENT
     wget https://repo.continuum.io/archive/$ANACONDA_INSTALLER
+
     wget $MINICONDA_URL
     bash $ANACONDA_INSTALLER -b -p $ANACONDA_PATH
+
     bash $MINICONDA_INSTALLER -b -p $MINICONDA_PATH
  
     export PATH=$ANACONDA_PATH/bin:$PATH
+
     export PATH=$MINICONDA_PATH/bin:$PATH
 
     echo PATH: $PATH
 
     echo PATH: $PATH
 
     echo >> ~/.bashrc
 
     echo >> ~/.bashrc
     echo export PATH=$ANACONDA_PATH/bin:\$PATH >> ~/.bashrc
+
     echo export PATH=$MINICONDA_PATH/bin:\$PATH >> ~/.bashrc
 +
 
 +
 
 +
=== Install PyTorch ===
 +
 
 +
These will be installed in a [https://conda.io/docs/intro.html conda] environment called '''pytorch''':
 +
 
 +
    export ENVNAME=pytorch
 +
    conda create --name $ENVNAME
 +
    source activate $ENVNAME
 +
    conda install pytorch torchvision -c pytorch
 +
    conda install opencv pillow spyder matplotlib #some other cool things
 +
 
 +
 
 +
To check it is working - this should run without error:
 +
 
 +
    python -c "import torch;torch.randn(10)"
 +
 
 +
To check the GPU is working, check this first (it should list two GPUs):
 +
 
 +
    nvidia-smi
 +
 
 +
Then make sure the following runs without error (it may take a moment):
 +
 
 +
    python -c 'import torch;torch.randn(10).to(torch.device("cuda"))'
 +
 
  
 
=== Install tensorflow and keras ===
 
=== Install tensorflow and keras ===
Line 77: Line 113:
 
     nvidia-smi
 
     nvidia-smi
  
Then make sure the following script runs and finds one CPU and two GPUS: https://bitbucket.org/damienjadeduff/uhem_keras_tf/src/master/sariyer_python3/test_tf_gpu.py
+
Then make sure the following script runs and finds one CPU and two GPUS: https://bitbucket.org/damienjadeduff/slurm_deep/src/master/sariyer_python3/test_tf_gpu.py
  
 
Run it like this:
 
Run it like this:
Line 96: Line 132:
 
     fusermount -u $targ # only necessary to unmount if already tried
 
     fusermount -u $targ # only necessary to unmount if already tried
 
     sshfs -p 1542 -o workaround=rename YOUR_SERVER_USERNAME@SERVER_IP_ADDRESS:/home/YOUR_SERVER_USERNAME $targ
 
     sshfs -p 1542 -o workaround=rename YOUR_SERVER_USERNAME@SERVER_IP_ADDRESS:/home/YOUR_SERVER_USERNAME $targ
 +
 +
The server IP address can be learnt from others on the team.
  
 
Note: if parts of your system hang because the connection to the ssh server gets stale (a common problem), just do:
 
Note: if parts of your system hang because the connection to the ssh server gets stale (a common problem), just do:

Latest revision as of 15:45, 24 July 2018

Help

Accessing

Access from: ITU or ITU VPN.

VPN Help

If you are accessing from off-campus you will need to access via the university VPN.

SSH help

The SSH command to connect from a Unix environment:

   ssh -X -p 1542 YOURUSERNAME@SERVER_IP_ADDRESS

Switch meanings:

  • -p 1542: Connect on port 1542
  • -X: This allows you to run X applications. Omit it if you will be pure command line. You can also run X applications from Windows but you will need to install an X server on your Windows machine.
  • YOURUSERNAME: You should have been given this along with your password when your account on the server was created.
  • SERVER_IP_ADDRESS: This is the IP address of the server. You can learn the IP off others in the team.

Note: to check that X forwarding is working, once you have connected, try running on the server the command:

   xeyes

Or:

   dolphin

Or:

   eog

You could for example run spyder like this. But there can be some latency across the network.

Port Tunnelling

You can use SSH command to tunnel your port into server port in order to access several applications (e.g. Tensorboard) from your machine. The option that is used for tunneling is:

-L <SERVER_PORT>:localhost:<YOUR_PORT>

For example:

 ssh -p 1542 -L 6006:localhost:6006 alican@SERVER_IP_ADDRESS

Here, anybody connecting to port 6006 on your local computer will be directed to port 6006 on the remote computer (the server).

The IP address of the server can be learnt from others on the team

Setting up a deep learning environment

Install Miniconda

Note 2018-07-02 this has just been updated and is not tested. If you have tested it let me know so I can remove this message or fix it.

   export MINICONDA_PATH_PARENT=$HOME # where you want miniconda to live
   export MINICONDA_PATH=$MINICONDA_PATH_PARENT/miniconda3
   export MINICONDA_INSTALLER=Miniconda3-latest-Linux-x86_64.sh
   export MINICONDA_URL=https://repo.continuum.io/miniconda/$MINICONDA_INSTALLER
   mkdir -p ~/tmp
   cd ~/tmp
   mkdir -p $MINICONDA_PATH_PARENT
   wget $MINICONDA_URL
   bash $MINICONDA_INSTALLER -b -p $MINICONDA_PATH
   export PATH=$MINICONDA_PATH/bin:$PATH
   echo PATH: $PATH
   echo >> ~/.bashrc
   echo export PATH=$MINICONDA_PATH/bin:\$PATH >> ~/.bashrc


Install PyTorch

These will be installed in a conda environment called pytorch:

   export ENVNAME=pytorch
   conda create --name $ENVNAME
   source activate $ENVNAME
   conda install pytorch torchvision -c pytorch 
   conda install opencv pillow spyder matplotlib #some other cool things


To check it is working - this should run without error:

   python -c "import torch;torch.randn(10)"

To check the GPU is working, check this first (it should list two GPUs):

   nvidia-smi

Then make sure the following runs without error (it may take a moment):

   python -c 'import torch;torch.randn(10).to(torch.device("cuda"))'


Install tensorflow and keras

These will be installed in a conda environment called deep:

   export ENVNAME=deep
   conda create --name $ENVNAME
   source activate $ENVNAME
   conda install theano keras tensorflow tensorflow-gpu opencv pillow spyder matplotlib

To check Keras is working:

   python -c "from keras.models import Sequential;Sequential()"

To check the GPU is working with tensorflow, check this first (it should list two GPUs):

   nvidia-smi

Then make sure the following script runs and finds one CPU and two GPUS: https://bitbucket.org/damienjadeduff/slurm_deep/src/master/sariyer_python3/test_tf_gpu.py

Run it like this:

   python test_tf_gpu.py

Warning: for different versions of Tensorflow, Keras or Theano you may need to use pip to install the version you need in an environment.

Easy file access (Linux)

This can be useful for getting files on and off the server by accessing your remote home directory as if it was on your local computer (mounted on your file system).

On YOUR Linux computer run:

   sudo apt-get install sshfs
   targ=~/remote/nautilus
   mkdir -p $targ
   fusermount -u $targ # only necessary to unmount if already tried
   sshfs -p 1542 -o workaround=rename YOUR_SERVER_USERNAME@SERVER_IP_ADDRESS:/home/YOUR_SERVER_USERNAME $targ

The server IP address can be learnt from others on the team.

Note: if parts of your system hang because the connection to the ssh server gets stale (a common problem), just do:

   killall ssfs

It should resolve most of your problems.

Using the SSD

There is an SSD drive installed. This drive is automatically mounted at:

   /media/FASTDATA1

The drive belongs to user root and group fastdata1. If you cannot access it you need to get an admin (Hossein) to add you to the group with the command:

   sudo usermod -aG fastdata1 YOURUSERNAME

And to add you a folder in there with the right permissions:

   sudo mkdir /media/FASTDATA1/YOURUSERNAME
   sudo chown YOURUSERNAME:YOURUSERNAME /media/FASTDATA1/YOURUSERNAME
   sudo chmod 700 /media/FASTDATA1/YOURUSERNAME

Compiling your own CUDA programs

To do this, add the following lines to your .bashrc file in your home folder:

   export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH
   export PATH=/usr/local/cuda/bin:$PATH

Running programs after you log out

Sometimes a process may take a while and if you happen to log out the lack of interactivity may cause that process to give up on you.

To get around this, people use the tool screen or later versions like tmux.

You log in over ssh, then type:

   tmux

This will put you in a little session-in-a-session. Even if you log out of ssh this tmux session will persist and you can get back to it by again typing tmux.

Once you've finished writing your commands (e.g. python my_python_program_that_takes_3_days.py), and started it running, you can get back to an ordinary shell session by typing:

   Ctrl-b d

(that is, a Ctrl-b followed by a d).

Now if you want to get back to your previous session, just run tmux again. Actually, to be sure you connect to the right session, run:

   tmux list-sessions

See the session number of the session to which you want to connect (e.g. 0) then to connect to it, run:

   tmux attach -t 0

There is a lot more to tmux than that but that's the important part.

For more, see:

https://robots.thoughtbot.com/a-tmux-crash-course

Other gotchas

Who else is using the computer?

Try running

    who

Or

   nvidia-smi

Or

   top

Temperature

The GPUs are set to slowdown at 93C and shutdown at 96C. Idle temperature should be about 50C. To see the current temperature information in full run:

   nvidia-smi -q -d temperature

It should never reach that temperature. If it does that's a big whoops.

I have noticed that the first GPU (GPU 0) gets hotter quicker, presumably due to its physical location. With this information, it might be preferable to use GPU 1 more of the time.

GPU and Memory Allocation

Multiple kernels and users can also run on one GPU. This may mean you don't have enough memory at some point. It is possible to set it so that a GPU can only be accessed by one user at a time. This may be necessary in the future to ensure there is enough memory for those big jobs.

Tensorflow actually seems to claim all the memory on all the GPUs so it might be a nice gentle thing for other users for you to make tensorflow be a bit nicer.

Hiding unused GPUs from your program

One suggestion is to use

   nvidia-smi

to check which GPUs are available first then hide the one that you don't need from your program, using

   CUDA_VISIBLE_DEVICES=0 yourprogram

if you want to use the 1st GPU or

   CUDA_VISIBLE_DEVICES=1 yourprogram 

if you want to use the 2nd GPU.

Restricting GPU memory used by your tensorflow program

Another approach when using tensorlow is to only use a certain amount of memory as described in the following answer: https://stackoverflow.com/a/34200194/1616231

Automatic memory claiming

Alternatively you may make tensorflow take memory as it is needed by taking the steps described in the following answer: https://stackoverflow.com/a/37454574/1616231 (though this will ultimately use more memory).

Non-tensorflow apparoaches may have different characteristics, so the first option (hiding GPUs from your program) is the most general.

More Information

Server Construction

Built by Uzmanlar PC

Software

OS

Kubuntu 16.04.3 LTS

Graphics Drivers

Nvidia 384.59 drivers installed using runfile NVIDIA-Linux-x86_64-384.59.run

Installed using (to keep using the integrated graphics as main display graphics):

   sudo ./NVIDIA-Linux-x86_64-370.28.run --no-opengl-files --no-x-check --disable-nouveau

CUDA Drivers

Installed using

   cuda_8.0.61.2_linux.run