Requirement | Version |
---|---|
VEOS | ≥ 2.7 |
NCC | ≥ 5.0 if using VE3 |
Within PyTorch we support to use native tensors. For this program PyTorch as if
you would use a GPU but replace all calls to cuda
with ve
. E.g.:
model.ve() # copy model to VE#0
input = input.ve() # copy data to VE#0
model(input) # gets executed on the device
torch.ve.synchronize() # wait for execution to complete
(see https://pytorch.org/docs/stable/cuda.html for description)
torch.Tensor.ve()
torch.Tensor.to('ve')
torch.Tensor.to('ve:X')
torch.nn.Module.ve()
torch.ve.synchronize(device=0)
torch.ve.is_available()
torch.ve.current_device()
torch.ve.set_device(device)
torch.ve.device_count()
torch.ve.memory_allocated(device=None)
CLASS torch.ve.device(device)
CLASS torch.ve.device_of(device)
Loss functions are not implemented natively for VE. Instead use a wrapper model to add the loss function to the SOL optimized model.
class TrainingModel(torch.nn.Module):
def __init__(self, model, loss):
super().__init__()
self.model = model
self.loss = loss
def forward(self, input, target):
output = self.model(input)
loss = self.loss (output, target)
return output, loss
And adjust your training loop:
device = 've:0'
model.to(device)
optimizer = torch.optim.SGD(model.parameters())
training_model = TrainingModel(model, torch.nn.L1Loss())
training_model = sol.optimize(training_model)
for input, target in dataset:
input, target = input.to(device), target.to(device)
output, loss = training_model(input, target)
loss.backward()
optimizer.step()
training_model
and model
share the same weights, so you don’t need to
further adjust your code.
For optimal performance, and if you don’t need PyTorch identical pseudo random
numbers, use sol.optimize(...,
determinism=sol.pytorch.determinism(sol.Determinism.Rand_Fastest))
, which
enables to use much faster random number generators.
Due to increasing number of unresolved issues in TensorFlow PluggableDevice API
(e.g.,
#55497,
#57095,
#60883,
#60895)
we decided to no longer maintain our
veda-tensorflow extension.
Therefore you cannot longer use with tf.device("/VE:0"):
. Instead please use
Transparent Offloading using sol.device.set('ve',
0)
. We are sorry for the inconvenience, but we don’t see any commitment of the
TensorFlow team to accept our bugfixes, nor to fix the issues themselves.
To use the NEC SX-Aurora, it is necessary to set sol.device.set("ve",
deviceIdx)
(deviceIdx is the index of the Aurora to run on, start from 0).
Further it is necessary that the input data is located on the host system.
As explained in our paper SOL: Effortless Device Support for AI Frameworks without Source Code Changes running inference with Transparent Offloading has nearly zero impact on the performance. However, training performance will be really low!
Option | Type/Default | Description |
---|---|---|
ve::trace | bool/false | Enables to use ftrace. |
ve::packed | bool/false | Enables use of packed vector for float32. |
EnvVar | Default | Description |
---|---|---|
NAR | “/opt/nec/ve/bin/nar” | Path to nar |
NCXX | “/opt/nec/ve/bin/nc++” | Path to nc++ |
NOBJCOPY | “/opt/nec/ve/bin/nobjcopy” | Path to nobjcopy |
VEDA_VISIBLE_DEVICES | see VEDA for description | |
VE_NODE_NUMBER | see VEDA for description | |
VE_OMP_NUM_THREADS | see VEDA for description | |
_VENODELIST | see VEDA for description | |
VE_LD_LIBRARY_PATH | see VEDA for description | |
NCPATH | Used as include paths | |
NC_INCLUDE_PATH | Used as include paths | |
NCPLUS_INCLUDE_PATH | Used as include paths | |
NLIBRARY_PATH | Used as library paths |
The AI framework reports that an operation is not supported by device type "VE" | |
---|---|
This is caused by the fact, that only a minimal subset of VE function calls are supported to be executed “eagerly” within the framework, i.e., +, -, *, /, … If you encounter this problem, please open an issue for VEDA-PyTorch. |
SOL reports "not found" for NCC compiler. | |
---|---|
Possible Cause 1 |
SOL is unable to find |
Possible Cause 2 |
If there is a problem with your NCC license SOL is unable to properly detect the
compiler. Please run |
SOL crashes with nc++: /opt/nec/ve/ncc/3.4.2/libexec/ccom is abnormally terminated by SIGSEGV . | |
---|---|
On some systems NCC v3.4.2 crashes when compiling code generated by SOL. If you
encounter this problem, please switch to an older version of the compiler using
the |
SOL reports VEDA_ERROR:
VEDA_ERROR_CANNOT_CREATE_CONTEXT . | |
---|---|
This
error message is triggered when the VE is occupied by another process. SOL
relies on AVEO which requires exclusive access to the device. To resolve this
issue, terminate all other processes on the device. You can use
|
You can use the following scripts to build Singularity Containers that contain SOL.
If you want to install SOL using the official repository use this script.
BootStrap: docker
From: rockylinux/rockylinux:8.10
%post
# setup OS
dnf update -y
dnf install -y gcc-toolset-10 python312 # SOL requirements
dnf install -y epel-release # VEOS requirements
dnf install -y libquadmath libdhash protobuf-c log4c # VEOS requirements
# setup VENV
python3 -m venv /venv
. /venv/bin/activate
python3 -m pip install --upgrade pip
python3 -m pip install {{ PYTHON_FRAMEWORKS }}
python3 -m pip install nec-sol
python3 -m nec-sol install -u "{{ SOL_USERNAME }}" -p "{{ SOL_PASSWORD }}" --accept-license --devices ve
deactivate
%environment
# init VE paths
export PATH=$PATH:/opt/nec/ve/bin
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/nec/ve/veos/lib64
# init GCC/10
. scl_source enable gcc-toolset-10
export CC=/opt/rh/gcc-toolset-10/root/usr/bin/gcc
export CXX=/opt/rh/gcc-toolset-10/root/usr/bin/g++
# init VENV
. /venv/bin/activate
# Configure Proxy here if needed
# export https_proxy=192.168.0.1:1234
# export http_proxy=192.168.0.1:1234
Create a file sol4ve.cfg
and set the correct values for the variables, replace
{USERNAME}
and {PASSWORD}
with your credentials:
SOL_USERNAME={USERNAME}
SOL_PASSWORD={PASSWORD}
PYTHON_FRAMEWORKS=torch torchvision
And then build it with the following command.
sudo -E singularity build --build-arg-file sol4ve.cfg sol4ve.sif sol4ve.def
In case you want to install SOL from a local folder, you can use the following script.
BootStrap: docker
From: rockylinux/rockylinux:8.10
%files
{{ SOL_PATH }} /sol
%post
# setup OS
dnf update -y
dnf install -y gcc-toolset-10 python312 # SOL requirements
dnf install -y epel-release # VEOS requirements
dnf install -y libquadmath libdhash protobuf-c log4c # VEOS requirements
# setup VENV
python3 -m venv /venv
. /venv/bin/activate
python3 -m pip install --upgrade pip
python3 -m pip install {{ PYTHON_FRAMEWORKS }}
python3 -m pip install --pre nec-sol-core[ve,torch] veda-pytorch -f /sol # add features if needed
deactivate
rm /sol/*.*
rmdir /sol
%environment
# init VE paths
export PATH=$PATH:/opt/nec/ve/bin
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/nec/ve/veos/lib64
# init GCC/10
. scl_source enable gcc-toolset-10
export CC=/opt/rh/gcc-toolset-10/root/usr/bin/gcc
export CXX=/opt/rh/gcc-toolset-10/root/usr/bin/g++
# init VENV
. /venv/bin/activate
# Configure Proxy here if needed
# export https_proxy=192.168.0.1:1234
# export http_proxy=192.168.0.1:1234
Create a file sol4ve.cfg
and set the correct values for the variables, replace
{PATH}
with path to the SOL download folder.
SOL_PATH={PATH}
SOL_PASSWORD={PASSWORD}
PYTHON_FRAMEWORKS=torch torchvision
And then build it with the following command.
sudo -E singularity build --build-arg-file sol4ve.cfg sol4ve.sif sol4ve.def
To run the container, just execute the following command. Instead of manually binding the required folders.
singularity shell --writable-tmpfs --bind /opt/nec:/opt/nec:ro --bind /usr/lib64/libaurlic.so.1:/usr/lib64/libaurlic.so.1:ro --bind /var/opt/nec/ve/veos/:/var/opt/nec/ve/veos/:rw sol4ve.sif
Alternatively you can also use the SINGULARITY_BIND
env var.
export SINGULARITY_BIND=/opt/nec:/opt/nec:ro,/usr/lib64/libaurlic.so.1:/usr/lib64/libaurlic.so.1:ro,/var/opt/nec/ve/veos/:/var/opt/nec/ve/veos/:rw
singularity shell --writable-tmpfs sol4ve.sif