27.07.2024 v0.6.0 Docs |
- Added experimental support for vLLM.
- Added experimental support for CUDAGraphs in PyTorch.
- Added BFloat16 and Float16 support for X86 and NVIDIA.
- Added FlashAttn-like kernel fusion for X86, NVIDIA and SX-Aurora.
- Improved
torch.compile(...) integration.
- SOL no longer aborts execution but properly throws Python exceptions that can be catched using
try: ... except sol.Exception as e: ...
Breaking Changes
sol.config['compiler::profile'] has been deprecated. Use env var SOL_PROFILER instead
Known Issues
- No BFloat16 and Float16 support for SX-Aurora.
- Performance regressions on SX-Aurora (e.g., ConvNext).
- No gradient computations yet for Interpolation layers.
Closed Issues
- #1555 [NVIDIA] NAN in Albert model
- #1554 [PyTorch] Scriptparser uses deprecated imp package
- #1550 [PyTorch] torch.cross
- #1548 [PyTorch] Unknown at::device
- #1547 [VDims] Can't use the same vidx twice! (128 * #2 * #2)
- #1546 [VE] User specified an unsupported autocast device_type 've'
- #1545 [BuildChain] CentOS7 mirrors are deprecated, we might need to upgrade to manylinux_2_28
- #1544 [Python] Destroy sol_optimizer* when error is thrown
- #1543 [VE] Invalid results when large values get passed to exp
- #1542 [Core] Uncaught sol::Exception when autotuning triggers an assertion
- #1540 [VE] PY_Issue_1410 fails
- #1538 [Python] Allow
determinism to be overwritten by user
- #1537 [PyTorch] Finalize Determinism
- #1536 [ISPC] Make Prefixsum as template
- #1535 [DNN] AutoTuner Cross GEMM performance
- #1534 [PyTorch] torch.rand/rand_like F16/BF16 on X86
- #1532 [Wrapper] Don't call sol_call if graph does not contain any nodes
- #1530 [DNNL] PostOp Bias
- #1529 [Core] Improve Exception Handling within TF
- #1528 [PyTorch] aten::extend
- #1527 [ONNX] change s_tensors to scope, and store s_opset in scope
- #1526 [ONNX] LRN
- #1525 [ONNX] Hub Tests
- #1520 [DFP] No Write/Read loops in 376:628
- #1519 [PyTorch] Can't expand [1, 3, 80, 80, 2] to [80, 80, 2] in YOLO perf run
- #1514 [VE] Finalize VE3 support
- #1512 [DNN] Add GEMM upcasting API to support e.g. FP16/BF16 on CPU/VE
- #1508 [Transformers] Persimmon/Qwen2/XLMRobertaXL Accuracy
- #1507 [TIMM] Upgrade to v1.0.7
- #1505 [VE] Set NCC -stdlib=compat
- #1504 [Parser] Implement lazy Numpy eval
- #1503 [DFP] Conv Accuracy
- #1502 [DNNL] Enable F16 and BF16
- #1501 [DFP] Tanh(large number) == nan
- #1498 [NVIDIA] Limit usage of TensorCores to suitable shapes
- #1497 [ISPC] Unable to store varying in uniform var
- #1496 [TIMM] fix models with
lit != idims.end() error
- #1494 [HLIR] Add Constraint Registry that allows to serialize them
- #1493 [DNNL] Upgrade 3.5
- #1492 [PyTorch] Unify Script and FX parser
- #1490 [SLEEF] Upgrade v3.6.1
- #1489 [PyTorch/Lightning] Update Testcase to new API
- #1487 [PyTorch] Test v2.3.1
- #1483 [DFP] AccessorPinning fails in PY_REDUCE training
- #1481 [PyTorch] HuggingFace LLama:
This model uses variable number of arguments, falling back to torch.jit.trace
- #1480 [DFP] CUDA pow(-0.5, 2.0) results in NaN
- #1477 [DNN] Add timeouts for GEMM and Transpose AutoTune
- #1476 [DFP] remove stack_alloc calls
- #1475 [AutoTuner] Reneable caching with new AT scheme
- #1474 [HLIR] SDPAttention: Cannot modify VDims at this point anymore!
- #1472 [DFP] Revise CPU+VE cores caching
- #1471 [CUDA] Don't abort if libcuda.so is not found (e.g. in Docker build)
- #1470 [Torchvision] vgg11: T18_D0 violates DNNL post op requirements as src is of type Add
- #1469 [CUDNN] Could not load library libcudnn_cnn_infer.so.8. Error: /usr/lib64/libcudnn_cnn_infer.so.8: undefined symbol
- #1468 [Runtime] Free Persistent Data in case user reruns training fwd pass without bwd pass
- #1466 [DFP] DFPBackend::lowerIR(Layer* l, Cast* p) causes infinite loop when using VE
- #1464 [DFP/Nvidia] LLama using -dt tf32 results in illegal view transformation
- #1462 [PyTorch] Test v2.3.0
- #1459 [NCC] add
-march ARCH to NCC command
- #1457 [DFP] Unvectorized OnlineSoftMax in SDP
- #1456 [PyTorch] If FlashAttn enabled for F32 SDP, then also set GEMM::TF32
- #1452 [Wrapper] Add options to wrapper::Attributes to either override or accumulate
- #1451 [VE] ve::trace=True does not compile
- #1450 [DFP] Transform Numel->Cast->Broadcast to Numel->Cast with correct dims
- #1446 [DFP] Test TensorCore-like GEMM implementations
- #1443 [DNN] Implement GEMM autotune swapping inputs
- #1442 [PyTorch] add Tensor.uniform_
- #1441 [PyTorch] add one_hot
- #1440 [DFP] HMerge identical Pooling/Conv and Reduce layers
- #1439 [DFP] Reduce Cores LoopAccessors
- #1437 [DFP] Rework Combine Input
- #1435 [Accuracy] Investigate SoftMax Gradient problem
- #1432 [Runtime] pass Model.training parameter as runtime parameter
- #1431 [DFP/ISPC] add iterations to sol_dfp_reduce_x(..., count_iterations)
- #1430 [HLIR] move Bernoulli::transform(Device&) to respective backends
- #1429 [DNNL] Upgrade 3.4.1
- #1427 [PyTorch] VLLM support
- #1425 [Docs] Update max TF version
- #1424 [HLIR] SDPAttention layer
- #1422 [DFP] Rework Reduce::transform as it underutilizes
- #1421 [ISPC] Add PyTorch_DETERMINISM compile flag, as they don't use FP64 sleef functions!
- #1419 [PyTorch] Test v2.2.2
- #1417 [DFP/X86] Accuracy GELU
- #1416 [DFP] Loop+AccLoop fusion
- #1415 [PyTorch] NaN when training MNist example on CUDA
- #1413 [PyTorch] Pass on fwargs and vdims to torch.compile Backend
- #1412 [TestBench] Add Default DType to perf Output
- #1411 [PyTorch] add tensordot
- #1410 [CUDA] Wrong result in CUDA Transpose
- #1409 [PyTorch] evaluate FP64 GEMM uses tensor cores
- #1407 [RNN] Add Determinism to API
- #1406 [Profiler] MemTrace reports wrong total
- #1405 [NVIDIA] LayerNorm accuracy
- #1403 [NVIDIA] consider to remove cross-warp-reduction support
- #1402 [DFP] OnlineSoftMax::derive
- #1399 [Runtime/HLIR] Include Model Inputs in runtime::RequiresGrad
- #1398 [DNNL] Use PostOp Activations in Inference
- #1397 [Numpy] Adopt new VDims system
- #1396 [Compuler/Runtime] Remove special "INF" case of Derivative
- #1395 [TF] Missing TF handler for DivNoNan
- #1394 [PyTorch] YOLO accuracy
- #1393 [TF] Test v2.15.1
- #1391 [Runtime] Consider separating INF and Training
- #1388 [DFP] Still unused Loop Unpacks, e.g. in AlexNet
- #1387 [DFP] Check Merged FOR-FOR loops, that cause unpacking of Loops (e.g. AlexNet)
- #1386 [TIMM] fix vit_base_patch16_384
- #1385 [Profiler] Reporting to file not working
- #1384 [NVIDIA] Evaluate new __reduce_xxx_sync function for CC >=8
- #1383 [DFP] Improve cache planning
- #1382 [DFP] Unnecessary cast in FP16 Mul->Mul
- #1381 [DFP] Expected [S64] for sol::compiler::backend::dfp::Indices but found [S32]
- #1378 [HLIR] Inherit Cast in MaxPooling.Indices -> Cast -> ...
- #1376 [Runtime] Enable to use
sol.device.set(...) from different Threads to run multiple devices in parallel
- #1373 [CUDA] SOL crashes with "invalid device context" when using Streamlit
- #1372 [HLIR] Evaluate if using determinism instead of sol.hlir.RandType is sufficient
- #1371 [DFP] Upcast pure intermediate results in FP16/BF16 to FP32
- #1370 [PyTorch] Test v2.2.1
- #1369 [DFP] Add transformation to upcast internal data types from f16 to f32.
- #1367 [VE] SegFault Norms(64)
- #1365 [NVIDIA] PY_Reduce(6/12) argmin/argmax fails with F64
- #1364 [NVIDIA] PY_Issue_1316 fails with F64
- #1363 [Tests] Enable non-fp32 dtypes in testsuites
- #1360 [DNN] use
sol_sync instead of sol::dnn:XXX::sync
- #1359 [CUDA] performance of cublasGEAM not optimial, e.g. in TransposeSwap testcase
- #1358 [DFP] Fix elementwise cases that don't get CoresSIMD assigned
- #1357 [Runtime] Implement sol_tensor_swap to skip Noop Reorders
- #1356 [Sleef] Upgrade v3.6
- #1355 [PyTorch] Performance Issues in CNNs with BS=1 on X86
- #1351 [PyTorch] Fix 'PY_Padding' on VE
- #1350 [PyTorch] Fix 'PY_Norms(3)' on VE
- #1349 [PyTorch] Fix 'PY_Addbmm' on VE
- #1348 [PyTorch] Fix 'PY_Matmul#T#Batched' on VE
- #1346 [PyTorch] Fix 'PY_CatRewire' -> nan on VE
- #1344 [YAAL] Return if any of the checks fail with error code
- #1343 [TF] Test v2.15.0
- #1342 [Runtime] Trigger recompilation if non-dynamic dimension changes instead of crashing
- #1341 [NVIDIA] Fix library detection if cu11 and cu12 packages are installed
- #1338 [Rand] Numpy Random Number Generator
- #1337 [HLIR] Add
tensor[condition] operator
- #1336 [HLIR] Remove (Layer*) arguments from Operation, as they know their layer via
m_layer !
- #1335 [BLAS] Fix decision making on AMD EPYC for new Autotuning
- #1334 [BLAS] Unify OpenBLAS, MKL, DNNL and AOCLBLAS BLAS Interface
- #1333 [OpenBLAS] Add Backend
- #1332 [AutoTuner] Consider Backend Specific "number of runs" and "not improved"
- #1331 [AutoTuner] Add option to poll performance for a layer from within another layer's tuning cycle
- #1330 [PyTorch] Capture ::c10::Error errors in handle and rethrow as sol::Exception
- #1329 [AutoTuner] AutoTuner Cache does not allow rerunning Reorder -> GEMM, profiling when identical GEMM layer but without previous Reorder was executed
- #1328 [PyTorch] Test v2.2.0
- #1327 [Numpy] Executor overrides input
- #1326 [MKL/VEBLAS] Evaluate if using GEMV when bs==1 is better
- #1323 [Profiler] Fix Total Bytes
- #1322 [DNN] Evaluate other GEMM tuning strategies
- #1320 [DNN] repair GEMM autotuning
- #1319 [VE] Attach to host profiler
- #1318 [PyTorch] Set executing device in model without inputs and parameters
- #1317 [PyTorch] Adjust executor to also copy MemberTensors that are no Buffers to device
- #1316 [PyTorch] aten::scaled_dot_product_attention
- #1315 [TF] Read accuracy modifying values, e.g. tf32 execution
- #1314 [PyTorch+VE] might cause segfault when exiting
- #1313 [PyTorch] Using shape of not used tensor
- #1312 [CAPI] Unify SOL dtypes for generated code
- #1311 [NCC] Don't expect nc++, ... to be installed in /opt/nec/ve/bin/
- #1310 [Installer] Add option to renew license
- #1309 [Installer] Download option not working, as PIP downloads only Wheels, not Source packages
- #1308 [HLIR] Tensors should only evaluate value_op if values and value_op are not None
- #1307 [HLIR] Parser implemented non existing np functions, e.g. np.erf, np.acos, ...
- #1306 [TF] Check Resnet50 CPU performance
- #1304 [Core] Progress bar breaks, when Backend Handles get compiled during optimization process
- #1298 [DNNL] "Unsupported dnnl_format_tag_t: POI/2/true" in tf/regnet
- #1296 [Python] Remove SOL_CONSTANT Params
- #1289 [DFP] Store ReAlloc not as instruction but directly within LoopStack
- #1285 [DFP] Group WriteBacks for better performance in case of MasterStacks
- #1284 [PyTorch] Read Torch accuracy + determinism values and attach them to the layers
- #1282 [DFP] Minimize Loop Index Calculations
- #1275 [ISPC] Investigate impact of setting different target gang-sizes in ISPC compilation
- #1269 [PyTorch] Implement FlashAttention
- #1268 [CUBLAS] FP16 + using advanced flags
- #1264 [DFP] Implement
Online normalizer calculation for softmax
- #1250 [AOCL] Add new Backend
- #1240 [PyTorch] enable Torch Compile style fwd+bwd within one pass
- #1238 [PyTorch] Bloom Support and Optimizations
- #1236 [HLIR] Reorder -> GEMM transform
- #1223 [DNNL] Enable NVIDIA GPUs
- #1193 [PyTorch/JIT] Add torch.jit.freeze to test-suite
- #1191 [VDims] Autodetect VDims
- #1190 [Profiler] Trace Memory Allocations
- #1186 [VE] Fix AutoCast to 32Bit/64Bit vars to enable vectorization
- #1185 [NCC] v5.0.2 fails in PY_Norms, PY_Reduce and PY_TorchNorm
- #1180 [HLIR/DFP] Enable to store SOL_CONSTANT in model
- #1153 [PyTorch] Investigate torch.fx for improving the parser
- #1060 [CUDA] Implement transpose as series of calls to cublasgeam
- #1043 [TF] "Unable to find SOL_CONSTANT"
- #999 [ISPC] Improve SLEEF integration
- #991 [Python] Improve performance of sol.optimize
- #920 [PyTorch] Add more Einsum testcases
- #913 [DFP] Rework stack memory caches
- #798 [NVIDIA] Change dependencies to use NVIDIA PIP packages
- #787 [AutoTuner] Think about choosing algorithms not solely based on the performance, but also about it's neighborhood, to increase chances of fusion.
- #766 [TF] Accuracy for BatchNorm in Efficientnet
- #696 [PyTorch] add aten::roll
- #651 [TF] tf.keras.applications.efficientnet.EfficientNet + V2 accuracy problems
- #622 [ISPC] ISPC casts wrongly casts double to uint8
- #504 [DFP] fix removal of unnecessary accumulators
- #503 [DFP] Memory Pinning
- #502 [DFP] Operation Inlining
- #497 [HLIR] Add PassThrough Node
- #495 [VEBLAS] Evaluate SOL's BatchedGEMM versus new NLC BatchedGEMM
- #390 [VEDNN] readd GEMM
- #367 [PyTorch] Add YOLO Test Case
- #366 [ONNX] Add YOLO TestCase
- #319 [PyTorch] Add option to parse "if self.training:" diverging paths
- #197 [DFP] Reorder: Fill
- #196 [DFP] Reorder: Narrow
- #104 [All] FP16, BFloat16
- #69 [DFP] Performance: BatchNorm Welford Algorithm