Pulling an all nighter

Configuring GPU libraries on Ubuntu 20.4 LTS

Dec 31, 2020

7 min read

DOING DATA SCIENCE FROM SCRATCH TASK BY TASK

That is me in that photo last night (December 30th, 2020). I pulled an all-nighter to configure my Linux ( Ubuntu 20.4 ) box for Deep Learning. It never gets old the myriad of dependencies, libraries, environment variables, GCC versions and optimization libraries that always throw a wobbler at some stage. Well, I certainly made a meal out of this one. Let me explain and provide you with some benchmarks. Since I am doing Data Science from Scratch, I can’t see my way to a Docker Image, pre-configured AMI, or even a VirtualBox image. That would be too easy and seemed like cheating. In practice, go with the best option that avoids you playing with configuration. Time is somebodies cash!

Honestly, I can tell you that the configuration steps are far too many and potentially too dull to write about. I will leave you a tutorial at the end of this post if you need the details. In any case, you will always have to adjust for versions of everything anyway.

Starters

As my starter course, we served up Cuda 10, and the annual invocation of ./deviceQuery returned the usual sorry news. My box still has a GTX 750Ti. Yes, Santa didn’t bring me a new GeForce GPU this Christmas. I could go from 706 Cuda Cores to 1408 Cores by just flashing my credit card. The older cards have come down a lot in price, and for about $250 I could go to a 16 series card. The chassis and power supply all check out. Since I followed all the instructions religiously, I was rewarded with a ‘deviceQuery’ output, and my starter was lovely. I enjoyed it! What’s next asked the pain waiter? How can we torture you further?

First dish

For my first dish – I choose TensorFlow-GPU. Installation of TensorFlow 2.4 naturally complained that it really preferred Cuda 11. Yes, it was coming up to 2am, and for sure it would be a long night. I had that taste in my mouth now. Like I’d started with Spicy Chicken wings, and now I was expecting to taste a delicate white fish. I had a terrible taste in my mouth. Since Cuda versions can co-exist, just set your environment variables as you wish and do a bit of linking, I went ahead and installed Cuda 11. Again I issued the ./deviceQuery incantation, and yet again I am reminded about the older card. Horray it is now 3am! The barkeep has refused to serve another beer since I am sudo’ing.

import tensorflow as tf
print("Num GPUs Available: ", len(tf.config.experimental.list_physical_devices('GPU')))

Running some test code returned confirmation that TensorFlow could see my Cuda capable GPU, even if it is old! Okay, TensorFlow worked with Cuda 11, but that dish just didn’t sit right with that starter. I have heartburn now, but that could be the last of the mince pies between sudos. So what’s for the next course?

Main course

Since it is now heading on towards 4am, I decided I needed a larger meal now. I choose an Irish Whiskey with a local build of OpenCV & Cuda. The waiter raised an eyebrow, the cat let out some sort of sigh and left. After a few downloads, I went about issuing the magic incantations to trigger the build.

make
make -j8
make install

The screen immediately filled up full of compiler messages. Green ones suggested success, but large areas of white compiler errors left room for another Irish Whiskey. So it is 6am, things are a little blurred now, but that screen says 25% completed. Oh lord! Might have a short nap while this carries on. Finally, the screen says done. All that was left was to create a link between my compiled cv2.so object and the Python virtual environment looking to use it. Well, it turned out there was no cv2.so, and I spent ages hunting for the actual compiled object. Eventually, the Main course is over, and it is time for my dessert.

Dessert

When I hear a mention of dessert, I generally think of a treat.

👁 Photo by Kobby Mendez on Unsplash

Photo by Kobby Mendez on Unsplash

However, here in Towards Data Science, dessert for us is doing some Deep Learning or inference with that newly configured computer system. My annual GPU configuration party is over. It works again until the next

sudo apt update
sudo apt upgrade

When things usually stop working.

Enjoying our dessert

If you have been reading my column here on Towards Data Science, you will know that I am on a mission. I wanted to count the number of cars passing my house using Computer Vision and Motion Detection. Things started to get complicated and slow when I tried to send images and videos through neural network detection. Since I had my own Linux box, it made sense to move my Deep Learning stuff over to something that can punch harder.

I made some code changes to my class and went about doing some benchmarks. Dessert is a course best shared and enjoyed together.

👁 Image by the Author

Image by the Author

Line [25] shows me creating a myYolo() object and setting CUDA’s as the target device. Yes, I can do this as I have OpenCV built with CUDA now. [26] runs object detection over 192 photographs. My i7 Intel Linux Box with Cuda GPU took just 31 seconds to process all those pictures with an average inference time of only .13 seconds. That is amazingly quick.

Line [27] creates a new instance of myYolo() and leaves the default device ( CPU ). [28] is me once again processing those 192 photographs. Now the job took just under 1 minute with an average inference time of .23 seconds. That really isn’t that slow, and that is only the native CPU. Still, CUDA is nearly twice as fast with the inference workload as the CPU. More dessert, please!

👁 Image by the author

Image by the author

Now sometimes I look around the restaurant to see what others are eating for their desserts. Must be an ego thing! So I ran the same script on my Raspberry Pi ARM processor. I do love all forms of Raspberry Pies.

[5] I create an instance of my class with the default CPU.

[6] It took 1,914 seconds to process. That is 32 minutes and the system got really hot. That dessert isn’t so nice. Inference time is 10seconds on average.

👁 Image by the author

Image by the author

How about a diner with the Intel Neural Compute Stick?

[2] I create an instance of my class with the device set to myriad

[3] The job finishes in 188 seconds, just 3minutes, and a lot more enjoyable than an entire 32 minutes. Inference time is almost .75 of second. That is probably an eternity for a front end user on a mobile phone.

Closing

So how do you like my dessert now? I pulled an all-nighter and configured my Ubuntu system. Today I can do Object detection, on a large number of photos in under 1 minute. Yesterday I was happy with getting the job done in 3 minutes. A next step could be to build an API with FastAPI and expose the model for on-demand inference. The trick is to perform one forward pass when you load the model. The first pass has some associated overhead. After that, the Cuda device will deliver a forward pass in under .15 of a second. Oh yeah and that is with only 706 cores and not 1,400 cores. Did I say I wanted a new GPU for XMAS?

With FastAPI bringing asynchronous skills, that Flask potentially lacked, I have the beginnings of an idea for a service. Presently my home server runs Python3 with uWGSI workers (5) and NGINX works (5). I could convert to FastAPI with GUNICORN and just stick NGINX in front of that chain.

So the reason I pulled an all-nighter was to figure this stuff out. Nobody will wait 1 second for a mobile app to make an inference, add to that the photo upload, traffic routing and the response round trip delays. Having built OpenCV with Cuda, I can continue my project and wrap a friendly web UI around my service. There is also TensorFlow.js to consider and just make inferences on the actual mobile device.

👁 Photo by The Creative Exchange on Unsplash

Photo by The Creative Exchange on Unsplash