Category Archives: Computer Vision

Introduction to Scale-Invariant Feature Transform (SIFT)

Scale-Invariant Feature Transform Summary

SIFT Detection Paper:

Terms used in SIFT paper

LoG = Laplacian of Gaussian

DoG = Difference of Gaussian

BBF = Best-Bin-First

RANSAC = RANdom SAmple Consensus

NN = Nearest Neighbor

IT = Inferior Temporal

What Can Developers Use the SIFT algorithms for?

  • Locate a certain object in an image of many other objects
  • Locate an object between frames in a sequence of images (video)
  • Stitching together images to create a panoramic image
  • Robot localization and mapping
  • 3D scene modeling, recognition and tracking
  • 3D SIFT-like descriptors for human action recognition
  • Analyzing the Human Brain in 3D Magnetic Resonance Images


Review of Tesla’s Short Self-driving Proof of Concept

Autopilot Full Self-Driving Hardware (Neighborhood Short) from Tesla Motors on Vimeo.

The views they provide are (left to right, top to bottom):

  1. Interior cabin & through windshield
  2. The vehicle’s left rearward vehicle camera
  3. The vehicle’s medium range (forward) camera
  4. The vehicle’s right reward vehicle camera

Related: Tesla’s HW2 (Hardware 2) sensor suite

Object detection:

  • Motion flow
  • Lane lines (left of vehicle)
  • Lane lines (right of vehicle)
  • Road flow
  • In-path objects
  • Road lights
  • Objects
  • Road signs

The following are my observations.  These are not necessarily errors or incorrect, but are things worth mentioning.

My general observations:

  • On two lane roads, far left line typically not detected on Medium range camera
  • Multiple bounding boxes around the same object
  • Rearward objects labeled as “in-path”
  • Brake pedal moves, but accelerator pedal does not appear to move
  • Slowed down for a crosswalk (0:12)
  • Detected pedestrian near road, but did not consider to be in-path (0:41)
  • Stopped/slowed down for walkers/joggers near road (0:55)
  • Stopped during right turn (1:02)
  • Detects the back of road signs as signs (1:23)
  • Stopped after right turn (1:33)
  • The cameras angles not included are: forward narrow, forward wide, left and right side forward facing, and rear facing
  • From the information provided here, we cannot determine whether pedestrians are treated as any other object or separately as a “pedestrian type” object


  • Detected road sign as road light (1:25)

Flip Image OpenCV Python

OpenCV provides the flip() function which allows for flipping an image or video frame horizontally, vertically, or both.



OpenCV documentation:

Exploring Udacity’s 1st 40GB driving data set

I read about the release of their second data set yesterday and wanted to check it out.  For convenience, I downloaded the original, smaller, data set.

Preface: ROS is only officially supported on Ubuntu & Debian and is experimental on  OS X (Homebrew), Gentoo, and OpenEmbedded/Yocto.

Getting the data

Download the data yourself: Driving Data

The data set, which is linked to from the page above, was served up from Amazon S3 and actually seemed quite slow to download, so I let it run late last night and started exploring today.

The compressed download is dataset.bag.tar.gz


and after extracting is a 42.3 GB file dataset.bag

.bag is a file type associated with the Robot Operating System (ROS)

Data overview

To get an overview of the file use the rosbag info <filename> command:


Open in new window

There are 28 data topics from on-board sensors including 3 color cameras.  Topics:

  • /center_camera/image_color
  • /left_camera/image_color
  • /right_camera/image_colors

Each camera topic has 15212 messages.   Doing the math on 15212 messages / 760 seconds works out to roughly 20 frames per second.

Viewing the video streams

Converting a camera topic to a standalone video is a two step process:

  1. export jpegs from the bag file
  2. convert the jpegs to video

Exporting the jpegs

To export the image topic to jpegs, the bag needs to be played back and the frames extracted.  This can be done with a launch script.  The default filename pattern (frame%04d.jpg) allows for 4 numerical figures, so we need to add the following line to modify the default file name pattern into one that allows for 5 digits:

The entire script below that launches the player and extractor:

The number of resulting frames should match the number of topic messages seen from info.

If not, as was our case, the default sec per frame time should be changed.  It seems counter-intuitive, but after slowing down the rate, trying “0.11” and “0.2”, the number of frames extracted was also going down.  I settled on “0.02” seconds per frame which resulted in the correct number of frames.  Add the line to the launch script.

The working launch script now looks like this:

Download working Left, Center, and Right jpeg export launch scripts on GitHub

The result should be the correct number of frames saved (frames starts at 00000) and the message “[rosbag-1] process has finished cleanly”

Hit Ctrl + C to exit

frame00000.jpg 640×480

frame00000.jpg extracted from topic /left_camera/image_color

Convert the jpegs to video


License: The data referenced in this post is available under the MIT license.  This post is available under CC BY 3.0

Where to next?