Our new model ZOO works with DepthAI V3. Find out more in our documentation.
1+ Likes
Model Details
Model Description
For applications in mobile robotics and augmented reality, it is critical that models can run on hardware-constrained computers. To this end, XFeat was designed as an agnostic solution focusing on both accuracy and efficiency in an image-matching pipeline. It has compact descriptors (64D) and simple architecture components that facilitate deployment on embedded devices. Performance is comparable to known deep local features such as SuperPoint while being significantly faster and more lightweight. Also, XFeat exhibits much better robustness to viewpoint and illumination changes than classic local features such as ORB and SIFT.
Developed by: VeRLab: Laboratory of Computer Vison and Robotics
Shared by:
Model type: keypoint-detection model
License:
Resources for more information:
Training Details
Training Data
The model is trained on a mix of scenes using synthetically warped pairs using raw images (without labels) from COCO in the proportion of 6 : 4 respectively.
Testing Details
Metrics
Megadepth-1500 relative camera pose estimation:
Metric
Value
AUC@5°
50.20
AUC@10°
65.40
AUC@20°
77.10
ACC@10°.
85.1
Technical Specifications
Input/Output Details
Input:
Name: images
Tensor: float32[1,3,352,640]
Info: NCHW BGR un-normalized image
Output:
Name: Multiple (please consult NN archive config.json)
Tensor: Multiple (please consult NN archive config.json)
Info: Features, keypoints, heatmaps that needed to be additionally post-processed.
Model Architecture
Backbone: composed of six convolutional blocks that progressively reduce the spatial resolution while increasing the depth of the network. The backbone includes a combination of basic layers with 2D convolutions, ReLU activations, and Batch Normalization, following a modular design to balance depth and computational efficiency.
Descriptor Head: feature pyramid and basic layers.
Keypoint Head: It uses a minimalist approach with 1×1 convolutions on an 8×8 grid-transformed image to efficiently regress keypoint coordinates.
Please consult the for more information on model architecture.
Throughput
Platform
Throughput [infs/sec]
RVC2*
55.96
RVC4**
174.84
* Benchmarked with 2 threads;
** Benchmarked on DSP fixed point runtime with balanced mode.
Quantization
RVC4 version of the model was quantized using a HubAI General dataset.
Utilization
Models converted for RVC Platforms can be used for inference on OAK devices.
DepthAI pipelines are used to define the information flow linking the device, inference model, and the output parser (as defined in model head(s)).
Below, we present the most crucial utilization steps for the particular model.
Please consult the docs for more information.
nn = pipeline.create(ParsingNeuralNetwork).build(
<CameraNode>, model_description
)
It requires setting the reference frame to which all upcoming frames will be matched. This can be achieved by triggering (inside a pipeline) e.g. by key-press:
# Get the parser
parser: XFeatMonoParser = nn.getParser(0)
# Inside pipeline loop
if cv2.waitKey(1) == ord('s'):
parser.setTrigger()
Stereo version
The stereo version requires two cameras (left and right). Therefore the usage is slightly different.
First, adjust the model description to get the stereo version and download NN archive:
XFeatMonoParser or XFeatStereoParser that outputs message (detected features over time with age=0 for left image and age=1 for right image).
Get parsed output(s):
while pipeline.isRuning():
parser_output: dai.TrackedFeatures = parser_output_queue.get()
Example
You can quickly run the model using our example.
We offer two modes of operation: mono and stereo. In mono mode we use a single camera as an input and match the frames to the reference image. In stereo mode we use two cameras to match the frames to each other. In mono mode we visualize the matches between the frames and the reference frame which can be set by pressing s key in the visualizer. In stereo mode we visualize the matches between the frames from the left and right camera.
It automatically downloads the model, creates a DepthAI pipeline, runs the inference, and displays the results using our DepthAI visualizer tool.