XFeat | Luxonis

Our new model ZOO works with DepthAI V3. Find out more in our documentation.

1+ Likes

Model Details

Model Description

For applications in mobile robotics and augmented reality, it is critical that models can run on hardware-constrained computers. To this end, XFeat was designed as an agnostic solution focusing on both accuracy and efficiency in an image-matching pipeline. It has compact descriptors (64D) and simple architecture components that facilitate deployment on embedded devices. Performance is comparable to known deep local features such as SuperPoint while being significantly faster and more lightweight. Also, XFeat exhibits much better robustness to viewpoint and illumination changes than classic local features such as ORB and SIFT.

Developed by: VeRLab: Laboratory of Computer Vison and Robotics
Shared by:
Model type: keypoint-detection model
License:
Resources for more information:

Training Details

Training Data

The model is trained on a mix of scenes using synthetically warped pairs using raw images (without labels) from COCO in the proportion of 6 : 4 respectively.

Testing Details

Metrics

Megadepth-1500 relative camera pose estimation:

Metric	Value
AUC@5°	50.20
AUC@10°	65.40
AUC@20°	77.10
ACC@10°.	85.1

Technical Specifications

Input/Output Details

Input:
- Name: images
  - Info: NCHW BGR un-normalized image
Output:
- Name: Multiple (please consult NN archive config.json)
  - Info: Features, keypoints, heatmaps that needed to be additionally post-processed.

Model Architecture

Backbone: composed of six convolutional blocks that progressively reduce the spatial resolution while increasing the depth of the network. The backbone includes a combination of basic layers with 2D convolutions, ReLU activations, and Batch Normalization, following a modular design to balance depth and computational efficiency.
Descriptor Head: feature pyramid and basic layers.
Keypoint Head: It uses a minimalist approach with 1×1 convolutions on an 8×8 grid-transformed image to efficiently regress keypoint coordinates.

Please consult the for more information on model architecture.

Throughput

Model variant: xfeat:mono-640x480

• Input shape: ['batch', 3, 'height', 'width'] • Output shapes:

[['Convfeats_dim_0', 64, 'Convfeats_dim_2', 'Convfeats_dim_3'], ['Convkeypoints_dim_0', 65, 'Convfeats_dim_2', 'Convfeats_dim_3'], ['Convfeats_dim_0', 1, 'Sigmoidheatmaps_dim_2', 'Sigmoidheatmaps_dim_3']]

Platform	Precision	Throughput (infs/sec)	Power Consumption (W)
RVC2	FP16	38.80	N/A
RVC4	INT8	388.10	3.69

Model variant: xfeat:stereo-640x480

• Input shape: ['batch', 3, 'height', 'width'] • Output shapes:

[['Convfeats_dim_0', 64, 'Convfeats_dim_2', 'Convfeats_dim_3'], ['Convkeypoints_dim_0', 65, 'Convfeats_dim_2', 'Convfeats_dim_3'], ['Convfeats_dim_0', 1, 'Sigmoidheatmaps_dim_2', 'Sigmoidheatmaps_dim_3']]

Platform	Precision	Throughput (infs/sec)	Power Consumption (W)
RVC2	FP16	38.82	N/A
RVC4	INT8	387.55	3.41

Model variant: xfeat:mono-320x240

• Input shape: ['batch', 3, 'height', 'width'] • Output shapes:

[['Convfeats_dim_0', 64, 'Convfeats_dim_2', 'Convfeats_dim_3'], ['Convkeypoints_dim_0', 65, 'Convfeats_dim_2', 'Convfeats_dim_3'], ['Convfeats_dim_0', 1, 'Sigmoidheatmaps_dim_2', 'Sigmoidheatmaps_dim_3']]

Platform	Precision	Throughput (infs/sec)	Power Consumption (W)
RVC2	FP16	148.65	N/A
RVC4	INT8	685.82	2.73

Model variant: xfeat:stereo-320x240

• Input shape: ['batch', 3, 'height', 'width'] • Output shapes:

[['Convfeats_dim_0', 64, 'Convfeats_dim_2', 'Convfeats_dim_3'], ['Convkeypoints_dim_0', 65, 'Convfeats_dim_2', 'Convfeats_dim_3'], ['Convfeats_dim_0', 1, 'Sigmoidheatmaps_dim_2', 'Sigmoidheatmaps_dim_3']]

Platform	Precision	Throughput (infs/sec)	Power Consumption (W)
RVC2	FP16	148.65	N/A
RVC4	INT8	688.83	2.57

* Benchmarked with , using 2 threads (and the DSP runtime in balanced mode for RVC4).

* Parameters and FLOPs are obtained from the package.

Quantization

RVC4 version of the model was quantized using a HubAI General dataset.

Utilization

Models converted for RVC Platforms can be used for inference on OAK devices. DepthAI pipelines are used to define the information flow linking the device, inference model, and the output parser (as defined in model head(s)). Below, we present the most crucial utilization steps for the particular model. Please consult the docs for more information.

Install DAIv3 and depthai-nodes libraries:

pip install depthai
pip install depthai-nodes

Define model:

model_description = dai.NNModelDescription(
    "luxonis/xfeat:mono-320x240"
)

Mono version

nn = pipeline.create(ParsingNeuralNetwork).build(
    <CameraNode>, model_description
)

It requires setting the reference frame to which all upcoming frames will be matched. This can be achieved by triggering (inside a pipeline) e.g. by key-press:

# Get the parser
parser: XFeatMonoParser = nn.getParser(0)

# Inside pipeline loop
if cv2.waitKey(1) == ord('s'):
    parser.setTrigger()

Stereo version

The stereo version requires two cameras (left and right). Therefore the usage is slightly different.

First, adjust the model description to get the stereo version and download NN archive:

model_description = dai.NNModelDescription(
    "luxonis/xfeat:stereo-320x240"
)
nn_archive_path = dai.getModelFromZoo(model_description)
nn_archive = dai.NNArchive(nn_archive_path)

Load the NN archive and manually extract the model input shape:

input_shape = nn_archive.getConfig().model.inputs[0].shape[2:][::-1]

You need two cameras for stereo mode and two NN nodes:

left_cam = pipeline.create(dai.node.Camera).build(dai.CameraBoardSocket.CAM_B)
right_cam = pipeline.create(dai.node.Camera).build(dai.CameraBoardSocket.CAM_C)

left_network = pipeline.create(dai.node.NeuralNetwork).build(
    left_cam.requestOutput(input_shape,type=dai.ImgFrame.Type.BGR888p),
    nn_archive
)
left_network.setNumInferenceThreads(2)

right_network = pipeline.create(dai.node.NeuralNetwork).build(
    right_cam.requestOutput(input_shape,type=dai.ImgFrame.Type.BGR888p),
	nn_archive
)
right_network.setNumInferenceThreads(2)

Here we create a parser with ParserGenerator node because we already have neural network nodes ready:

parsers = pipeline.create(ParserGenerator).build(nn_archive)
parser: XFeatStereoParser = parsers[0]

Inspect model head(s):

XFeatMonoParser or XFeatStereoParser that outputs message (detected features over time with age=0 for left image and age=1 for right image).

Get parsed output(s):

while pipeline.isRuning():
    parser_output: dai.TrackedFeatures = parser_output_queue.get()

Example

You can quickly run the model using our example.

We offer two modes of operation: mono and stereo. In mono mode we use a single camera as an input and match the frames to the reference image. In stereo mode we use two cameras to match the frames to each other. In mono mode we visualize the matches between the frames and the reference frame which can be set by pressing s key in the visualizer. In stereo mode we visualize the matches between the frames from the left and right camera.

It automatically downloads the model, creates a DepthAI pipeline, runs the inference, and displays the results using our DepthAI visualizer tool.

To try it out, run:

python3 main.py

Feature points detector and descriptor.
License	Apache 2.0 Commercial use
Downloads	592
Tasks	Feature Detection
Model Types	ONNX

Model Variants

Name	Version	Available For	Created At	Deploy
		RVC2, RVC4	10 months ago
		RVC2, RVC4	10 months ago
		RVC2, RVC4	10 months ago
		RVC2, RVC4	10 months ago