Building Multimodal Apps with Voice and Vision in .NET MAUI
You can make Multimodal Apps with voice and vision features fast using .NET MAUI and Azure AI Foundry. Multimodal Apps use voice and pictures to give users better experiences. You get support for many devices, so your app works on phones and computers. With Microsoft.Extensions.AI, you can link AI models easily. Try new ways to talk with your users and make your app smarter.
Key Takeaways
You can use .NET MAUI and Azure AI Foundry to make apps. These apps use voice and vision together. This helps users have a better experience on many devices.
First, set up your app by getting the right tools. Add AI packages to your project. Keep your API keys safe by using environment variables.
Add voice features like speech-to-text. You can also add voice chat and voice commands. These make your app easier to use and more fun.
Add vision features like taking pictures and analyzing images. You can also use OCR to help your app understand pictures.
Make sure your app uses voice and vision together in a smooth way. Support many platforms. Think about user control, security, and how well your app works.
Multimodal Apps Setup
Install Tools
You need some tools before you start. Check if your computer meets these needs:
Windows 10 (x64), Windows 11 (x64/ARM), Windows Server 2025, or macOS
At least 8GB RAM and 3GB free disk space
For best results, use 16GB RAM and 15GB free disk space
Optional: A modern GPU like NVIDIA (2000 series or newer), AMD (6000 series or newer), Intel iGPU, Qualcomm Snapdragon X Elite, or Apple silicon
You need internet to download models. You must be an admin to install tools.
Tip: Hardware acceleration makes your app faster, but you can still build and test without it.
To install the main tools, use these commands:
On Windows:
winget install Microsoft.FoundryLocal
On macOS:
brew tap microsoft/foundrylocal
brew install foundrylocal
Add AI Packages
You need to add some packages to your project. Follow these steps:
Install the .NET SDK that matches your sample repository branch.
Add Microsoft.Extensions.AI to your project using NuGet.
Get your Azure AI Foundry or Azure OpenAI endpoint and API key. Make sure your model supports tool or function calling.
Optionally, get an OpenWeatherMap API key if you want real weather data.
Register services and tools in your app. Use files like
Services/HostingExtensions.cs
andServices/Tools/*.cs
to set up logic.Use
ViewModels/ChatViewModel.cs
to collect and use tools through your chat client.
Secure API Keys
You must keep your API keys safe. Set environment variables for your user account. Here is how you do it in PowerShell:
$env:AZURE_OPENAI_ENDPOINT = "your Azure OpenAI endpoint URL"
$env:AZURE_OPENAI_API_KEY = "your API key"
$env:AZURE_OPENAI_MODEL = "gpt-4o-mini" # optional
$env:WEATHER_API_KEY = "your OpenWeatherMap API key" # optional
Restart your IDE or terminal to use the new settings. Never share your API keys in public code or with others.
Note: Keeping secrets safe protects your app and your users.
Voice Features
Speech-to-Text
You can add speech-to-text so users can talk instead of type. Many APIs work with .NET MAUI. Here are some popular ones:
AssemblyAI Universal-Streaming: It is fast and works well for many accents.
Deepgram Nova-3: It supports over 50 languages and can be customized.
AWS Transcribe: It covers more than 100 languages and is good for live use.
Google Cloud Speech-to-Text: It works with 125+ languages but may not be as accurate in real-time.
Microsoft Azure Speech Services: It works easily with .NET MAUI and is accurate.
WhisperX (open-source): You get control and support for 99+ languages, but setup is harder.
Live captions and voice commands are very accurate. For .NET MAUI, Microsoft Azure Speech Services is the easiest to use. You can use the Community Toolkit’s ISpeechToText
to turn speech into text. Add the service in your MauiProgram.cs
file. Handle results with simple events.
builder.Services.AddSingleton<ISpeechToText, SpeechToTextImplementation>();
Tip: Always ask users for microphone access before using speech recognition.
Voice Chat Agent
You can make a voice chat agent that listens and talks back. First, capture speech with your chosen API. Send the text to your AI model. Reply using text-to-speech. Use the ITextToSpeech
from Microsoft.Maui.Media
to speak answers. Change pitch, volume, and language to fit your app.
await TextToSpeech.Default.SpeakAsync("Hello! How can I help you?");
You can line up many speech requests. Use cancellation tokens to stop speech if you need.
Voice Commands
Users can control your app with voice commands. Use the Community Toolkit’s ISpeechToText
to listen for commands. Match the text to actions in your app. For example, say “Show weather” to see weather info. This makes your app easier and better for everyone.
Note: Test voice commands in different places to make them work better and help users.
Vision Features
Image Capture
You can let users take photos or select images from their device. .NET MAUI gives you easy tools for this. Use the MediaPicker
class to open the camera or photo gallery. Here is how you can capture an image:
FileResult photo = await MediaPicker.CapturePhotoAsync();
if (photo != null)
{
Stream stream = await photo.OpenReadAsync();
// You can now use this stream for analysis
}
Tip: Always ask for camera and storage permissions before you try to capture or pick images.
You can use this feature for many things. For example, you can let users scan receipts, take profile pictures, or share moments.
Image Analysis
You can add AI to your app to understand images. Azure AI Vision and other services can find objects, read emotions, or spot scenes in a photo. First, send the image stream to your AI model. Then, get results like tags, descriptions, or even faces.
Here is a simple way to analyze an image using Azure AI Vision:
Capture or pick an image.
Convert the image to a byte array.
Send the data to your AI endpoint.
Show the results to the user.
Note: You can use a table to show results. For example:
You help users learn more about their photos with just a few lines of code.
OCR
OCR stands for Optical Character Recognition. You can use OCR to read text from images. This is useful for scanning documents, reading signs, or copying text from books. Azure AI Vision and other OCR tools work well with .NET MAUI.
To use OCR:
Capture or pick an image.
Send the image to the OCR service.
Get the text back and show it in your app.
// Pseudocode for OCR
string text = await OcrService.ReadTextAsync(imageStream);
Tip: Always check the text for errors. OCR is powerful, but it may not be perfect. Let users edit the results if needed.
Multimodal Apps Workflow
Combine Inputs
You can combine voice and vision inputs to make your app smarter and easier to use. Start by letting users speak and show images at the same time. For example, a user can say, "What is in this picture?" and then take a photo. Your app listens to the voice, captures the image, and sends both to your AI model. The model can answer with details about the photo.
Follow these steps to combine inputs:
Set up your app to listen for voice commands.
Use the camera or gallery to get images.
Send both the voice text and image data to your AI service.
Show the results to the user in a clear way.
You can use a table to display results from both inputs:
Tip: Combining inputs helps users get answers faster and makes your app more flexible.
User Interaction
Designing user interaction in Multimodal Apps is different from single-mode apps. You give users more ways to interact, like speaking, tapping, or showing images. This makes your app more accurate and efficient. Users can use voice and vision together or switch between them.
Here are some benefits of multimodal user interaction:
Users can choose how they want to interact.
Your app can understand more by using both voice and images.
You make your app easier for everyone, including people with disabilities.
Users get faster and better results.
You need to plan your screens and buttons so users know what to do. Use clear icons for voice and camera. Give feedback when users speak or take a photo. Show results in a simple way, like lists or tables.
Note: Multimodal Apps need careful design because you mix many input and output methods. Test your app with real users to make sure it works well.
Cross-Platform
You can run Multimodal Apps on many devices, like phones, tablets, and computers. .NET MAUI helps you build one app that works everywhere. You use the same code for Android, iOS, Windows, and macOS.
To make your app work well on all platforms:
Check for device features, like camera or microphone.
Ask for permissions before using voice or vision.
Use responsive layouts so your app looks good on any screen size.
Test your app on different devices to find problems early.
Here is a simple code block to check for camera support:
if (MediaPicker.Default.IsCaptureSupported)
{
// Camera is available
}
else
{
// Show message: Camera not found
}
Tip: Cross-platform support lets more people use your app and gives you a bigger audience.
Multimodal Apps give users a rich experience by combining voice and vision. You help users get answers quickly and make your app easy to use on any device.
Best Practices
User Control
Let users pick how they use your app. They can choose voice or vision features. Make buttons for starting and stopping voice easy to find. Show icons for camera and microphone so users know what is used. If someone wants to turn off a feature, make it simple.
Add a settings page for users to change permissions.
Use easy prompts to tell why you need the camera or microphone.
Let users check and fix results from speech or image analysis.
Tip: Users trust your app more when they feel in charge.
Security
Keep user data safe and protect your app. Store API keys and secrets in environment variables, not in your code. Always ask before using the camera or microphone. Use HTTPS when sending data to AI services.
Note: Check your app’s security settings often to keep data safe.
Performance
Make sure your app works well on every device. Shrink images before sending them to AI services. Use hardware acceleration if you can. Keep audio and image files small to make things faster. Test your app on different devices to find slow spots and fix them.
Shrink images before uploading.
Use async methods so your app does not freeze.
Save results to stop repeated calls.
// Example: Compress image before sending
var compressedImage = await ImageService.CompressAsync(originalImage);
Tip: Fast apps make users happy and they will use your app more.
You now know how to make Multimodal Apps with voice and vision using .NET MAUI and Azure AI Foundry. These apps can be smarter and work on lots of devices. Try out new AI models and test features to see what your users like best. Join the Azure AI Foundry community for help and to see real examples. You can use sample projects, talk in forums, and learn from people who build cool things.
FAQ
How do you test voice and vision features in your app?
You can use emulators or real devices. Try speaking commands and taking pictures. See if your app answers the right way. Test on Android, iOS, Windows, and macOS for best results.
What should you do if your API keys stop working?
First, check if you typed the keys right. If they are old, get new ones from your provider. Change your environment variables. Restart your app to use the new keys.
Can you use local AI models instead of cloud services?
Yes, you can run some models on your device. Use Foundry Local for models that work. This saves bandwidth and keeps your data private.
How do you handle user permissions for camera and microphone?
Ask users before using the camera or microphone. Show a message that says why you need access. Respect their choice and let them change permissions in settings.
What is the best way to show results from AI analysis?
Use tables or lists to show results. Show tags, confidence scores, or text you found. Make the layout simple so users can read it fast.