Digital Clone technology for devices – Thoughts from 2018

Image recognition using popular algorithms such as Imagenet have bettered the human analysis. Word Vectors composed from text corpus have increased Natural Language understanding capability for software systems. End to End Deep Learning has improved Speech Recognition and has achieved parity with humans. We can apply theses advancements, to redefine customer interactions with businesses and customers on devices.

Users carry mobile devices all the time. The phone captures all the information about the user’s favorite applications, search queries, and the places he visits. Despite the rich user information, the current generation of software and hardware is still not able to recommend content to the users, the user wants to see. The user still needs to go to a search engine or a portal to get the information and/or content. The user still needs to learn how to interact with each application, as there is no universal interface that can help the user. Switching context to Augmented Reality (AR) applications, the user needs to painfully drag objects into the real world to see how the virtual object looks in the real world. The current generation of AR applications have limited support for Natural Language Interaction. The user has to painstakingly move virtual things using hands.

In this invention disclosure, we will describe how smarter hardware and software leveraging Natural Language, Image, and User Behavior analysis can help users and businesses with their experiences. We will discuss the personalized behavior model of the user to simulate his thinking process. The personalized behavior model takes in aggregated user behavior and application context captured by pixels on the screen that the user is looking for and helps with generating actions and content recommendations.

It is to be noted that this is unlike, a virtual agent from Apple, Google, and Amazon, which are triggered by hot keywords from the user’s speech and don’t use the visual, previous context about the user to generate content recommendations.

Behavior processor:

The current generation of devices knows everything about the user. They can capture where the user was, what the user is seeing currently, has seen and read in the past, whom did he talk to, what messages he sent to the friends, etc. Deep Learning and ReInforcement learning techniques have improved Image Understanding, Text extraction, Natural Language Understanding capabilities.

Despite the rich information and technology progress, we have seen that is available to the user, the current generation of devices are still not able to predict the content that the user likes. The user still has to go to a search engine and painfully type the search query on a small keyboard to find the information he wants. The user still has to go to the browser and type, to read up the news articles. The user can’t interact with the applications on the devices, even though the machines now can read, understand what the user is reading and answer questions in Natural Language.

In this invention, we will describe a hardware and software component called behavior analyzer which can be embedded int devices. The behavior analyzer running on the device can use the application context by modeling what the user is looking at and using aggregated information about the user to generate content that the user will like and execute actions on behalf of the users.

In an embodiment, to figure out what the user is reading and viewing on phone, the behavior processor is going to have a Location Analyzer, Vision Analyzer, Text Analyzer, Application Context Analyzer, a Memory component, Controller component and a Model Manager in the behavior processor. The behavior processor will form a hypothesis on the activity of the user, continuously learn from user interactions with content on the mobile phone and try to generate appropriate content or response in a timely manner, using above components

In an embodiment, we can use a combination of multiple architectures as will be discussed in the disclosure below to generate content and action recommendations based on the context.

Behavior processor to recommend content:

Let us say, a user is taking pictures of his family in a new year. If a user is generally active on Social Network and posts his pictures on the social network after taking pictures then there is a high probability that he will share the new year pictures on Facebook. In the current experiences, the user has to go to a social network, choose pictures taken from the camera, and then post his pictures on Facebook.

Switching context to another experience, would it not be easy for the user to show search results before the user decides to go to a search engine such as and type a search query in the search engine.

Experiences like above can be improved substantially using behavior processor. The behavior processor can run as an offline process, whenever the user starts an application or as a process that executes every ‘x’ minutes or so. The number ‘x’ can be a configurable parameter. The application context analyzer component, can take the pixels on the device that the user is looking, process it against a text recognition component, to extract text embedding. The pixels can be fed to object detection DNN to get image embeddings in the application.  In an embodiment of this, we can train a general model for the behavior processor based on the user cluster associated with the device. In embodiments the users can be clustered using a K-Means clustering algorithm.

The generalized model for the user cluster can be trained using a Neural Network on anonymous training data on the users in the cluster. We will use techniques borrowed from the provisional document 62543400 with title Techniques to improve Content Presentation Experiences for businesses to build a general model. In an embodiment, the generalized model for predicting a sequence of actions for the user can be done by training a Recurrent Neural Net or Sequence to Sequence Algorithm with attention on the user activity.

The generalized model can be built by using a Deep Neural Network by feeding the training data from Location Analyzer, Vision Analyzer, Application Context Analyzer, and Memory component and application actions, content follow-ups within the user cluster. The DNN will be learned to predict application actions such as entering search engine queries, sharing with social networks, sending an SMS message to a friend, calling a merchant, etc and content generation such as pro-actively showing interesting news items and an update from the social network.

The trained general model for the user cluster can then be pushed to the device. In an embodiment, the model manager component will initialize the general model either during the device setup or as part of the booting process.

The general model, can then be further retrained and personalized for the user. In an embodiment, this can be done by using Reinforcement Learning methods. We can model content and action recommendation as an MDP process. The aggregated user behavior updates from social networks, news articles can be the state for the user. The possible action space can be to show content recommendation, display an application action, or don’t do anything. A reward function can be correctly predicting the action at time t. We can then use Policy Leaning or Value Iteration approaches to figure out an action. To start with, a general Reinforcement Learning model can be learned offline on the user cluster, using the generalized model. The general model can then be personalized to the user by adjusting the Reinforcement Learning model to maximize explicit user interaction. The personalized user behavior model can then be persisted on a remote server using the internet. The personalized model can be used on other user devices and internet ecosystems.

In another embodiment, an End to End Neural Network using an architecture consisting of Policy Gradient Deep Reinforcement Learning on top of a Deep Neural Network (DNN). The DNN with attention can generate user behavior embeddings on the offline user cluster behavior data. The generic model than can be personalized for the user by adjusting the loss function in the Policy Gradient Deep Reinforcement Learning to predict the user actions.

In yet another embodiment, we can train a general model to do imitation learning for user clusters on the behavior sequence data. We can then apply techniques from One-Shot Learning to fine-tune user behavior.

It is to be noted that, we are proposing a different architecture for personalization and simulating user behavior from the current generation of ML models. Most of the current systems are built on the premise of a single global model for all groups of users. Personalization is done in a single global model by adding user features as an additional input to the mode. A single model for all users substantially simplifies the validation and debugging. The architecture in this disclosure builds out a single model for a user. A single model for a user gives the model more freedom to choose parameters that are applicable for that specific user. The single model can also be made complex to mimic the complex behavior and action of that user. We will remove the additional complexity of optimizing the burden on the model to optimize on all groups of users.

Behavior processor as a virtual agent for an application:

Patent Application US 15/356,512, talks about a Virtual Agent for an application/website which can talk in Natural Language based on using external API integration.  The behavior processor can also act as a virtual agent which can interact in Natural Language/Natural Speech for an application, without the application manually adding an external API service.

The behavior processor has got the application context of what the user is seeing, what the user is looking at the application, who the user is, the buttons and text in the application. The behavior process will also have access to external intelligence added by manual rules and/or derived by crawling the application. The behavior processor can also use information about the user aggregated from multiple ecosystems.

In an embodiment, the behavior processor can use the information identified in the above paragraph to answer questions about the service in the application and do actions in the application.

In an embodiment, the behavior processor can use Imitation Learning and one-shot learning approaches to execute actions in the application on behalf of the user. The behavior processor can learn from other user interactions that happen on the cloud

Behavior processor to help with Augmented Reality application:

Companies such as Flipkart, Amazon, and Walmart sell furniture, dresses, shoes and other merchandising in their Mobile eCommerce Apps. Before purchasing merchandise, the user wants to see how the furniture fits in their living room. He also wants to check, how the dress fits on him, before purchasing.

The eCommerce companies use experiences from augmented reality to increase user engagement with merchandising in their Mobile Applications. For instance, a user can choose a TV stand from the furniture category, point the camera on their Mobile Phone at their living room, move the chosen TV stand to get a physical sense of how the TV looks in the living room.

This painful experience of moving virtual object such as furniture in the Mobile App to the physical world can be improved by adding a software Virtual Agent which can interact in Natural Language to the flow. This virtual agent can be embedded within the app or can be triggered through a general voice agent on the phone such as SIRI on iPhone or Google Assistant on Google. The virtual agent can be embedded in the app using a third-party library code or be part of the application. The behavior processor described above can also act as a virtual agent for the eCommerce application.

The virtual agent can take the voice input, convert the voice to text optionally, and figure out the intent of the user. The entities associated with the user utterance can be figured out using slot filling algorithms. The bitmap of physical visual images captured by the physical camera, a textual description of the image, and the image of the object in shopping can be provided as additional context to the virtual agent. The virtual agent can use this additional context in figuring out intents and entities.

In an embodiment, the virtual agent can use Neural Module Networks to understand the Virtual Image in the application, the title, and category of the image, the Physical Context, and the Natural Language utterance. In an implementation, the Neural Module Networks can be dynamically assembled by parsing the Natural Language utterance. In another embodiment, we can train an end to end model using Reinforcement learning.

After understanding the intent using Neural Modules, we need to complete the action. An action can move the virtual object on the site to the physical environment of the user. Another example of action, can take a fish on Google Images and put it into a physical aquarium to see how the virtual fish looks in an aquarium at home.

Action sequences for the intent such as moving an object from one location to another can be configured manually for a Natural Language Intent. A Deep Neural Network can also be used to train actions from training data consisting of actions, Natural Language utterances, and scene input. In an embodiment, we can use the Deep Reinforcement Learning approach on top of Neural Modules for Natural Language Understanding, Object Detection, and Scene Understanding to execute actions.

In another embodiment, we can use Imitation Learning techniques to execute action sequences. We can use techniques borrowed from search engine rewrite to gather training data for imitation learning. For instance, let us say a user says, I want to see how an “I want to see how Guppy fish looks in my aquarium” pointing the Augmented Reality Device to his aquarium.

Let us say the behavior processor does not recognize the utterance in the context of the visual scene and says “Sorry, I can’t help you”. The user will then go to an Image Search Engine such as and search for Guppy Fish and then move Guppy Fish to the aquarium.

The behavior processor can learn from this interaction for the user cluster and apply it for future iterations down the line. This can be done by applying one-shot learning techniques on the general model, that we trained for AR applications.

Unified Model for different application scenarios:

In this disclosure, we talked about how a Behavior Processor can use application and user context to simplify user interactions.

We proposed different use cases for Behavior Processor. We also note that we proposed different DNN architectures for the Behavior processor for different use cases. We can use a Unified software component by combining different use cases. In an embodiment, we can run a simple Deep Learning classifier on the application and user context to decide which model to run. In another embodiment, we can train an end to end Neural Network on all the use cases and build a unified model to help the user in different application contexts.


In this disclosure, we propose a behavior processor on the user’s devices. The behavior processor simulates user behavior by leveraging application and user context and helps the user with different use cases using Natural Language and Vision techniques.


Leave a Reply

Your email address will not be published. Required fields are marked *