Elasticsearch – The Smarter Way to Search

August 24, 2016

By: Marlene Laus – Marketing Specialist

No matter what size your business is or industry you’re engaged in, the necessity to efficiently locate content on demand is paramount. To locate an item, one must search, and to search there must be a pool of data that can be queried – an index.  The act of indexing creates searchable information that users can query to locate their target. Unfortunately, as businesses grow and content volumes rise, it becomes increasingly difficult for some search systems to keep up with the demands of indexing and querying, let alone return quality results.

elasticsearch logo

This is where Elasticsearch comes in. Elasticsearch changes the way your documents are both indexed and queried, making it more efficient and faster than ever to locate your content. TEAM has taken this technology to the next level by integrating it with Oracle WebCenter Content, an industry-leading Enterprise Content Management (ECM) system. By integrating Elasticsearch with WebCenter Content, you are able to get the latest, fastest and most advanced functionality.

 

Unmatched Speed

Traditionally, when data is entered into WebCenter Content it goes through Oracle Text Search (OTS) which indexes the data so that your document can be searched by any words within the document and on metadata applied to the document. Elasticsearch takes it even further by providing a data indexing system that is able to index your documents in real time at a remarkably fast speed, resulting in almost instant search results. Operating at an unmatched speed, the advanced indexing system is able to support big repositories with high volumes of search.

Scalable

One of the most impressive features of Elasticsearch is its ability to be easily scaled both horizontally and vertically. Expand vertically by adding more memory and processors on individual servers and the service will attempt to utilize everything you give it. Expand horizontally when you add additional servers to create clusters and  they’ll operate in tandem to power your search efforts.

elas_0401
Image Source

As repositories continue to grow, it can become difficult for typical indexing systems to keep up with the high demand. Indexes start to break down and it can take days or even weeks to re-build them, possibly rendering your search capabilities inactive. This is where Elasticsearch changes the game for organizations with large repositories; in what would normally take hours, Elasticsearch can efficiently re-build your indexes within minutes.

Custom and Customizable

Elasticsearch enables you get the most out of your search indexes by allowing you to customize them! At a basic level, you have the ability to choose what metadata fields you want to use for searching your documents and the option of using friendly metadata names of your choosing. For example, if you want to find a document that contains the word “Marketing” in the title, you can type “title: Marketing” into the search bar and Elasticsearch will pull all of the documents in your repository with “Marketing” included in the title metadata field. At a more complex level, the Elasticsearch index allows dozens of intricate index configuration options, all of which are available to use with the integration. The different ways you choose to index and search for documents within your repository are limitless. Elasticsearch puts the power in hands of the user.

Screen Shot 2016-08-24 at 11.59.35 AM
Image Source

 

Finally, Elasticsearch enables you to utilize a practice called “stemming”. For example, if you query for the word “run,” Elasticsearch will compile your results and include documents that contain words where “run” is a stem, such as “runner” and “running”. The non-exact results will be returned with lower confidence than the exact matches but you can rest easy knowing your results will be as complete as possible.

Take Control of Your Searches

In a nutshell, Elasticsearch enables faster indexing and quicker, more accurate search results while making it easier than ever to utilize customized indexes all while staying fault- and load-tolerant. The possibilities are endless with Elasticsearch supercharging the search capabilities of your WebCenter Content repository.


How to Synchronize WebCenter Capture Configurations between Environments

July 29, 2016

By: Dwayne Parkinson – Solution Architect

Anyone who has used WebCenter Enterprise Capture appreciates the power and flexibility the tool provides for scanning, importing, converting, processing and exporting content. Its power and flexibility come from its wide range of configuration options.  However, in some situations this flexibility also creates problems as system administrators try to keep Capture configurations synchronized between environments.  Thankfully Oracle has provided a relatively easy way to synchronize Capture configurations between environments.

The process of migrating WebCenter Enterprise Capture configurations involves the following steps:

  1. Start the WebLogic scripting tool on the server running the WebCenter Capture instance that you want to export. This is done by running the wlst.cmd or wlst.sh depending on whether you’re running Windows or Linux.  Sample locations for the files are:
 C:\Oracle\Middleware\Oracle_ECM1\common\bin\wlst.cmd
 /home/middleware/Oracle_ECM1/common/bin/wlst.sh

2. Find the Capture workspace ID you want to export. Capture workspaces are assigned a numeric ID that you can obtain by using the listWorkspaces command from within the WebLogic Scripting Tool.  This command will list the ID followed by the name of the workspace.

 listWorkspaces()

3. Using the ID of the workspace you want to export, issue the WebLogic Scripting Tool command to exportWorkspaces with the ID of the workspace and a destination path and file name.

 exportWorkspace(2,'/home/whatever/workspace2.xml')

4. Copy the file that was created in step 3 to the destination server that will be updated with the configurations.

5. Start the WebLogic Scripting Tool on the WebCenter Capture instance that you want to import the configurations into (see #1).

6. Perform a listWorkspaces command (see #2) to insure that the name of the workspace you’re importing does not already exist. If there is an existing workspace with the same name as the workspace you exported in step 3, you must go into the Capture Console and either rename it or delete it before proceeding.

7. Issue the importWorkspace command to load the workspace from the configuration file from step 4 and create the workspace.

 importWorkspace('/home/somefolder/workspace2.xml')

8. VERY IMPORTANT: be sure to verify the configuration and change Import Source settings such as the folder locations and e-mail addresses on Capture jobs so they point to the correct location for this environment.

sync blogHINT: By using similar paths and just changing from Development to Test, QA, Production, etc. the change should be relatively small and easy to manage.

Using the steps above, you will be able to quickly and easily keep your WebCenter Enterprise Capture configurations synchronized between environments, even in the most complex installations.

For more information about WebCenter Capture and how it can be leveraged to help your business, please contact TEAM Informatics.


Don’t Regret – Redact

July 6, 2016

By: Jon Chartrand – Director of Product Management

The concept of sensitive information management is germane to pretty much every business, organization, and public sector outfit in the world. Typically, this sensitive information is classified as “PII” or Personally Identifiable Information – this would be any data which could lead to someone being personally identified and includes things like social security numbers, date of birth, and phone numbers. Other data, often revolving around financial information, includes credit card numbers, bank account numbers, and account balances. All of these data points must be carefully monitored and masked before documents can potentially be made available for distribution – externally or internally. Failure to do so can lead to devastating legal and financial consequences, bankrupting corporations and governments alike. As experts in the field of content management and in bringing order to unstructured data, we felt an obligation to assist our clients with this often expensive and time-consuming effort.

Examples of PII, according to the National Institute of Standards and Technology (NIST)1:

Name Street Address State Zip Code
Telephone Number Email Address Social Security Number Medical Record Number
Health Plan Number Account Number Account Balance ACH Number
Bank Account Routing Number Credit Card Number CCV Code
Driver’s License Number Passport Number Taxpayer ID Date of Birth

 

Just these example values represent a staggering amount of data across potentially every piece of content your organization creates, updates, manages, stores, distributes, and archives. The compliance costs required to scour content for this data can be monumental in terms of both dollars and hours. However, these costs can pale in comparison to the costs associated with a data breach. A recent study found that the average total cost of a data breach in the US can exceed $7 million, with an average per-record cost of more than $2002. These are some frightening numbers. So how do we help strengthen your compliance efforts while also reducing your compliance costs? That is the question we asked ourselves several months ago and the answer, we believe, is the TEAM Redaction Engine.

We built the Engine to meet three specific needs:

  • textual pattern matching in digital documents
  • integration with scanning solutions for paper documents
  • redaction of identified data in both PDFs and images

The Redaction Engine is a plugin, or component, for Oracle’s WebCenter Content (WCC) platform. This was done because WCC is a leader in the Enterprise Content Management space and it has direct integrations with powerful scanning solutions, Oracle’s cloud-based platforms, and powerful search options such as Elasticsearch. Other than enabling scanning, the component requires no additional software or hardware to perform its functions against the content in your repository – which is a revolution in the sensitive information arena.

Pattern Matching

When it comes to assisting with sensitive information compliance, the primary challenge comes in the form of identifying the data in question. Between our efforts with WebCenter Content and with Elasticsearch in the enterprise content management space, we realized that we already have access to every character of every piece of digital content that’s been indexed. What it boils down to is identifying patterns and developing a method for seeking those patterns in the available data. Look again at the table of examples above. Of the 20 data points described, 18 of them (90%!) can easily be identified based on a likely pattern.  This is where we started on our efforts.

The Redaction Engine is focused around a primary core – the Pattern Matching Engine. We allow you to craft a series of patterns using both Regular Expressions and Simple Patterns. To identify Social Security Numbers, for example, you’ll need to take into account the common variation which lacks the dashes. You could choose to use two simple patterns if you weren’t interested in specifics of SSN rules:

  • (with dashes) ###-##-####
  • (without dashes) #########

These would pick up Social Security Numbers but would also incorrectly identify any numeric value which fits this form but doesn’t actually meet certain rules for SSN’s such as that no group of digits can be all zeroes. We could instead craft a regular expression which is much more robust and is designed to meet the rules of SSN’s laid out by the Social Security Administration3:

  • ^(?!219-09-9999|078-05-1120)(?!666|000|9\d{2})\d{3}-(?!00)\d{2}-(?!0{4})\d{4}$
  • ^(?!219099999|078051120)(?!666|000|9\d{2})\d{3}(?!00)\d{2}(?!0{4})\d{4}$

This is an example of how simple and also how robust the pattern matching can be. These same tactics patternmatchingcan be applied to matching pretty much any other predictably-formatted value. The only question is the depth of complexity you want to apply to the efforts. Given that Regular Expression experts are fairly rare, we also included an expression evaluator in the interface. This provides feedback on your expressions and confirms whether each pattern makes sense to the engine or not.

Now that the patterns are configured, WebCenter Content does the heavy lifting during the check-in process of opening the document and extracting the text within so that the document’s contents can be indexed. This indexing means you can search for a word inside the document instead of just the title or metadata. It also means we have a readily available block of extracted text that we can quickly parse against our patterns and identify desired information. Once identified, we simply hand the PDF to an editing library which adds the redaction, burns it into the document, and saves a new copy as a “Redacted Rendition”. The new PDF even remains full-text searchable – it just has the redacted text removed! This is the simplest – and most common – scenario.

Scanning Integration

Less common but no less important are image-based, or scanned, documents.  As paper documents are still a fact of life, we always want to keep an eye on our methods for digitizing that physical content to bring it into the repository. Whether that’s a simple WebCenter Capture setup or some other scanning platform, the important piece is that we get this now-digital item into a managed structure such as WebCenter Content. If your choice is to stick with the WebCenter family, the Redaction Engine is specifically enhanced to work intimately with both Capture and Oracle Forms Recognition (OFR). One of the best examples of this partnership is with content that contains non-digital text, that is, handwriting.

After the paper item is digitized via the scanner and Capture, it’s passed to OFR for processing. This is where we set up “markers” and instruct OFR where to look for characters in a specific location. Even if Oracle Forms cannot interpret the handwriting (via Optical Character Recognition or OCR) it can identify the precise coordinates for the location of the handwriting. Now we simply pass the digitized document and the coordinates to WebCenter Content and the Redaction Engine.

redaction1                                     redaction2

In the end we have a perfectly redacted entry even though the text wasn’t readable by a character recognition engine. This means that as long as we can find digital “landmarks” in our document, we can train Oracle Forms Recognition to look for and identify illegible entries and pass those for redaction.

If, however, your solution for scanning physical documents does not include WebCenter Capture or Oracle Forms Recognition, the Redaction Engine is happy to work with those items as well.

badfax

A Bad Fax

In fact, any image-based content can be passed through the Redaction Engine as we’ve included an OCR library with the product. This means not only image-based PDFs but native TIFF, JPEG, or GIF files can be processed as well. The Redaction Engine OCR library will process the content item and scan for any machine-readable English text that it can find. Of course, like with any OCR process, there are limitations in terms of language, fonts, and file resolution however the vast majority of modern scanned documents will have no problems being read. If you’re submitting documents sent via fax machine in 1997 and then digitized with a consumer-grade scanner a year later, you could very well run into issues.

Something extra on this front comes from the fact that we’re finding text in these images – search. While WebCenter Content would not ordinarily be able to include these content items in the full text search index, we’ve joined the Redaction Engine with TEAM’s Elasticsearch Integration to make this happen. That mean’s any text found when an image or image-based item is passed through the engine is submitted to the Elasticsearch index, making it fully searchable. This means, for example, that a scanned invoice could possibly be found by searching for the vendor name, or the invoice ID, or the invoice total and not just by the metadata that was associated to the item at check-in.

Responsible Redaction

We’ve now covered three specific cases where content can be redacted:

  1. via full-text matching of the document contents
  2. via sets of coordinates passed to the Engine
  3. via pattern and location matching of OCR text in an image or image-based item

In all cases the Redaction Engine creates a new, specifically-redacted content item that is separate and unique from the original file. The redactions are also “burned in” to the new file ensuring that the underlying text is permanently removed. Both of these steps are taken to first ensure that no data is lost for the redaction process and, second, to simultaneously ensure that redacted items are secure in terms of information removal.

The last piece of what we have come to call “responsible redaction” is the auditing capability of the Redaction Engine. The product keeps a record of every redaction performed – not just at a document level but at the redaction level. A single content item with several redactions has every individual redaction logged, including the specific pattern that was matched in each case. Redaction Reports can be generated for any date range desired and can be exported as a Microsoft Excel document. This exported document can now be stored as a managed record in WebCenter Content or maintained elsewhere for legal purposes. The goal in all cases is simply to provide as much transparency as possible into a process that is built to, well, do the exact opposite!

redactionreport

The Redaction Engine is not only about helping to lessen the burden on businesses that have to manually parse, identify, and redact sensitive information but to also bolster those on-going information compliance efforts and keep trouble from finding the front door. As we’ve worked on this effort, I’ve come to find a much greater appreciation for the efforts that must be undertaken to try and keep our information safe and secure. As a group, we’re incredibly pleased to be able to offer a solution that could very well save you and your business time, money, and headaches.

 

1 “Guide to Protecting the Confidentiality of Personally Identifiable Information (PII)”, NIST, April 2010

2 “2016 Cost of Data Breach Study: United States”, Ponemon Institute, June 2016

3 “Validating Social Security Numbers through Regular Expressions”, Rion Williams, Codeproject.com, Sep 2013


TEAM Informatics’ Intelligent Content A smart solution for businesses to manage and control content

May 19, 2016

By: Jon Chartrand – Director of Product Management

Perhaps the primary conceit when it comes to content management is this: context is king. When your content or records have context, it means they can be both cataloged and discovered with much greater ease. When we talk about context, that means metadata – or data describing data. When a document is placed into your content management system it’s important to know who it came from, who it belongs to, what the data within is regarding, and every other aspect of context that can be known, implied, or assumed. This allows the system to catalog the item appropriately and other users to search for and locate the item easily. The problem is that while context is king, entering metadata can be a royal pain – and bad metadata can ruin an otherwise good system. As we all know: garbage in – garbage out.

Picture1TEAM’s been working in the content management space for over a decade so we’ve seen this issue arise repeatedly for our clients. Relying on end users for full, complete, and accurate metadata puts stress on them, slows down the contribution process, and can lead to human error or, even worse, human disinterest. So we set out to not only solve this problem but revolutionize how context is achieved for your content. We partnered with SmartLogic and combined the power of Oracle WebCenter Content with their extraordinary context classification software, Semaphore, to create a unified, smart solution.

This is Intelligent Content.

What is Intelligent Content and how does it work?

TEAM’s Intelligent Content solution alleviates the challenges and roadblocks of requiring users to navigate the metadata process by doing the work for them. This is started by the user simply saving their content to the WebCenter Content repository. The content can be contributed automatically by line-of-business systems or even ingested from network drives or cloud-based file systems. The Intelligent Content engine processes the stored material and leverages an information classification model, or “ontology”, rather than the traditional two-dimensional taxonomy. Intelligent Content drives the auto-classification process by opening each document at contribution time and parsing the content of the document. It is then able to automatically populate metadata based on the rules of the classification model. By automatically tagging your materials, it makes your content easily findable across what would have previously been multiple taxonomic pathways.

 

ontologybrowserwithstart

Perhaps an example can help here. Imagine an overview document that describes a land use project to build a park. The document may contain sections on project planning, soil samples, a work breakdown, price estimates, and more. In the old-school method the Project Manager checks the item into the repository and, on reflection, classifies the item as a Project Document type item with a subtype of Overview. This is helpful, but really doesn’t encompass the breadth and depth of what’s in the document. In the new-school method, Intelligent Content parses the text and applies predefined classifications; overview, soil samples, work breakdown, pricing… On and on. This means the item can be found by others who search based on what they’re looking for not necessarily solely on the structure of the item. The old-school method provides a single taxonomic pathway (Project Document > Overview). The new-school method enables a much more nuanced approach. When the Engineer looks for documents relating to soil samples, the item is returned. When the Construction Foreman looks for documents relating to Work Breakdown, the item is returned.

As I mentioned earlier, the ontology (AKA information classification model) is comprised of a set of terms and rules, which have the ability to be maintained as needed by the information or records management SME within your organization or through TEAM. By utilizing the information model on the search side of the equation, it allows the use of “semantically enhanced” search capabilities including a “search as you type” feature as well as the ability to browse through the model in an interactive graphical manner. Both methods create easier, faster, and more intelligent pathways for users to find the content they’re looking for in the system.

 

Why is this important for businesses?

Help Your Contributors.

There’s a lot of room for human error when a document is manually classified. TEAM’s Intelligent Content solution saves the content contributor time and effort by automatically tagging newly stored content. This ensures that every time new content is stored in any department of your business, its classification will be consistent and no longer susceptible to the vagueries of human interpretation.

Help Your Users.

Will the end-user always know what key words to search for when looking for a specific document? The auto classification system makes finding your documents faster and easier than ever. What could potentially take hours to locate within a large system can now be found in a matter of seconds due to the unique ontology model utilized by Intelligent Content.

Help Your Business.

By changing the way your content is cataloged and managed, TEAM’s Intelligent Content solution is a bottom line contributor to the overall enhancement of your business.

While this sounds like a sales pitch – and I admit it kind of is – I want you to understand that we’re also incredibly excited about the results we’re already seeing from Intelligent Content; better classification, less human error, simpler contribution experience, and far faster and more accurate searching. This is the next step in the evolution of enterprise content management. If you’re interested in learning more, you can check out our YouTube video on this topic or email us directly with your questions.


Using Enterprise Manager for Troubleshooting and Optimizing your WebCenter Content Deployment

May 10, 2016

Raoul Miller – Enterprise Architect

When Oracle WebCenter Content made the architectural shift from a standalone J2SE application to a managed application running in WebLogic Server (WLS), the change provided a number of new capabilities for management, integration, and support.  One of these capabilities is the version of Enterprise Manager that is built into WLS which allows administrators to monitor many different aspects of the WebCenter Content application.

If you haven’t been through formal WLS or Enterprise Manager training, the interface may seem complex or confusing.  My speaking session at Collaborate 2016 in April explained how to use Enterprise Manager to monitor, optimize, and troubleshoot your WCC deployment(s) and I wanted to accompany that with a post here to provide a bit more context.

First a little background – there are multiple versions of Enterprise Manager (EM), and it’s important to be clear which one we are talking about.  Those of us who have worked with the Oracle Database will be familiar with the original EM that’s been used to manage databases since version 9i.  This is now specifically called Enterprise Manager Database Control.

At the other end of the spectrum there is the full-featured Enterprise Manager platform.  This is a multi-tier set of applications which monitor and manage all aspects of your Oracle hardware and software deployment.  We recommend it highly for large Enterprise clients, but it can be expensive and complex.

In the middle is the Enterprise Manager we will discuss today which is a set of web-based tools used to manage and monitor your WLS application server deployments.  You access this at almost the same URL as the WLS administration interface – http://<WLS servername>:7001/em – note the /em rather than /console for WLS, and it’s possible you may not be using the standard 7001 port.

Your initial screen will show you what is deployed to your domain and whether the applications / servers are running or not.

Picture1

You’ll notice that there are lists of application deployments and managed servers within the domain and right clicking on any of these will show you custom actions for each.

Picture2

Before we get to what to monitor and measure, let’s take a moment to review best practices when we are optimizing or troubleshooting WebCenter Content.  As the java application architecture has stayed much the same over the years, the standard areas to focus on have remained fairly constant.  It cannot be stated strongly enough that it is vital to look at ALL these areas, measure and test performance before making any changes, change one thing at a time, and then re-test and re-measure after making that isolated change.  It’s very much an iterative approach as without data you are just playing around with inputs and outputs to a black box model.

The areas you need to monitor and measure when optimizing or troubleshooting WCC are:

  • Java virtual machine
  • File system
  • Database (metadata and indexing)
  • Network
  • Authentication / authorization
  • Customization / components
  • Hardware

 

(I have to credit Brian “Bex” Huff and Kyle Hatlestad for their presentations back in the day at Stellent which taught me this approach.)

Enterprise Manager can help you with many of these areas, but not all – you need other tools to look at file system I/O and utilization, network speed and routing, and (non-oracle) hardware.  However, for the other areas, EM can be extremely helpful.  Let’s look at a couple of examples:

JVM metrics

Right click on the managed server instance and select JVM Performance

Picture3

This brings up a pre-selected set of JVM metrics and a non-standard color scheme.

Picture4

 

This will let you monitor the heap and non-heap memory usage in real time.

**TIP** You may see that the heap is smaller than you thought you had set it – I have often seen an issue where there has been confusion over where the non-default maximum and minimum heap sizes should be set.

Lower on the page you’ll see more granular data on JVM threads, objects, etc.

Picture5

Picture6

Datasource Metrics

You’ll need to open the metric palette on the right side of the screen and open up the Datasource Metrics folder.

Picture7

**TIP ** Make sure you choose this rather than the Server Datasource Metrics, because you will need to select the “CSDS domain-level Datasource”.

Picture8

WebCenter Content Metrics

 

Navigate to the WebCenter Content Server deployment at the bottom of the folder list in the left hand area:

Picture9

Select “Performance Summary” and you’ll see a pre-selected set of content-specific metrics in the graph area.  As with all of the other selections, you can add or subtract metrics as you go – this short cut just gives you a good starting point.

Picture10

We have only scratched the surface here of the capabilities of Enterprise Manager and its use for optimization of WebCenter Content.  For much more information, download to my presentation from Collaborate 2016 or contact us through our website.  We’ll be happy to discuss how we can further help you optimize and troubleshoot your WCC deployments.


Taming the Paper Tiger with Oracle Forms Recognition

April 22, 2016

By: Dwayne Parkinson – Solution Architect

tiger512We all like to believe that technology makes everything somehow better, right? Our parents’ watches tell time and maybe the date while mine gives me the weather, tells me when to get up and exercise, tracks calories, integrates with email and sends text messages. Seemingly everything from our refrigerator to our garage door opener to the latest and greatest ERP system is connected to our phones and devices these days. Yet amidst all this technology and integration, lurking somewhere in the bowels of virtually every company and organization is a massive pile of paper.

They say the first step to fixing a problem is to admit that we have one.  So let’s admit what the problem is: paper.  It used to be that paper went back and forth between companies as a record of the various transactions.  If you placed an order, it was on paper.  When you got a shipment, there was more paper.  When you needed to pay, there was a paper invoice.  And up until recently, when you paid, someone somewhere was potentially issued a paper check. With the advent of the Electronic Data Interchange (EDI), electronic transactions thankfully became the standard – or so we’d like to think.  What’s really happened however is that only those transactions between electronically astute organizations have migrated to EDI, while smaller organizations and those facing significant technology challenges have unfortunately remained largely paper-based.

While many of these smaller organizations have stopped sending physical paper for these transactions, it’s important to recognize that an e-mail with a PDF attachment is still a paper-based transaction in the end.  Ultimately it requires a person somewhere to open the attachment, read it, extract the important information, and then enter that information into the business system.  Due to this process, the end result is that there are very few organizations that are completely free from the shackles of paper.

1461368758_88The obvious solution is to use some kind of scanning and optical character recognition (OCR) to try to automatically import data into the systems.  The problem with this solution is that many existing OCR systems use technology that hasn’t changed in twenty years.  Often enough the legacy processes – defining templates, creating scanning zones, forcing customers to use predefined forms and cryptic barcode solutions – all fail for various reasons.

Oracle Forms Recognition (OFR) approaches the problem of scanning in a very different way.  First of all, the software is designed to simulate what a human might do when looking at a piece of paper.  The first thing a person does is to evaluate the document and figure out what the document is.  Is it a W2?  Is it an invoice? Is it a resume?  OFR does the same thing.  Based on the layout of the document, the actual content, and several other metrics OFR classifies a document automatically.

Once classified, rules are set up to define what various pieces of information look like within that document.  For example, a Social Security Number is always in the same general format; three digits a dash, two more digits and another dash followed by four digits (999-99-9999).  When a person looks for a Social Security Number on a piece of paper they look for a couple of things:

  • They also look for a specific format
  • They look “geographically” in the general area where they expect the social security number to be based on the document type and past experience

OFR does that exact same thing.  Here we are defining a simple rule for a social security number:

Picture1

Based on that rule OFR will identify candidates on the scanned documents as shown here:

Picture2

With OFR, rules can be defined to specify formats or to look next to, above or below certain identifiers (i.e “SSN” and “Social Security Number”).

Once the rules are in place, OFR identifies candidate values on the document and OFR is then trained on sample documents so it can learn where to expect to find each value.  This process is known as creating a “learn set”.  Batches of sample documents are scanned and “taught” to OFR so that when it encounters similar documents in the future it will already know how to handle them.

Here we see the evolution from the traditional scanning/OCR model. With the OFR approach it isn’t necessary to define separate templates for each type of document that might come into the company.  Instead a single document class is created to represent a group of information that is needed from a class of documents.  For example, there may be one class for information contained on a W2 tax form and another class for health insurance information retrieved from various health provider forms.  With just two classes defined, OFR can handle all of the variations of W2 forms and all of the healthcare provider forms a company might reasonably encounter.

In the event that OFR encounters a problem such as a light scan or invalid data, there is an intuitive browser-based verification system that allows users to review the exception data and make an informed decision.  OFR can also be configured so that each piece of data it finds is measured against a certainty level.  So whenever OFR is unsure if the data it has is correct (that is, the certainty level is low), the item can be sent into the verification system where a person can review it.  Additionally, as documents go into the verification system they can be flagged to help further train the system so the accuracy of the system continues to improve over time.

Behind all of this technology is a powerful scripting engine that provides the ability to customize the process as needed and integrate with other systems and a host of other standard OCR technologies.  These include optical mark recognition (OMR), barcode recognition, zonal OCR, floating anchors and pre-processing technologies such as box and comb removal.

We’ve seen wild success with our clients through the adoption of modern, powerful and flexible scanning solutions like Oracle Forms Recognition. From relatively simple needs of only several hundred documents a week to much larger operations, OFR and WebCenter Capture can help you evolve your processes and ultimately cage the Paper Tiger.


TEAM Informatics Introduces Their Innovative Product, DOCSConnect for Oracle WebCenter Content and Oracle Documents Cloud Service

October 26, 2015

docsconnectMINNEAPOLIS, Oct. 26, 2015 — Oracle OpenWorld 2015 — TEAM Informatics (“TEAM”), a leading enterprise content management products and service provider and Oracle Gold Partner, has recently released their newest connector, DOCSConnect. The announcement comes from TEAM at Oracle OpenWorld 2015 where they are participating in the event as presenters of two unique sessions in the WebCenter space.

TEAM’s DOCSConnect joins the power of Oracle WebCenter Content 11g (WCC) and the highly developed Oracle Public Cloud offering, Documents Cloud Service. This hybrid enterprise content model provides security, compliance, and data management features with the extensive collaborative capabilities of the cloud. DOCSConnect is the first connector that functions solely with WebCenter Content 11g and Oracle DOCS rather than utilizing a third party installation or interface. TEAM developed DOCSConnect in order to provide a deeply integrated, controlled, and auditable hybrid document system to ensure content could be accessible and editable at all times from any device.

DOCSConnect is an enhancement component within Oracle WebCenter Content 11g. Not only does DOCSConnect provide improved access to enterprise content, it enables an unprecedented level of collaboration and maintains auditable version histories of files uploaded in both WCC and DOCS. DOCSConnect allows WebCenter Content 11g to serve as a Single Point of Truth (SPoT) for all enterprise content while leveraging the burgeoning power of Documents Cloud Service and Oracle’s Public Cloud platform. “Oracle’s cloud products are game-changers for the traditional enterprise software model and our DOCSConnect product is a powerful way to bridge the gap between paradigms. The best of WebCenter merged with the next generation of enterprise capabilities, enables true collaboration for our customers,” said Doug Thompson, CEO of TEAM.

For more information on DOCSConnect, watch their YouTube Video on the product, and visit www.teaminformatics.com/products/docsconnect.

About Oracle Open World 2015

Oracle OpenWorld is an annual Oracle event for business decision-makers, IT management, and line-of-business end users. It is held in October in San Francisco, California. The world’s largest conference for Oracle customers and technologists, Oracle OpenWorld San Francisco attracts tens of thousands of Oracle technology users every year.

About TEAM Informatics, Inc.

TEAM Informatics, Inc. (www.teaminformatics.com) is an employee-owned, Minnesota-based software products and systems integration firm with a global customer base and offices on three continents. TEAM was formed over 10 years ago and has experienced a sustained aggressive growth rate.

TEAM is an Oracle Software Reseller and a global member of the Oracle Partner Network, specializing in areas such as WebCenter Content, WebCenter Portal and Oracle Documents Cloud Service. Offerings include professional services, managed services, enterprise and development support, and an expanding set of custom products. In addition, TEAM is a Google Enterprise Partner and Reseller for the Google Search technologies. TEAM’s suite of business applications include a GSA Connector for WebCenter for enterprise search, TEAM Sites Connector for enabling web experience management, DOCSConnect for hybrid enterprise content management, and Intelligent Content for metadata auto-classification.  Get more information on these and all of TEAM’s offerings at http://www.teaminformatics.com.

Trademarks

Oracle is a registered trademark of Oracle and/or its affiliates. Other names may be trademarks of their respective owners.

To view this video on YouTube, please visit: https://www.youtube.com/watch?v=kNoJllO2VW4&feature=youtu.be

Media Contact: Doug Thompson, TEAM Informatics, Inc., 1.651.760.4802,doug.thompson@teaminformatics.com


Follow

Get every new post delivered to your Inbox.

Join 64 other followers