It is possible to extract text from document files, from web pages, from even pdf files. However, one can run into erratic situations when it comes to image based files. For example, how do you extract floor plan details from an image file. For image files, one really requires an OCR (Optical Character Recognizer). There are a few libraries available for this very purpose but for large-scale developments it is probably best to utilize a custom based solution as they can often be quite memory intensive. There are bound to be quite a few available on Python compared to Java.