Large-Scale Web Scraping Case for a Legal Recruitment Startup in the USA

Introduction

If you’ve looked through our site, you’ll have already noticed that we have extensive expertise in data scraping and provide a wide range of services in this sphere.

However, real cases tell a richer story, and we want to share one with you today. Unfortunately, most of our projects are under NDAs (non-disclosure agreements) and we cannot tell you the name of the client or the company he works for, but the tech details of our case are too incredible not to share.

Let’s get down to business.

Client

Our client is an attorney from the USA, who launched a legal recruitment startup back in 2015. If you have ever looked for a good lawyer in a particular field, you know that the task itself is a challenge, and when you deal with recruitment in this sphere, everything is even more complicated. Thus, our client decided to match the right attorneys with those looking to hire them in the USA.

The project started 5 years ago, and since then we have gathered a comprehensive database of over 3,000 law companies across the USA, as well as more than 300,000 attorney profiles. No doubt, the scope of work was tremendous, and the project is a perfect example of a large-scale web scraping.

Who benefits from this service?

Our client’s major customers are recruiting agencies specialized in servicing law firms, legal companies with in-house recruitment teams, and attorneys in search of a new job.

The information provided on our client’s official site is also a fruitful ground for marketers and marketing research, so specialists in this field would also benefit from the service.

Read also: Job Scraping Service

The service is subscription-based, and recruiters can purchase access to both candidates and job databases to match one with the other.

Project

DataOx’s task was to scrape the information about all the US attorneys, analyze it, cleanse it, and enrich it for recruiters. Since we continue to provide project maintenance, we are still scraping and updating the database twice a month.

The other aspect of our work involves parsing job listings for new judicial vacancies, which we do every two hours. This allows our customers to have an up-to-date job database for layers of various specialties.

More than 3,000 of DataOx scrapers operate to collect information about US attorneys. Over 30 parameters are extracted for each person, including their personal and professional details, educational and working background, practice areas, specializations, bar admissions, and much more. At the same time, about a thousand other crawlers work to gather job information in the judicial sphere all over the USA. Two people are involved in the maintenance of the bots, since their operation is often damaged with anti-scraping measures and unexpected changes totarget websites.

As an attorney himself, our client analyzes and enriches information based on his judicial knowledge and familiarity with the specifics of the profession and the industry as a whole.

Challenges

Personal details accuracy

In the USA, law firms publish information about their teams on their websites. However, from time to time, specialists make lateral movements—so when some attorney, let’s call him John, disappears on one site, and another John appears somewhere else, our task is to identify if he is the same John.

Similarly, this issue can apply to candidates who have changed their last names when married, moved to another state, etc.

Matching problem

What’s more, firms do not reveal attorneys’ age on their websites, so one more challenge we faced was identifying their age by substituted data, like graduation year or year they earned a degree. An attorney can receive several degrees from different universities, but they rarely indicate these details in their resumes. So, DataOx’s task was also to match the right degree with the right high school or university to provide accurate personal information about each candidate. We did complex custom parsing to address the task.

Proxy blocking

This is a more or less common challenge in most of our projects. DataOx is a company based in the EU, though we carry out projects all over the globe. The project we’re describing now is about American attorneys, so we had to get American IPs to avoid blocking our proxies.

We know how to do it quickly, efficiently and hassle-free, so the task was solved seamlessly.

Data management

When you deal with large-scale web scraping, data management is always a burning issue. As discussed above, the scope of information is really enormous, and the DataOx team developed a custom data management system to manage it all. The system is built on Java for this particular project.

Thus, when it comes to a data scraping project, DataOx can always offer its clients custom integrated solutions for data storage and management.

Picture storage

We scraped not only text details about the attorneys in the USA, but also extracted their photos for simpler and quicker identification of the specialists. For this purpose, we had to accurately design the storage infrastructure to match the right pictures with the right profiles.

Data enrichment

To make the attorneys’ profiles comprehensive, we also monitored social networks. It helped us supplement the official information with some important details. It’s not always a simple task to detect a person’s profile on social networks; however, we have done it for most of our candidates and firms they work for.

Job descriptions scraping

The key challenge with extracting job details was the free form in which some vacancies are published. Here, we have also used complex custom parsing to identify the right job requirement with the right fields in our dataset.

DataOx complex custom data parsing at large scale

Marking standardization

When extracting information from various sources, we often faced the issue of marking one and the same location (New York, N. York, NY), University, Degree or Company Name. We had to develop a system of unification for all these marks and bring them to a common standard.

Data Quality Assurance

When handling large-scale web scraping projects, data accuracy is of special importance. We have developed a custom data quality verification tool for this specific project.

Additionally, a human from our team checks data consistency and accuracy of details.

More than that, DataOx clients check the received data based on their legal expertise and make necessary corrections to the collected details if there are any.

Cooperation between 2 independent teams

A team from Poland carries out the front-end development for our client. While they work on site and app development, we deal with data scraping and management at a large scale. The DataOx team includes 2-7 specialists working on the project’s different stages; however, the work of our two teams has been perfectly integrated into a single successful project.

So, keep in mind that as a client you do not always need one team to be a jack of all trades. It’s reasonable to choose the best experts in each field and make them cooperate for a perfect result.

Team members planned rotation

As we have already mentioned, we continue maintaining the project, and two specialists work on it on the ongoing basis. However, such work is more like a routine, and we motivate our employees to constantly develop, so we have a scheduled expert rotation plan for this project. One specialist changes every six months, while the other one shares his knowledge of the project. In this way, we support smooth and effective work on the project at all times.

Project data storage

The project has already been in progress for five years and the work is still going. It has involved various teams and experts at different times, and many issues were solved for a job well done. However, it’s impossible to keep all the project details, technical aspects, and nuances of such an extensive project in mind. As such, we’ve created a project database in Confluence, so that its details could be refreshed any time by any expert.

Read also: Large-Scale Web Scraping

Result

What about the result?

The service has successfully operated in the United States since 2015, having already reached a turnaround of about $10 million. Our client has started expanding his business beyond the USA, and we have started to scrape European and Asian job markets for this purpose.

Thanks to this service, recruiters in the USA have access to a comprehensive database of attorneys, can search for the required candidates, filter their search with multiple parameters, and choose the best candidate out of several specialists. More than that, through the analyses of certain parameters, they can predict the right moment to offer a candidate a new job.

Final Thoughts

Large-scale data scraping reveals huge opportunities for startups, the recruitment sphere, and job search. DataOx has valuable expertise in this field and a deep understanding of the process. We can guarantee you accurate and comprehensive data, since we know how to collect, check, and enrich information with out-of-the-box solutions. If you have some startup ideas or projects in web scraping at large scale – schedule a free consultation with our expert to discuss it. We are always ready to take on a new challenge.

Popular posts
surface web vs deep web vs dark web

Importance of Understanding the Differences Between Surface Web vs Deep Web vs Dark Web

Scrape Zillow: A Detailed Guide to Extracting Real Estate Listings with Python

Sports Betting Arbitrage – a Modern Way to Supplement Your Profits

Python PDF scraping

Python PDF Scraping – How to Extract PDF Files from Websites

Basics of web scraping DataOx's article

Web Scraping Basics, Challenges & Technologies for Startups and Entrepreneurs

Our site uses cookies and other technologies to tailor your experience and understand how you and other visitors use our site. Visit our Cookie Policy and our Privacy Policy for more information on our datd collection practices. By clicking Accept, you agree to our use of cookies for the purposes listed in our Cookie Policy.

-->