找到你要的答案

Q:Multiple crawlers database connection in Java

Q:多个爬虫在java数据库连接

Let's say I instantiate multiple crawlers for the same URL. They write URLs that are processed to MySQL database. Before processing URL, they check in database if record for that page exists, so it wouldn't process already processed page again.

Here is the catch, there should exist some kind of lock, so that only one of them can read or write from that specific table, if my logic is right. So, I instantiated only one database connection (JDBC) for them to use. Still, I am unsure if this is right thing to do.

So my question is: do statements executed from single database connection run sequentially(are they queued) or does this depend on database engine it's configuration.

让我们说我实例化多个相同的URL抓取工具。他们写的处理MySQL数据库的URL。在处理URL之前,如果该页存在记录,他们会检查数据库,所以它不会再处理已经处理过的页面。

这里是捕捉,应该存在某种锁,所以只有一个可以读或写从特定的表,如果我的逻辑是正确的。所以,我将只有一个数据库连接(JDBC)他们使用。不过,我不确定这是否是正确的事情。

所以我的问题是:从单个数据库连接执行的语句是否按顺序运行(它们是排队的)还是依赖于数据库引擎的配置?。

answer1: 回答1:

So, I instantiated only one database connection (JDBC) for them to use. Still, I am unsure if this is right thing to do.

I recently face some similar situation. I choose to have one JDBC connection only and use a Lock for sharing this connection between threads.

Some kind of status flag is needed to indicate that one particular URL is being processed.

Check my simple snippet below illustrating the previous points.

do statements executed from single database connection run sequentially(are they queued) or does this depend on database engine it's configuration.

To eliminate any doubts, let your Java code ensure that statements run sequentially. This will put away any headcahes coming from unexpected errors later.

Code snippet

CrawlerThread.java

public class CrawlerThread extends Thread {

     public void run() {
         String url=null;
         try {
            String url = urlDao.getNextUrlToProcess();

            if (url!=null) {
                // Process URL here...
            }
         } catch(Exception e) {
              // Handle exception here...
         } finally {
            urlDao.markUrlAsFetched(url);
         }
     }
}

UrlDAO.java

public class UrlDAO {

      public String getNextUrlToProcess() {
          Connection jdbcConnection = null;
          String url = null;

          try {
              jdbcConnection = connectionManager.acquireSharedConnection();

              // Perform query to get next url
              // SELECT url FROM urls WHERE status = 'NEED_FETCH' LIMIT 1
              ResultSet rs = ...
              if (rs.next()) {
                  url = ...

                  // Mark url as being processed
                  setUrlStatus(jdbcConnection, url, 'BEING_FETCHED');
              }
          } catch(Exception e) {
              // Handle exception here...
          } finally {
              connectionManager.releaseSharedConnection();
          }

          return url;
      }

      public void markUrlAsFetched(String url) {
          Connection jdbcConnection=null;

          try {
              jdbcConnection = connectionManager.acquireSharedConnection();

              setUrlStatus(jdbcConnection, url, 'FETCHED');
          } catch(Exception e) {
              // Handle exception here...
          } finally {
              connectionManager.releaseSharedConnection();
          }
      }

      private void setUrlStatus(Connection jdbcConnection, String url, String newStatus) {
          // UPDATE urls SET status = ? WHERE url = ?
          if (url!=null) {
             ...
          }
      }
}

ConnectionManager.java

/**
 *
 * The ConnectionManager opens, closes and shares a JDBC connection among the different threads.
 *
 */
public class ConnectionManager {
     private Lock connectionLock = new ReentrantLock();
     private Connection sharedConnection = ... 

     public Connection acquireSharedConnection() {
          connectionLock.lock();
          return sharedConnection;
     }

     public void releaseSharedConnection() {
          connectionLock.unlock();
     }
}

所以,我将只有一个数据库连接(JDBC)他们使用。不过,我不确定这是否是正确的事情。

我最近也面临类似的情况。我选择了一个JDBC连接和使用锁的线程之间共享这个连接。

需要某种状态标志以指示正在处理某个特定URL。

看看我下面的代码片段说明一点简单的。

从单个数据库连接执行的语句是否按顺序运行(它们是排队的)还是取决于数据库引擎的配置?。

消除任何疑虑,让你的java代码,确保报表顺序运行。这将把任何headcahes来自意想不到的错误后。

Code snippet

crawlerthread.java

public class CrawlerThread extends Thread {

     public void run() {
         String url=null;
         try {
            String url = urlDao.getNextUrlToProcess();

            if (url!=null) {
                // Process URL here...
            }
         } catch(Exception e) {
              // Handle exception here...
         } finally {
            urlDao.markUrlAsFetched(url);
         }
     }
}

urldao.java

public class UrlDAO {

      public String getNextUrlToProcess() {
          Connection jdbcConnection = null;
          String url = null;

          try {
              jdbcConnection = connectionManager.acquireSharedConnection();

              // Perform query to get next url
              // SELECT url FROM urls WHERE status = 'NEED_FETCH' LIMIT 1
              ResultSet rs = ...
              if (rs.next()) {
                  url = ...

                  // Mark url as being processed
                  setUrlStatus(jdbcConnection, url, 'BEING_FETCHED');
              }
          } catch(Exception e) {
              // Handle exception here...
          } finally {
              connectionManager.releaseSharedConnection();
          }

          return url;
      }

      public void markUrlAsFetched(String url) {
          Connection jdbcConnection=null;

          try {
              jdbcConnection = connectionManager.acquireSharedConnection();

              setUrlStatus(jdbcConnection, url, 'FETCHED');
          } catch(Exception e) {
              // Handle exception here...
          } finally {
              connectionManager.releaseSharedConnection();
          }
      }

      private void setUrlStatus(Connection jdbcConnection, String url, String newStatus) {
          // UPDATE urls SET status = ? WHERE url = ?
          if (url!=null) {
             ...
          }
      }
}

connectionmanager.java

/**
 *
 * The ConnectionManager opens, closes and shares a JDBC connection among the different threads.
 *
 */
public class ConnectionManager {
     private Lock connectionLock = new ReentrantLock();
     private Connection sharedConnection = ... 

     public Connection acquireSharedConnection() {
          connectionLock.lock();
          return sharedConnection;
     }

     public void releaseSharedConnection() {
          connectionLock.unlock();
     }
}
java  mysql  database  jdbc